百度推广助手登陆不了。提示：设备出现致命错误误。不能写入错误日志，详细看下图，有没有知道怎么解决的？

点击联系发帖人 时间：2018-07-13 02:44

天正致命错误怎么解决

当前位置： >>
SUN系统故障分析与诊断
Sun 系统故障分析与诊断SAMSUNG SDS CHINA, Inc. Beijing DC Business Operation Alan Jin CHINAPart Number Sun Platform Trouble Shooting
May 2005 SAMSUNG SDSC Internel ConfidentialContact Email
SUNMARMOTSun 系统故障分析与诊断手册本文档为一个预备性文档，在此处所描述的软件的最终商 Alan Jin 用版本发本文档所包含的信息代表了在发布之日，对所讨论问题的当前看法。 Alan Jin 不保证所给信息在发布之日以后的准确性。本文档仅供参考。对本文档中的信息，SAMSUNG 不做任何明示、默示或法定的保证。遵守所有适用的版权法律是用户的责任。在不对版权法所规定的权利加以限制的情况下，如未得到 SAMSUNG 及 Alan Jin 明确的书面许可，不得为任何目的、以任何形式或手段（电子的、机械的、影印、录制等等）复制、传播本文的任何部分，也不得将其存储或引入到检索系统中。Alan Jin 可能拥有本文档主题涉及到的专利、专利申请、商标、版权或其他知识产权。除非在 Alan Jin 的任何书面许可协议中明确表述，否则获得本文档不代表您将同时获得这些专利、商标、版权或其它知识产权的许可证。 @ Alan Jin. 保留所有权利。 Sun 徽标和 Solaris 是 Sun Microsystem 在美国和/或其它国家或地区的注册商标或商标。此处提到的实际公司和产品名称可能是其各自所有者的商标。SUNMARMOTSun 系统故障分析与诊断手册目录一．系统常见问题诊断与处理 ...............................................................................................5 系统分析 ...................................................................................................................5 系统诊断 ...................................................................................................................5 1. 故障类型描述 ...........................................................................................................6 1.1 机械设备故障报告 .....................................................................................7 1.2 典型的错误类型 .........................................................................................7 软件错误 ...................................................................................................................7 硬件校验错误（Hardware-Corrected Errors）........................................................7 可恢复错误（Recoverable Errors） ........................................................................8 致命错误（Fatal Errors） ........................................................................................8 System Watchdog Reset ............................................................................................8 Critical Errors ............................................................................................................8 2．系统常见硬件故障定位及处理 .................................................................................8 2.1 CPU 及 Memory 常见故障及处理............................................................8 常见 CPU 及 Memory 故障......................................................................................8 CPU 常见故障 .........................................................................................................9 内存常见故障 .........................................................................................................21 如何定位 CPU/内存常见故障 ...............................................................................25 2.2 硬盘常见故障定位及故障处理 ...............................................................27 2.3 Critical 常见故障定位 ..............................................................................30 2.4 Fire 及 V 系列服务器诊断及维护 ...................................................................31 3. 常见文件系统问题及处理 ........................................................................................49 3.1 Volume Manager 常见问题处理...................................................................50 3.2 Sun SDS 常见问题分析处理........................................................................55 4. 常见网络故障及处理 .............................................................................................80 二．Sun explorer 软件使用...................................................................................................85 1. Sun explorer 软件包简单介绍................................................................................85 2. Sun explorer 软件包安装.......................................................................................85 2.1 如何获取 Sun Explorer 软件包 .......................................................................85 2.2 Sun Explorer 软件包安装及使用 .....................................................................85 3. 利用 explorer 收集系统信息..................................................................................86 3.1 初始化 explorer.................................................................................................86 3.2 利用 explorer 收集一般系统信息 ....................................................................87 3.3 利用 explorer 收集某一个方面的系统信息 ....................................................87 4. 分析 explorer 结果并判断故障点 ..........................................................................90 4.1 exploeror 输出文件目录描述 ...........................................................................90 5. explorer 5.1 新增功能.............................................................................................92 5. 3510 独立分析包使用 ............................................................................................96 6．explorer 举例分析.....................................................................................................97 三. 利用 Sun 官方资源及工具解决故障 .....................................................................97SUNMARMOTSun 系统故障分析与诊断手册1 分析 Kernel Core dump 及系统 Panic........................................................................97 1.1 Panic 的原因 ..................................................................................................97 1.2 Panic 的过程 ..................................................................................................97 1.3 Panics 两种典型错误类型.............................................................................97 1.4 如何设置系统 Kernel Core Dump.................................................................98 1.5 当系统挂起或者 Panic 时如何收集系统 Core Dump..................................99 1.6 如何利用 adb,strings,ACT 分析 Core 文件 ................................................100 2 使用 Sun 官方资源定位系统故障 ...........................................................................104 2.1 利用 Sunsolve 定位系统故障......................................................................104 2.2 利用 Sun BigAdmin 管理 Patch ............................................................... 110 2.3 常用的 Sun 资源网站 .................................................................................. 111 3 使用 VTS 软件对系统进行分析.............................................................................. 112 3.1 VTS 结构.................................................................................................. 112 3.2 VTS 软件安装.......................................................................................... 113 3.3 使用 VTS 对系统进行检测..................................................................... 114SUNMARMOTSun 系统故障分析与诊断手册一．系统常见问题诊断与处理当系统出现故障时候我们如何定位系统故障。通常我们能通过两个步骤解决系统故障。系统分析在这一步主要是收集系统相关信息及并对收集的信息进行对照 1.& 2.& 3.& 4.& 说明系统环境描述故障现象查看系统发生故障时与正常运行时的不同查看系统相关的更改系统诊断在这一步是对已经收集的信息进行分析，并做相应的测试，修复，最终对问题作出分析报告 5.& 查找相似的故障状况 6.& 对最接近 CASE 的故障原因做测试 7.& 采取行动纠正错误SUNMARMOTSun 系统故障分析与诊断手册例子：一个 E450 故障处理记录 1. Environment ============ Sun E450: Solaris 8 2. Problem Description ================== The customer's machine often automatically down. The error messages appears continually in E450 system. 3. Action Taken ============ Got messages form cu: WARNING: [AFT1] CP event on CPU2 (caused Data access error on CPU3), errID 0x00004e2f.f5ab3bdf AFSR 0x00080&CP& AFAR 0xf4a8928 AFSR.PSYND 0x0080(Score 95) AFSR.ETS 0x00 UDBH 0x0000 UDBH.ESYND 0x00 UDBL 0x0000 UDBL.ESYND 0x00 unix: [AFT2] errID 0x00004e2f.f5ab3bdf PA=0xf4a8928 E$tag 0xd400fe9 E$State: Owner E$parity 0x0e Note: It is a common mistake to replace the wrong CPU on E450. (IMPORTANT: CPU 2 corresponds to socket J0301). 4. Conclusion/Suspect ================= The messages indicates that there are parity error with score 95 on CPU2 which causes UE errors and data access error on CPU3. Thus, CPU2 on socket J0301 need to be replaced. 5. Current Action Plan ================== 1. Replace CPU2 on Socket J0301. 2. suggest onsite parts: 501-MHz UltraSPARC IIi Module1. 故障类型描述Error reporting mechanisms Bus Errors Interrupts Resets Types of errors Software errors Hardware-corrected errors Recoverable errors Fatal errors Critical errorsSUNMARMOTSun 系统故障分析与诊断手册1.1机械设备故障报告总线错误（Bus Errors）总线错误是当处理器涉及到虚拟或者物理空间时，由于硬件的原因造成处理器不能正常处理而引起系统报错。一些典型的Bus错误发生在下面的情况：指令存取期间（During instruction fetch.） Sbus直接内存存取（DVMA）读/写操作同步/异步数据存取内存管理单元（MMU）操作中断报告（Interrupts for Reporting）在正常操作过程中，外部条件异步可以引起处理器中断报告。正常中断的现象：设备启动或者处于ready状态错误诊断改变电源状态系统 ResetsReset会尝试使系统进入一个指定的状态典型的Reset包括: 系统打开电源监控系统软件1.2典型的错误类型1.2.1 软件错误错误由不是由硬件引起，而是由软件软件引起。这种类型的错误会被处理器侦测到并且显示。典型的软件错误，如程序错误或者系统bug。1.2.2 硬件校验错误（Hardware-Corrected Errors）对于error-logging方式，硬件校验错误总是由一个中断记录。通常我们不需要对它做任何恢复操作。一个内存的bit错误会自动被ECC Error Checking and （ Correcting)校正，同时在错误日志中报告。SUNMARMOTSun 系统故障分析与诊断手册1.2.3 可恢复错误（Recoverable Errors）可恢复性错误是由于硬件引起，通常发生在总线错误去请求设备和一个特殊的中断（它可以广播错误）。错误恢复通常会被trap截获，然后错误会被中断处理并记录。典型的可恢复错误：一个不重要的设备掉电或者状态改变为不可用硬盘的非正常掉点都可能出现Recoverable Errors，并记录在 /var/adm/messages中1.2.4 致命错误（Fatal Errors）所有的致命错误都会引起系统检测重启，系统正常的操作不能被保证。系统背板的奇偶校验错误就是一个典型的 Fatal Errors。通常系统不停的重启并且报Fatal Reset,通常是由主板或者CPU引起。1.2.5 System Watchdog Reset当一个fatal error被检测到，系统的watchdog reset就开始了。系统的 watchdog reset影响所有的CPU和I/O设备。写进程可能丢失，但是主存储器的状态没有改变并且在系统watchdog重启后会继续更新。1.2.6 Critical ErrorsCritical errors 需要立即关闭系统并且关闭电源，并通过一个高级广播中断信号通报。典型的critical errors包括： An AC/DC failure Temperature warning Fan failure2．系统常见硬件故障定位及处理 2.1 CPU 及 Memory 常见故障及处理2.1.1 常见 CPU 及 Memory 故障SUNMARMOTSun 系统故障分析与诊断手册2.1.2 CPU 常见故障当发生 CPU ESD 的情况可以直接要求 SUN 更换 CPUSUNMARMOTSun 系统故障分析与诊断手册CPU 物理（VHDM）连接故障，通常这种故障发生的初安装的过程中。SUNMARMOTSun 系统故障分析与诊断手册SUNMARMOTSun 系统故障分析与诊断手册SUNMARMOTSun 系统故障分析与诊断手册SUNMARMOTSun 系统故障分析与诊断手册系统环境造成 CPU 系统报错，通常这种故障可能由以下几种情况引起。通常我们能在/var/adm/messages 中看到关于环境的报警。或者使用 prtdiag 检测。 1. 环境温度过高引起 CPU 传感器报警并重启机器SUNMARMOTSun 系统故障分析与诊断手册2. 环境湿度过高引使 CPU 或内存表面有露水，从而造成系统故障。我们可以查看出问题 CPU 或者内存表面是否有水干燥后的痕迹。3. 设备运行环境中灰尘较多，引起散热不良。SUNMARMOTSun 系统故障分析与诊断手册通常这种问题解决比较简单，清除设备灰尘并检查环境即可。偶尔也可能会引起系统 ECC 错误。在系统扩容的时候，如果是比较老的设备。强烈建议首先对设备清灰。SUNMARMOTSun 系统故障分析与诊断手册电源，接地及电磁引起的 CPU 及内存故障一般这种故障是由于设备运行环境电源环境不稳定或者设备没有接地造成系统的异常重启。SUNMARMOTSun 系统故障分析与诊断手册OBP, KU Patch 引起的故障通常我们处理 CPU 故障的时候，我们应该检查系统 OBP 及系统 patch 的版本。当我们在一台新设备中安装一个新的 CPU 或内存，如果设备运行的是老的系统 OBP 版本或者 Patch，很可能引起系统的报错或者 Fatal Reset.所以解决故障时应该通过 sunsolve.sun.com 查看相关的 patch 及说明，如果有必要。则对系统 OBP 及 patch 升级。ECC 错误SUNMARMOTSun 系统故障分析与诊断手册SUNMARMOTSun 系统故障分析与诊断手册SUNMARMOTSun 系统故障分析与诊断手册2.1.3 内存常见故障内存 ECC 错误，这种错误包括可纠正事件 ECC Correctable Events （CE,FRC,RCE）等。SUNMARMOTSun 系统故障分析与诊断手册SUNMARMOTSun 系统故障分析与诊断手册内存 ECC 错误，包括可不可纠正事件 Uncorrectable Events (UE/RUE/FRU) Example: Blade 1500 showing UE Events on local memory. May 2 21:42:37 zenith SUNW,UltraSPARC-IIIi: [ID 665471 kern.warning] WARNING: [AFT1] Uncorrectable memory (UE) Event detected by CPU0 Privileged Data Access at TL=0, errID 0x434792 May 2 21:42:37 zenith AFSR 0x&PRIV,UE&.0000000a AFAR 0x14020 May 2 21:42:37 zenith Fault_PC 0x Esynd 0x000a DIMM0 DIMM1 May 2 21:42:38 zenith SUNW,UltraSPARC-IIIi: [ID 763290 kern.notice] [AFT1] errID 0x434792 Two Bits were in error May 2 21:42:38 zenith SUNW,UltraSPARC-IIIi: [ID 558597 kern.notice] [AFT2] errID 0x434792 E$tag PA=0x14000 does not match AFAR=0x14000 May 2 21:42:38 zenith SUNW,UltraSPARC-IIIi: [ID 500588 kern.notice] [AFT2] errID 0x434792 PA=0x14000 May 2 21:42:38 zenith E$tag 0x02000 E$state Invalid E$indx 0. May 2 21:42:38 zenith SUNW,UltraSPARC-IIIi: [ID 894887 kern.notice] [AFT2] E$Data (0x00) 0xe09e2. ECC 0x0be May 2 21:42:38 zenith SUNW,UltraSPARC-IIIi: [ID 894887 kern.notice] [AFT2] E$Data (0x10) 0x.xaa2000 ECC 0x030 May 2 21:42:38 zenith SUNW,UltraSPARC-IIIi: [ID 894887 kern.notice] [AFT2] E$Data (0x20) 0xfffe344 0xc2000 ECC 0x0ba May 2 21:42:38 zenith SUNW,UltraSPARC-IIIi: [ID 894887 kern.notice] [AFT2] E$Data (0x30) 0xb138a000.81c7e008 0x81e00 ECC 0x0de May 2 21:42:38 zenith SUNW,UltraSPARC-IIIi: [ID 558597 kern.notice] [AFT2] errID 0x434792 E$tag PA=0x14000 does not match AFAR=0x14000 May 2 21:42:38 zenith SUNW,UltraSPARC-IIIi: [ID 500588 kern.notice] [AFT2] errID 0x434792 PA=0x14000 May 2 21:42:38 zenith E$tag 0x06000 E$state Invalid E$indx 1. The above UE error with syndrome bit 0x0a show that there are 2 bits flipped in a nibble, giving this a high chance that this is due to a bad memory or datapaths. Example: V440 showing RUE/FRU errors. Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 903886 kern.warning] WARNING: [AFT1] Uncorrectable remote memory/cache (RUE) Event detected by CPU2 Privileged Data Access at TL=0, errID 0xefd76e8 Feb 17 10:00:21 barrington AFSR 0x&PRIV,RUE&. AFAR 0xcf0c070 Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 903886 kern.warning] WARNING: [AFT1] Uncorrectable remote memory/cache (RUE) Event detected by CPU2 Privileged Data Access at TL=0, errID 0xefd76e8 Feb 17 10:00:21 barrington AFSR 0x&PRIV,RUE&. AFAR 0xcf0c070 Feb 17 10:00:21 barrington Fault_PC 0x10098f94 J_REQ 3 Feb 17 10:00:21 barrington C3/P0/B1: B1/D0 B1/D1 (applicable only if corresponding FRU Event also logged) Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 875905 kern.info] [AFT2] errID 0xefd76e8 E$tag PA=0xce8c040SUNMARMOTSun 系统故障分析与诊断手册does not match AFAR=0xcf0c040 Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 604857 kern.info] [AFT2] errID 0xefd76e8 PA=0xce8c040 Feb 17 10:00:21 barrington E$tag 0xc8b3a E$state Exclusive E$indx 0. Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2] E$Data (0x00) 0xe8c080 0xe8bfc0 ECC 0x16e Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2] E$Data (0x10) 0xffffffff.ffffffff 0x00000 ECC 0x0ed Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0x00 ECC 0x03e Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2] E$Data (0x30) 0x00806cab. 0x00000 ECC 0x0bd Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 875905 kern.info] [AFT2] errID 0xefd76e8 E$tag PA=0xcc8c040 does not match AFAR=0xcf0c040 Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 604857 kern.info] [AFT2] errID 0xefd76e8 PA=0xcc8c040 Feb 17 10:00:21 barrington E$tag 0xc8b32 E$state Exclusive E$indx 1. Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2] E$Data (0x00) 0x00 ECC 0x000 Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2] E$Data (0x10) 0x00 ECC 0x000 Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0xc8c0a0 0xc8bfe0 ECC 0x1d3 Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2] E$Data (0x30) 0xffffffff.ffffffff 0x00000 ECC 0x0ed Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 875905 kern.info] [AFT2] errID 0xefd76e8 E$tag PA=0xcdcc040 does not match AFAR=0xcf0c040 Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 604857 kern.info] [AFT2] errID 0xefd76e8 PA=0xcdcc040 Feb 17 10:00:21 barrington E$tag 0xc8b37 E$state Exclusive E$indx 2. Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2] E$Data (0x00) 0xdcc080 0xdcbfc0 ECC 0x01b Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2] E$Data (0x10) 0xffffffff.ffffffff 0x00000 ECC 0x0ed Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0x00 ECC 0x03e Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2] E$Data (0x30) 0x00804cab. 0x00000 ECC 0x079 Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 604857 kern.info] [AFT2] errID 0xefd76e8 PA=0xcf0c040 Feb 17 10:00:21 barrington E$tag 0xc8b3c E$state Invalid E$indx 3. Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2] E$Data (0x00) 0x00 ECC 0x03e Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2] E$Data (0x10) 0x000c0 0x00000 ECC 0x0bc Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 895151 kern.info] [AFT2] E$Data (0x20) 0x00 ECC 0x000 Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 819380 kern.info] [AFT2] E$Data (0x30) 0xc0 ECC 0x0b7 *Bad* Esynd=0x003 Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 929717 kern.info] [AFT2]SUNMARMOTSun 系统故障分析与诊断手册D$ data not available Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 755759 kern.warning] WARNING: [AFT1] Uncorrectable memory (FRU) Event detected by CPU3 at TL=0, errID 0xefdf76c Feb 17 10:00:21 barrington AFSR 0x004b4&FRU& AFAR 0x. INVALID Feb 17 10:00:21 barrington Fault_PC 0x10034c3c Esynd 0x00b4 J_AID 2 Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 571007 kern.notice] [AFT1] errID 0xefdf76c Two Bits were in error Feb 17 10:00:21 barrington SUNW,UltraSPARC-IIIi: [ID 642995 kern.warning] WARNING: [AFT1] Uncorrectable memory (FRU) Event detected by CPU3 at TL=0, errID 0xf3774 Feb 17 10:00:21 barrington AFSR 0x00403&FRU& AFAR 0x. INVALID Feb 17 10:00:21 barrington Fault_PC 0x1000721c Esynd 0x0003 J_AID 2 The above error is showing CPU2 was trying to access memory module located on CPU 3. CPU2 was informed (RUE) that the remote CPU3 got hit by a UE error while trying to access the memory. CPU3 knows that there was a request from CPU2 to bring over data from the local memory and CPU3 got hit by UE, thus FRU event.2.1.4 如何定位 CPU/内存常见故障1. 看 CPU/Memory 板状态灯是否正常 2. 从/var/adm/messages 中查看相关报错信息,如何去分辨是内存或者 CPU 引起的错误，重点查看 Score 的值，如果 Score 的值为 90 以上，则很可能是由于 CPU 引起的故障。如果为 05 或者很低的值则很可能是内存引起的。 3. 如果是 V440 等使用 ALOM 卡的设备，可以在 ALOM 中检查系统状态。 Example ALOM BOOTMON v1.0.0 ALOM Build Release: 017 Reset register: e0000000 EHRS ESRS LLRS ALOM POST 1.0 Dual Port Memory Test, PASSED. TTY External - Internal Loopback Test TTY External - Internal Loopback Test, PASSED. TTYC - Internal Loopback Test TTYC - Internal Loopback Test, PASSED. TTYD - Internal Loopback Test TTYD - Internal Loopback Test, PASSED. Memory Data Lines Test Memory Data Lines Test, PASSED. Memory Address Lines Test Slide address bits to test open address linesSUNMARMOTSun 系统故障分析与诊断手册Test for shorted address lines Memory Address Lines Test, PASSED. Memory Parity Test Memory Parity Test, PASSED. Boot Sector FLASH CRC Test Boot Sector FLASH CRC Test, PASSED. Return to Boot Monitor for Handshake ALOM POST 1.0 Status = 00007fff Returned from Boot Monitor and Handshake Clearing Memory Cells Memory Clean Complete Loading the runtime image... Sun(tm) Advanced Lights Out Manager 1.0 (bear) Full VxDiag Tests BASIC TOD TEST Read the TOD Clock: MON SEP 30 19:09:26 2002 Wait, 1 - 3 seconds Read the TOD Clock: MON SEP 30 19:09:28 2002 BASIC TOD TEST, PASSED ETHERNET CPU LOOPBACK TEST 50 BYTE PACKET - a 0 in field of 1's. 50 BYTE PACKET - a 1 in field of 0's. 900 BYTE PACKET - pseudo-random data. ETHERNET CPU LOOPBACK TEST, PASSED Full VxDiag Tests - PASSED Status summary VxDiag POST LOOPBACK I2C EPROM FRU PROM ETHERNET MAIN CRC BOOT CRC TTYD TTYC MEMORY MPC860 Status = 7FFF - PASSED - PASSED - PASSED - PASSED - PASSED - PASSED - PASSED - PASSED - PASSED - PASSED - PASSED - PASSED - PASSEDSUNMARMOTSun 系统故障分析与诊断手册Please login: 详细步骤请参考 2.5 Fire 及 V 系列服务器诊断及维护 4. 5. 6. 7. 可以在 ok 模式下对系统进行最大化自检检测 CPU prtdiag Cv 查看 CPU 内存状态是否正常收集 explorer 并分析使用 VTS 软件对 CPU 及内存做压力测试2.2硬盘常见故障定位及故障处理2.2.1 硬盘物理损坏硬盘的物理损坏通常是指硬盘已经不能够被系统识别，或者硬盘一直报 hard 错误。我们可以认为硬盘已经损坏不能使用。一般我们可以通过以下的几个步骤检测硬盘是不是已经物理损坏。 (注意：如果是 Fire 或者 V 系列服务器，首先要保证硬盘已经在线) 硬盘状态灯为黄灯当在 ok 模式下 probe-scsi-all 不能正常检测到硬盘。进入系统后使用 format 命令检测硬盘时发现硬盘已经是 unformat 状态，或者状态不可以用。系统/var/adm/messages 里有频繁的硬盘报错信息。进入系统使用 iostat CEn 检测硬盘，有大量的 hard error 错误。通过 ALOM 卡 showlogs 或者进入系统查看 syslog2.2.2 硬盘软故障常见系统硬盘软故障系统在重新启动时有时会发现文件系统报错。进入文件系统使用 iostat CEn 发现有 Soft error 错误。在/var/adm/messages 里发现关于硬盘的报错信息，但是不是很频繁出现。在 Sun SDS 或者 Veritas Volume Manager 里检测硬盘，状态正常。使用 VTS 检测设备未发现硬盘报错通过 ALOM 卡 showlogs 或者进入系统查看 syslogSUNMARMOTSun 系统故障分析与诊断手册EG. ---------------------------------------------System Disks: ---------------------------------------------Disk Status Service OK-to-Remove ---------------------------------------------HDD0 OK OFF OFF HDD1 OK OFF OFF HDD2 OK OFF OFF HDD3 OK OFF OFF The showlogs command also indicates any events如果我们按照以上的步骤检测硬盘都可以顺利通过。并且使用 iostat CEn 发现没有我们可以通过做文件系统同步等工作来修复软错误。 EG. 如何更换 Fire 及 V 系列服务器内部硬盘 Replace a Sun StorEdge[TM] Volume Manager controlled internal drive in a Sun Fire[TM] V880 Problem Statement As the internal drives in a V880 are Fiber Channel, luxadm will have to be used to remove traces of the old WWN from the system. &luxadm probe& can be used to determine the &enclosure& name containing these internal drives. # luxadm probe Found Enclosure: SUNWGS INT FCBPL Name:LoopA Logical Path:/dev/es/ses0Node WWN:d7d8With the enclosure name, you can determine the status of this drive with &luxadm display &encl name&& # luxadm display LoopA SUNWGS INT FCBPL DISK STATUS DISKS (Node WWN) On (O.K.) f87099 On (O.K.) f39796 On (Login failed) Not Installed Not Installed Not Installed Not Installed Not Installed Not Installed Not Installed Not Installed Not Installed SUBSYSTEM STATUSSLOT 0 1 2 3 4 5 6 7 8 9 10 11SUNMARMOTSun 系统故障分析与诊断手册FW Revision:9218 Box ID:0 Node WWN:d7d8 Enclosure Name:LoopA SSC100's - 0=Base Bkpln, 1=Base LoopB, 2=Exp Bkpln, 3=Exp LoopB SSC100 #0: O.K.(11.A) SSC100 #1: O.K.(11.A) SSC100 #2: Not Installed SSC100 #3: Not Installed Temperature Sensors - 0 Base, 1 Expansion 0:21C 1:Not Installed We'll be replacing slot 2 in the example above. Resolution Before replacing any disk under VxVM control, it should be either in a 'failed' or 'removed' state. If the disk does not show up as &failed was&, as shown here: # vxdisk list DEVICE c1t0d0s2 c1t1d0s2 c1t1d0s2 TYPE DISK GROUP STATUS sliced rootdisk rootdg online sliced disk01 rootdg online sliced online disk02 rootdg failed was:c1t2d0s2then you should run 'vxdiskadm' and choose option #4 to remove the disk for replacement. After running 'vxdiskadm', the output should look like this: # vxdisk list DEVICE c1t0d0s2 c1t1d0s2 c1t2d0s2 was:c1t2d0s2 TYPE DISK sliced rootdisk sliced disk01 sliced disk02 GROUP STATUS rootdg online rootdg online online rootdg removed1. Once the disk is in one of the two states shown above, put the disk into the &offline& state with the following command: # vxdisk offline c1t2d0s2 2. At this point Volume Manager is prepared for the removal of the drive. We'll now run &luxadm remove_device &encl&,&slot&& to remove the WWN entries and device links for the failed drive: # luxadm remove_device LoopA,s2 This command will prompt you to physically remove the drive from the V880. 3. We can now run &luxadm insert_device &encl&,&slot&& which will prompt you to physically insert the drive in the V880. # luxadm insert_device LoopA,s2SUNMARMOTSun 系统故障分析与诊断手册4. After the new disk has been inserted, verify the disk is seen in format and run the following command to force Volume Manager to look for this new disk: # vxdctl enable 5. Running &vxdisk list& you should see this new disk in an &error& state: # vxdisk list DEVICE c1t0d0s2 c1t1d0s2 c1t2d0s2 was:c1t2d0s2 TYPE DISK sliced rootdisk sliced disk01 sliced disk02 GROUP STATUS rootdg online rootdg online error rootdg removed6. You can now use vxdiskadm option #5 to replace the Volume Manager disk using the newly inserted drive. Note that you will be told that &Access is disabled& for this new disk (because it is still &offline&), and will be asked whether or not you wish to &enable access& to it. Answer 'yes' to this question.2.3Critical 常见故障定位通常我们常见的Critical故障有以下三种： An AC/DC failure Temperature warning Fan failure 一般我们可以通过以下的几个步骤定位 Critical 故障 1. 看系统状态灯是否正常2. 3. 4. 5. 6.从/var/adm/messages 中查看相关报错信息。例如风扇 FAN，温度等。 partdiag Cv 检测系统如果是使用 ALOM 卡的设备，可以在 ALOM 中检查系统状态。如果是老的设备可以使用 POST 对系统做检测用 VTS 检测单独某个设备状态（例如：FAN）SUNMARMOTSun 系统故障分析与诊断手册2.4 Fire 及 V 系列服务器诊断及维护因为现在 Sun 中低端服务器以 V440,V240,F480,F280 等设备为主。所以我们以 Fire 及 V 系列服务器为例对系统进行分析。Fire V1280, E,E450，N 1405 等服务器也可以参考以下步骤定位系统故障。一般我们可以使用如下的几个步骤定位系统故障 1. 2. 3. 4. 5. 6. 7. Hardware indicators such as system LEDs POST diagnostics testing OBP diagnostics testing ALOM system controller commands and logs Solaris OS utilities Solaris OS system messages Solaris OS applications如何使用 ALOM 检测系统状态 ALOM 通过 I C 总线监控如下：
2Voltages Temperatures Fans Generating events
Host System XIR/Reset Component status and indicatorsSUNMARMOTSun 系统故障分析与诊断手册2.4.1 如何设置 ALOM 卡第一步：连接到 ALOM 卡第二步：从串口连接到 ALOM 卡SUNMARMOTSun 系统故障分析与诊断手册第三步：配置 ALOM 卡基本信息第四步：保存 ALOM 卡基本信息SUNMARMOTSun 系统故障分析与诊断手册第五步：登出 ALOM第六步：从远程 telnet 登陆 ALOMSUNMARMOTSun 系统故障分析与诊断手册第七步：从 ALOM 的 sc 提示符下做操作2.4.2 如何使用 ALOM 卡做检测第一步：从串口连接到 ALOMSUNMARMOTSun 系统故障分析与诊断手册第二步：Reset SC系统自检开始SUNMARMOTSun 系统故障分析与诊断手册第三步：自检完成并生成报告，键入 console 进入系统第四步：键入用户名/密码(admin/***) 准备 poweron 进入系统SUNMARMOTSun 系统故障分析与诊断手册第五步：进入系统第六步：如何从系统退回 ALOM 状态SUNMARMOTSun 系统故障分析与诊断手册第七步：在 sc& 提示符下用 showlogs 查看系统自检结果第八步：sc& 提示符下用 poweroff 关闭系统SUNMARMOTSun 系统故障分析与诊断手册SUNMARMOTSun 系统故障分析与诊断手册2.4.3 ALOM 完整命令ALOM Console Interface Commands Permission Needed c c c u u u u u a Command console break consolehistory useradd usershow userpassword userperm userdel flashupdate Description Provides access to the host's console Drops the host system to the boot PROM level Displays commands executed and resulting output on the console Adds a user to the ALOM environment Shows the configured users on the ALOM environment Sets the password for a user on the ALOM environment Sets the permissions for a user on the ALOM environment Removes a user from the ALOM environment Updates the main and bootmon firmware for ALOM See the &Perform Installation of the ALOM Image& section for additional details. Resets the ALOM system controller card Sets the date and time when the OS is not running Sets the system controller parameters back to a default state Sets individual configuration parameters for the system controller Sets configuration information for the system controller interactively Powers off the host system Powers on the host system if not already powered on. If a FRU is specified, the FRU will be powered on Performs hot-plug operation on a FRU Resets the host system Overrides host system's behavior during system initialization by overriding the OBPa a a a a r rresetsc setdate setdefaults setsc setupsc poweroff poweronr r rremovefru reset bootmodeSUNMARMOTSun 系统故障分析与诊断手册diag-switch? parameter READ-ONLY showlogs Displays the ALOM system controller's log. It displays the last 20 lines of the buffer when no options are specified. This is reset when the system controller is reset. Sets the locator LED on the host system Shows the status of the locator LED on the host systemREAD-ONLY READ-ONLY READ-ONLY READ-ONLY READ-ONLY READ-ONLYsetlocator showlocatorshowenvironment Shows the environmental status of system components showfru showplatform showsc Displays FRU information Shows the state of the host system Displays the configuration and version information of t if no parameter is specified, all are printed Shows the network configuration for the system controller Shows the current date in universal time coordinated (UTC) Sets the password for the current user Shows the users logged into the system controller Logs out of the system controller Obtains help on commandsREAD-ONLY READ-ONLY READ-ONLY READ-ONLY READ-ONLY READ-ONLYshownetwork showdate password showusers logout help在系统中调用 scadm 产看 ALOM 状态 The scadm command is located under /usr/platform/'uname -i/sbin. The following table shows the subcommands that can be issued through the scadm command. The scadm Administrative Tool Sub-Commands Command help date set show resetrsc Description Prints a usage statement for ALOM Prints or sets the date for ALOM Sets a variable Shows the value of a variable Resets the ALOM system controller cardSUNMARMOTSun 系统故障分析与诊断手册downloadDownloads the firmware to the ALOM system controller card See the &Perform Installation of the ALOM Image& section for additional details. Sends a message about an event Configures the modem for the serial port. Not supported on the Sun Fire V440 server Adds a user for ALOM Deletes a user from ALOM Shows the user accounts on ALOMsend_event modem_setup useradd userdel usershowuserpassword Sets the password for a specified user userperm shownetwork loghistory version Sets the permissions for a specified user Shows the network configuration for ALOM Shows the log history for ALOM; the log history is cleared when the system controller is reset Shows the version for ALOMFire 及 V 系列 RSC 及 ALOM 卡常见问题The following systems will have the Service Processor Configuration: RSC ----Sun Fire V880 Sun Fire V480 Sun Fire 280R ALOM ---Sun Fire V210 Sun Fire V240 Sun Fire V250 Sun Fire V440 Netra 240Problem Statement Output from rsc-config command: # /usr/platform/SUNW,Sun-Fire-880/rsc/rsc-config Continue with RSC setup (y|n): y Set RSC date/time now (y|n|?) [y]: y Server Hostname [kennedy]:SUNMARMOTSun 系统故障分析与诊断手册Edit customer info field (y|n|?) [n]: Enable RSC Ethernet Interface (y|n|s|?) [n]: y RSC IP Mode (config|dhcp|?) [dhcp]: config RSC IP Address []: 192.19.42.70 RSC IP Netmask [255.255.255.0]: RSC IP Gateway []: 192.19.42.1 Enable RSC Alerts (y|n|s|?) [n]: Enable RSC Modem Interface (y|n|s|?) [n]: Enable RSC Serial Port Interface (y|n|s|?) [n]: Setup RSC User Account (y|n|?) [y]: Username []: jjg User Permissions (c,u,a,r|none|?) [cuar]: -------------------Verifying Selections -------------------General Setup ------------Set RSC date now = y Server Hostname = kennedy Set Customer Info = n Is this correct (y|n): y Ethernet Setup -------------IP Mode = config IP Address = 192.19.42.70 IP Netmask = 255.255.255.0 IP Gateway = 192.19.42.1 Is this correct (y|n): y Alert Setup ----------- Alerts disabled Is this correct (y|n): y Modem Setup ----------- Modem disabled, ppp disabled Is this correct (y|n): y Serial Port Setup ----------------- Serial port disabled Is this correct (y|n): y User Setup ---------User Name = jjg User Permissions = cuar Is this correct (y|n): y This script will now update RSC, continue? (y|n): y Updating flash, this takes a few minutes rscadm: RSC did not respond during boot initialization ERROR: during update flash ******************************** SETUP SCRIPT FAILED!SUNMARMOTSun 系统故障分析与诊断手册******************************** Please re-run the install script. Make sure inputs to make sure they are valid. ERROR: unable to find RSC serial device ******************************** SETUP SCRIPT FAILED! ******************************** Please re-run the install script. Make sure inputs to make sure they are valid. rscadm: RSC firmware not responding Disabling ethernet interface: rscadm: RSC firmware not responding ERROR: during set ip_mode ******************************** SETUP SCRIPT FAILED! ******************************** Please re-run the install script. Make sure inputs to make sure they are valid. Disabling RSC alert engine: rscadm: RSC firmware not responding ERROR: during set page_enabled ******************************** SETUP SCRIPT FAILED! ******************************** Please re-run the install script. Make sure inputs to make sure they are valid. Resolution Solution:you check allyou check allyou check allyou check allNote: Try solution A. first. If A doesn't work, do solution B.A. Either reset rsc: #/usr/platform/SUNW,Sun-Fire-880/rsc/rscadm resetrsc B. Or reseat the rsc card as follows: 1. Power off the server. 2. Remove the power cords from the power supplies. 3. Wait 15-20 seconds before going to step 4. 4. Remove and reinsert the card. 5. Plug power cords back into power supplies. 6. Power up system.SUNMARMOTSun 系统故障分析与诊断手册7. Log in as root and run the command: #/usr/platform/SUNW,Sun-Fire-880/rsc/rsc-config. Keywords: V880, rsc-config, firmware, RSC, responding如何使用 POST 对系统做最大化自检如果从 SC 使用 break 没有反映， XIR 可能需要设置。 ALOM 系统使用 reset 则从 Cx 命令。如果成功系统可以进入 OBP 模式，我们可以在 OBP 层面上设置最大化自检。 {1} ok setenv auto-boot? true {1} ok setenv diag-level max {1} ok setenv diag-switch? true {1} ok setenv post-trigger all-resets {1} ok setenv obdiag-trigger all-resets {1} ok sync 如果系统没有自检，我们可以在开机时把服务器钥匙打到检测模式，做最大化自检。 Sample Error MessageThe following is an example of a fatal reset on a Sun Fire V440 server: Fatal Error Reset CPU 00.0002 AFSR 00.0000 JETO PRIV OM TO AFAR ec0.c180 In the example, the CPU reports an error using the status register and address. The code at the end of the second line indicates the type of error that has occurred. Collecting this data is necessary to troubleshooting the actual problems that are occurring on the system.Solaris OS 常用检查命令列表SUNMARMOTSun 系统故障分析与诊断手册System Health Check FormCustomer name: Samsung Customer's contact person / telephone: Check time: Mar.17, 10:00 Samsung SDS On-site Engineer: Alan Jin、Wang Hao Hostname : SAMSUNG TEST S/N: Model: Item Method Site Environment Check 机房环境检查L C N voltage 火线－地线电压 N-G voltage 零线－地线电压 Site Temperature 温度 Site humidity 湿度Result or Problem Description220.2 1.4 22°C 40%NotesIndicator Light check 状态显示灯Attention Light on server 主机面板 attention 显示灯 Fault Light on server 主机面板 Fault 显示灯OFF OFF 否 OKServer cleanness 请标明该设备是否需要清洁，及需要清洁检查机体各部分清洁的位置度 Cable Connection 请检查线缆是否存在松动检查系统电缆连接 Firmware Version and OBP version Information Collection Firmware 版本和 OBP 版本信息收集OBP Hard Disk Firmware Solaris Eeprom Solaris Version # prtdiag -v # luxadm display WWN # uname -a # eeprom # isainfo -kvOBP 4.5.210207SunOS 5.8 Generic_ OK 64-bit sparcv9 kernel modules OK OK OK N/A OK OK OKHardware Check 硬件检查Processor/CPU 处理器 Memory 内存 Disk 硬盘 Tape drive 磁带机 DVD/CDROM # # # # # # # # prtdiag Cv more messages prtdiag Cv wsinfo more messages prtdiag Cv iostat CE format# mt Cf /dev/rmt/ status # # # # iostat -En ipconfig Ca snoop prtdiag CvLan Status FAN StatusSUNMARMOTSun 系统故障分析与诊断手册Power Supply StatusI/O System Configuration Check 系统配置检查System Configuration 系统配置 Disk Configuration 硬盘配置 Network Card Configuration 网卡配置 Network status Check 网络状态检查 Software installed 已安装的软件 Kernel Patch 核心补丁 Patch installed 已安装的补丁 System kernel 内核参数 Configuration on Logic volume state 逻辑卷配置(VXVM) Configuration on Logic volume state 逻辑卷配置(Sun SDS) Mirror of Disk configuration check 逻辑卷配置/磁盘镜象 SWAP information 缓冲区信息 Filesystem information 文件系统信息 Crontab Information 定时作业 System Log 系统日志 Last boot time 上次启动时间 System boot log 系统启动日志 Last 100 loin 最后登陆系统用户日志 # # # # # # # prtfru Cx prtconf -vp format iostat CEn luxadm Cprobe luxadm Cprobe Cp luxadm display WWN# # # #more messages prtdiag Cv more messages luxadm CprobeOK OK OKOK# ipconfig Ca # snoop # pkginfo -i # uname －a # showrev -p # more /etc/system # vxdisk listOK OK OK Generic_ OK OK N/A N/A No mirror OK OK 1# metastat # metadb -i # metastat # metadb -i # swap -l # df Ch # du -h # crontab ClSystem log file check 系统日志检查/var/adm/messages last-20-reboot.out last-20-reboot.out last-100-login.out信息为空，系统可能被修改 Wed Aug 13 16:37 OK OKSystemBackup Check 系统备份检查# metastat # metadb -i N/AWhether exists Volume Group(VG) configuration backup 是否存在卷组（SDS）配置备份 Whether exists Volume Group(VG)# vxdisk listN/ASUNMARMOTSun 系统故障分析与诊断手册configuration backup 是否存在卷组（VXVM）配置备份Other Check 其它检查kernel dump existance 是否有内核卸出 System process 进程检查 System and CPU Temperature Recorde 系统温度记录 # ls /var/crash/’uname Cn’/* # ps -axu # prtdiag -v NO OK OKSolaris OS system messages Solaris OS applications 详见二三章3. 常见文件系统问题及处理利用 Fsck 修复文件系统1. fsck 是可以说是使用次数第一的工具（系统自己使用占 90%以上）。它是 FS 完整性检查，包括 supblk,cylgrpblk,inode.tab,data 区等。检查的原理是：冗余发。修复时按照实际情况调整记录信息。 lost+found 目录：在 fsck 的时候，将找不到父目录的那些文件拷贝到该目录中，并以 i 节点号作为文件名。当系统启动的时候会使用 fsck 对文件系统进行扫描，并相应的报出扫描结果。例如：/dev/rdsk/c0t0d0s7 stable 等。后面是 Fs 的状态。其中，clean 表示文件系统 umount 后无人用，stable 表示文件系统用过，但却是完整的，好的。而出一大堆的话，还有什么 fragment % 什么的的那都表示文件系统上有乱的地方，那么就应该进入系统后使用 fsck 来整理。在非法关机后（各种原因），再次启动的时候会有很多的情况。当系统的状态是 clean，stable 和 logging 的状态的时候 fsck 不运行。2. fsck 的使用三个参数： -o f 对系统进行强制检查，不论系统是否在 clean 等状态 -o p 非交互式检查并修复文件系统，对有的问题则立即退出 -o b=xx 用来修复超级块的错误，就是将备份的超级块内容拷入超级块中。 solaris 对超级块很重视，它的备份有很多，一般的 b=32 就可以了，如果不行可以使用命令 newfs -N /dev/rdsk/cxtxdxsx 来查看超级块的位置，其中任何一个备份块都可使用SUNMARMOTSun 系统故障分析与诊断手册3、一些错误的情况一、RECONNECT 表示目录丢失，可将其存入 lost+found 中再作转移。回答 yes 二、SUPERBLK 坏（注意是坏，不是 wrong）修复见上面（如果是 wrong 就随便了，修不修都可以）三、CLEAR 删 i 节点，可能会错四、REMOVE 删文件，一般给出文件名。file=.... 五、ADJUST 调整连接数。实际数与原记录不符。回答 yes 六、SALVAGE 自由列表计数不正确。回答 yes （不能在正在 mount 的文件系统上操作，否则有可能导致文件系统损坏。）Fsck 修复文件系统时候需要注意的一些地方不是所有的文件系统出现问题时都可以做 fsck,对于特定一些设备如某一批的
等阵列，如果系统需要维护。直接使用 fsck Cy 有可能会造成系统的所有数据丢失。详细信息请注意 Sun Alert Report.3.1Volume Manager 常见问题处理Volume Manger 状态描述Keyword(s):SEVM, VxVM, Volume Manager, Veritas Volume Manager, Sun Enterprise Volume Manager[TM], volume maintenance Document Body For Volume Manager 2.X and 3.X versions. VXVM Volume States CLEAN - The volume is not started (Kernel state DISABLED) and its plexes are synchronized. ACTIVE - The volume has been started or was in use when the machine was rebooted. If the volume is DISABLED, the plexes cannot be guaranteed to be consistent, but will be consistent when the volume is started. EMPTY - The volume contents are not initialized. The kernel state is always DISABLED when the volume is EMPTY. SYNC - The volume is either in read-writeback recovery mode(ENABLED) or wasin this mode when the machine was rebooted(DISABLED). WithSUNMARMOTSun 系统故障分析与诊断手册read-writeback recovery, plex consistency is recovered by reading data blocks of one plex and writing the data to all other writable plexes. If the volume is ENABLED, the plexes are being resyncronized. If volume is DISABLED it was resyncing when machine was rebooted and the plexes need to be resyncronized. NEEDSYNC-The volume will require a resyncronization operation the next time it is started. -------------------------RAID-5 Volume States Raid-5 have there own set of volume states: CLEAN - The volume is not started and its parity is good. The raid-5 plex stripes are consistent. ACTIVE - The volume has been started or was in use when the machine was rebooted. If the volume is DISABLED, the parity can't be guaranteed to be synchronized. EMPTY - The volume contents are not initialized. The kernel state is always DISABLED when the volume is EMPTY. SYNC - The volume is either undergoing a parity resyncronization or was having its parity resyncronization when the machine was rebooted. NEEDSYNC- The volume will require a parity resyncronization operation the next time it is started. REPLAY - The volume is in a transient state as part of a log replay. A log replay occurs when it becomes necessary to use logged parity and data. Plex states and Plex Kernel State Plexes that are associated with a volume have one of the following states: * * * * * * * * * EMPTY CLEAN ACTIVE STALE OFFLINE TEMP TEMPRM TEMPRMSD IOFAILA Dirty Region Logging or RAID-5 log plex is a special case, as its stateSUNMARMOTSun 系统故障分析与诊断手册is always set to LOG. -------------------EMPTY Plex StateWhen a volume is created and the plex isn't initialized the plex is in an EMPTY state. CLEAN Plex StateA plex is in a CLEAN state when it is known to contain a good copy (mirror) of the volume. Therefore all the plexes of a volume are clean, no action is required. ACTIVE Plex StateA plex can be in the ACTIVE state in two situations: *When the volume is started and the plex fully participates in normal volume I/O (meaning the plex contents change as the contents of the volume change) *When the volume was stopped as a result of a system crash and the plex was active at the moment of the crash. In the latter case, a system failure may leave plex contents in an inconsistent state. When a volume is started, VxVM performs a recovery action to guarantee that the contents of the plexes are marked as ACTIVE are made identical. --------------------------------------------------------------NOTE- ACTIVE state should be the most common state for plexes on a well running system _______________________________________________________________ STALE Plex StateIf there is a possibility a plex doesn't have the complete and current volume contents, This plex is placed in a STALE state. Also, if I/O errors occur on a plex, the kernel stops using and updating this plex, and the operation sets the state of the plex to STALE. To re-attach the plex to the volume you can run the *vxplex -g (disk group) att volume-name plex-name* or highlight the plex and goto advanced options-&plex-& attach plex and it will sync the date and set the plex to ACTIVE state. To force a plex into STALE state you can run *vxplex -g (disk group) det volume-name plex-name* or highlight the plex and goto advanced options-&plex-& dettach plex. OFFLINE Plex State-SUNMARMOTSun 系统故障分析与诊断手册The *vxmend -g diskgroup off volname plexname* will detach a plex from a volume setting the plex state to OFFLINE. Although the detached plex is associated with the volume the changes to the volume aren't reflected to the plex while in the OFFLINE state. Running the *vxplex -g (disk group) att volume-name plex-name* or highlight the plex and goto advanced options-&plex-& attach plex will set the plex state to STALE and will start to recover data after the vxvol start operation. TEMP Plex StateA utility will set the plex state to TEMP at the start of an operation and to an appropriate state at the end of the operation. For example, attaching a plex to an enabled volume requires copying volume contents to the plex before it can be conceited fully attached. If the system goes down for any reason, a TEMP plex state indicates the ope a subsequent vxvol start will dissociate plexes in the TEMP state. TEMPRM Plex StateA TEMPRM plex state resembles a TEMP state except that at the completion of the operation, the TEMPRM plex is removed. If the system goes down for any reason, a TEMPRM plex state indicates the ope a subsequent vxvol start will disassociate plexes and remove the TEMPRM plex. TEMPRMSD Plex StateThe TERMPRMSD plex state is used by vxassist when attaching new plexes. If the operation doesn't complete, the plex and its subdisk are removed. IOFAIL Plex StateThe IOFAIL plex state is associated with persistent state logging. On the detection of a failure of an ACTIVE plex, vxconfigd places that plex in the IOFAIL state so that it is disqualified from the recovery selection process at volume start time.Plex Kernel States DISABLEDThe plex may not be accessed. DETACHEDA write to the volume is not reflected to the plex. A readSUNMARMOTSun 系统故障分析与诊断手册request from the volume will never be satisfied from the plex. Plex operations and ioctl functions are accepted. ENABLEDA write request to the volume will be reflected to the plex. A read request from the volume will be satisfied from the plex.Volume Manger 状态检查命令# vxdisk list # vxprint 例子：. 系统显示一块硬盘状态为 DISABLE，如何处理。 #vxprint |more Disk group: rootdg TY NAME 0 PUTIL0 dg rootdg dm dm pl sd 0 v pl sd 0 pl sd 0 v pl sd rootdisk rootmirr -ASSOCKSTATELENGTHPLOFFSSTATETUTILrootdg------c1t0d0s2 c1t1d0s2 rootvol-02 fsgen flash flash-01 flash flash-02 root rootvol --3339136rootvol-02 rootmirr-02 flash ACTIVE flash-01 ACTIVE rootdisk-03 flash-02 ACTIVE rootmirr-01 rootvol ACTIVE rootvol-01 ACTIVE -DISABLED ENABLED59360ENABLED ENABLED ENABLED ENABLED ENABLED72 48672ENABLED ENABLED59360SUNMARMOTSun 系统故障分析与诊断手册rootdisk-B0 rootvol-01 ENABLED 1 Block0 sd rootdisk-02 rootvol-01 1 -0 ENABLED-v swapvol swap ENABLED ACTIVE pl swapvol-01 swapvol ENABLED ACTIVE sd rootdisk-01 swapvol-01 ENABLED 0 pl swapvol-02 swapvol ENABLED ACTIVE sd rootmirr-03 swapvol-02 ENABLED 0 好像有个 plex 被 diable 了，但从 vx 的图形界面看没有任何报错! #format Searching for disks...done24 80224AVAILABLE DISK SELECTIONS: 0. c1t0d0 &SUN72G cyl 14087 alt 2 hd 24 sec 424& /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w6f88a1,0 1. c1t1d0 &SUN72G cyl 14087 alt 2 hd 24 sec 424& /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w71cf91,0 2. c1t2d0 &SUN72G cyl 14087 alt 2 hd 24 sec 424& /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w6f3a21,0 3. c1t3d0 &SUN72G cyl 14087 alt 2 hd 24 sec 424& /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w71cf21,0 4. c1t4d0 &SUN72G cyl 14087 alt 2 hd 24 sec 424& /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w6f. c1t5d0 &SUN72G cyl 14087 alt 2 hd 24 sec 424& /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w71ceb1,0 format 每个硬盘都能看得到！！！把 rootvol-02 删掉，把 rootvol 重新做镜像就可以恢复 # vxedit -g rootdg rm rootvol-02 # vxmake -g rootdg plex rootvol-02 sd=rootmirr-02 # vxprint -lp3.2Sun SDS 常见问题分析处理Solstice Disksuit 维护介绍创建 Metadb 创建方法： 1、图形界面: metatool 2、命令行界面：（1）首先使用 metadb 来创建 meta 状态数据库(metaDB)SUNMARMOTSun 系统故障分析与诊断手册例: metadbCa Cf c0t0d0s7 c0t0d1s7 c0t0d2s7 （2）再用 metainit 创建 RAID 设备。 SDS 图形界面：Metatool在建立metadevice之前，必须首先建立存放metadevice重要信息的状态数据库metaDB(meta state database)。如果 metaDB被破坏，所有metadevice上的数据将不能访问。因此，通常会建多个metaDB的拷贝，并且将它们分布在不同硬盘的slice上，以避免单点故障造成损失。 Metadb 的一般用法： metadb Ca Cf c0t0d0s7 c0t0d1s7 c0t0d2s7 其中-a表示添加metaDB的拷贝，最初建metaDB时必须加-参数。后面是存放拷贝的三个磁盘区间。如果未指定拷贝数目，缺省情况将在每个区间上建一个拷贝。 metaDB建好后，就可以使用metainit创建metadevice。 metainit [选项] 传接/分流分流数宽度部件… [ -i 间隔] 宽度部件… [ -i 间隔] [ -h 热备件池] 例： 1、建立一个concantenation metadevice， /dev/md/dsk/d7，它由4个slice 串接而成 # metainit d7 4 1 c0t1d0s0 1 c0t2d0s0 1 c0t3d0s0 1 /dev/dsk/c0t4d0s0 2、建立一个stripe metadevice ,/dev/md/dsk/d15，分流在两个slice 上 # metainit d15 1 2 c0t1d0s2 c0t2d0s2 -i 32k 3、分流并串接(stripe+concantenation device)， /dev/md/dsk/d75 # metainit d75 2 3 c0t1d0s2 c0t2d0s2 c0t3d0s2 -i 16k 3 c1t1d0s2 c1t2d0s2 c1t3d0s2 -i 32k SDS命令行(续) 在创建镜像设备(mirror)的时候，一般采用先分别创建子SUNMARMOTSun 系统故障分析与诊断手册镜像，再用metattach命令将另一个子镜像依附到前一个子镜像上的方法来做。例： 1、下面建立一个双路镜像设备，/dev/md/dsk/d50。(N路表示有N个子镜像) # metainit d51 1 1 c0t1d0s2 (先创建第一个子镜像) # metainit d52 1 1 c0t2d0s2 (再创建第二个子镜像) # metainit d50 -m d51(创建只有一个子镜像的镜像设备) # metattach d50 d52 (将另一个子镜像依附上来) 2、下面创建一个RAID5设备,d80。 # metainit d80 -r c1t0d0s2 c1t1d0s2 c1t3d0s2 -i 20k 3、创建热备件池 # metainit hsp001 c2t2d0s2 c3t2d0s2 c1t2d0s2 4、创建日志(logging)设备 # metainit d1 -t d10 d20 日常维护可以使用metadb来查看数据库的状态。 # metadb flags first blk block count a m p luo 16 1034 /dev/dsk/c0t0d0s7 使用metastat来查看metadevice的状态 #metastat d0: Mirror Submirror 0: d1 State: Okay Submirror 1: d2 State: Okay Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 8194168 blocks d1: Submirror of d0 State: Okay Size: 8194168 blocks Stripe 0: Device Start Block Dbase State Hot Spare c0t0d0s0 0 No Okay d2: Submirror of d0 State: Okay Size: 8194168 blocks Stripe 0: Device Start Block Dbase State Hot Spare c0t0d0s3 0 No Okay 如果使用metastat发现镜像或者RAID5设备出现状态不一致或者提示需要“维护”的情况，可以使用metareplace命令来进行恢复。metareplace将镜像中的子镜像进行同步，或者使用RAID5中其余设备的信息来恢复不同步设备中的数据信息。例如：d11所在的c1t4d0s2块的数据不一致，可以使用命令 # metareplace -e d11 c1t4d0s2 来恢复数据的一致性。-e参数表示将设备的状态转换成 “available”，并且进行同步。SUNMARMOTSun 系统故障分析与诊断手册根据实际设备的容量，同步过程所需时间会不一样。在同步过程中，可以使用metastat来查看同步完成的进度在 Sun cluster 2.2 下更换硬盘 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 检查diskset状态和metadb状态检查metadevice状态,并保存配置 detach 相关的metadevice 删除已detach的metadevice 删除坏盘的metadb 从diskset中,删除坏盘更换坏盘,重建did 加入diskset中,并分区建立diskset的metadb,重建相关meta device 恢复R像检查diskset状态和metadb状态 #metaset Cs hwhlr10 Set name = hwhlr10, Set number = 1 Host Owner hwhlr-ph1 Yes hwhlr-ph2 Drive Dbase /dev/did/dsk/d2 Yes /dev/did/dsk/d3 Yes /dev/did/dsk/d4 Yes /dev/did/dsk/d5 Yes #metadb Cs hwhlr10 flags first blk block count a m luo 16 1034 /dev/did/dsk/d2s7 a luo 16 1034 /dev/did/dsk/d3s7 a luo
/dev/did/dsk/d2s7 a luo
/dev/did/dsk/d3s7 a luo 16 1034 /dev/did/dsk/d4s7 a luo 16 1034 /dev/did/dsk/d5s7 a luo
/dev/did/dsk/d4s7 a luo
/dev/did/dsk/d5s7 检查metadevice状态,并保存配置 #metastat Cs hwhlr10 hwhlr10/d10: Trans State: Okay Size:
blocks Master Device: hwhlr10/d11 Logging Device: hwhlr10/d14 hwhlr10/d11: Mirror Submirror 0: hwhlr10/d12 State: Okay Submirror 1: hwhlr10/d13SUNMARMOTSun 系统故障分析与诊断手册State: Okay Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size:
blocks hwhlr10/d12: Submirror of hwhlr10/d11 State: Okay Size:
blocks Stripe 0: (interlace: 32 blocks) Device Start Block Dbase State Hot Spare /dev/did/dsk/d2s0 0 No Okay /dev/did/dsk/d3s0 0 No Okay hwhlr10/d13: Submirror of hwhlr10/d11 State: Okay Size:
blocks Stripe 0: (interlace: 32 blocks) Device Start Block Dbase State Hot Spare /dev/did/dsk/d4s0 0 No Okay hwhlr10/d14: Logging device for hwhlr10/d10 State: Okay Size: 353142 blocks hwhlr10/d14: Mirror Submirror 0: hwhlr10/d15 State: Okay Submirror 1: hwhlr10/d16 State: Okay Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 353400 blocks hwhlr10/d15: Submirror of hwhlr10/d14 State: Okay Size: 353400 blocks Stripe 0: (interlace: 32 blocks) Device Start Block Dbase State Hot Spare /dev/did/dsk/d2s6 0 No Okay /dev/did/dsk/d3s6 0 No Okay hwhlr10/d16: Submirror of hwhlr10/d14 State: Okay Size: 353400 blocks Stripe 0: (interlace: 32 blocks) Device Start Block Dbase State Hot Spare /dev/did/dsk/d4s6 0 No Okay /dev/did/dsk/d5s6 0 No Okay hwhlr10/d100: Mirror Submirror 0: hwhlr10/d101 State: Okay Submirror 1: hwhlr10/d102 State: Okay Pass: 1 Read option: roundrobin (default)SUNMARMOTSun 系统故障分析与诊断手册Write option: parallel (default) Size: 80104 blocks hwhlr10/d101: Submirror of hwhlr10/d100 State: Okay Size: 80104 blocks Stripe 0: (interlace: 32 blocks) Device Start Block Dbase State Hot Spare /dev/did/dsk/d2s4 0 No Okay /dev/did/dsk/d3s4 0 No Okay hwhlr10/d102: Submirror of hwhlr10/d100 State: Okay Size: 80104 blocks Stripe 0: (interlace: 32 blocks) Device Start Block Dbase State Hot Spare /dev/did/dsk/d4s4 0 No Okay /dev/did/dsk/d5s4 0 No Okay #metastat Cs hwhlr10 Cp hwhlr10/d10 -t hwhlr10/d11 hwhlr10/d14 hwhlr10/d11 -m hwhlr10/d12 hwhlr10/d13 1 hwhlr10/d12 1 2 /dev/did/dsk/d2s0 /dev/did/dsk/d3s0 -i 32b hwhlr10/d13 1 2 /dev/did/dsk/d4s0 /dev/did/dsk/d5s0 -i 32b hwhlr10/d14 -m hwhlr10/d15 hwhlr10/d16 1 hwhlr10/d15 1 2 /dev/did/dsk/d2s6 /dev/did/dsk/d3s6 -i 32b hwhlr10/d16 1 2 /dev/did/dsk/d4s6 /dev/did/dsk/d5s6 -i 32b hwhlr10/d100 -m hwhlr10/d101 hwhlr10/d102 1 hwhlr10/d101 1 2 /dev/did/dsk/d2s4 /dev/did/dsk/d3s4 -i 32b hwhlr10/d102 1 2 /dev/did/dsk/d4s4 /dev/did/dsk/d5s4 -i 32b detach 相关的metadevice #metadetach Cs hwhlr10 d12 #metadetach Cs hwhlr10 d15 #metadetach Cs hwhlr10 d101 删除已detach的metadevice #metaclear Cs hwhlr10 d12 #metaclear Cs hwhlr10 d15 #metaclear Cs hwhlr10 d101 删除坏盘的metadb #metadb Cs hwhlr10 Cd /dev/did/dsk/d2s7 从diskset中,删除坏盘 #scdidadm Cd /dev/did/rdsk/d2 更换坏盘,重建did #scdidadm CR d2 加入diskset中,并分区 #scdidadm Ca /dev/did/dsk/d2 建立diskset的metadb,重建相关metadeviceSUNMARMOTSun 系统故障分析与诊断手册#metadb Cs hwhlr10 Cac 3 /dev/did/dsk/d2s7 # metainit Cs hwhlr10 d12 1 2 /dev/did/dsk/d2s0 /dev/did/dsk/d3s0 # metainit Cs hwhlr10 d15 1 2 /dev/did/dsk/d2s6 /dev/did/dsk/d3s6 #metainit Cs hwhlr10 d101 1 2 /dev/did/dsk/d2s4 /dev/did/dsk/d3s4 恢复R像 #metattach Cs hwhlr10 d100 d101 #metattach Cs hwhlr10 d14 d15 #metattach Cs hwhlr10 d11 d12 注意： Cluster 2.2 环境与 Cluster 3.0 环境截然不同，具体的 Cluster 3.0 相关步骤请参考 Sun 相关文档。例 2．Sun N1405 一个硬盘出错的处理过程1.Environment ============= Sun N1405: Os : Solaris 8 2. Problem Description ================== One hdds has filed. 3. Action Taken ============ 1.& got messages from customer: # metastat d0: Trans State: Okay Size:
blocks Master Device: d1 Logging Device: d2 d1: Mirror Submirror 0: d3 State: Okay Submirror 1: d4 State: Needs maintenance Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size:
blocks d3: Submirror of d1 State: Okay Size:
blocks Stripe 0: Device Start Block Dbase State Hot Spare c0t0d0s0 0 No Okay d4: Submirror of d1 State: Needs maintenance Invoke: metareplace d1 c0t1d0s0 &new device& Size:
blocks Stripe 0: Device Start Block Dbase State Hot Spare c0t1d0s0 0 No MaintenanceSUNMARMOTSun 系统故障分析与诊断手册d2: Logging device for d0 State: Okay Size: 131678 blocks d2: Mirror Submirror 0: d5 State: Okay Submirror 1: d6 State: Needs maintenance Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 131936 blocks d5: Submirror of d2 State: Okay Size: 131936 blocks Stripe 0: Device Start Block Dbase State Hot Spare c0t0d0s3 0 No Okay& d6: Submirror of d2 State: Needs maintenance Invoke: metareplace d2 c0t1d0s3 &new device& Size: 131936 blocks Stripe 0: Device Start Block Dbase State Hot Spare c0t1d0s3 0 No Maintenance d7: Mirror Submirror 0: d8 State: Okay Submirror 1: d9 State: Okay Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 3147616 blocks d8: Submirror of d7 State: Okay Size: 3147616 blocks Stripe 0: Device Start Block Dbase State Hot Spare c0t0d0s1 0 No Okay d9: Submirror of d7 State: Okay Size: 3147616 blocks Stripe 0: Device Start Block Dbase State Hot Spare c0t1d0s1 0 No Okay d30: Mirror Submirror 0: d31 State: Okay Submirror 1: d32 State: Okay Pass: 1 Read option: roundrobin (default) Write option: parallel (default) Size: 207328 blocks d31: Submirror of d30 State: OkaySUNMARMOTSun 系统故障分析与诊断手册Size: 207328 blocks Stripe 0: Device Start Block c0t0d0s4 0 d32: Submirror of d30 State: Okay Size: 207328 blocks Stripe 0: Device Start Block c0t1d0s4 0Dbase State No OkayHot SpareDbase State No OkayHot Spare# iostat -En c0t0d0 Soft Errors: 2 Hard Errors: 0 Transport Errors: 0 Vendor: FUJITSU Product: MAG3182L SUN18G Revision: 1111 Serial No:
Size: 18.11GB & bytes& Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 2 Illegal Request: 0 Predictive Failure Analysis: 0 c0t1d0 Soft Errors: 31 Hard Errors: 25 Transport Errors: 5 Vendor: FUJITSU Product: MAG3182L SUN18G Revision: 1111 Serial No: . Conclusion/Suspect ================= I suspect that hdd c0t1d0 has failed. 5. Current Action Plan ================== 1.& check find boot disk error 2.& replace boot disk c0t1d0 3.& # devfsadm 4.& format to label the new disk # format 5.& sync use sds # metareplace -e d1 c0t1d0s0 6.& check the status of meta # metastat3.3 Sun Cluster 环境下常见问题分析处理Sun Cluster 2.2 常用命令启动suncluster #scadmin startcluster hwhlr-ph1 hwhlr #scadmin startnode 关闭suncluster #scadmin stopnode Suncluster切换 #haswitch hwhlr-ph2 hwhlr10 #scadmin switch hwhlr hwhlr-ph2 hwhlr10SUNMARMOTSun 系统故障分析与诊断手册查看suncluster状态 #hastat Getting Information from all the nodes ...... HIGH AVAILABILITY CONFIGURATION AND STATUS ------------------------------------------LIST OF NODES CONFIGURED IN &hwhlr& CLUSTER hwhlr-ph1 hwhlr-ph2 CURRENT MEMBERS OF THE CLUSTER hwhlr-ph1 is a cluster member hwhlr-ph2 is a cluster member CONFIGURATION STATE OF THE CLUSTER Configuration State on hwhlr-ph1: Stable Configuration State on hwhlr-ph2: Stable UPTIME OF NODES IN THE CLUSTER uptime of hwhlr-ph1: 4:35am up 20 min(s), 1 user, load average: 1.06,0.74, 0.45 uptime of hwhlr-ph2: 4:35am up 12 min(s), 1 user, load average: 1.09,0.66, 0.33 LOGICAL HOSTS MASTERED BY THE CLUSTER MEMBERS Logical Hosts Mastered on hwhlr-ph1: hwhlr10 Logical Hosts for which hwhlr-ph1 is Backup Node: None Logical Hosts Mastered on hwhlr-ph2: None Logical Hosts for which hwhlr-ph2 is Backup Node: hwhlr10 LOGICAL HOSTS IN MAINTENANCE STATE None STATUS OF PRIVATE NETS IN THE CLUSTER Status of Interconnects on hwhlr-ph1: interconnect0: selected interconnect1: up Status of private nets on hwhlr-ph1: To hwhlr-ph1 - UP To hwhlr-ph2 - UP Status of Interconnects on hwhlr-ph2: interconnect0: selected interconnect1: up Status of private nets on hwhlr-ph2: To hwhlr-ph1 - UP To hwhlr-ph2 - UP STATUS OF PUBLIC NETS IN THE CLUSTER Status of Public Network On hwhlr-ph1: bkggrp r_adp status fo_time live_adp nafo0 hme0:qfe1 OK NEVER hme0 Status of Public Network On hwhlr-ph2: bkggrp r_adp status fo_time live_adp nafo0 hme0:qfe1 OK NEVER hme0 STATUS OF DATA SERVICES RUNNING IN THE CLUSTER Status Of Registered Data ServicesSUNMARMOTSun 系统故障分析与诊断手册hlr: On oracle: On Not being managed on this system Data Service &oracle&: Not being managed on this system Sun Cluster 2.2命令 Status Of Data Services Running On hwhlr-ph1 No Status Method for Data Service &hlr“ Data Service &oracle&: Database Status on hwhlr-ph1: ora7 - Status Of Data Services Running On hwhlr-ph2 Data Service &hlr&: Not being managed on this system Data Service &oracle&: Not being managed on this system RECENT ERROR MESSAGES FROM THE CLUSTER Recent Error Messages on hwhlr-ph1 May 22 04:25:33 hwhlr-ph1 ID[SUNWcluster.clustd.1920]: hwhlr node 1 (hwhlr-ph2) is a cluster member May 22 04:25:33 hwhlr-ph1 ID[SUNWcluster.clustd.1940]: hwhlr cluster reconf #6 finished May 22 04:30:21 hwhlr-ph1 unix: st4:&HP DDS-3 4MM DAT&^M May 22 04:30:21 hwhlr-ph1 unix: st4 at glm0: May 22 04:30:21 hwhlr-ph1 unix: target 4 lun 0 May 22 04:30:21 hwhlr-ph1 unix: st4 is /pci@1f,4000/scsi@3/st@4,0 May 22 04:35:39 hwhlr-ph1 explorer: Explorer started Recent Error Messages on hwhlr-ph2 May 22 04:32:47 hwhlr-ph2 unix: ses69 at glm4: May 22 04:32:47 hwhlr-ph2 unix: target 5 lun 0 May 22 04:32:47 hwhlr-ph2 unix: ses69 is /pci@1f,4000/scsi@4/ses@5,0 May 22 04:32:57 hwhlr-ph2 unix: Vendor 'Symbios', product 'StorEDGE', (unknown capacity)^M May 22 04:34:27 hwhlr-ph2 last message repeated 7 times May 22 04:34:58 hwhlr-ph2 unix: pseudo-device: lockstat0 May 22 04:34:58 hwhlr-ph2 unix: lockstat0 is /pseudo/lockstat@0 查看suncluster配置 #scconf -p Checking node status... Current Configuration for Cluster hwhlr Hosts in cluster: hwhlr-ph1 hwhlr-ph2 Private Network Interfaces for hwhlr-ph1: hme1 qfe0 hwhlr-ph2: hme1 qfe0 Quorum Device Information Logical Host Timeout Value : Step10 :1200 Step11 :1200SUNMARMOTSun 系统故障分析与诊断手册Logical Host :600 Logical Host : hwhlr10 Node List : hwhlr-ph1 hwhlr-ph2 Disk Groups : hwhlr10 Logical Address : hwhlr10 Logical Interface : 1 Network Interface : hme0 (hwhlr-ph1) hme0 (hwhlr-ph2) Logical Address : hwhlr9 Logical Interface : 2 Network Interface : hme0 (hwhlr-ph1) hme0 (hwhlr-ph2) Automatic Switchover : no Sun Cluster 3.0 常用命令查看suncluster状态 # /usr/cluster/bin/scstat ------------------------------------------------------------------- Cluster Nodes -Node name --------test01 test02 Status -----Online OnlineCluster node: Cluster node:------------------------------------------------------------------- Cluster Transport Paths -Endpoint -------test01:hme3 test01:hme2 Endpoint -------test02:hme3 test02:hme2 Status -----Path online Path onlineTransport path: Transport path:------------------------------------------------------------------- Quorum Summary -Quorum votes possible: Quorum votes needed: Quorum votes present: -- Quorum Votes by Node -Node Name -------------test01 test02 Present Possible Status ------- -------- -----1 1 Online 1 1 Online 3 2 3Node votes: Node votes:-- Quorum Votes by Device -Device Name ---------------Present Possible Status ------- -------- ------SUNMARMOTSun 系统故障分析与诊断手册Device votes:/dev/did/rdsk/d4s211Online------------------------------------------------------------------- Device Group Servers -Device Secondary -----------Device group servers: datadg Device group servers: rmt/2 Device group servers: rmt/1 Device group servers: rmt/4 Device group servers: rmt/3 -- Device Group Status -Device Group ----------------datadg rmt/2 rmt/1 rmt/4 rmt/3 Status ------Online Offline Offline Offline Offline ------test01 --------test02 Group PrimaryDevice group status: Device group status: Device group status: Device group status: Device group status:------------------------------------------------------------------- Resource Groups and Resources -Group Name ---------------Resources: oracle-rg -- Resource Groups -Group Name ---------Group: oracle-rg Group: oracle-rg -- Resources -Resource Name ------------Resource: test LogicalHostname online. Resource: test Resource: hastorage-res Resource: hastorage-res Resource: oracle-server Resource: oracle-server Node Name --------test01 test02 test01 test02 test01 test02 State ----Status Message -------------Online Online Offline Online Offline Online Offline Node Name -------------test01 test02 State -----Online Offline Resources -----------test hastorage-res oracle-server oracle-lsnOffline Online Offline Online OfflineSUNMARMOTSun 系统故障分析与诊断手册Resource: oracle-lsn Resource: oracle-lsntest01 test02Online OfflineOnline Offline----------------------------------------}

我爱游戏网