[摘要] Dell 12代 Dell PowerEdge R420服务器突然挂掉,无响应,Idrac可以连接,但是通过Idrac reset后毫无反应。记得之前同样的机器也挂掉过一台,因为没抓到更多有用的系统日志,当时也没太在意。
这次发现日志里面有错误出现了:“CPU 1 has an internal error (IERR)”,因为系统用keepalived配置了高可用,挂掉一台并不影响服务,所以并不着急,正好可以找找问题原因所在。
一边请教谷歌大神,一边致电Dell金牌服务:400-886-8618,技术支持听我描述一番后给出了如下建议:
(1)BIOS中修改System Profile Settings -> System Profile,修改为Performance
(2)升级BIOS版本:BIOS下载地址
Google的结果也说Dell12代服务器电源管理有问题,建议使用acpi-cpufreq电源管理模块
# modprobe -r p4-clockmod # modprobe acpi-cpufreq
因为Idrac无法重启,于是找到了机房的remote hand,断电重启,居然能点亮,看来电源或者主板没问题,接下来好办了,Idrac全部可以搞定。
慢慢来,首先BIOS中修改了System Profile为Performance
然后升级了BIOS版本,从1.5.2升级到了2.1.2
过程如下:
# ./BIOS_R5R32_LN_2.1.2.BIN Collecting inventory... .... Running validation... BIOS The version of this Update Package is newer than the currently installed version. Software application name: BIOS Package version: 2.1.2 Installed version: 1.5.2 Continue? Y/N:Y Executing update... WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER DELL PRODUCTS WHILE UPDATE IS IN PROGRESS. THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE! ............................................................................. The BIOS image file is successfully loaded. To successfully apply the BIOS update, do not shut down, cold reboot, power cycle, or turn off the system before the BIOS update is complete. Reboot the system for the update to take effect. Note: If OMSA is installed on the system, the OMSA data manager service stops if it is already running. Would you like to reboot your system now? Continue? Y/N:Y Broadcast message from root@sudops.com (/dev/pts/0) at 23:16 ...
重启之后ssh登陆到系统,dmsg中发现有很多这样的日志:
p4-clockmod: Warning: EST-capable CPU detected. The acpi-cpufreq module offers voltage scaling in addition of frequency scaling. You should use that instead of p4-clockmod, if possible. p4-clockmod: Warning: EST-capable CPU detected. The acpi-cpufreq module offers voltage scaling in addition of frequency scaling. You should use that instead of p4-clockmod, if possible.
看来google到的处理方法应该是有必要的,于是执行两条命令
# modprobe -r p4-clockmod # modprobe acpi-cpufreq FATAL: Error inserting acpi_cpufreq (/lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.ko): No such device 居然报错,说是找不到文件,但文件明明就在那呢,怎么会找不到? # ls -l /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/* -rwxr--r--. 1 root root 23672 Nov 9 2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.ko -rwxr--r--. 1 root root 5824 Nov 9 2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/mperf.ko -rwxr--r--. 1 root root 12160 Nov 9 2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/p4-clockmod.ko -rwxr--r--. 1 root root 18552 Nov 9 2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/pcc-cpufreq.ko -rwxr--r--. 1 root root 41704 Nov 9 2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/powernow-k8.ko -rwxr--r--. 1 root root 13120 Nov 9 2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/speedstep-lib.ko # modprobe -l acpi-cpufreq kernel/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.ko
继续Google。。
找到这样一篇jaseywang.me的文章,在Performance模式下是无法加载任何module的:
1. Performance Per Watt(DAPC): System DBPM(DAPC) 该模式是无法加载任何的 module 的: # cpuspeed Error: Could not find any CPUFreq controlled CPU cores to manage # /etc/init.d/cpuspeed status cpuspeed is stopped 2. Performance Per Watt(OS): OS DBPM 启动后可以发现,系统自动的加载了 acpi_cpufreq: # lsmod | grep cpu cpufreq_ondemand 10544 24 acpi_cpufreq 7891 1 freq_table 4881 2 cpufreq_ondemand,acpi_cpufreq mperf 1557 1 acpi_cpufreq # /etc/init.d/cpuspeed status Frequency scaling enabled using ondemand governor 3. Performance: Maximum Performance 该模式同样无法加在任何的 module 的
于是又回到BIOS中把 System Profile,修改为 Performance Per Watt(OS): OS DBPM
再次重启,dmsg中已经正常了,看来问题解决了,不过还有待于时间的考验!
Trouble shooting的过程中发现cpufreq_setup的使用方法比较有价值
https://access.redhat.com/site/documentation/zh-CN/Red_Hat_Enterprise_Linux/6/html/Power_Management_Guide/cpufreq_setup.html
另外Dell的Idrac命令里面真的有很多选项
比如Idrac取到的sel日志如下:
racadm>>getsel racadm getsel ------------------------------------------------------------------------------- Record: 2 Date/Time: 05/22/2014 12:44:33 Source: system Severity: Critical Description: CPU 1 has an internal error (IERR). -------------------------------------------------------------------------------
其他帮助参数
/admin1-> help [Usage] show [<options>] [<target>] [<properties>] [<propertyname>== <propertyvalue>] set [<options>] [<target>] <propertyname>=<value> cd [<options>] [<target>] create [<options>] <target> [<property of new target>=<value>] [<property of new target>=<value>] delete [<options>] <target> exit [<options>] reset [<options>] [<target>] start [<options>] [<target>] stop [<options>] [<target>] version [<options>] help [<options>] [<help topics>] load -source <URI> [<options>] [<target>] dump -destination <URI> [<options>] [<target>] /admin1-> racadm racadm>>help racadm help help [subcommand] -- display usage summary for a subcommand arp -- display the networking ARP table clearasrscreen -- clear the last ASR (crash) screen closessn -- close a session clrraclog -- clear the RAC log clrsel -- clear the System Event Log (SEL) config -- Deprecated: modify RAC configuration properties coredump -- display the last RAC coredump coredumpdelete -- delete the last RAC coredump eventfilters -- Alerts configuration commands fwupdate -- update the RAC firmware get -- display RAC configuration properties getconfig -- Deprecated: display RAC configuration properties getled -- Get the state of the LED on a module. getniccfg -- display current network settings getraclog -- display the RAC log getractime -- display the current RAC time getsel -- display records from the System Event Log (SEL) getsensorinfo -- display system sensors getssninfo -- display session information getsvctag -- display service tag information getsysinfo -- display general RAC and system information gettracelog -- display the RAC diagnostic trace log getuscversion -- display the current USC version details getversion -- display the current version details ifconfig -- display network interface information inlettemphistory -- inlet temperature history operations lclog -- LCLog operations frontpanelerror -- hide LCD errors - color amber to blue netstat -- display routing table and network statistics ping -- send ICMP echo packets on the network ping6 -- send ICMP echo packets on the network racdump -- display RAC diagnostic information racreset -- perform a RAC reset operation racresetcfg -- restore the RAC configuration to factory defaults remoteimage -- make a remote ISO image available to the server serveraction -- perform system power management operations set -- modify RAC configuration properties setled -- Set the state of the LED on a module. setniccfg -- modify network configuration properties sshpkauth -- manage SSH PK authentication keys on the RAC sslcertdelete -- delete an SSL certificate on the iDRAC sslcertview -- view SSL certificate information sslcsrgen -- generate a certificate CSR from the RAC sslresetcfg -- resets the web certificate to default and restarts the web server. testemail -- test RAC e-mail notifications testtrap -- test RAC SNMP trap notifications testalert -- test RAC SNMP - FQDN trap notifications traceroute -- print the route packets trace to network host traceroute6 -- print the route packets trace to network host usercertview -- view user certificate information vflashpartition -- manage partitions on the vFlash SD card vflashsd -- perform vFlash SD Card initialization vmdisconnect -- disconnect Virtual Media connections vmkey -- Deprecated: perform vFlash operations license -- License Manager commands debug -- Field Service Debug Authorization facility commands raid -- Monitoring and Inventory of H/W RAID connected to the server. hwinventory -- Monitoring and Inventory of H/W NICs connected to the server. nicstatistics -- Statistics for NICs connected to the server. fcstatistics -- Statistics for FCs connected to the server. update -- Platform Update of the devices on the server jobqueue -- Jobqueue of of the jobs currently scheduled systemconfig -- Backup &/or Restore of iDRAC Config and Firmware Groups idRacInfo -- Information about iDRAC being queried cfgRemoteHosts -- Properties for configuration of the SMTP server cfgUserAdmin -- Information about iDRAC users cfgEmailAlert -- Parameters to configure e-mail alerting capabilities cfgSessionManagement -- Information of the session Properties cfgSerial -- Provides configuration parameters for the iDRAC cfgOobSnmp -- Configuration of the SNMP agent and trap capabilities cfgRacTuning -- Configuration for various iDRAC properties. ifcRacManagedNodeOs -- Properties of the managed server OS cfgRacSecurity -- Configure SSL certificate signing request settings cfgRacVirtual -- Configuration Properties for iDRAC Virtual Media cfgActiveDirectory -- Configuration of the iDRAC Active Directory feature cfgLDAP -- Configuration properties for LDAP settings cfgLdapRoleGroup -- Configuration of role groups for LDAP cfgLogging -- Group Description for group cfgLogging cfgStandardSchema -- Configuration of AD standard schema settings cfgIpmiSerial -- Properties to configure the IPMI serial interface cfgIpmiSol -- Configuration the SOL capabilities of the system cfgIpmiLan -- Configuration the IPMI over LAN of the system cfgIpmiPef -- Configuration the platform event filters cfgServerPower -- Provides power management features cfgServerPowerSupply -- Provides information related to the power supplies cfgVFlashSD -- Configure the properties for the vFlash SD card cfgVFlashPartition -- Configure partitions on the vFlash SD Card cfgUserDomain -- Configure the Active Directory user domain names cfgSmartCard -- Properties to access iDRAC using a smart card cfgServerInfo -- Configuration of first boot device cfgSensorRedundancy -- Configure the power supply redundancy cfgLanNetworking -- Parameters to configure the iDRAC NIC cfgStaticLanNetworking -- Parameters to configure the iDRAC NIC cfgNetTuning -- Group Description for group cfgNetTuning cfgIPv6LanNetworking -- Configuration of the IPv6 over LAN networking cfgIPv6StaticLanNetworking -- Configuration of the IPv6 over LAN networking cfgIPv6URL -- Configuration of the iDRAC IPv6 URL. For Help on configuring the properties of a group - racadm help config -----------------------------------------------------------------------