[摘要] Dell 12代 Dell PowerEdge R420服务器突然挂掉,无响应,Idrac可以连接,但是通过Idrac reset后毫无反应。记得之前同样的机器也挂掉过一台,因为没抓到更多有用的系统日志,当时也没太在意。
这次发现日志里面有错误出现了:“CPU 1 has an internal error (IERR)”,因为系统用keepalived配置了高可用,挂掉一台并不影响服务,所以并不着急,正好可以找找问题原因所在。
一边请教谷歌大神,一边致电Dell金牌服务:400-886-8618,技术支持听我描述一番后给出了如下建议:
(1)BIOS中修改System Profile Settings -> System Profile,修改为Performance
(2)升级BIOS版本:BIOS下载地址
Google的结果也说Dell12代服务器电源管理有问题,建议使用acpi-cpufreq电源管理模块
# modprobe -r p4-clockmod # modprobe acpi-cpufreq
因为Idrac无法重启,于是找到了机房的remote hand,断电重启,居然能点亮,看来电源或者主板没问题,接下来好办了,Idrac全部可以搞定。
慢慢来,首先BIOS中修改了System Profile为Performance
然后升级了BIOS版本,从1.5.2升级到了2.1.2
过程如下:
# ./BIOS_R5R32_LN_2.1.2.BIN Collecting inventory... .... Running validation... BIOS The version of this Update Package is newer than the currently installed version. Software application name: BIOS Package version: 2.1.2 Installed version: 1.5.2 Continue? Y/N:Y Executing update... WARNING: DO NOT STOP THIS PROCESS OR INSTALL OTHER DELL PRODUCTS WHILE UPDATE IS IN PROGRESS. THESE ACTIONS MAY CAUSE YOUR SYSTEM TO BECOME UNSTABLE! ............................................................................. The BIOS image file is successfully loaded. To successfully apply the BIOS update, do not shut down, cold reboot, power cycle, or turn off the system before the BIOS update is complete. Reboot the system for the update to take effect. Note: If OMSA is installed on the system, the OMSA data manager service stops if it is already running. Would you like to reboot your system now? Continue? Y/N:Y Broadcast message from root@sudops.com (/dev/pts/0) at 23:16 ...
重启之后ssh登陆到系统,dmsg中发现有很多这样的日志:
p4-clockmod: Warning: EST-capable CPU detected. The acpi-cpufreq module offers voltage scaling in addition of frequency scaling. You should use that instead of p4-clockmod, if possible. p4-clockmod: Warning: EST-capable CPU detected. The acpi-cpufreq module offers voltage scaling in addition of frequency scaling. You should use that instead of p4-clockmod, if possible.
看来google到的处理方法应该是有必要的,于是执行两条命令
# modprobe -r p4-clockmod # modprobe acpi-cpufreq FATAL: Error inserting acpi_cpufreq (/lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.ko): No such device 居然报错,说是找不到文件,但文件明明就在那呢,怎么会找不到? # ls -l /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/* -rwxr--r--. 1 root root 23672 Nov 9 2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.ko -rwxr--r--. 1 root root 5824 Nov 9 2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/mperf.ko -rwxr--r--. 1 root root 12160 Nov 9 2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/p4-clockmod.ko -rwxr--r--. 1 root root 18552 Nov 9 2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/pcc-cpufreq.ko -rwxr--r--. 1 root root 41704 Nov 9 2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/powernow-k8.ko -rwxr--r--. 1 root root 13120 Nov 9 2011 /lib/modules/2.6.32-220.el6.x86_64/kernel/arch/x86/kernel/cpu/cpufreq/speedstep-lib.ko # modprobe -l acpi-cpufreq kernel/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.ko
继续Google。。
找到这样一篇jaseywang.me的文章,在Performance模式下是无法加载任何module的:
1. Performance Per Watt(DAPC): System DBPM(DAPC) 该模式是无法加载任何的 module 的: # cpuspeed Error: Could not find any CPUFreq controlled CPU cores to manage # /etc/init.d/cpuspeed status cpuspeed is stopped 2. Performance Per Watt(OS): OS DBPM 启动后可以发现,系统自动的加载了 acpi_cpufreq: # lsmod | grep cpu cpufreq_ondemand 10544 24 acpi_cpufreq 7891 1 freq_table 4881 2 cpufreq_ondemand,acpi_cpufreq mperf 1557 1 acpi_cpufreq # /etc/init.d/cpuspeed status Frequency scaling enabled using ondemand governor 3. Performance: Maximum Performance 该模式同样无法加在任何的 module 的
于是又回到BIOS中把 System Profile,修改为 Performance Per Watt(OS): OS DBPM
再次重启,dmsg中已经正常了,看来问题解决了,不过还有待于时间的考验!
Trouble shooting的过程中发现cpufreq_setup的使用方法比较有价值
https://access.redhat.com/site/documentation/zh-CN/Red_Hat_Enterprise_Linux/6/html/Power_Management_Guide/cpufreq_setup.html
另外Dell的Idrac命令里面真的有很多选项
比如Idrac取到的sel日志如下:
racadm>>getsel racadm getsel ------------------------------------------------------------------------------- Record: 2 Date/Time: 05/22/2014 12:44:33 Source: system Severity: Critical Description: CPU 1 has an internal error (IERR). -------------------------------------------------------------------------------
其他帮助参数
/admin1-> help
[Usage]
show [<options>] [<target>] [<properties>]
[<propertyname>== <propertyvalue>]
set [<options>] [<target>] <propertyname>=<value>
cd [<options>] [<target>]
create [<options>] <target> [<property of new target>=<value>]
[<property of new target>=<value>]
delete [<options>] <target>
exit [<options>]
reset [<options>] [<target>]
start [<options>] [<target>]
stop [<options>] [<target>]
version [<options>]
help [<options>] [<help topics>]
load -source <URI> [<options>] [<target>]
dump -destination <URI> [<options>] [<target>]
/admin1-> racadm
racadm>>help
racadm help
help [subcommand] -- display usage summary for a subcommand
arp -- display the networking ARP table
clearasrscreen -- clear the last ASR (crash) screen
closessn -- close a session
clrraclog -- clear the RAC log
clrsel -- clear the System Event Log (SEL)
config -- Deprecated: modify RAC configuration properties
coredump -- display the last RAC coredump
coredumpdelete -- delete the last RAC coredump
eventfilters -- Alerts configuration commands
fwupdate -- update the RAC firmware
get -- display RAC configuration properties
getconfig -- Deprecated: display RAC configuration properties
getled -- Get the state of the LED on a module.
getniccfg -- display current network settings
getraclog -- display the RAC log
getractime -- display the current RAC time
getsel -- display records from the System Event Log (SEL)
getsensorinfo -- display system sensors
getssninfo -- display session information
getsvctag -- display service tag information
getsysinfo -- display general RAC and system information
gettracelog -- display the RAC diagnostic trace log
getuscversion -- display the current USC version details
getversion -- display the current version details
ifconfig -- display network interface information
inlettemphistory -- inlet temperature history operations
lclog -- LCLog operations
frontpanelerror -- hide LCD errors - color amber to blue
netstat -- display routing table and network statistics
ping -- send ICMP echo packets on the network
ping6 -- send ICMP echo packets on the network
racdump -- display RAC diagnostic information
racreset -- perform a RAC reset operation
racresetcfg -- restore the RAC configuration to factory defaults
remoteimage -- make a remote ISO image available to the server
serveraction -- perform system power management operations
set -- modify RAC configuration properties
setled -- Set the state of the LED on a module.
setniccfg -- modify network configuration properties
sshpkauth -- manage SSH PK authentication keys on the RAC
sslcertdelete -- delete an SSL certificate on the iDRAC
sslcertview -- view SSL certificate information
sslcsrgen -- generate a certificate CSR from the RAC
sslresetcfg -- resets the web certificate to default and restarts the web server.
testemail -- test RAC e-mail notifications
testtrap -- test RAC SNMP trap notifications
testalert -- test RAC SNMP - FQDN trap notifications
traceroute -- print the route packets trace to network host
traceroute6 -- print the route packets trace to network host
usercertview -- view user certificate information
vflashpartition -- manage partitions on the vFlash SD card
vflashsd -- perform vFlash SD Card initialization
vmdisconnect -- disconnect Virtual Media connections
vmkey -- Deprecated: perform vFlash operations
license -- License Manager commands
debug -- Field Service Debug Authorization facility commands
raid -- Monitoring and Inventory of H/W RAID connected to the server.
hwinventory -- Monitoring and Inventory of H/W NICs connected to the server.
nicstatistics -- Statistics for NICs connected to the server.
fcstatistics -- Statistics for FCs connected to the server.
update -- Platform Update of the devices on the server
jobqueue -- Jobqueue of of the jobs currently scheduled
systemconfig -- Backup &/or Restore of iDRAC Config and Firmware
Groups
idRacInfo -- Information about iDRAC being queried
cfgRemoteHosts -- Properties for configuration of the SMTP server
cfgUserAdmin -- Information about iDRAC users
cfgEmailAlert -- Parameters to configure e-mail alerting capabilities
cfgSessionManagement -- Information of the session Properties
cfgSerial -- Provides configuration parameters for the iDRAC
cfgOobSnmp -- Configuration of the SNMP agent and trap capabilities
cfgRacTuning -- Configuration for various iDRAC properties.
ifcRacManagedNodeOs -- Properties of the managed server OS
cfgRacSecurity -- Configure SSL certificate signing request settings
cfgRacVirtual -- Configuration Properties for iDRAC Virtual Media
cfgActiveDirectory -- Configuration of the iDRAC Active Directory feature
cfgLDAP -- Configuration properties for LDAP settings
cfgLdapRoleGroup -- Configuration of role groups for LDAP
cfgLogging -- Group Description for group cfgLogging
cfgStandardSchema -- Configuration of AD standard schema settings
cfgIpmiSerial -- Properties to configure the IPMI serial interface
cfgIpmiSol -- Configuration the SOL capabilities of the system
cfgIpmiLan -- Configuration the IPMI over LAN of the system
cfgIpmiPef -- Configuration the platform event filters
cfgServerPower -- Provides power management features
cfgServerPowerSupply -- Provides information related to the power supplies
cfgVFlashSD -- Configure the properties for the vFlash SD card
cfgVFlashPartition -- Configure partitions on the vFlash SD Card
cfgUserDomain -- Configure the Active Directory user domain names
cfgSmartCard -- Properties to access iDRAC using a smart card
cfgServerInfo -- Configuration of first boot device
cfgSensorRedundancy -- Configure the power supply redundancy
cfgLanNetworking -- Parameters to configure the iDRAC NIC
cfgStaticLanNetworking -- Parameters to configure the iDRAC NIC
cfgNetTuning -- Group Description for group cfgNetTuning
cfgIPv6LanNetworking -- Configuration of the IPv6 over LAN networking
cfgIPv6StaticLanNetworking -- Configuration of the IPv6 over LAN networking
cfgIPv6URL -- Configuration of the iDRAC IPv6 URL.
For Help on configuring the properties of a group - racadm help config
-----------------------------------------------------------------------
