Hardware Error 内存报错
2018-01-22 17:51
561 查看
192.168.219.90 使用 dmesg|grep -i error 查看时发现这台机器内存有问题,如下图所示:
[Hardware Error]: MC4 Error (node 1): L3 cache tag error.
[Hardware Error]: Error Status: Corrected error, no action required.
[Hardware Error]: MC4_ADDR: 0x00000018edfd9100
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: SNP
[Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8cf6cb900
[Hardware Error]: Error Status: Corrected error, no action required.
[Hardware Error]: MC4_ADDR: 0x00000008cf6cb900
[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
[Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8cf6cb900
[Hardware Error]: Error Status: Corrected error, no action required.
[Hardware Error]: MC4_ADDR: 0x00000008cf6cb900
[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
进一步查询发现是第5条内存有问题,需要联系私有云那边报修。
grep [0-9] /sys/devices/system/edac/mc/mc/csrow/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow2/ch0_ce_count:146
/sys/devices/system/edac/mc/mc2/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc5/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc5/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow2/ch1_ce_count:0
count不为0的行即代表存在内存错误。
mc:第几个CPU。
csrow:内存通道。
ch*:通道内的第几根内存。
然后通过dmidecode查看:
[root@customer log]# dmidecode -t memory |grep 'Locator: DIMM'
Locator: DIMM01
Locator: DIMM02
Locator: DIMM03
Locator: DIMM04
Locator: DIMM05
Locator: DIMM06
Locator: DIMM07
Locator: DIMM08
Locator: DIMM09
Locator: DIMM10
Locator: DIMM11
Locator: DIMM12
Locator: DIMM13
Locator: DIMM14
Locator: DIMM15
Locator: DIMM16
Locator: DIMM17
Locator: DIMM18
Locator: DIMM19
Locator: DIMM20
Locator: DIMM21
Locator: DIMM22
Locator: DIMM23
Locator: DIMM24
Locator: DIMM25
Locator: DIMM26
Locator: DIMM27
Locator: DIMM28
Locator: DIMM29
Locator: DIMM30
Locator: DIMM31
Locator: DIMM32
通过服务器控制台查看内存:
主板上内存插槽的分布:
结合报错日志:kernel: EDAC MC1: 16107 CE error on CPU#1Channel#2_DIMM#1 (channel:2slot:1
应该是内存插槽DIMM_F1的问题。
解决:
最后我们要做的就是,把有问题的F1插槽上的内存拔出来或是更换到其它的内存插槽上面,之后系统启动后不再报错。
[Hardware Error]: MC4 Error (node 1): L3 cache tag error.
[Hardware Error]: Error Status: Corrected error, no action required.
[Hardware Error]: MC4_ADDR: 0x00000018edfd9100
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: SNP
[Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8cf6cb900
[Hardware Error]: Error Status: Corrected error, no action required.
[Hardware Error]: MC4_ADDR: 0x00000008cf6cb900
[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
[Hardware Error]: MC4 Error (node 2): DRAM ECC error detected on the NB.
EDAC amd64 MC2: CE ERROR_ADDRESS= 0x8cf6cb900
[Hardware Error]: Error Status: Corrected error, no action required.
[Hardware Error]: MC4_ADDR: 0x00000008cf6cb900
[Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)
进一步查询发现是第5条内存有问题,需要联系私有云那边报修。
grep [0-9] /sys/devices/system/edac/mc/mc/csrow/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc1/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc2/csrow2/ch0_ce_count:146
/sys/devices/system/edac/mc/mc2/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc3/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc4/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc5/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc5/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc6/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc7/csrow2/ch1_ce_count:0
count不为0的行即代表存在内存错误。
mc:第几个CPU。
csrow:内存通道。
ch*:通道内的第几根内存。
然后通过dmidecode查看:
[root@customer log]# dmidecode -t memory |grep 'Locator: DIMM'
Locator: DIMM01
Locator: DIMM02
Locator: DIMM03
Locator: DIMM04
Locator: DIMM05
Locator: DIMM06
Locator: DIMM07
Locator: DIMM08
Locator: DIMM09
Locator: DIMM10
Locator: DIMM11
Locator: DIMM12
Locator: DIMM13
Locator: DIMM14
Locator: DIMM15
Locator: DIMM16
Locator: DIMM17
Locator: DIMM18
Locator: DIMM19
Locator: DIMM20
Locator: DIMM21
Locator: DIMM22
Locator: DIMM23
Locator: DIMM24
Locator: DIMM25
Locator: DIMM26
Locator: DIMM27
Locator: DIMM28
Locator: DIMM29
Locator: DIMM30
Locator: DIMM31
Locator: DIMM32
通过服务器控制台查看内存:
主板上内存插槽的分布:
结合报错日志:kernel: EDAC MC1: 16107 CE error on CPU#1Channel#2_DIMM#1 (channel:2slot:1
应该是内存插槽DIMM_F1的问题。
解决:
最后我们要做的就是,把有问题的F1插槽上的内存拔出来或是更换到其它的内存插槽上面,之后系统启动后不再报错。
相关文章推荐
- Tomcat内存设置详解
- c++ 内存分配
- JVM内存配置详解
- 计算机CPU、内存、指令、硬盘关系
- C语言深度剖析学习笔记-指针、数组、内存、函数
- ovirt需要多少内存
- PHP共享内存详解
- [转]Flash高性能开发基础系列—内存篇
- 大内高手—内存模型
- 本想买条内存,升级家里的台式机。结果淘了一本本回家
- [精]Oracle 内存结构详解
- Java在线笔试编程(2)---模拟内存操作
- eclipse运行内存不足解决办法
- 释放内存
- Android进阶篇-使用DDMS查看内存(Heap)变化
- 2KB内存单片机上实现彩屏GUI控件库
- JavaScript 的垃圾回收与内存泄露
- 物理内存不足 无法使用配置的设置开启虚拟机.
- Android系统查看内存情况的命令
- Android中的内存,apk大小,方法数,线程等的限制研究