[知其然不知其所以然-18] handle_mm_fault, bad page等错误现场分析
2016-02-25 15:23
645 查看
Someone using centos 7 with 3.10.123.13.1 says there is a bug report:
[3550602.146301] kernel BUG at mm/memory.c:3765!
/* THP should already have been handled */
BUG_ON(pmd_numa(*pmd));
After checking the git log of mm/memory.c, we found exactly the fix for this problem:
The main scenario here is that, someone may modify the attribute of a normal pmd
to a numa pmd when handle_mm_fault is checking the value of old_pmd,
so when we comes to fetch current pmd value by *pmd, it might be changed to a
numa one, so it is triggered. The key to this problem is that, either handle_mm_fault
or modification of numa_pmd is protected by down_read(mmap_sem), so there is no
exclusive lock between them, they are not aware of each others.
see comments of lock_pte_protection:
do_migrate_pages->check_range->change_prot_numa->change_pte_range->ptep_set_numa
Then there is other bug reported such as:
First , let;s check
reclaimed, say, if a process need to exit, he must make sure all
MM_FILEPAGES, MM_ANONPAGES, MM_SWAPENTS have been
reclaimed. According to our test, the number of MM_ANONPAGES
is 5, whilc the number of MM_SWAPENTS is -5, this is ridiculous.
then we see:
finds the counter of swap_map for this swap page is zero, which means,
this swap slot has already been released by others:
since its value is 00053980, composed of swap type and swap offset.
So let's start the kdump parser.
according to crash dmesg, the problem is firstly triggered when enable the swap partition:
let' check swap_info array, which contains the swap areas in the system.
I've seen similar bugs reported at:
http://linux-kernel.2935.n7.nabble.com/unused-swap-offset-bad-page-map-td698180.html
however per Hugh's suggestion, this problem described above is due to incorrect setting of pte entry in meremap,
but in our 3.10 this bug has not been introduced yet, so we have to check other clues.
and also per Linus and Hugh's experience:
Well, it looks like we have some trouble on finding out the crime culprit.
Let's launch the crash kdump, and we first check current running process
on each cpu, and we found the last panic process is kswapd, so:
let's dis the function of page_lock_anon_vma_read and we get:
let's backtrace page_lock_anon_vma_read:
then
1.anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
PAGE_MAPPING_ANON equals 1.
2. we acuires value from r12 pointer.
the r14 is the value of root_anon_vma
then
and the panic place for EIP:
it is trying to access edi,
sem is invalid, according to register context,we know
edi is 0600380100804d3c, so it means sem is 0600380100804d3c, this is absolutely wrong.
since the r14 comes from r12, let's check r12-1 value , which is the value of
struct anon_vma *root_anon_vma, and r12 is ffff88204d1d5841:
WTF? obviously there is memory overlap around ffff88204d1d5841, let's check this struct:
Since anon_vma is highly connected with swap, so we check the counter of swap_map again,
let's figure out how swap_map increases.
initial value for swap_map entry is zero, when a process's pages are swaped out
by shrink_list, a spare swap slot must be found out to store the page content.
Besides swap error, there is also a suspected warning on mm->rss_conter:
[3626234.775485] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:1 val:5 [3626234.775529] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:2 val:-5It is too hard for us to track the root cause because the problem maybe a chain reaction for other
bugs. And according to search result, here is an possible fix for this issue:
http://www.serverphorums.com/read.php?12,950567
which says:
I saw a bunch of similar errors on 3.14.x up to and including 3.14.4,
running Java (Tomcat) and Postgres on Xen PV. Have not seen it since
"mm: use paravirt friendly ops for NUMA hinting ptes" landed in 3.14.5.
402e194dfc5b38d99f9c65b86e2666b29adebf8c in stable,
29c7787075c92ca8af353acd5301481e6f37082f upstream
This bug can break XEN on numa, so it can also break KVM, since it is a general bug.
[3550602.146301] kernel BUG at mm/memory.c:3765!
/* THP should already have been handled */
BUG_ON(pmd_numa(*pmd));
After checking the git log of mm/memory.c, we found exactly the fix for this problem:
commit 107437febd495a50e2cd09c81bbaa84d30e57b07 Author: Rik van Riel <riel@redhat.com> Date: Tue Apr 29 15:36:15 2014 -0400 mm/numa: Remove BUG_ON() in __handle_mm_fault() Changing PTEs and PMDs to pte_numa & pmd_numa is done with the mmap_sem held for reading, which means a pmd can be instantiated and turned into a numa one while __handle_mm_fault() is examining the value of old_pmd. If that happens, __handle_mm_fault() should just return and let the page fault retry, instead of throwing an oops. This is handled by the test for pmd_trans_huge(*pmd) below.
The main scenario here is that, someone may modify the attribute of a normal pmd
to a numa pmd when handle_mm_fault is checking the value of old_pmd,
so when we comes to fetch current pmd value by *pmd, it might be changed to a
numa one, so it is triggered. The key to this problem is that, either handle_mm_fault
or modification of numa_pmd is protected by down_read(mmap_sem), so there is no
exclusive lock between them, they are not aware of each others.
see comments of lock_pte_protection:
/* * For a prot_numa update we only hold mmap_sem for read so there is a * potential race with faulting where a pmd was temporarily none. This * function checks for a transhuge pmd under the appropriate lock. It * returns a pte if it was successfully locked or NULL if it raced with * a transhuge insertion. */for example, the following path:
do_migrate_pages->check_range->change_prot_numa->change_pte_range->ptep_set_numa
Then there is other bug reported such as:
[3557075.805696] kvm: zapping shadow pages for mmio generation wraparound [3625477.198897] Adding 4194300k swap on /dev/mapper/os-swap. Priority:-1 extents:1 across:4194300k FS [3626233.759633] traps: ceilometer-agen[3417] general protection ip:7f184c3f039e sp:7fff42d5fbc0 error:0 in libc-2.17.so[7f184c372000+1b6000] [3626234.775485] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:1 val:5 [3626234.775529] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:2 val:-5 [3645969.174217] systemd-journald[1082]: Vacuuming done, freed 0 bytes [3801900.193751] swap_free: Unused swap offset entry 00053980 [3801900.193773] BUG: Bad page map in process qemu-kvm pte:0a730000 pmd:4d6635067 [3801900.193791] addr:00007f4d20481000 vm_flags:a0120073 anon_vma:ffff88204d1d5840 mapping: (null) index:7f4d20481 [3801900.193820] CPU: 7 PID: 45513 Comm: qemu-kvm Not tainted 3.10.0-123.13.1.el7.x86_64 #1 [3801900.193822] Hardware name: To be filled by O.E.M. Tecal RH5885 V3/BC61BLCA, BIOS BLISV301 07/25/2014 [3801900.193824] 00007f4d20481000 00000000934bf4bb ffff88084c071998 ffffffff815e232e [3801900.193834] ffff88084c0719e0 ffffffff81167b28 000000000a730000 00000007f4d20481 [3801900.193839] ffff8804d6635408 000000000a730000 00007f4d20481000 00007f4d20600000 [3801900.193844] Call Trace: [3801900.193855] [<ffffffff815e232e>] dump_stack+0x19/0x1b [3801900.193862] [<ffffffff81167b28>] print_bad_pte+0x1a8/0x240 [3801900.193866] [<ffffffff81169490>] unmap_page_range+0x5b0/0x860 [3801900.193870] [<ffffffff811697c1>] unmap_single_vma+0x81/0xf0 [3801900.193873] [<ffffffff8116a889>] unmap_vmas+0x49/0x90 [3801900.193877] [<ffffffff8117353c>] exit_mmap+0xac/0x1a0 [3801900.193884] [<ffffffff8105b187>] mmput+0x67/0xf0 [3801900.193903] [<ffffffffa05a84d1>] vhost_dev_cleanup+0x231/0x260 [vhost_net] [3801900.193908] [<ffffffffa05ab2ae>] vhost_net_release+0x4e/0xc0 [vhost_net] [3801900.193915] [<ffffffff811b1239>] __fput+0xe9/0x270 [3801900.193919] [<ffffffff811b150e>] ____fput+0xe/0x10 [3801900.193925] [<ffffffff810823e4>] task_work_run+0xc4/0xe0 [3801900.193928] [<ffffffff81063deb>] do_exit+0x2cb/0xa60 [3801900.193934] [<ffffffff81070a53>] ? __sigqueue_free.part.11+0x33/0x40 [3801900.193937] [<ffffffff8107104c>] ? __dequeue_signal+0x13c/0x220 [3801900.193941] [<ffffffff810645ff>] do_group_exit+0x3f/0xa0 [3801900.193955] [<ffffffff81074010>] get_signal_to_deliver+0x1d0/0x6e0 [3801900.193961] [<ffffffff81012437>] do_signal+0x57/0x600 [3801900.193965] [<ffffffff81072ddc>] ? do_send_sig_info+0x6c/0xa0 [3801900.193969] [<ffffffff81012a49>] do_notify_resume+0x69/0xb0 [3801900.193974] [<ffffffff815f2cd2>] int_signal+0x12/0x17 [3801900.193976] Disabling lock debugging due to kernel taint
First , let;s check
[3626234.775485] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:1 val:5 [3626234.775529] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:2 val:-5corresponding codes are:
__mmdrop->check_mm
static void check_mm(struct mm_struct *mm) { int i; for (i = 0; i < NR_MM_COUNTERS; i++) { long x = atomic_long_read(&mm->rss_stat.count[i]); if (unlikely(x)) printk(KERN_ALERT "BUG: Bad rss-counter state " "mm:%p idx:%d val:%ld\n", mm, i, x); } }The code here is to check if all the resident page frames for mm have been
reclaimed, say, if a process need to exit, he must make sure all
MM_FILEPAGES, MM_ANONPAGES, MM_SWAPENTS have been
reclaimed. According to our test, the number of MM_ANONPAGES
is 5, whilc the number of MM_SWAPENTS is -5, this is ridiculous.
then we see:
[3801900.193751] swap_free: Unused swap offset entry 00053980According to backtrace, this should happen when:
static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, struct zap_details *details) { int rss[NR_MM_COUNTERS]; do { //in the memory, not swap if (pte_present(ptent)) { //normal pages pte_set(0) } //swap, if it is file page, it could only be a nonlinear page. //else it is an anon page. if (pte_file(ptent)) { if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) print_bad_pte(vma, addr, ptent, NULL); } else { //entry is composed of (swap_type(swap area), swap_offset(pgoff inside swap_area)) swp_entry_t entry = pte_to_swp_entry(ptent); if (unlikely(!free_swap_and_cache(entry))) print_bad_pte(vma, addr, ptent, NULL); } } }The problem is that, swap_info_get(entry) in free_swap_and_cache(entry)
finds the counter of swap_map for this swap page is zero, which means,
this swap slot has already been released by others:
static struct swap_info_struct *swap_info_get(swp_entry_t entry) { struct swap_info_struct *p; unsigned long offset, type; if (!entry.val) goto out; type = swp_type(entry); if (type >= nr_swapfiles) goto bad_nofile; p = swap_info[type]; if (!(p->flags & SWP_USED)) goto bad_device; offset = swp_offset(entry); if (offset >= p->max) goto bad_offset; if (!p->swap_map[offset]) goto bad_free; spin_lock(&p->lock); return p; bad_free: printk(KERN_ERR "swap_free: %s%08lx\n", Unused_offset, entry.val); goto out; bad_offset: printk(KERN_ERR "swap_free: %s%08lx\n", Bad_offset, entry.val); goto out; bad_device: printk(KERN_ERR "swap_free: %s%08lx\n", Unused_file, entry.val); goto out; bad_nofile: printk(KERN_ERR "swap_free: %s%08lx\n", Bad_file, entry.val); out: return NULL; }
since its value is 00053980, composed of swap type and swap offset.
So let's start the kdump parser.
crash /usr/lib/debug/lib/module/3.10.0-123.13.1.e7.x86_64/vmlinux vmcore
according to crash dmesg, the problem is firstly triggered when enable the swap partition:
[3625477.198897] Adding 4194300k swap on /dev/mapper/os-swap. Priority:-1 extents:1 across:4194300k FS
[3626233.759633] traps: ceilometer-agen[3417] general protection ip:7f184c3f039e sp:7fff42d5fbc0 error:0 in libc-2.17.so[7f184c372000+1b6000]
[3626234.775485] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:1 val:5 [3626234.775529] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:2 val:-5
[3645969.174217] systemd-journald[1082]: Vacuuming done, freed 0 bytes
[3801900.193751] swap_free: Unused swap offset entry 00053980
[3801900.193773] BUG: Bad page map in process qemu-kvm pte:0a730000 pmd:4d6635067
let' check swap_info array, which contains the swap areas in the system.
I've seen similar bugs reported at:
http://linux-kernel.2935.n7.nabble.com/unused-swap-offset-bad-page-map-td698180.html
however per Hugh's suggestion, this problem described above is due to incorrect setting of pte entry in meremap,
but in our 3.10 this bug has not been introduced yet, so we have to check other clues.
I just did a quick diff of 3.11-rc7/mm against 3.10, and here's a line in mremap which worries me. That set_pte_at() is operating on anything that isn't pte_none(), so the pte_mksoft_dirty() looks prone to corrupt a swap entry.
and also per Linus and Hugh's experience:
> That said, google does find "swap_free: Unused swap offset entry" > reports from over the years. Most of them seem to be single-bit > errors, though (ie when the entry is 00000100 or similar I'm more > inclined to blame a bit error Yes, historically they have usually represented either single-bit errors, or corruption of page tables by other kernel data. The swap subsystem discovers it, but it's rarely an error of swap.
Well, it looks like we have some trouble on finding out the crime culprit.
Let's launch the crash kdump, and we first check current running process
on each cpu, and we found the last panic process is kswapd, so:
crash> bt -a | grep kswap PID: 476 TASK: ffff88085345c440 CPU: 27 COMMAND: "kswapd3" #13 [ffff880852ca9e28] kswapd at ffffffff811572db crash> set 476 PID: 476 COMMAND: "kswapd3" TASK: ffff88085345c440 [THREAD_INFO: ffff880852ca8000] CPU: 27 STATE: TASK_RUNNING (PANIC) crash> bt PID: 476 TASK: ffff88085345c440 CPU: 27 COMMAND: "kswapd3" #0 [ffff880852ca96f0] machine_kexec at ffffffff810411c1 #1 [ffff880852ca9748] crash_kexec at ffffffff810cf2b2 #2 [ffff880852ca9818] oops_end at ffffffff815eae48 #3 [ffff880852ca9840] die at ffffffff8101632b #4 [ffff880852ca9870] do_general_protection at ffffffff815ea74e #5 [ffff880852ca98a0] general_protection at ffffffff815ea068 [exception RIP: down_read_trylock+9] RIP: ffffffff8108a8e9 RSP: ffff880852ca9950 RFLAGS: 00010206 RAX: 0000000000000000 RBX: ffff88204d1d5840 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0600380100804d3c RBP: ffff880852ca9950 R8: ffffffff81944ae0 R9: ffff880852ca9ba0 R10: 00000000000000f0 R11: ffffea006b203a40 R12: ffff88204d1d5841 R13: ffffea0078f117c0 R14: 0600380100804d3c R15: 0000000000000001 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #6 [ffff880852ca9958] page_lock_anon_vma_read at ffffffff811779c5 #7 [ffff880852ca9988] page_referenced at ffffffff81177c47 #8 [ffff880852ca9a00] shrink_page_list at ffffffff811541c4 #9 [ffff880852ca9b38] shrink_inactive_list at ffffffff81154e6a #10 [ffff880852ca9c00] shrink_lruvec at ffffffff81155965 #11 [ffff880852ca9cf8] shrink_zone at ffffffff81155d76 #12 [ffff880852ca9d50] balance_pgdat at ffffffff8115701c #13 [ffff880852ca9e28] kswapd at ffffffff811572db #14 [ffff880852ca9ec8] kthread at ffffffff81085aff #15 [ffff880852ca9f50] ret_from_fork at ffffffff815f296c crash>according to dmesg this is a 'general protection fault',
[4401153.261568] device tap292007ca-7d entered promiscuous mode [4401153.285726] qbr292007ca-7d: port 2(tap292007ca-7d) entered forwarding state [4401153.285754] qbr292007ca-7d: port 2(tap292007ca-7d) entered forwarding state [4401161.284223] general protection fault: 0000 [#1] SMP [4401161.284675] Modules linked in: vhost_net macvtap macvlan tun nf_conntrack_ipv6 nf_defrag_ipv6 xt_mac xt_physdev xt_multiport xt_set xt_conntrack iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw ip_set_hash_ip ip_set nfnetlink veth ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables openvswitch vxlan ip_tunnel gre 8021q garp mrp bridge stp llc bonding sg iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core ixgbe i2c_i801 ses enclosure i2c_core ptp pps_core mdio mei_me lpc_ich dca mei mfd_core shpchp wmi acpi_cpufreq acpi_pad mperf nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sd_mod lpfc crc32c_intel crc_t10dif [4401161.289521] crct10dif_common scsi_transport_fc ahci libahci scsi_tgt libata megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [4401161.290475] CPU: 27 PID: 476 Comm: kswapd3 Tainted: G B -------------- 3.10.0-123.13.1.el7.x86_64 #1 [4401161.291334] Hardware name: To be filled by O.E.M. Tecal RH5885 V3/BC61BLCA, BIOS BLISV301 07/25/2014 [4401161.292217] task: ffff88085345c440 ti: ffff880852ca8000 task.ti: ffff880852ca8000 [4401161.293088] RIP: 0010:[<ffffffff8108a8e9>] [<ffffffff8108a8e9>] down_read_trylock+0x9/0x30 [4401161.294006] RSP: 0018:ffff880852ca9950 EFLAGS: 00010206 [4401161.294462] RAX: 0000000000000000 RBX: ffff88204d1d5840 RCX: 0000000000000000 [4401161.295298] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0600380100804d3c [4401161.296172] RBP: ffff880852ca9950 R08: ffffffff81944ae0 R09: ffff880852ca9ba0 [4401161.297061] R10: 00000000000000f0 R11: ffffea006b203a40 R12: ffff88204d1d5841 [4401161.297949] R13: ffffea0078f117c0 R14: 0600380100804d3c R15: 0000000000000001 [4401161.298841] FS: 0000000000000000(0000) GS:ffff88207fc60000(0000) knlGS:0000000000000000 [4401161.299725] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [4401161.300173] CR2: 000000000154e578 CR3: 00000000018d0000 CR4: 00000000001427e0 [4401161.301057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [4401161.301944] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [4401161.302828] Stack: [4401161.303276] ffff880852ca9980 ffffffff811779c5 ffffea0078f117c0 0000000000000000 [4401161.304165] ffff880852ca9bb8 ffff88204d1d5841 ffff880852ca99f8 ffffffff81177c47 [4401161.305046] 0000000000000297 ffff880852ca9da8 ffff880852ca9ac8 0000000000000000 [4401161.305936] Call Trace: [4401161.306380] [<ffffffff811779c5>] page_lock_anon_vma_read+0x55/0x110 [4401161.306832] [<ffffffff81177c47>] page_referenced+0x1c7/0x350 [4401161.307278] [<ffffffff811541c4>] shrink_page_list+0x4b4/0xab0 [4401161.307724] [<ffffffff81154e6a>] shrink_inactive_list+0x1ea/0x580 [4401161.308159] [<ffffffff81155965>] shrink_lruvec+0x375/0x710 [4401161.308598] [<ffffffff81155d76>] shrink_zone+0x76/0x1a0 [4401161.309039] [<ffffffff8115701c>] balance_pgdat+0x48c/0x5e0 [4401161.309483] [<ffffffff811572db>] kswapd+0x16b/0x430 [4401161.309938] [<ffffffff81086ac0>] ? wake_up_bit+0x30/0x30 [4401161.310382] [<ffffffff81157170>] ? balance_pgdat+0x5e0/0x5e0 [4401161.310849] [<ffffffff81085aff>] kthread+0xcf/0xe0 [4401161.311281] [<ffffffff81085a30>] ? kthread_create_on_node+0x140/0x140 [4401161.311765] [<ffffffff815f296c>] ret_from_fork+0x7c/0xb0 [4401161.312256] [<ffffffff81085a30>] ? kthread_create_on_node+0x140/0x140 [4401161.312728] Code: 00 00 00 ba 01 00 00 00 48 89 de e8 12 fe ff ff eb ce 48 c7 c0 f2 ff ff ff eb c5 e8 02 34 fd ff 66 90 0f 1f 44 00 00 55 48 89 e5 <48> 8b 07 48 89 c2 48 83 c2 01 7e 07 f0 48 0f b1 17 75 f0 48 f7 [4401161.314361] RIP [<ffffffff8108a8e9>] down_read_trylock+0x9/0x30 [4401161.314836] RSP <ffff880852ca9950>
let's dis the function of page_lock_anon_vma_read and we get:
0xffffffff81177970 <page_lock_anon_vma_read>: nopl 0x0(%rax,%rax,1) [FTRACE NOP] 0xffffffff81177975 <page_lock_anon_vma_read+5>: push %rbp 0xffffffff81177976 <page_lock_anon_vma_read+6>: mov %rsp,%rbp 0xffffffff81177979 <page_lock_anon_vma_read+9>: push %r14 0xffffffff8117797b <page_lock_anon_vma_read+11>: push %r13 0xffffffff8117797d <page_lock_anon_vma_read+13>: mov %rdi,%r13 0xffffffff81177980 <page_lock_anon_vma_read+16>: push %r12 0xffffffff81177982 <page_lock_anon_vma_read+18>: push %rbx 0xffffffff81177983 <page_lock_anon_vma_read+19>: mov 0x8(%rdi),%r12 0xffffffff81177987 <page_lock_anon_vma_read+23>: mov %r12,%rax 0xffffffff8117798a <page_lock_anon_vma_read+26>: and $0x3,%eax 0xffffffff8117798d <page_lock_anon_vma_read+29>: cmp $0x1,%rax 0xffffffff81177991 <page_lock_anon_vma_read+33>: je 0xffffffff811779a8 <page_lock_anon_vma_read+56> 0xffffffff81177993 <page_lock_anon_vma_read+35>: xor %ebx,%ebx 0xffffffff81177995 <page_lock_anon_vma_read+37>: mov %rbx,%rax 0xffffffff81177998 <page_lock_anon_vma_read+40>: pop %rbx 0xffffffff81177999 <page_lock_anon_vma_read+41>: pop %r12 0xffffffff8117799b <page_lock_anon_vma_read+43>: pop %r13 0xffffffff8117799d <page_lock_anon_vma_read+45>: pop %r14 0xffffffff8117799f <page_lock_anon_vma_read+47>: pop %rbp 0xffffffff811779a0 <page_lock_anon_vma_read+48>: retq 0xffffffff811779a1 <page_lock_anon_vma_read+49>: nopl 0x0(%rax) 0xffffffff811779a8 <page_lock_anon_vma_read+56>: mov 0x18(%rdi),%eax 0xffffffff811779ab <page_lock_anon_vma_read+59>: test %eax,%eax 0xffffffff811779ad <page_lock_anon_vma_read+61>: js 0xffffffff81177993 <page_lock_anon_vma_read+35> 0xffffffff811779af <page_lock_anon_vma_read+63>: mov -0x1(%r12),%r14 0xffffffff811779b4 <page_lock_anon_vma_read+68>: lea -0x1(%r12),%rbx 0xffffffff811779b9 <page_lock_anon_vma_read+73>: add $0x8,%r14 0xffffffff811779bd <page_lock_anon_vma_read+77>: mov %r14,%rdi 0xffffffff811779c0 <page_lock_anon_vma_read+80>: callq 0xffffffff8108a8e0 <down_read_trylock> 0xffffffff811779c5 <page_lock_anon_vma_read+85>: test %eax,%eax
let's backtrace page_lock_anon_vma_read:
0xffffffff81177983 <page_lock_anon_vma_read+19>: mov 0x8(%rdi),%r12
0xffffffff81177987 <page_lock_anon_vma_read+23>: mov %r12,%rax
anon_mapping = (unsigned long) ACCESS_ONCE(page->mapping);
struct anon_vma *page_lock_anon_vma_read(struct page *page)
struct page { /* First double word block */ unsigned long flags; /* Atomic flags, some possibly * updated asynchronously */ struct address_space *mapping; /* If low bit clear, points toso r12 comes from unsigned long anon_mapping, thus r12-1 is value of anon_vma.
then
0xffffffff811779af <page_lock_anon_vma_read+63>: mov -0x1(%r12),%r14which is
root_anon_vma = ACCESS_ONCE(anon_vma->root);because
1.anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
PAGE_MAPPING_ANON equals 1.
2. we acuires value from r12 pointer.
the r14 is the value of root_anon_vma
then
0xffffffff811779b9 <page_lock_anon_vma_read+73>: add $0x8,%r14
0xffffffff811779bd <page_lock_anon_vma_read+77>: mov %r14,%rdi
struct anon_vma { struct anon_vma *root; /* Root of this anon_vma tree */ struct rw_semaphore rwsem; /* W: modification, R: walking the list */so r14 and rdi is the address of rwsem embedded in root_anon_vma.
and the panic place for EIP:
RIP: 0010:[<ffffffff8108a8e9>] [<ffffffff8108a8e9>] down_read_trylock+0x9/0x30
down_read_trylock(&root_anon_vma->rwsem)</span>
it is trying to access edi,
0xffffffff8108a8e9 <down_read_trylock+9>: mov (%rdi),%rax
static inline int __down_read_trylock(struct rw_semaphore *sem) { long result, tmp; asm volatile("# beginning __down_read_trylock\n\t" " mov %0,%1\n\t" "1:\n\t" " mov %1,%2\n\t" " add %3,%2\n\t" " jle 2f\n\t" LOCK_PREFIX " cmpxchg %2,%0\n\t" " jnz 1b\n\t" "2:\n\t" "# ending __down_read_trylock\n\t" : "+m" (sem->count), "=&a" (result), "=&r" (tmp) : "i" (RWSEM_ACTIVE_READ_BIAS) : "memory", "cc"); return result >= 0 ? 1 : 0; }so the panic IP is 'move %0,%1' thus trying to put sem->count to result, but
sem is invalid, according to register context,we know
edi is 0600380100804d3c, so it means sem is 0600380100804d3c, this is absolutely wrong.
since the r14 comes from r12, let's check r12-1 value , which is the value of
struct anon_vma *root_anon_vma, and r12 is ffff88204d1d5841:
crash> rd ffff88204d1d5840 ffff88204d1d5840: 0600380100804d34 4M...8..After plus 8 it just right equals the bad addr of r14: 0600380100804d3c
WTF? obviously there is memory overlap around ffff88204d1d5841, let's check this struct:
struct anon_vma { root = 0x600380100804d34, rwsem = { count = 1099511654660, wait_lock = { raw_lock = { { head_tail = 0, tickets = { head = 0, tail = 0 } } } }, wait_list = { next = 0x0, prev = 0x996e000000000020 } }, refcount = { counter = 680270649 }, rb_root = { rb_node = 0x84b9568a801bb81c } }which is quite confusing, let's rd this address range:
crash> struct anon_vma -o struct anon_vma { [0] struct anon_vma *root; [8] struct rw_semaphore rwsem; [40] atomic_t refcount; [48] struct rb_root rb_root; } SIZE: 56
crash> rd ffff88204d1d5840 38 ffff88204d1d5840: 0600380100804d34 0000010000006904 4M...8...i...... ffff88204d1d5850: 0000000000000000 0000000000000000 ................ ffff88204d1d5860: 996e000000000020 84b92543288c1b39 .....n.9..(C%.. ffff88204d1d5870: 84b9568a801bb81c 0000568a801bb81c .....V.......V..so anon_vma is corrupted.
Since anon_vma is highly connected with swap, so we check the counter of swap_map again,
let's figure out how swap_map increases.
initial value for swap_map entry is zero, when a process's pages are swaped out
by shrink_list, a spare swap slot must be found out to store the page content.
static unsigned long shrink_page_list(struct list_head *page_list) { spin_lock(&swap_lock); while (!list_empty(page_list)) { //it is a anon page needed to be swaped out if (PageAnon(page) && !PageSwapCache(page)) { //try to find a free swap entry //check each swap area for (type = swap_list.next) { //try to find a free swap entry inside current swap area si = swap_info[type]; spin_lock(&si->lock); spin_unlock(&swap_lock); //this is extremely obscure to find a free page slot. offset = scan_swap_map(si, SWAP_HAS_CACHE); spin_unlock(&si->lock); } SetPageDirty(page); } //update pte with swap_entry try_to_unmap(page); //it is dirty set by previous step pageout(page); } }
Besides swap error, there is also a suspected warning on mm->rss_conter:
[3626234.775485] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:1 val:5 [3626234.775529] BUG: Bad rss-counter state mm:ffff8810528b8640 idx:2 val:-5It is too hard for us to track the root cause because the problem maybe a chain reaction for other
bugs. And according to search result, here is an possible fix for this issue:
http://www.serverphorums.com/read.php?12,950567
which says:
I saw a bunch of similar errors on 3.14.x up to and including 3.14.4,
running Java (Tomcat) and Postgres on Xen PV. Have not seen it since
"mm: use paravirt friendly ops for NUMA hinting ptes" landed in 3.14.5.
402e194dfc5b38d99f9c65b86e2666b29adebf8c in stable,
29c7787075c92ca8af353acd5301481e6f37082f upstream
commit 29c7787075c92ca8af353acd5301481e6f37082f Author: Mel Gorman <mgorman@suse.de> Date: Fri Apr 18 15:07:21 2014 -0700 mm: use paravirt friendly ops for NUMA hinting ptes David Vrabel identified a regression when using automatic NUMA balancing under Xen whereby page table entries were getting corrupted due to the use of native PTE operations. Quoting him Xen PV guest page tables require that their entries use machine addresses if the preset bit (_PAGE_PRESENT) is set, and (for successful migration) non-present PTEs must use pseudo-physical addresses. This is because on migration MFNs in present PTEs are translated to PFNs (canonicalised) so they may be translated back to the new MFN in the destination domain (uncanonicalised). pte_mknonnuma(), pmd_mknonnuma(), pte_mknuma() and pmd_mknuma() set and clear the _PAGE_PRESENT bit using pte_set_flags(), pte_clear_flags(), etc. In a Xen PV guest, these functions must translate MFNs to PFNs when clearing _PAGE_PRESENT and translate PFNs to MFNs when setting _PAGE_PRESENT. His suggested fix converted p[te|md]_[set|clear]_flags to using paravirt-friendly ops but this is overkill. He suggested an alternative of using p[te|md]_modify in the NUMA page table operations but this is does more work than necessary and would require looking up a VMA for protections. This patch modifies the NUMA page table operations to use paravirt friendly operations to set/clear the flags of interest. Unfortunately this will take a performance hit when updating the PTEs on CONFIG_PARAVIRT but I do not see a way around it that does not break Xen.
This bug can break XEN on numa, so it can also break KVM, since it is a general bug.
相关文章推荐
- CocoaPod使用
- Android中Butter Knife库使用
- 从网址截取域名
- 【SPOJ-INVCNT】Inversion Count【树状数组】【逆序对】
- Android中Fragment的生命周期与返回栈的管理
- mysql执行脚本无法插入中文
- mysql分表的3种方法
- Spring Bean配置:注解配置
- hosts+httpd-vhosts.conf,總結
- JVM内存分析
- android中的空指针异常NullpointExcepiton
- IOS之UITabBarController
- 编译android内核源码
- install Apache on CentOS 7
- dialog使用方法(同一页面,调用一个js代码,实现多个不同样式的弹窗)
- 数据库访问性能优化
- 启用始终脱机模式下提供对文件更快的访问权限
- Repeater获取勾选数据,拼接成SQL实现批量删除+存储过程
- 数据库E-commerce管理
- OpenCV求解的矩阵: CvMat与cvSolve函数