您的位置:首页 > 大数据 > 人工智能

[知其然不知其所以然-16] page reclaim and hibernation failure

2016-01-05 23:25 881 查看
本文提出的背景是hibernation休眠时当前系统的页面保存计算方法。首先我们知道,对于hibernation来说,

用户可以指定最多可以生成多大的image到磁盘:

cat /sys/power/image_size
1604927488
内核的hibernation模块,在生成image前,会首先计算出当前系统中已经被使用了的有效页面数,

如果这个页面数比sysfs的image_size小的话,那么OK,我们就把所有的这些有效页面写入到image中;

但如果用户要求image_size很小时,就需要做一些“压缩”了。说是叫压缩,实际上并没有compact pages

的过程,而是尽量丢弃一些不需要保存的页面。哪些是可以丢弃的页?一句话,可回收的页。

函数hibernate_preallocate_memory计算需要保存的页面数,并完成image页面的分配。这里我们主要

关注image页面的计算,因为这和我们本文的主题页面回收有联系,具体解释在下面贴出的代码中:

int hibernate_preallocate_memory(void)
{
//1. calculate current used pages(excluding:
//   buddy freelist(by mark_free_pages), nosave, etc
saveable = count_data_pages();
//2. get the total managable pages, + free pages
count = saveable;
for_each_populated_zone(zone)
if (!is_highmem(zone))
count += zone_page_state(zone, NR_FREE_PAGES);

//3. Compute the maximum number of saveable pages to leave in memory.
max_size = (count - (size + PAGES_FOR_IO)) / 2
- 2 * DIV_ROUND_UP(reserved_size, PAGE_SIZE);

//4. get the user customized image size
size = DIV_ROUND_UP(image_size, PAGE_SIZE);
//5. if user is nice, let's store all the savable pages.
if (size >= saveable) {
pages += preallocate_image_memory(saveable - pages, avail_normal);
goto out;
}
//6. otherwise, minus the reclaimable pages
pages = minimum_image_size(saveable);
//7. if still biger than user defined number, use the possible minimal image size
if (size < pages)
size = min_t(unsigned long, pages, max_size);

//8. let's reclaim some pages
shrink_all_memory(saveable - size);
//9. the following is the core pre-allocate, we try to occupy
// another alloc pages, thus the savable pages will be reduced by
// alloc pages.
alloc = count - max_size;
pages = preallocate_image_memory(alloc, avail_normal);
/*
* There are max_size saveable pages now,
* and we want to reduce this number down to size,so
* occupy another max_size - size pages.
*/
alloc = max_size - size;
size = preallocate_image_memory(alloc, avail_normal);

//10. now we only have size of savable pages in memory,
// later we can copy these datas to the allocated images pages allocated
// by preallocate_image_memory, what a fucking algorithm.
free_unnecessary_pages();

}


第6步是计算理论上能达到的最小image包含的页面数:

static unsigned long minimum_image_size(unsigned long saveable)
{
unsigned long size;

size = global_page_state(NR_SLAB_RECLAIMABLE)
+ global_page_state(NR_ACTIVE_ANON)
+ global_page_state(NR_INACTIVE_ANON)
+ global_page_state(NR_ACTIVE_FILE)
+ global_page_state(NR_INACTIVE_FILE)
- global_page_state(NR_FILE_MAPPED);

return saveable <= size ? 0 : saveable - size;
}
上面的size就是可回收的页面总数,计算方法借用该函数的注释:

* [number of saveable pages] - [number of pages that can be freed in theory]
*
* where the second term is the sum of (1) reclaimable slab pages, (2) active
* and (3) inactive anonymous pages, (4) active and (5) inactive file pages,
* minus mapped file pages.
我们需要关注的是为什么要减去NR_FILE_MAPPED?首先,对于page cache来说,

尽量还是保存到image中比较好,因为hibernation的image被恢复后,如果page cache不在了,

还要重新读磁盘中的数据到内存中,相当耗时,所以不要去尝试回收page cache所在的页面。

那就不要加上NR_FILE_MAPPED就好了,为什么还要减去他呢?是不是NR_FILE_MAPPED包含在

NR_ACTIVE_FILE或者NR_INACTIVE_FILE里,所以需要被排除在可回收页面中?我们下面来首先

分析NR_ACTIVE_FILE和NR_INACTIVE_FILE表示的是啥。首先把/proc/meminfo涉及到minimum_image_size

的选项都列出来:

[chenyu@localhost ~]$ cat /proc/meminfo
Active(anon):    2125528 kB NR_ACTIVE_ANON
Inactive(anon):   889316 kB NR_INACTIVE_ANON
Active(file):    2154472 kB NR_ACTIVE_FILE
Inactive(file):  1232632 kB NR_INACTIVE_FILE
Mapped:           271348 kB	NR_FILE_MAPPED
SReclaimable:    1154612 kB NR_SLAB_RECLAIMABLE


说到这里,我要抱怨一下了,因为我在内核代码里搜索NR_ACTIVE_FILE,

没有看到对zone结构体内的这个宏对应的数目进行增减的地方,而

NR_FILE_MAPPED则有对应的代码对他进行增减,即当一个page被

加到用户的mmap虚拟地址空间后:

void page_add_file_rmap(struct page *page)
{
__inc_zone_page_state(page, NR_FILE_MAPPED);
}


而对于NR_ACTIVE_FILE来说,增减位置很隐蔽,于是我们就干脆通过git log

检查NR_ACTIVE_FILE被引入的历史记录,来看他是如何影响到zone的页面计数的。

最终找到实际操作位置,位于内存回收流程,在active_list和inactive_list之前移动page时,

会更新这些计数,这里直接给出结果,并添加注释:

static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
struct scan_control *sc, unsigned long *lru_pages)
{
//1. stop further bio
blk_start_plug(&plug);
//2. traverse from high(LRU_ACTIVE_FILE) to low(LRU_INACTIVE_ANON=LRU_BASE)
for_each_evictable_lru(lru)
shrink_list(lru, nr_to_scan,
lruvec, sc);
}
//3. allow further bio
blk_finish_plug(&plug);
}

static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct lruvec *lruvec, struct scan_control *sc)
{
if (is_active_lru(lru))
shrink_active_list(nr_to_scan, lruvec, sc, lru);
else
shrink_inactive_list(nr_to_scan, lruvec, sc, lru);
}

static void shrink_active_list(unsigned long nr_to_scan,
struct lruvec *lruvec,
struct scan_control *sc,
enum lru_list lru)
{
//1. hot path, check at most nr_to_scan pages, if isolated(no one
// is using, then regarded as valid page, put the number of valid
// pages in nr_taken, the pages linked in l_hold
nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
&nr_scanned, sc, isolate_mode, lru);
//2. active_page number decreases by nr_taken
__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
//3. put the pages to temp inactive list
while (!list_empty(&l_hold))
list_add(&page->lru, &l_inactive);
//4. move the temp inactive page to lruvec[lru - LRU_ACTIVE],
//the lru is LRU_ACTIVE_ANON or LRU_ACTIVE_FILE,
//say, if it is LRU_ACTIVE_FILE, thus lru - LRU_ACTIVE =
//LRU_BASE + LRU_FILE = 2,
//the list_move(&page->lru, &lruvec->lists[lru]);
//and since the structure of lists in lruvec is:
//struct list_head lists[NR_LRU_LISTS]; the active pages
//are moved to lists[LRU_INACTIVE_FILE],
//finally increase the number of inactive number in:
//__mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
}

static noinline_for_stack unsigned long
shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
struct scan_control *sc, enum lru_list lru)
{
//1. the same as active_list
nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
&nr_scanned, sc, isolate_mode, lru);
//2. the same as active_list
__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
//3. reclaim the pages
shrink_page_list(&page_list);
}
上面覆盖的流程,主要是从active_list里挖一部分老化页面,添加到inactive_list

(增加NR_INACTIVE_FILE,减少NR_ACTIVE_FILE)。那么什么时候NR_ACTIVE_FILE

增加呢?根据深入理解内核第三版页面回收章节的描述,以下比较典型的场景会将一个页面标记为active:

 mark_page_accessed

当一个页面被用户态进程、文件系统层、设备驱动层引用时,会调用上述函数,

具体的地方,有:

匿名映射的请求调页do_anonymous_page,page cache的请求调页filemap_nopage,

IPC共享内存的请求调页shmem_nopage,从一个文件读数据do_generic_file_read,交换

页do_swap_page,查找page cache里的一个块设备buffer时__find_get_block,这个函数

会把inactive_list移动到active_list里。

其他的也有一些函数比如refile_inactive_list也会从active_list里拖page到inactive_list等等。

之前我们说到,NR_FILE_MAPPED的变化是在page_add_file_rmap,是在用户态mmap

的时候增加。那么NR_FILE_MAPPED和NR_ACTIVE_FILE以及NR_INACTIVE_FILE有没有关系呢?

在大多数情况下,前者都是后两者的子集,但是,我们看到的某次故障的情况是Mapped比Inactive+Active file

的个数还大
https://bugzilla.kernel.org/show_bug.cgi?id=97201
这个故障可以轻松的由以下指令复现:

sync;echo 3 > /proc/sys/vm/drop_cache

即先把dirty page刷到磁盘,再把clean的page cache回收
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: