您的位置:首页 > 运维架构 > Linux

Linux Kernel Development 3rd Edition 读书笔记(6)

2012-12-24 16:01 1031 查看
第十一章 Timers and Time Management

1. Frequency of the Timer Interrupt



2. 全局变量jiffies系统启动后的tick数值.

jiffies变量声明在 <linux/jiffies.h> :

extern unsigned long volatile jiffies;

jiffies是jiffies_64的低32位,大部分情况用jiffies,只有时间管理才用jiffies_64,不会溢出.

避免溢出计算出错,采用的宏:

#define time_after(unknown, known) ((long)(known) - (long)(unknown) < 0)

#define time_before(unknown, known) ((long)(unknown) - (long)(known) < 0)

#define time_after_eq(unknown, known) ((long)(unknown) - (long)(known) >= 0)

#define time_before_eq(unknown, known) ((long)(known) - (long)(unknown) >= 0)

3.Timer中断函数分为平台相关和平台无关部分( tick_periodic()).

4. 当前时间(the wall time)定义在 kernel/time/timekeeping.c:

struct timespec xtime;
The timespec  data structure is defined in <linux/time.h>  as:
struct timespec {
__kernel_time_t tv_sec;      /* seconds */
long tv_nsec;                /* nanoseconds */
};
xtime.tv_sec保存了从1970年1月1日(UTC)以来的秒数,叫做epoch,

读写变量xtime需要seqlock类型的spinlock.

写xtime:

write_seqlock(&xtime_lock);

/* update xtime ... */

write_sequnlock(&xtime_lock);

读xtime:

unsigned long seq;
do {
unsigned long lost;
seq = read_seqbegin(&xtime_lock);
usec = timer->get_offset();
lost = jiffies - wall_jiffies;
if (lost)
usec += lost * (1000000 / HZ);
sec = xtime.tv_sec;
usec += (xtime.tv_nsec / 1000);
} while (read_seqretry(&xtime_lock, seq));


Userspace获得xtime的方法是 gettimeofday(), 由sys_gettimeofday()实现(kernel/time.c):

asmlinkage long sys_gettimeofday(struct timeval *tv, struct timezone *tz)
{
if (likely(tv)) {
struct timeval ktv;
do_gettimeofday(&ktv);
if (copy_to_user(tv, &ktv, sizeof(ktv)))
return -EFAULT;
}
if (unlikely(tz)) {
if (copy_to_user(tz, &sys_tz, sizeof(sys_tz)))
return -EFAULT;
}
return 0;
}


settimeofday()用来设置wall time(需定义 CAP_SYS_TIME).

5. Timers

Timers由结构体timer_list表示, 定义在 <linux/timer.h>:

struct timer_list {
struct list_head entry;           /* entry in linked list of timers */
unsigned long expires;            /* expiration value, in jiffies */
void (*function)(unsigned long);  /* the timer handler function */
unsigned long data;               /* lone argument to the handler */
struct tvec_t_base_s *base;       /* internal timer field, do not touch */
};


创建timer第一步:

struct timer_list my_timer;

初始化timer:

init_timer(&my_timer);

然后填充需要数据:

my_timer.expires = jiffies + delay; /* timer expires in delay ticks */

my_timer.data = 0; /* zero is passed to the timer handler */

my_timer.function = my_function; /* function to run when timer expires */

时间到了执行函数的原型:

void my_timer_function(unsigned long data);

最后激活timer:

add_timer(&my_timer);

mod_timer()用来操作已经初始化但还没激活的timer.运行后timer被激活.

在timer到期前取消timer:

del_timer(&my_timer);

取消并等待执行函数完成:

del_timer_sync(&my_timer);//和del_timer不同,不能用在中断上下文.

内核在时间中断完成在bottom-half执行timers,softirqs类型,时间中断运行update_process_times(),会调用run_local_timers():

void run_local_timers(void)
{
hrtimer_run_queues();
raise_softirq(TIMER_SOFTIRQ);   /* raise the timer softirq */
softlockup_tick();
}


TIMER_SOFTIRQ softirq由run_timer_softirq()处理.执行所有到期的timer.timer存储在链表中,kernel根据expired value讲timers分为5组来提高效率.

6.delay

Busy looping:

unsigned long timeout = jiffies + 10;        /* ten ticks */
while (time_before(jiffies, timeout))
;
这样系统会死等,下面的方法在等待时候允许其他进程运行:

unsigned long delay = jiffies + 5*HZ;
while (time_before(jiffies, delay))
cond_resched();


不使用jiffies的delay方法,可以获得更短的delay:

void udelay(unsigned long usecs)

void ndelay(unsigned long nsecs)

void mdelay(unsigned long msecs)

udelay使用busy looping实现,通过BogoMIPS获取.

更优化的delay'方法:

schedule_timeout(),delay时任务进入sleep状态直至到期被唤醒.使用方法:

/* set task’s state to interruptible sleep */
set_current_state(TASK_INTERRUPTIBLE);
/* take a nap and wake up in “s” seconds */
schedule_timeout(s * HZ);


等待一定时间或事件被唤醒可调用schedule_timeout() 而不是schedule().

第十二章: Memory Management

1. 页是虚拟内存的最小单元. 大部分32位处理器使用4KB的页,64位使用8KB的页.

2. 页由struct page来表示,定义在 <linux/mm_types.h>.原型:

struct page {
unsigned long         flags;
atomic_t              _count;
atomic_t              _mapcount;
unsigned long         private;
struct address_space  *mapping;
pgoff_t               index;
struct list_head      lru;
void                  *virtual;
};


flags: 保存页的状态,共32个bit用来表示状态,定义在<linux/page-flags.h>.

_count: 保存页使用的数目.-1时没有被使用.可以被用来新的分配.kernel使用page_count()而不是直接访问该成员.page_count()返回0表示free,非0表示在使用.page可以被page cache使用(mapping指向关联该页的 address_space对象).作为private data(private指向), 或者进程页表的映射.

virtual: 是页的虚拟地址.

page结构关联的是物理页,不是虚拟页.用来表示物理内存,而不是其中的数据.

3. kernel将页分成不同zones. Linux有4个基本的么memory zones(定义在<linux/mmzone.h>):

ZONE_DMA: 包含的页可以进行DMA.

ZONE_DMA32: 包含的页可以进行DMA, 但只能被32位设备访问.

ZONE_NORMAL: 包含普通的,可被映射的页.

ZONE_HIGHMEM: 包含high memory,这些内容不能永久被内核地址空间映射.



zones结构( <linux/mmzone.h>):

struct zone {
unsigned long            watermark[NR_WMARK];
unsigned long            lowmem_reserve[MAX_NR_ZONES];
struct per_cpu_pageset   pageset[NR_CPUS];
spinlock_t               lock;
struct free_area         free_area[MAX_ORDER]
spinlock_t               lru_lock;
struct zone_lru {
struct list_head list;
unsigned long    nr_saved_scan;
} lru[NR_LRU_LISTS];
struct zone_reclaim_stat reclaim_stat;
unsigned long            pages_scanned;
unsigned long            flags;
atomic_long_t            vm_stat[NR_VM_ZONE_STAT_ITEMS];
int                      prev_priority;
unsigned                 int inactive_ratio;
wait_queue_head_t        *wait_table;
unsigned long            wait_table_hash_nr_entries;
unsigned long            wait_table_bits;
struct pglist_data       *zone_pgdat;
unsigned long            zone_start_pfn;
unsigned long            spanned_pages;
unsigned long            present_pages;
const char               *name;
};
lock: 用来保护该结构避免被并发进程访问.用来保护该结构而不是其代表的zones的内容.

watermark: 表示该zone消耗的情况,用minimum, low, and high来表示.

name: zones的名字,在 mm/page_alloc.c中被初始化,其中三个名字为DMA, Normal, and HighMem.

4. 分配页内存(<linux/gfp.h>):

struct page * alloc_pages(gfp_t gfp_mask, unsigned int order)

该函数分配2的order次幂的page,返回指向第一个page的指针.

page转换逻辑地址:

void * page_address(struct page *page)

只获得一页内存的方法:

unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)

只获得一页内存的简单方法:

struct page * alloc_page(gfp_t gfp_mask)

unsigned long __get_free_page(gfp_t gfp_mask)

返回page被填充为0的函数:

unsigned long get_zeroed_page(unsigned int gfp_mask)





释放内存:

void __free_pages(struct page *page, unsigned int order)

void free_pages(unsigned long addr, unsigned int order)

void free_page(unsigned long addr)

分配内存的例子:

unsigned long page;
page = __get_free_pages(GFP_KERNEL, 3);
if (!page) {
/* insufficient memory: you must handle this error! */
return –ENOMEM;
}
/* ‘page’ is now the address of the first of eight contiguous pages ... */
释放内存例子:

free_pages(page, 3);
/*
* our pages are now freed and we should no
* longer access the address stored in ‘page’
*/


5. kmalloc()

kmalloc()用来分配基于字节数目的内核内存.定义在 <linux/slab.h>,成功返回分配的内存地址,否则返回NULL.

void * kmalloc(size_t size, gfp_t flags);

例子:

struct dog *p;
p = kmalloc(sizeof(struct dog), GFP_KERNEL);
if (!p)
/* handle error ... */
gfp_t,定义在 <linux/types.h>,是如何分配内存的标志,有Action Modifiers,Zone Modifiers和Type Flags之分.

Action modifiers:定义内核如何分配所需的内存.



Zone Modifiers:指明内存从哪个zone开始分配.



Type flags:是上两种类型的组合.使用时使用这种标志.









kfree(), 声明在 <linux/slab.h>:

void kfree(const void *ptr)

例子:

char *buf;
buf = kmalloc(BUF_SIZE, GFP_ATOMIC);
if (!buf)
/* error allocating memory ! */
....
kfree(buf);


6. vmalloc()(声明在 <linux/vmalloc.h>,定义在mm/vmalloc.c)

void * vmalloc(unsigned long size)

vmalloc()和kmalloc()类似,但分配的是连续的虚拟内存,而不必是连续的物理内存.内核中一般用kmalloc(),开销比vmalloc()小,因为vmalloc()要进行内存映射.

vmalloc()能够sleep,因此不能被中断及其他阻塞不允许的情况.

释放内存:void vfree(const void *addr)

例子:

char *buf;
buf = vmalloc(16 * PAGE_SIZE); /* get 16 pages */
if (!buf)
/* error! failed to allocate memory */
/*
* buf now points to at least a 16*PAGE_SIZE bytes
* of virtually contiguous block of memory
*/
After you finish with the memory, make sure to free it by using
vfree(buf);


7. slab layer

slab layer是通用的数据结构缓存层.频繁使用的数据需要缓存.可以避免内存碎片.对于频繁的内存分配释放操作提高性能.

slab layer将不同的对象分组,每个组叫一个cache,每个对象类型对应一个cache.cache然后分组成slab,slabs包含一个或多个连续的物理内存页.cache包含多个slab.

每个slab有3种状态:: full, partial, or empty.



每个cache由 kmem_cache结构表示,包含3个链表:slabs_full, slabs_partial, slabs_empty.

slab结构:

struct slab {
struct list_head  list;       /* full, partial, or empty list */
unsigned long     colouroff;  /* offset for the slab coloring */
void              *s_mem;     /* first object in the slab */
unsigned int      inuse;      /* allocated objects in the slab */
kmem_bufctl_t     free;       /* first free object, if any */
};
使用kmem_getpages通过调用__get_free_pages()来分配新的slab.

static void *kmem_getpages(struct kmem_cache *cachep, gfp_t flags, int nodeid)

kmem_freepages()进行释放.

新的cache通过以下函数创建:

struct kmem_cache * kmem_cache_create(const char *name,
size_t size,
size_t align,
unsigned long flags,
void (*ctor)(void *));
成功返回创建的cache的指针,失败返回NULL.该函数不能在interrupt上下午中调用,因为会睡眠.

销毁cache:

int kmem_cache_destroy(struct kmem_cache *cachep);
成功返回0,失败返回非0值.
cache创建完成后,object可以通过以下函数来获得.

void * kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags);
Example:

一个全局指针指向task struct cache

struct kmem_cache *task_struct_cachep;
在 fork_init()中创建:

task_struct_cachep = kmem_cache_create(“task_struct”,
sizeof(struct task_struct),
ARCH_MIN_TASKALIGN,
SLAB_PANIC | SLAB_NOTRACK,
NULL);
进程调用 fork()创建新的进程时,新的进程描述被创建, do_fork()- dup_task_struct():

struct task_struct *tsk;
tsk = kmem_cache_alloc(task_struct_cachep, GFP_KERNEL);
if (!tsk)
return NULL;
任务终止后,如果没有子任务等待,进程描述符被释放回 task_struct_cachep slab cache, free_task_struct()调用:

kmem_cache_free(task_struct_cachep, tsk);
进程描述符是内核核心部分一直被需要,因而不会销毁.

8. Statically Allocating on the Stack

栈上的静态分配,内核进程使用2个页的栈, 也就是32位系统8KB,64位系统16KB.

在编译时可以选择1个页的栈,中断进程使用单独的栈.

9.High Memory Mappings

将high memory内存永久映射到内核地址空间,

void *kmap(struct page *page);( <linux/highmem.h>)

high memory或low memory都可以实用该函数,不使用时应释放:

void kunmap(struct page *page);

临时映射:

可以使用在无法睡眠,如中断处理函数

void *kmap_atomic(struct page *page, enum km_type type);

void kunmap_atomic(void *kvaddr, enum km_type type);

10.The New percpu Interface

2.6内核新的接口percpu,声明在 <linux/percpu.h> ,实现在 mm/slab.c和<asm/percpu.h>.

编译时定义percpu数据:

DEFINE_PER_CPU(type, name);

引用其他地方声明的数据:

DECLARE_PER_CPU(type, name);

使用get_cpu_var()和put_cpu_var()来操作数据:

get_cpu_var(name)++; /* increment name on this processor */

put_cpu_var(name); /* done; enable kernel preemption */

获得其他处理器的数据:

per_cpu(name, cpu)++; /* increment name on the given processor */

动态数据:

void *alloc_percpu(type); /* a macro */

void *__alloc_percpu(size_t size, size_t align);

void free_percpu(const void *);

get_cpu_var(ptr); /* return a void pointer to this processor’s copy of ptr */

put_cpu_var(ptr); /* done; enable kernel preemption */

Example:

void *percpu_ptr;
unsigned long *foo;
percpu_ptr = alloc_percpu(unsigned long);
if (!ptr)
/* error allocating memory .. */
foo = get_cpu_var(percpu_ptr);
/* manipulate foo .. */
put_cpu_var(percpu_ptr);
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: