您的位置：首页 > 其它

混沌初开--内核启动笔记

2016-09-23 00:02 344 查看

先隆重介绍这个网站 https://www.gitbook.com/book/0xax/linux-insides/details

实在太好了，膜拜，膜拜，膜拜。

bootloader加载内核到内存

bootloader之前的我就不关心了，主要关心加载到内存以及之后的事情。从这篇文章中才知道原来内核加载是有自己的协议的，我也是醉了阿。这个可以从 boot protocol 看到原文，有兴趣的同学可以参考。

先看一眼人家是怎么讲的。

When using bzImage, the protected-mode kernel was relocated to

0x100000 (“high memory”), and the kernel real-mode block (boot sector,

setup, and stack/heap) was made relocatable to any address between

0x10000 and end of low memory. Unfortunately, in protocols 2.00 and

2.01 the 0x90000+ memory range is still used internally by the kernel; the 2.02 protocol resolves that problem.

再看人家怎么给你画的一张图。

~                        ~
|  Protected-mode kernel |
100000  +------------------------+
|  I/O memory hole       |
0A0000  +------------------------+
|  Reserved for BIOS     |  Leave as much as possible unused
~                        ~
|  Command line          |  (Can also be below the X+10000 mark)
X+10000 +------------------------+
|  Stack/heap            |  For use by the kernel real-mode code.
X+08000 +------------------------+
|  Kernel setup          |  The kernel real-mode code.
|  Kernel boot sector    |  The kernel legacy boot sector.
X +------------------------+
|  Boot loader           |  <- Boot sector entry point 0000:7C00
001000  +------------------------+
|  Reserved for MBR/BIOS |
000800  +------------------------+
|  Typically used by MBR |
000600  +------------------------+
|  BIOS use only         |
000000  +------------------------+

看到这里我真的是眼泪水也要掉出来了，你看人家工程师是多么敬业，这个都给你画出来了呀。

这里我来多说两句，这张图上我们关心的，（应该说是我）

一个是 Kernel boot sector + Kernel setup

一个是 Protected-mode kernel

记得我们之前看bzImage编译过程中遇到过啥， bzImage 是由两部分组成的

setup.bin

vmlinux.bin

正好一个对一个。 Kernel boot sector是setup.bin中前512字节。

啊，世界突然清净了些。。。

跳跳跳～那些年都没有理清的细节

忙活了一两个月，总结了这么一张图。希望你会喜欢。

+------------------------+
|                        |
|  *relocated            |
|                        |
|                        |
|  Protected-mode kernel |
|                        |
|                        |
1000000 + off |                        |
+........................+  5th jump: after compressed kernel relocated
|                        |
|                        |
|                        |  9th jump: to start_kernel()
|                        |
|                        |  8th jump: to the c code world
|                        |
|                        |  7th jump: use virtual address
|                        |
|                        |
|                        |
1000000 (16M) +------------------------+  6th jump: to startup_64 in arch/x86/kernel/head_64.S
|                        |
~                        ~
|                        |
+------------------------+
|  page table (16k size) |
|                        |
+........................+
|  stack (4k or 16k size)|
+........................+
|  heap  (16k or 4M size)|
+........................+
|                        |
|                        |
|                        |
|  Protected-mode kernel |
|                        |
|                        |
100200        |  *startup_64           |   4th jump: from 32bit protected mode to long mode
|                        |
|  *startup_32           |
100000  (1M)  +------------------------+   3rd jump: from header.S to head_64.S
|                        |
~                        ~
|                        |
X+10000       +------------------------+
|  Stack/heap            |
X+08000       +------------------------+
|  Kernel setup          |   2nd jump: real mode to protected mode
x+512         +------------------------+   1st jump: boot loader to header.S
|  Kernel boot sector    |
X             +------------------------+
|  Boot loader           |
001000  (4K)  +------------------------+
|                        |
~                        ~
|                        |
000000        +------------------------+

人往高处走，水往地处流。看来代码他也是往高处走的～

第一跳 – 从bootloader到setup

从硬盘上加载内核到内存后，就开始要第一次跳转到我们亲手编译的内核了。那这第一跳是跳到哪里呢？

用了内核这么多年，编了内核这么多年，还写了不少补丁，竟然不知道！还是这个boot protocol 中有写。

摘出最关键的部分，其余的大家自行研究～

The kernel is started by jumping to the kernel entry point, which is

located at segment offset 0x20 from the start of the real mode

kernel. This means that if you loaded your real-mode kernel code at

0x90000, the kernel entry point is 9020:0000.

其实就是从setup.bin的0x200 = 512的位置，瞅一眼代码

"archc/x86/boot/head.S"

# offset 512, entry point

.globl  _start
_start:

你的明白？

第二跳 – 进入保护模式

我们现在已经愉快的进入了内核的世界。不过这个时候CPU还是运行在实模式的，内核需要做点准备进入到保护模式。可以认为就是让CPU运行在一个高级的模式。

刚才在arch/x86/boot/head.S的_start后面做了点简单的初始化，就运行到了main函数。这个函数定义在arch/x86/boot/main.c，样子是

"arch/x86/boot/main.c"

void main(void)
{
/* First, copy the boot header into the "zeropage" */
copy_boot_params();

...

/* Do the last things and invoke protected mode */
go_to_protected_mode();
}

其中省略无关的，我们这里最关心的是 go_to_prorected_mode()这个函数。

"arch/x86/boot/pm.c"

void go_to_protected_mode(void)
{
/* Hook before leaving real mode, also disables interrupts */
realmode_switch_hook();

/* Enable the A20 gate */
if (enable_a20()) {
puts("A20 gate not responding, unable to boot...\n");
die();
}

/* Reset coprocessor (IGNNE#) */
reset_coprocessor();

/* Mask all interrupts in the PIC */
mask_all_interrupts();

/* Actual transition to protected mode... */
setup_idt();
setup_gdt();
protected_mode_jump(boot_params.hdr.code32_start,
(u32)&boot_params + (ds() << 4));
}

这个也很长，我们最关心的也是最后的那一个。

"arch/x86/boot/pmjump.S"

GLOBAL(protected_mode_jump)
movl    %edx, %esi      # Pointer to boot_params table

xorl    %ebx, %ebx      # 清零
movw    %cs, %bx        # bx = cs；
shll    $4, %ebx        # bx = bx << 4；
addl    %ebx, 2f        # *(2f) = bx + *(2f) = (cs << 4) + *(2f)；
# 神！
jmp 1f                  # Short jump to serialize on 386/486
1:

movw    $__BOOT_DS, %cx # 加载新的DS段
movw    $__BOOT_TSS, %di# 保存TSS段的值

movl    %cr0, %edx
orb $X86_CR0_PE, %dl    # Protected mode
movl    %edx, %cr0      # cr0 = cr0 | X86_CR0_PE;

# Transition to 32-bit mode
.byte   0x66, 0xea      # ljmpl opcode
2:  .long   in_pm32         # offset
.word   __BOOT_CS       # segment
ENDPROC(protected_mode_jump)

除了加注释的语句，还要着重解释两点

1. 最后那个长跳转，跳转到了 __BOOT_CS : in_pm32 执行。与此同时也加载了cs段，真正开始启用了保护模式寻址。

2. 标注了神的那条。这动态修改了 2f 位置的值。编译的时候这里是一个相对的值。在setup_gdt()里面 __BOOT_CS段指向的是从0开始的4G大小的物理内存。但是bootloader加载setup.bin的时候，到底加载到内存哪里事先是不知道的。通过自己的“段内偏移” + “当时段地址” 来动态计算 “实际地址” 的方式跳转到实际内存的位置。

高，实在是高。

第三跳进入32位保护模式内核 startup_32

ok 这一跳是从setup.bin 跳到 compressed/vmlinux.bin了～

这个步骤就紧接着上面的跳跃

"arch/x86/boot/pmjump.S"

.code32
.section ".text32","ax"
GLOBAL(in_pm32)
# Set up data segments for flat 32-bit mode
movl    %ecx, %ds
movl    %ecx, %es
movl    %ecx, %fs
movl    %ecx, %gs
movl    %ecx, %ss
# The 32-bit code sets up its own stack, but this way we do have
# a valid stack if some debugging hack wants to use it.
addl    %ebx, %esp

# Set up TR to make Intel VT happy
ltr %di

# Clear registers to allow for future extensions to the
# 32-bit boot protocol
xorl    %ecx, %ecx
xorl    %edx, %edx
xorl    %ebx, %ebx
xorl    %ebp, %ebp
xorl    %edi, %edi

# Set up LDTR to make Intel VT happy
lldt    %cx

jmpl    *%eax           # Jump to the 32-bit entrypoint
ENDPROC(in_pm32)

这么长，其实我就关心最后那一句。

首先来看eax里面到底是啥？ eax从protected_mode_jump()以来就没有变过，那eax就是保存了这个函数调用的第一个参数。

"arch/x86/boot/pm.c"

protected_mode_jump(boot_params.hdr.code32_start,
(u32)&boot_params + (ds() << 4));

这个code32_start其实是从head.S拷贝过来的。

"arch/x86/boot/head.S"

code32_start:               # here loaders can put a different
# start address for 32-bit code.
.long   0x100000    # 0x100000 = default for big kernel

我去，人说的这么直白。好了，再回忆一下bootloader加载的保护模式内核，不就是放在 0x100000的物理地址么。我已经是醉了～

好了，现在我们来看一下0x10000这个地方是谁。也就是vmlinux.bin的起始位置是那段代码。

"arch/x86/boot/compressed/vmlinux.lds.S"

#ifdef CONFIG_X86_64
OUTPUT_ARCH(i386:x86-64)
ENTRY(startup_64)
#else
OUTPUT_ARCH(i386)
ENTRY(startup_32)
#endif

SECTIONS
{
/* Be careful parts of head_64.S assume startup_32 is at
* address 0.
*/
. = 0;
.head.text : {
_head = . ;
HEAD_TEXT
_ehead = . ;
}
}

起始位置方的是HEAD_TEXT段。定义如下：

/* Section used for early init (in .S files) */
#define HEAD_TEXT  *(.head.text)

再看head_64.S中

"arch/x86/boot/compressed/head_64.S"

__HEAD
.code32
ENTRY(startup_32)

而这里的__HEAD定义为

#define __HEAD      .section    ".head.text","ax"

好了，原来第一条在compressed/vmlinux.bin中执行的代码就是startup_32符号标注的地方～

第四跳进入长模式，内核 startup_64

/*
* Setup for the jump to 64bit mode
*
* When the jump is performend we will be in long mode but
* in 32bit compatibility mode with EFER.LME = 1, CS.L = 0, CS.D = 1
* (and in turn EFER.LMA = 1).  To jump into 64bit mode we use
* the new gdt/idt that has __KERNEL_CS with CS.L = 1.
* We place all of the values on our mini stack so lret can
* used to perform that far jump.
*/
pushl   $__KERNEL_CS
leal    startup_64(%ebp), %eax
pushl   %eax

/* Enter paged protected Mode, activating Long Mode */
movl    $(X86_CR0_PG | X86_CR0_PE), %eax /* Enable Paging and Protected mode */
movl    %eax, %cr0

/* Jump from 32bit compatibility mode into 64bit mode. */
lret

真的跳跃的步骤很简单，就是把段和地址压栈然后用lret启用新的模式并跳转到startup_64。

在这段过程中，值得注意的是重新设置了GDT，而且设置了页表。

第五跳移动内核，准备解压缩

从代码上看，这块内容不多。目的应该就是为了解压缩内核而准备。

"arch/x86/boot/compressed/head_64.S"

/*
* Jump to the relocated address.
*/
leaq    relocated(%rbx), %rax
jmp *%rax

内核会被搬移到16M后的一段内存空间。这个位子是可以配置的，

第六跳进入解压缩后的内核–对就是那个你见过的vmlinux

简单了看，就是内核被解压缩到16M的内存地址，最后跳过去。

"arch/x86/boot/compressed/head_64.S"

/*
* Jump to the decompressed kernel.
*/
jmp *%rax

这么一跳那是非常简单的，但是在这一跳前做了几个准备动作，我们挑几个备注一下。

parse_elf()

其中一个有意思的就是parse_elf()了。我们压缩的vmlinux是一个elf的文件，解压缩后还需要按照elf格式中的指示，将相应的program header中可加载段加载到指定的位置。

Elf file type is EXEC (Executable file)
Entry point 0x1000000
There are 5 program headers, starting at offset 64

Program Headers:
Type           Offset             VirtAddr           PhysAddr
FileSiz            MemSiz              Flags  Align
LOAD           0x0000000000200000 0xffffffff81000000 0x0000000001000000
0x0000000000da0000 0x0000000000da0000  R E    200000
LOAD           0x0000000001000000 0xffffffff81e00000 0x0000000001e00000
0x0000000000143000 0x0000000000143000  RW     200000
LOAD           0x0000000001200000 0x0000000000000000 0x0000000001f43000
0x0000000000019018 0x0000000000019018  RW     200000
LOAD           0x000000000135d000 0xffffffff81f5d000 0x0000000001f5d000
0x000000000016c000 0x00000000002ef000  RWE    200000
NOTE           0x0000000000a38e08 0xffffffff81838e08 0x0000000001838e08
0x0000000000000204 0x0000000000000204         4

就看第一个program header，虚拟地址是 0x ffffffff 80000000，物理地址是 0x1000000。

而在parse_elf中，对应的代码是

dest = (void *)(phdr->p_paddr);
memmove(dest, output + phdr->p_offset, phdr->p_filesz);

可以看到，把对应文件的区域搬移到了物理地址指定的地方。

那我们再来看看这个地址是怎么定义出来的。这个关键就在链接脚本里了。

. = __START_KERNEL;
phys_startup_64 = ABSOLUTE(startup_64 - LOAD_OFFSET);

/* Text and read-only data */
.text :  AT(ADDR(.text) - LOAD_OFFSET) {
_text = .;

关键的是这个 .text的段，它的虚拟地址由 __START_KERNEL定义，物理地址由AT(ADDR(.text) - LOAD_OFFSET)定义。具体偶就不算了，大家一看便知。

看到这，觉得这玩意还挺有意思的。不过现在物理地址的作用明白了，虚拟地址的作用还不太清楚。不着急，来日自会知晓～

我们跳到了哪？

jmp *%rax

这个rax保存的是extract_kernel函数返回的地址，当没有kASRL的时候，这个地址就是 0x 1000000。

还是回到链接脚本，

#define LOAD_OFFSET __START_KERNEL_map
#define __START_KERNEL      (__START_KERNEL_map + __PHYSICAL_START)

. = __START_KERNEL;
phys_startup_64 = ABSOLUTE(startup_64 - LOAD_OFFSET);

/* Text and read-only data */
.text :  AT(ADDR(.text) - LOAD_OFFSET) {
_text = .;
/* bootstrapping code */
HEAD_TEXT

.text = __START_KERNEL - LOAD_OFFSET

= __START_KERNEL_map + __PHYSICAL_START - __START_KERNEL_map

= __PHYSICAL_START = 0x1000000

而这个.text段的开始就是HEAD_TEXT，即arch/x86/kernel/head_64.S对应的startup_64。

第七跳启用虚拟地址

/* Ensure I am executing from virtual addresses */
movq    $1f, %rax
jmp *%rax
1:

一路走来，终于看到内核使用虚拟地址了。只是这么说，还不太直观。来看一下bochs调试显示的地址吧。

(0) [0x000000000100012e] 0010:000000000100012e (unk. ctxt): mov rax, 0xffffffff81000137 ; 48c7c037010081
<bochs:3> n
Next at t=308261737
(0) [0x0000000001000135] 0010:0000000001000135 (unk. ctxt): jmp rax                   ; ffe0
<bochs:4>
Next at t=308261738
(0) [0x0000000001000137] 0010:ffffffff81000137 (unk. ctxt): mov eax, 0x80000001       ; b801000080

注意看显示的下一条指令的地址， mov和jmp的时候还是 0x100012e，跳转后就是 0xffffffff81000137啦～

第八跳终于看到c代码了

没想到啊，要跳这么多次。该是见到c代码的时候了。

movq    initial_code(%rip),%rax
pushq   $0     # fake return address to stop unwinder
pushq   $__KERNEL_CS   # set correct cs
pushq   %rax        # target address in negative space
lretq

也是用了一个长ret。就是跳转到了initial_code存放的地址。那是什么呢？

GLOBAL(initial_code)
.quad   x86_64_start_kernel

恩，就是这个了～

第九跳遇见梦中的start_kernel

本来觉得八跳就够了，不过觉得总是还差了点什么。没有看到那个梦中的start_kernel，怎么能说是个完整的启动呢。九九归真，那就再加上这个第九跳吧。

其实很简单，就是两个函数调用，

x86_64_start_kernel() -> x86_64_start_reservations() -> start_kernel()

恩，这下从bootloader到start_kernel()，算是完整了～

段和页

保护模式准备的段

进入保护模式前，需要准备好段， GDT。那我们来仔细看一下这个部分。

"arch/x86/boot/pm.c"

static void setup_gdt(void)
{
/* There are machines which are known to not boot with the GDT
being 8-byte unaligned.  Intel recommends 16 byte alignment. */
static const u64 boot_gdt[] __attribute__((aligned(16))) = {
/* CS: code, read/execute, 4 GB, base 0 */
[GDT_ENTRY_BOOT_CS] = GDT_ENTRY(0xc09b, 0, 0xfffff),
/* DS: data, read/write, 4 GB, base 0 */
[GDT_ENTRY_BOOT_DS] = GDT_ENTRY(0xc093, 0, 0xfffff),
/* TSS: 32-bit tss, 104 bytes, base 4096 */
/* We only have a TSS here to keep Intel VT happy;
we don't actually use it for anything. */
[GDT_ENTRY_BOOT_TSS] = GDT_ENTRY(0x0089, 4096, 103),
};
/* Xen HVM incorrectly stores a pointer to the gdt_ptr, instead
of the gdt_ptr contents.  Thus, make it static so it will
stay in memory, at least long enough that we switch to the
proper kernel GDT. */
static struct gdt_ptr gdt;

gdt.len = sizeof(boot_gdt)-1;
gdt.ptr = (u32)&boot_gdt + (ds() << 4);

asm volatile("lgdtl %0" : : "m" (gdt));
}

注释写得也比较清楚，这块是使用了 Kernel Setup的部分。

而且值得注意的是，在刚跳转到保护模式的时候，并没有启用页。

长模式中的段页

进入长模式前，准备了新的段并写好页表启动了页。

GDT长这样

"arch/x86/boot/compressed/head_64.S"

.data
gdt:
.word   gdt_end - gdt
.long   gdt
.word   0
.quad   0x0000000000000000  /* NULL descriptor */
.quad   0x00af9a000000ffff  /* __KERNEL_CS */
.quad   0x00cf92000000ffff  /* __KERNEL_DS */
.quad   0x0080890000000000  /* TS descriptor */
.quad   0x0000000000000000  /* TS continued */
gdt_end:

和刚进入保护模式的时候差不多。没有太大区别（主要是我看不懂～）

关键的地方是填写了页表并开启了页。页表在head_64.S的最后部分。

"arch/x86/boot/compressed/head_64.S"

.section ".pgtable","a",@nobits
.balign 4096
pgtable:
.fill BOOT_PGT_SIZE, 1, 0

展开的话，长这样

kernel/head_64.S设置的段页

进入内核后，内核会重新构建自己的页表，很有意思的一点（我又回过去看了一下代码）确认在更新页表前没有更新GDT。

好了，先来看看这张页表展开后是什么样子的。

很有意思的一点，这里面其实包含了两个映射。

1. 低地址的是物理地址到物理地址的identity map

2. 高地址的是虚拟地址到物理地址的映射

中断

关中断

在进入保护模式前，先会把中断关闭

"arch/x86/boot/pm.c"

static void realmode_switch_hook(void)
{
if (boot_params.hdr.realmode_swtch) {
asm volatile("lcallw *%0"
: : "m" (boot_params.hdr.realmode_swtch)
: "eax", "ebx", "ecx", "edx");
} else {
asm volatile("cli");
outb(0x80, 0x70); /* Disable NMI */
io_delay();
}
}

空idt

依然是在进入保护模式前，会设置idt。

"arch/x86/boot/pm.c"

/*
* Set up the IDT
*/
static void setup_idt(void)
{
static const struct gdt_ptr null_idt = {0, 0};
asm volatile("lidtl %0" : : "m" (null_idt));
}

不过这个是空的，可能是中断关掉的原因？

用bochs调试启动阶段内核 – 代码精读

书上的来终觉浅，绝知此事要躬行。

看了半天，想了半天，如果能动手写写改改代码那是最好的理解。现在还没有达到那个水平，那就先用调试器调试一下，看看代码究竟是什么效果～

注：这是4.7的内核

compressed/head_64.S -> startup_32

本小节看的是 arch/x86/boot/compressed/head_64.S。

计算这次加载在哪里

head_64.S 编译的时候，认为自己是在地址0的。所有的地址偏移，都基于地址0. 所以在最开始的地方就要计算本次加载是在哪里，这样之后的地址计算才可以正确进行，包括页表的初始化。

/*
* Calculate the delta between where we were compiled to run
* at and where we were actually loaded at.  This can only be done
* with a short local call on x86.  Nothing  else will tell us what
* address we are running at.  The reserved chunk of the real-mode
* data at 0x1e4 (defined as a scratch field) are used as the stack
* for this calculation. Only 4 bytes are needed.
*/
leal    (BP_scratch+4)(%esi), %esp
call    1f
1:  popl    %ebp
subl    $1b, %ebp

这个注释讲得也很明白了，那就直接看调试的汇编代码。

这个就是上面代码的反汇编。

当执行完第一条 leal (BP_scratch+4)(%esi), %esp 后，显示一下寄存器和相应内存的值。

看到这时计算出来的栈地址是0x14828。也就是，我们预期call 1f执行后，堆栈上就保存了1f的实际地址。这样subl $1b, %ebp才能够得到本次加载的实际物理地址。

那我们来看看执行完popl %ebp后的寄存器和堆栈情况。

你看现在 ebp 和栈顶都是 0x100021。证实了我们确实用到了这个栈保存call 返回的地址。注意一点，栈是从高到低逆生长的～。所以在显示内存的时候是显示 0x14824，而不是0x14828。

最后看一下subl $1b, %ebp后的寄存器。

看，现在ebp的值是0x100000了吧～

加载GDT

GDT表定义在head_64.S的底部。

.data
gdt:
.word   gdt_end - gdt
.long   gdt
.word   0
.quad   0x0000000000000000  /* NULL descriptor */
.quad   0x00af9a000000ffff  /* __KERNEL_CS */
.quad   0x00cf92000000ffff  /* __KERNEL_DS */
.quad   0x0080890000000000  /* TS descriptor */
.quad   0x0000000000000000  /* TS continued */
gdt_end:

仔细看，这个gdt其实包含了gdtr，就是这个表的第一个8字节。其中头两位就是gdt大小，后四位是地址，空了两位对齐。刚才我们看到内核每次加载地址不同，而在gdtr中需要保存指向物理地址，所以在加载之前需要fix。

先看一下代码：

/* Load new GDT with the 64bit segments using 32bit descriptor */
leal    gdt(%ebp), %eax
movl    %eax, gdt+2(%ebp)
lgdt    gdt(%ebp)

瞧，这里已经用上了刚才计算好的加载地址 (%epb) 加上 gdt偏移来计算了。那接下来看看调试的情况。

先看一眼反汇编：

00100067: ( ): lea eax, dword ptr ss:[ebp+6237152] ; 8d85e02b5f00

0010006d: ( ): mov dword ptr ss:[ebp+6237154], eax ; 8985e22b5f00

00100073: ( ): lgdt ss:[ebp+6237152] ; 0f0195e02b5f00

执行第一句后，eax保存的就是gdt的实际物理地址了。这个时候，我们看一下寄存器和内存的值。

<bochs:14> r
rax: 0x00000000_006f2be0 rcx: 0x00000000_00002028

这时候显示eax 保存 0x62be0，这个就是实际gdt在内存的物理地址。

那我们来看一下里面都保存了什么内容。

<bochs:15> xp /1xh 0x6f2be0
[bochs]:
0x00000000006f2be0 <bogus+       0>:    0x0030
<bochs:16> xp /1xw 0x6f2be2
[bochs]:
0x00000000006f2be2 <bogus+       0>:    0x005f2be0

前两个字节是 0x30, 后四个字节内容是 0x5f2be0。

再对应到gdt表定义，头两个字节是大小， 6*8 = 48 = 0x30 是不是对上了？

后四个呢，是gdt的偏移～ 0x5f2be0 + 0x100000 = 0x6f2be0 是不是正好和eax中的值相等？

在看一下执行第二条语句后的结果

<bochs:18> xp /1xw 0x6f2be2
[bochs]:
0x00000000006f2be2 <bogus+       0>:    0x006f2be0

这下 gdt + 2 (%ebp) 这个地址的内容就是实际物理地址了。不过我有点纳闷，为啥这个地址是指向gdt, 这个gdt的开始是用作gdtr的呀？莫非这是一个空gdt表项，所以就这么利用一下了？

初始化长模式下的页表

进入长模式前一定要打开页功能，所以这里要准备好。其实代码很简单，就是设置好地址之间的映射关系。先来看一下代码。

/*
* Build early 4G boot pagetable
*/
/* Initialize Page tables to 0 */
leal    pgtable(%ebx), %edi
xorl    %eax, %eax
movl    $(BOOT_INIT_PGT_SIZE/4), %ecx
rep stosl

/* Build Level 4 */
leal    pgtable + 0(%ebx), %edi
leal    0x1007 (%edi), %eax
movl    %eax, 0(%edi)

/* Build Level 3 */
leal    pgtable + 0x1000(%ebx), %edi
leal    0x1007(%edi), %eax
movl    $4, %ecx
1:  movl    %eax, 0x00(%edi)
addl    $0x00001000, %eax
addl    $8, %edi
decl    %ecx
jnz 1b

/* Build Level 2 */
leal    pgtable + 0x2000(%ebx), %edi
movl    $0x00000183, %eax
movl    $2048, %ecx
1:  movl    %eax, 0(%edi)
addl    $0x00200000, %eax
addl    $8, %edi
decl    %ecx
jnz 1b

来看一下调试结果。

执行完第一条语句拿到pgtable的实际地址后，edi的值是 0x2305000。清零后我们看一下第一个页的内容。

<bochs:20> xp /2xw 0x2305000
[bochs]:
0x0000000002305000 <bogus+       0>:    0x00000000  0x00000000

可以看到这个时候，都是0.

等到执行完， Build Level 4这个代码块，我们再显示一下：

<bochs:24> xp /2xw 0x2305000
[bochs]:
0x0000000002305000 <bogus+       0>:    0x02306007  0x00000000

这下就指向了下一个页～

接下来就是填写下一层这个页表的内容了。我们还是先显示一下这个时候的内容。

<bochs:29> xp /8xw 0x2306000
[bochs]:
0x0000000002306000 <bogus+       0>:    0x00000000  0x00000000  0x00000000  0x00000000
0x0000000002306010 <bogus+      16>:    0x00000000  0x00000000  0x00000000  0x00000000

恩，还是空的。

然后等到Build Level 3代码块结束的时候，再来看一下内存。

<bochs:33> xp /8xw 0x2306000
[bochs]:
0x0000000002306000 <bogus+       0>:    0x02307007  0x00000000  0x02308007  0x00000000
0x0000000002306010 <bogus+      16>:    0x02309007  0x00000000  0x0230a007  0x00000000

有没有看到每一个都指向了下一个4k?

好了，快到最后了。最后设置Level 2的内容。在这个阶段每个最后一层的页表大小是2M，就是每个项指向了一块2M的大小，一共填充了2048个这样的项，所以是 2M * 2048 = 4G。

具体怎么做，代码中已经给出了，我们就看一下写的效果。

还是再先看一下写好之前的内存。

<bochs:34> xp /8xw 0x02307000
[bochs]:
0x0000000002307000 <bogus+       0>:    0x00000000  0x00000000  0x00000000  0x00000000
0x0000000002307010 <bogus+      16>:    0x00000000  0x00000000  0x00000000  0x00000000

还都是0.

然后在 Enable the boot page tables 停下，再打印。

<bochs:40> xp /8xw 0x2307000
[bochs]:
0x0000000002307000 <bogus+       0>:    0x00000183  0x00000000  0x00200183  0x00000000
0x0000000002307010 <bogus+      16>:    0x00400183  0x00000000  0x00600183  0x00000000

瞧，第一个指向地址0，第二个指向地址 2M, 以此类推～

好啦，这样我们可以快快乐乐得进入长模式 startup_64啦～

计算内核解压缩的地址

bzImage是一个压缩的内核，在这里计算内核要解压缩到哪里。(不过之后也有一个地方，到时候再细看。)

代码如下：

/* Start with the delta to where the kernel will run at. */
#ifdef CONFIG_RELOCATABLE
leaq    startup_32(%rip) /* - $startup_32 */, %rbp
movl    BP_kernel_alignment(%rsi), %eax
decl    %eax
addq    %rax, %rbp
notq    %rax
andq    %rax, %rbp
cmpq    $LOAD_PHYSICAL_ADDR, %rbp
jge 1f
#endif
movq    $LOAD_PHYSICAL_ADDR, %rbp
1:

/* Target address to relocate to for decompression */
movl    BP_init_size(%rsi), %ebx
subl    $_end, %ebx
addq    %rbp, %rbx

/* Set up the stack */
leaq    boot_stack_end(%rbx), %rsp

第一句很巧妙，当前指令 + startup_32的偏移 => rbp。这不就是startup_32加载的地址0x100000么？那我们来看下调试的结果。

<bochs:12> r
rax: 0x00000000_00000000 rcx: 0x00000000_c0000080
rdx: 0x00000000_00000000 rbx: 0x00000000_00000000
rsp: 0x00000000_00706d40 rbp: 0x00000000_00100000
(0) [0x00000000001002b3] 0010:00000000001002b3 (unk. ctxt): mov eax, dword ptr ds:[rsi+560] ; 8b8630020000
<bochs:14> r
rax: 0x00000000_00000000 rcx: 0x00000000_c0000080
rdx: 0x00000000_00000000 rbx: 0x00000000_00000000
rsp: 0x00000000_00706d40 rbp: 0x00000000_00100000

哦，这是一样的，因为在startup_32中，已经设置过了。

接下来一段是计算解压缩目标地址的。先看一下反汇编长什么样儿。

<bochs:28> u 0x1002d4 0x1002e3
001002d4: (                    ): mov ebx, dword ptr ds:[rsi+608] ; 8b9e60020000
001002da: (                    ): sub ebx, 0x0060d000       ; 81eb00d06000
001002e0: (                    ): add rbx, rbp              ; 4801eb

通过调试看到：

BP_init_size = 0x130b000

最后rbx = rbp + (BP_init_size - _end) = 0x100000 + (0x130b000 - 0x60d000) = 0x1cfe000

拿来看一下执行完这三条语句后rbx的值。

<bochs:9> r
rax: 0xffffffff_ff000000 rcx: 0x00000000_c0000080
rdx: 0x00000000_00000000 rbx: 0x00000000_01cfe000

正符合预期

最后看一下设置堆栈的那条语句。执行完之后 rsp = 0x2304d40

说明 boot_stack_end = rsp - rbx = 0x2304d40 - 0x1cfe000 = 0x606d40

为啥我要算这个值呢？那我们再算一个值。

_end - align(boot_stack_end, 4k) = 0x60d000 - 0x607000 = 0x 6000 = 6* 4K

然后再看一下定义：

"arch/x86/include/asm/boot.h"

# define BOOT_INIT_PGT_SIZE (6*4096)
# ifdef CONFIG_RANDOMIZE_BASE
/*
* Assuming all cross the 512GB boundary:
* 1 page for level4
* (2+2)*4 pages for kernel, param, cmd_line, and randomized kernel
* 2 pages for first 2M (video RAM: CONFIG_X86_VERBOSE_BOOTUP).
* Total is 19 pages.
*/
#  ifdef CONFIG_X86_VERBOSE_BOOTUP
#   define BOOT_PGT_SIZE    (19*4096)
#  else /* !CONFIG_X86_VERBOSE_BOOTUP */
#   define BOOT_PGT_SIZE    (17*4096)
#  endif
# else /* !CONFIG_RANDOMIZE_BASE */
#  define BOOT_PGT_SIZE     BOOT_INIT_PGT_SIZE
# endif

"arch/x86/compressed/head_64.S"

boot_stack_end:

/*
* Space for page tables (not in .bss so not zeroed)
*/
.section ".pgtable","a",@nobits
.balign 4096
pgtable:
.fill BOOT_PGT_SIZE, 1, 0

在上面这段代码上可以清楚看到， boot_stack_end和_end之间定义的是页表。

在普通情况下就是6个4k～

将压缩的内核移动好，准备解压缩

先来看一下代码

/*
* Copy the compressed kernel to the end of our buffer
* where decompression in place becomes safe.
*/
pushq   %rsi
leaq    (_bss-8)(%rip), %rsi
leaq    (_bss-8)(%rbx), %rdi
movq    $_bss /* - $startup_32 */, %rcx
shrq    $3, %rcx
std
rep movsq
cld
popq    %rsi

代码很是简单，就是把 (rip + bss) 位置的内容拷贝到了 (rbx + bss)的位置，大小是bss。

调试了一下，发现一个很有意思的东西。

leaq (_bss - 8)(%rip), %rsi

在调试过程中发现 _bss = 0x5F2D40，但是执行完这条指令后 rsi - rip 不是这个值。

然而想想是有道理的，这个时候 rip已经走到了某个地方，距离startup_32已经有一定距离。如果还是加上_bss的偏移，那就计算错误了。有点意思。

好了，那我们再来计算一个值。

boot_stack_end - bss = 0x606d40 - 0x5F2D40 = 0x14000

再来看一下代码：

/*
* stack and heap for uncompression
*/
.bss
.balign 4
boot_heap:
.fill BOOT_HEAP_SIZE, 1, 0
boot_stack:
.fill BOOT_STACK_SIZE, 1, 0
boot_stack_end:

所以 boot_stack_end - bss 就是堆和栈的大小。

那再来看一下堆和栈大小的定义

# define BOOT_HEAP_SIZE      0x10000
# define BOOT_STACK_SIZE     0x4000

完美～

kernel/head_64.S

保存加载偏移 phys_base

自从内核可以relocate后，编译的时候就无法知道内核实际运行的地址了。那就需要记住这个偏移。

这个值保存在phys_base中。

ENTRY(phys_base)
/* This must match the first entry in level2_kernel_pgt */
.quad   0x0000000000000000
EXPORT_SYMBOL(phys_base)

那我们看一下，怎么计算的。

/*
* Compute the delta between the address I am compiled to run at and the
* address I am actually running at.
*/
leaq    _text(%rip), %rbp
subq    $_text - __START_KERNEL_map, %rbp

其实就是计算了加载地址和编译时地址的差。在不启用relocate内核时，这个rbp经过计算就是 0 。

最后直接将rbp加到了phys_base的地址。

/* Fixup phys_base */
addq    %rbp, phys_base(%rip)

也就是说在没有开启relocate的时候，这个phys_base的值就是0.

好了，咱来看看怎么调试一下，把phys_base的值打印出来。

首先我在代码中增加了两条语句:

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 5df831e..c02877b 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -164,7 +164,10 @@ startup_64:
jne 1b

/* Fixup phys_base */
+   leaq    phys_base(%rip), %rax
+   movq    phys_base(%rip), %rax
addq    %rbp, phys_base(%rip)
+   movq    phys_base(%rip), %rax

movq    $(early_level4_pgt - __START_KERNEL_map), %rax
jmp 1f

获取phys_base的地址，获取内容，并在写之后再获取一下内容。

该段修改后的代码反汇编如下：

<bochs:10> u 0x10000fd 0x100012f
010000fd: (                    ): lea rax, qword ptr ds:[rip+12635916] ; 488d050ccfc000
01000104: (                    ): mov rax, qword ptr ds:[rip+12635909] ; 488b0505cfc000
0100010b: (                    ): add qword ptr ds:[rip+12635902], rbp ; 48012dfecec000
01000112: (                    ): mov rax, qword ptr ds:[rip+12635895] ; 488b05f7cec000

执行完第一句取地址获得phys_base地址是 0x1c0d010。并打印该段内存，也是0.

<bochs:16> r
rax: 0x00000000_01c0d010

<bochs:17> xp /2xw 0x1c0d010
[bochs]:
0x0000000001c0d010 <bogus+       0>:    0x00000000  0x00000000

执行第二条语句，取内存内容，也发现是0.

<bochs:19> n
Next at t=308237806
(0) [0x000000000100010b] 0010:000000000100010b (unk. ctxt): add qword ptr ds:[rip+12635902], rbp ; 48012dfecec000
<bochs:20> r
rax: 0x00000000_00000000

执行完三条语句，显示内存和rax也还是0.

<bochs:21> n
Next at t=308237807
(0) [0x0000000001000112] 0010:0000000001000112 (unk. ctxt): mov rax, qword ptr ds:[rip+12635895] ; 488b05f7cec000
<bochs:22> xp /2xw 0x1c0d010
[bochs]:
0x0000000001c0d010 <bogus+       0>:    0x00000000  0x00000000

<bochs:24> r
rax: 0x00000000_00000000

以此证明了在不relocate内核时，phys_base保存的值是0.

为什么这个值这么重要？恩，以后你会知道的～

startup_64

这里是内核解压缩后第一条要执行的指令。虽然实际内容没有，但是绕了这么大远，终于来到这里，还是很兴奋的。

startup_64:
leaq    (__end_init_task - 8)(%rip), %rsp

/* Sanitize CPU configuration */
call verify_cpu

/*
* Compute the delta between the address I am compiled to run at and the
* address I am actually running at.
*/
leaq    _text(%rip), %rbp
subq    $_text - __START_KERNEL_map, %rbp

对应的反汇编调试是：

(0) [0x00000000022e7b2d] 0010:00000000022e7b2d (unk. ctxt): jmp rax                   ; ffe0
<bochs:48> n
Next at t=308258765
(0) [0x0000000001000000] 0010:0000000001000000 (unk. ctxt): lea rsp, qword ptr ds:[rip+12599281] ; 488d25f13fc000
<bochs:49> u 0x1000000 0x100002f
01000000: (                    ): lea rsp, qword ptr ds:[rip+12599281] ; 488d25f13fc000
01000007: (                    ): call .+413                ; e89d010000
0100000c: (                    ): lea rbp, qword ptr ds:[rip-19] ; 488d2dedffffff
01000013: (                    ): sub rbp, 0x0000000001000000 ; 4881ed00000001

世界突然变得明亮了～

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 内核内存

相关文章推荐

新的分享

章节导航