linux 启动详细过程
2011-09-04 17:06
253 查看
1. Booting
1.1 Building the Linux Kernel Image
This section explains the steps taken during compilation of the Linux kerneland the output produced at each stage.The build process depends on the architecture so I would like to emphasizethat we only consider building a Linux/x86 kernel.When the user types 'make zImage' or 'make bzImage' the resulting bootablekernel image is stored as
arch/i386/boot/zImageor
arch/i386/boot/bzImagerespectively.Here is how the image is built:
C and assembly source files are compiled into ELF relocatable object format (.o) andsome of them are grouped logically into archives (.a) usingar(1).
Using ld(1), the above .o and .a are linked into
vmlinuxwhich is astatically linked, non-stripped ELF 32-bit LSB 80386 executable file.
System.mapis produced by nm vmlinux, irrelevant or uninterestingsymbols are grepped out.
Enter directory
arch/i386/boot.
Bootsector asm code
bootsect.Sis preprocessed either with or without-D__BIG_KERNEL__, depending on whether the target isbzImage or zImage, into
bbootsect.sor
bootsect.srespectively.
bbootsect.sis assembled and then converted into 'raw binary' formcalled
bbootsect(or
bootsect.sassembled and raw-converted into
bootsectfor zImage).
Setup code
setup.S(
setup.Sincludes
video.S) is preprocessed into
bsetup.sfor bzImage or
setup.sfor zImage. In the same way as thebootsector code, the difference is marked by -D__BIG_KERNEL__ presentfor bzImage. The result is then converted into 'raw binary' formcalled
bsetup.
Enter directory
arch/i386/boot/compressedand convert
/usr/src/linux/vmlinuxto $tmppiggy (tmp filename) in raw binaryformat, removing
.noteand
.commentELF sections.
gzip -9 < $tmppiggy > $tmppiggy.gz
Link $tmppiggy.gz into ELF relocatable (ld -r)
piggy.o.
Compile compression routines
head.Sand
misc.c(still in
arch/i386/boot/compresseddirectory) into ELF objects
head.oand
misc.o.
Link together
head.o,
misc.oand
piggy.ointo
bvmlinux(or
vmlinuxforzImage, don't mistake this for
/usr/src/linux/vmlinux!). Note thedifference between -Ttext 0x1000 used for
vmlinuxand -Ttext 0x100000for
bvmlinux, i.e. for bzImage compression loader is high-loaded.
Convert
bvmlinuxto 'raw binary'
bvmlinux.outremoving
.noteand
.commentELF sections.
Go back to
arch/i386/bootdirectory and, using the program
tools/build,cat together
bbootsect,
bsetupand
compressed/bvmlinux.outinto
bzImage(delete extra 'b' above for
zImage). This writes important variableslike
setup_sectsand
root_devat the end of the bootsector.
The size of the bootsector is always 512 bytes. The size of the setup mustbe greater than 4 sectors but is limited above by about 12K - the ruleis:
0x4000 bytes >= 512 + setup_sects * 512 + room for stack while running bootsector/setup
We will see later where this limitation comes from.
The upper limit on the bzImage size produced at this step is about 2.5M forbooting with LILO and 0xFFFF paragraphs (0xFFFF0 = 1048560 bytes) forbooting raw image, e.g. from floppy disk or CD-ROM (El-Torito emulation mode).
Note that while tools/build does validate the size of boot sector, kernel imageand lower bound of setup size, it does not check the *upper* bound of saidsetup size. Therefore it is easy to build a broken kernel by just adding somelarge ".space"
at the end of
setup.S.
1.2 Booting: Overview
The boot process details are architecture-specific, so we shallfocus our attention on the IBM PC/IA32 architecture.Due to old design and backward compatibility, the PC firmware boots theoperating system in an old-fashioned manner.This process can be separatedinto the following six logical stages:
BIOS selects the boot device.
BIOS loads the bootsector from the boot device.
Bootsector loads setup, decompression routines and compressed kernelimage.
The kernel is uncompressed in protected mode.
Low-level initialisation is performed by asm code.
High-level C initialisation.
1.3 Booting: BIOS POST
The power supply starts the clock generator and asserts #POWERGOODsignal on the bus.CPU #RESET line is asserted (CPU now in real 8086 mode).
%ds=%es=%fs=%gs=%ss=0, %cs=0xFFFF0000,%eip = 0x0000FFF0 (ROM BIOS POST code).
All POST checks are performed with interrupts disabled.
IVT (Interrupt Vector Table) initialised at address 0.
The BIOS Bootstrap Loader function is invoked via int 0x19,with %dl containing the boot device 'drive number'. This loads track 0, sector 1 at physical address 0x7C00 (0x07C0:0000).
1.4 Booting: bootsector and setup
The bootsector used to boot Linux kernel could be either:Linux bootsector (
arch/i386/boot/bootsect.S),
LILO (or other bootloader's) bootsector, or
no bootsector (loadlin etc)
We consider here the Linux bootsector in detail.The first few lines initialise the convenience macros to be used for segmentvalues:
29 SETUPSECS = 4 /* default nr of setup-sectors */ 30 BOOTSEG = 0x07C0 /* original address of boot-sector */ 31 INITSEG = DEF_INITSEG /* we move boot here - out of the way */ 32 SETUPSEG = DEF_SETUPSEG /* setup starts here */ 33 SYSSEG = DEF_SYSSEG /* system loaded at 0x10000 (65536) */ 34 SYSSIZE = DEF_SYSSIZE /* system size: # of 16-byte clicks */
(the numbers on the left are the line numbers of bootsect.S file)The values of
DEF_INITSEG,
DEF_SETUPSEG,
DEF_SYSSEGand
DEF_SYSSIZEare takenfrom
include/asm/boot.h:
/* Don't touch these, unless you really know what you're doing. */ #define DEF_INITSEG 0x9000 #define DEF_SYSSEG 0x1000 #define DEF_SETUPSEG 0x9020 #define DEF_SYSSIZE 0x7F00
Now, let us consider the actual code of
bootsect.S:
54 movw $BOOTSEG, %ax 55 movw %ax, %ds 56 movw $INITSEG, %ax 57 movw %ax, %es 58 movw $256, %cx 59 subw %si, %si 60 subw %di, %di 61 cld 62 rep 63 movsw 64 ljmp $INITSEG, $go 65 # bde - changed 0xff00 to 0x4000 to use debugger at 0x6400 up (bde). We 66 # wouldn't have to worry about this if we checked the top of memory. Also 67 # my BIOS can be configured to put the wini drive tables in high memory 68 # instead of in the vector table. The old stack might have clobbered the 69 # drive table. 70 go: movw $0x4000-12, %di # 0x4000 is an arbitrary value >= 71 # length of bootsect + length of 72 # setup + room for stack; 73 # 12 is disk parm size. 74 movw %ax, %ds # ax and es already contain INITSEG 75 movw %ax, %ss 76 movw %di, %sp # put stack at INITSEG:0x4000-12.
Lines 54-63 move the bootsector code from address 0x7C00 to 0x90000.This is achieved by:
set %ds:%si to $BOOTSEG:0 (0x7C0:0 = 0x7C00)
set %es:%di to $INITSEG:0 (0x9000:0 = 0x90000)
set the number of 16bit words in %cx (256 words = 512 bytes = 1 sector)
clear DF (direction) flag in EFLAGS to auto-increment addresses (cld)
go ahead and copy 512 bytes (rep movsw)
The reason this code does not use
rep movsdis intentional (hint - .code16).
Line 64 jumps to label
go:in the newly made copy of thebootsector, i.e. in segment 0x9000. This and the following threeinstructions (lines 64-76) prepare the stack at $INITSEG:0x4000-0xC, i.e. %ss = $INITSEG (0x9000) and %sp = 0x3FF4 (0x4000-0xC).
This is where thelimit on setup size comes from that we mentioned earlier (see Building theLinux Kernel Image).
Lines 77-103 patch the disk parameter table for the first disk toallow multi-sector reads:
77 # Many BIOS's default disk parameter tables will not recognise 78 # multi-sector reads beyond the maximum sector number specified 79 # in the default diskette parameter tables - this may mean 7 80 # sectors in some cases. 81 # 82 # Since single sector reads are slow and out of the question, 83 # we must take care of this by creating new parameter tables 84 # (for the first disk) in RAM. We will set the maximum sector 85 # count to 36 - the most we will encounter on an ED 2.88. 86 # 87 # High doesn't hurt. Low does. 88 # 89 # Segments are as follows: ds = es = ss = cs - INITSEG, fs = 0, 90 # and gs is unused. 91 movw %cx, %fs # set fs to 0 92 movw $0x78, %bx # fs:bx is parameter table address 93 pushw %ds 94 ldsw %fs:(%bx), %si # ds:si is source 95 movb $6, %cl # copy 12 bytes 96 pushw %di # di = 0x4000-12. 97 rep # don't need cld -> done on line 66 98 movsw 99 popw %di 100 popw %ds 101 movb $36, 0x4(%di) # patch sector count 102 movw %di, %fs:(%bx) 103 movw %es, %fs:2(%bx)
The floppy disk controller is reset using BIOS service int 0x13 function 0 (reset FDC) and setup sectors are loaded immediately after the bootsector, i.e. at physical address 0x90200 ($INITSEG:0x200), again usingBIOS service int 0x13, function 2 (read sector(s)).This
happens during lines 107-124:
107 load_setup: 108 xorb %ah, %ah # reset FDC 109 xorb %dl, %dl 110 int $0x13 111 xorw %dx, %dx # drive 0, head 0 112 movb $0x02, %cl # sector 2, track 0 113 movw $0x0200, %bx # address = 512, in INITSEG 114 movb $0x02, %ah # service 2, "read sector(s)" 115 movb setup_sects, %al # (assume all on head 0, track 0) 116 int $0x13 # read it 117 jnc ok_load_setup # ok - continue 118 pushw %ax # dump error code 119 call print_nl 120 movw %sp, %bp 121 call print_hex 122 popw %ax 123 jmp load_setup 124 ok_load_setup:
If loading failed for some reason (bad floppy or someone pulled the disketteout during the operation), we dump error code and retry in an endlessloop. The only way to get out of it is to reboot the machine, unless retry succeedsbut usually it doesn't (if something
is wrong it will only get worse).
If loading setup_sects sectors of setup code succeeded we jump to label
ok_load_setup:.
Then we proceed to load the compressed kernel image at physicaladdress 0x10000. Thisis done to preserve the firmware data areas in low memory (0-64K).After the kernel is loaded, we jump to $SETUPSEG:0 (
arch/i386/boot/setup.S).Once the data is
no longer needed (e.g. no more calls to BIOS) it isoverwritten by moving the entire (compressed) kernel image from 0x10000 to0x1000 (physical addresses, of course).This is done by
setup.Swhich sets things up for protected mode and jumpsto 0x1000 which is the head of the compressed kernel, i.e.
arch/386/boot/compressed/{head.S,misc.c}.This sets up stack and calls
decompress_kernel()which uncompresses thekernel to address 0x100000 and jumps to it.
Note that old bootloaders (old versions of LILO) could only load thefirst 4 sectors of setup, which is why there is code in setup to load the rest ofitself if needed. Also, the code in setup has to take care of variouscombinations of loader type/version
vs zImage/bzImage and is thereforehighly complex.
Let us examine the kludge in the bootsector code that allows to load a bigkernel, known also as "bzImage".The setup sectors are loaded as usual at 0x90200, but the kernel is loaded64K chunk at a time using a special helper routine that calls BIOS to movedata
from low to high memory. This helper routine is referred to by
bootsect_kludgein
bootsect.Sand is defined as
bootsect_helperin
setup.S.The
bootsect_kludgelabel in
setup.Scontains the value of setup segmentand the offset of
bootsect_helpercode in it so that bootsector can use the
lcallinstruction to jump to it (inter-segment jump).The reason why it is in
setup.Sis simply because there is no more space leftin bootsect.S (which is strictly not true - there are approximately 4 spare bytesand at least 1 spare byte in
bootsect.Sbut that is not enough, obviously).This routine uses BIOS service int 0x15 (ax=0x8700) to move to high memoryand resets %es to always point to 0x10000. This ensures that the code in
bootsect.Sdoesn't run out of low memory when copying data from disk.
1.5 Using LILO as a bootloader
There are several advantages in using a specialised bootloader (LILO) overa bare bones Linux bootsector:Ability to choose between multiple Linux kernels or even multiple OSes.
Ability to pass kernel command line parameters (there is a patchcalled BCP that adds this ability to bare-bones bootsector+setup).
Ability to load much larger bzImage kernels - up to 2.5M vs 1M.
Old versions of LILO (v17 and earlier) could not load bzImage kernels. Thenewer versions (as of a couple of years ago or earlier) use the sametechnique as bootsect+setup of moving data from low into high memory bymeans of BIOS services. Some people (Peter Anvin
notably) argue that zImagesupport should be removed. The main reason (according to Alan Cox) it staysis that there are apparently some broken BIOSes that make it impossible toboot bzImage kernels while loading zImage ones fine.
The last thing LILO does is to jump to
setup.Sand things proceed as normal.
1.6 High level initialisation
By "high-level initialisation" we consider anything which is not directlyrelated to bootstrap, even though parts of the code to perform this arewritten in asm, namelyarch/i386/kernel/head.Swhich is the head of theuncompressed kernel. The following steps are performed:
Initialise segment values (%ds = %es = %fs = %gs = __KERNEL_DS = 0x18).
Initialise page tables.
Enable paging by setting PG bit in %cr0.
Zero-clean BSS (on SMP, only first CPU does this).
Copy the first 2k of bootup parameters (kernel commandline).
Check CPU type using EFLAGS and, if possible, cpuid, able to detect386 and higher.
The first CPU calls
start_kernel(), all others call
arch/i386/kernel/smpboot.c:initialize_secondary()if ready=1,which just reloads esp/eip and doesn't return.
The
init/main.c:start_kernel()is written in C and does the following:
Take a global kernel lock (it is needed so that only one CPUgoes through initialisation).
Perform arch-specific setup (memory layout analysis, copyingboot command line again, etc.).
Print Linux kernel "banner" containing the version, compiler used tobuild it etc. to the kernel ring buffer for messages. This is takenfrom the variable linux_banner defined in init/version.c and is thesame string as displayed by
cat /proc/version.
Initialise traps.
Initialise irqs.
Initialise data required for scheduler.
Initialise time keeping data.
Initialise softirq subsystem.
Parse boot commandline options.
Initialise console.
If module support was compiled into the kernel, initialise dynamicalmodule loading facility.
If "profile=" command line was supplied, initialise profiling buffers.
kmem_cache_init(), initialise most of slab allocator.
Enable interrupts.
Calculate BogoMips value for this CPU.
Call
mem_init()which calculates
max_mapnr,
totalram_pagesand
high_memoryand prints out the "Memory: ..." line.
kmem_cache_sizes_init(), finish slab allocator initialisation.
Initialise data structures used by procfs.
fork_init(), create
uid_cache, initialise
max_threadsbased onthe amount of memory available and configure
RLIMIT_NPROCfor
init_taskto be
max_threads/2.
Create various slab caches needed for VFS, VM, buffer cache, etc.
If System V IPC support is compiled in, initialise the IPC subsystem.Note that for System V shm, this includes mounting an internal(in-kernel) instance of shmfs filesystem.
If quota support is compiled into the kernel, create and initialisea special slab cache for it.
Perform arch-specific "check for bugs" and, whenever possible,activate workaround for processor/bus/etc bugs. Comparing variousarchitectures reveals that "ia64 has no bugs" and "ia32 has quite afew bugs", good example is "f00f bug" which is only checked
if kernelis compiled for less than 686 and worked around accordingly.
Set a flag to indicate that a schedule should be invoked at "nextopportunity" and create a kernel thread
init()which execsexecute_command if supplied via "init=" boot parameter, or tries toexec
/sbin/init, /etc/init, /bin/init,
/bin/sh in this order; ifall these fail, panic with "suggestion" to use "init=" parameter.
Go into the idle loop, this is an idle thread with pid=0.
Important thing to note here that the
init()kernel thread calls
do_basic_setup()which in turn calls
do_initcalls()which goes through thelist of functions registered by means of
__initcallor
module_init()macrosand invokes them. These functions either do not depend on each otheror their dependencies have been manually fixed by the link order in theMakefiles. This means that, depending onthe position of directories
in the trees and the structure of the Makefiles,the order in which initialisation functions are invoked can change. Sometimes, thisis important because you can imagine two subsystems A and B with B dependingon some initialisation done by A. If A is compiled
statically and B is amodule then B's entry point is guaranteed to be invoked after A preparedall the necessary environment. If A is a module, then B is also necessarilya module so there are no problems. But what if both A and B are staticallylinked into the
kernel? The order in which they are invoked depends on the relative entry point offsets in the
.initcall.initELF section of the kernel image.Rogier Wolff proposed to introduce a hierarchical "priority" infrastructurewhereby modules could let the linker know in what (relative) order theyshould be linked, but so far there are no patches available
that implementthis in a sufficiently elegant manner to be acceptable into the kernel.Therefore, make sure your link order is correct. If, in the example above,A and B work fine when compiled statically once, they will always work,provided they are listed sequentially
in the same Makefile. If they don'twork, change the order in which their object files are listed.
Another thing worth noting is Linux's ability to execute an "alternativeinit program" by means of passing "init=" boot commandline. This is usefulfor recovering from accidentally overwritten
/sbin/init or debugging theinitialisation (rc) scripts and
/etc/inittabby hand, executing themone at a time.
1.7 SMP Bootup on x86
On SMP, the BP goes through the normal sequence of bootsector, setup etcuntil it reaches thestart_kernel(), and then on to
smp_init()andespecially
src/i386/kernel/smpboot.c:smp_boot_cpus(). The
smp_boot_cpus()goes in a loop for each apicid (until
NR_CPUS) and calls
do_boot_cpu()onit. What
do_boot_cpu()does is create (i.e.
fork_by_hand) an idle task forthe target cpu and write in well-known locations defined by the Intel MPspec (0x467/0x469) the EIP of trampoline code found in
trampoline.S. Thenit generates STARTUP IPI to the target cpu which makes this AP execute thecode in
trampoline.S.
The boot CPU creates a copy of trampoline code for each CPU inlow memory. The AP code writes a magic number in its own code which isverified by the BP to make sure that AP is executing the trampoline code.The requirement that trampoline code must be in low
memory is enforced bythe Intel MP specification.
The trampoline code simply sets %bx register to 1, enters protected modeand jumps to startup_32 which is the main entry to
arch/i386/kernel/head.S.
Now, the AP starts executing
head.Sand discovering that it is not a BP,it skips the code that clears BSS and then enters
initialize_secondary()which just enters the idle task for this CPU - recall that
init_tasks[cpu]was already initialised by BP executing
do_boot_cpu(cpu).
Note that init_task can be shared but each idle thread must have its ownTSS. This is why
init_tss[NR_CPUS]is an array.
1.8 Freeing initialisation data and code
When the operating system initialises itself, most of the code and datastructures are never needed again.Most operating systems (BSD, FreeBSD etc.) cannot dispose of this unneededinformation, thus wasting precious physical kernel memory.The excuse they use(see McKusick's 4.4BSD book) is that "the relevant codeis spread around various subsystems and so it is not feasible to free it".Linux, of course, cannot use such excuses because under Linux "if somethingis possible in principle, then it is already implemented
or somebody isworking on it".
So, as I said earlier, Linux kernel can only be compiled as an ELF binary, andnow we find out the reason (or one of the reasons) for that. The reasonrelated to throwing away initialisation code/data is that Linux provides twomacros to be used:
__init- for initialisation code
__initdata- for data
These evaluate to gcc attribute specificators (also known as "gcc magic")as defined in
include/linux/init.h:
#ifndef MODULE #define __init __attribute__ ((__section__ (".text.init"))) #define __initdata __attribute__ ((__section__ (".data.init"))) #else #define __init #define __initdata #endif
What this means is that if the code is compiled statically into the kernel(i.e. MODULE is not defined) then it is placed in the special ELF section
.text.init, which is declared in the linker map in
arch/i386/vmlinux.lds.Otherwise (i.e. if it is a module) the macros evaluate to nothing.
What happens during boot is that the "init" kernel thread (function
init/main.c:init()) calls the arch-specific function
free_initmem()whichfrees all the pages between addresses
__init_beginand
__init_end.
On a typical system (my workstation), this results in freeing about 260K ofmemory.
The functions registered via
module_init()are placed in
.initcall.initwhich is also freed in the static case. The current trend in Linux, whendesigning a subsystem (not necessarily a module), is to provideinit/exit entry points
from the early stages of design so that in thefuture, the subsystem in question can be modularised if needed. Example ofthis is pipefs, see
fs/pipe.c. Even if a given subsystem will never become amodule, e.g. bdflush (see
fs/buffer.c), it is still nice and tidy to usethe
module_init()macro against its initialisation function, provided it doesnot matter when exactly is the function called.
There are two more macros which work in a similar manner, called
__exitand
__exitdata, but they are more directly connected to the module support andtherefore will be explained in a later section.
1.9 Processing kernel command line
Let us recall what happens to the commandline passed to kernel during boot:LILO (or BCP) accepts the commandline using BIOS keyboard servicesand stores it at a well-known location in physical memory, as wellas a signature saying that there is a valid commandline there.
arch/i386/kernel/head.Scopies the first 2k of it out to the zeropage.
arch/i386/kernel/setup.c:parse_mem_cmdline()(called by
setup_arch(), itself called by
start_kernel()) copies 256 bytes from zeropageinto
saved_command_linewhich is displayed by
/proc/cmdline. Thissame routine processes the "mem=" option if present and makes appropriateadjustments to VM parameters.
We return to commandline in
parse_options()(called by
start_kernel())which processes some "in-kernel" parameters (currently "init=" andenvironment/arguments for init) and passes each word to
checksetup().
checksetup()goes through the code in ELF section
.setup.initandinvokes each function, passing it the word if it matches. Note thatusing the return value of 0 from the function registered via
__setup(),it is possible to pass the same "variable=value" to more than onefunction with "value" invalid to one and valid to another.Jeff Garzik commented: "hackers who do that get spanked :)"Why? Because this is clearly ld-order specific, i.e.
kernel linkedin one order will have functionA invoked before functionB and anotherwill have it in reversed order, with the result depending on the order.
So, how do we write code that processes boot commandline? We use the
__setup()macro defined in
include/linux/init.h:
/* * Used for kernel command line parameter setup */ struct kernel_param { const char *str; int (*setup_func)(char *); }; extern struct kernel_param __setup_start, __setup_end; #ifndef MODULE #define __setup(str, fn) \ static char __setup_str_##fn[] __initdata = str; \ static struct kernel_param __setup_##fn __initsetup = \ { __setup_str_##fn, fn } #else #define __setup(str,func) /* nothing */ endif
So, you would typically use it in your code like this(taken from code of real driver, BusLogic HBA
drivers/scsi/BusLogic.c):
static int __init BusLogic_Setup(char *str) { int ints[3]; (void)get_options(str, ARRAY_SIZE(ints), ints); if (ints[0] != 0) { BusLogic_Error("BusLogic: Obsolete Command Line Entry " "Format Ignored\n", NULL); return 0; } if (str == NULL || *str == '\0') return 0; return BusLogic_ParseDriverOptions(str); } __setup("BusLogic=", BusLogic_Setup);
Note that
__setup()does nothing for modules, so the code that wishes toprocess boot commandline and can be either a module or statically linkedmust invoke its parsing function manually in the module initialisationroutine. This also means that
it is possible to write code thatprocesses parameters when compiled as a module but not when it is static orvice versa.
转自:* http://tldp.org/LDP/lki/lki-1.html
相关文章推荐
- 【ARM-Linux开发】U-Boot启动过程--详细版的完全分析
- linux2.6.29 启动过程详细分析
- Linux开机启动过程详细分析
- linux2.4启动分析(2)---内核解压缩过程(续,更详细) compress booting kernel
- Linux启动过程详细介绍
- linux2.6.29 启动过程详细分析
- linux2.6.29 启动过程详细分析
- Linux开机启动过程详细分析
- Linux开机启动过程详细分析
- Linux 启动过程的详细解释
- 共创桌面Linux 2005光盘启动安装过程详细图解
- linux系统的详细启动过程
- linux启动过程
- 剖析Linux系统启动过程
- Linux系统的启动过程
- 剖析Linux系统启动过程
- linux的启动过程初探
- arm-linux启动过程(2)
- Linux系统--Linux的启动过程
- Linux 下二进制源码包安装mysql 详细过程