您的位置:首页 > 运维架构 > Linux

Linux Power Management for x86 CPU (1)---- C-State

2012-10-16 19:23 597 查看
http://blog.sina.com.cn/s/blog_7014a5340100mv7m.html

Linux Power Management for x86 CPU (1)---- C-State

------------------------------------------------

Modern CPUs are more and more powerful. When there is no job to do, it

enters into idle state. During its ilde period, we certainly can cut

Linux Power Management for x86 CPU (1)---- C-State

------------------------------------------------

Modern CPUs are more and more powerful. When there is no job to do, it

enters into idle state. During its ilde period, we certainly can cut

off its power and have it enter into low-power state only if we know

when there is new assignment and we can re-activate CPU and have it do

its jobs again. The process is like this:

no job cut off power

CPU in active ----------> CPU in idle --------------> low-power state

^
|

|
|

| re-power up
v

<-----------------------------------------------------

To achieve the above goal, we need to answer the following questions:

1) How to know CPU is idle so that we can cut off power;

2) How to cut off power;

3) When and how to re-power up CPU;

1. When CPU is idle

-----------------

The answer to the first question is very simple as a matter of fact: When

it is idle, CPU runs the swapper process (process ID is 0. Pobably, it

should be called idle thread, anyway, it is a legacy name, and all text-

books call it that way). So, CPU must be idle when it runs into swapper.

Traditionally, the swapper process does nothing. In a forever loop, it just

checks if there is other task to do, if not, delays for a while and then

checks again, otherwise, it tells process scheduler to schedule other task.

The code is like like this:

while (1) {

while (no_job_to_do) {

delay for a while; <------- halt instruction, in fact;

}

schedule_other_process;

}

So, To cut CPU power, we change the above code to,

while (1) {

while (no_job_to_do)
{

cut_off_cpu_power; <-----done in pm_idle() for Linux

...

}

schedule_other_process;

}

2. How to Cut Off Power

-----------------------

Note that CPU consists of many units, besides core logic,
it has cache, BIU

(Bus Interface Unit), Local APIC. when a CPU is in idle state, we can cut

clock signal and power from some units. The more units are stopped, the more

power saved.

We need to consider another side effect of cuting CPU power: Each unit spends

some time to power up. So, the more units are stopped, the more time it takes

for CPU to be re-activated (wake up). We call the time as entry/exit latency.

2.1 C-State

-------------

To find a balance between power-saving and entry/exit latnecy, Intel CPUs

provide many low-power states called C-State, or sleeping state. Deponding

on CPU models, Intel CPUs support C-States: C1, C2, C3, C4 C5, C6, ...

(C0 is active state). While in sleeping state(C1 or above), CPU doesn't

execute any instruction, but consumes less power.

C0 - CPU is full-powered, and executes instruction;

C1 - stop main internal core clocks;

C2 - C2 has two sub-mode: Stop-Grant & Stop-Clock;

While
in C1/C2, CPU still processes bus snoop & snoop from other

cores.
That means CPU automatically exits C1/C2, handle snoop and

then
returns C1/C2 again.

C3 - Flush cache. So, it won't exit C3 to handle snoop.

C4 - for multi-core processors. For example, for Duo 2, if both cores

are
in C4, the package will enter a deeper sleep state.

C5 - I don't know :)

C6 - For Intel Core i7, the package enters more deeper sleep if all

cores
in C6, and some additional power-saving from QPI link.

Cn - ... Sigh~,

Besides Cx, some Intel CPUs have enhanced CxE states. For example, Intel

Core 2 Duo instroduced enhanced C-States: C1E, C2E, C3E, C4E. The enhanced

states have an additional feature than Cx-State:
they reduce CPU voltage

before entering Cx-state
(In fact,
voltage-reducing is implemented based

on EIST/T-States).

2.2 HLT, P_LVLx and MWait

---------------------------

Then, how to enter into some certain C-State ? Intel provides three methods.

2.2.1 HLT instruction

----------------------

As we know, Intel x86 has a HLT (halt) instruction. From 486DX4, this

instruction will cause CPUs to enter into C1 or C1E state. If BIOSes

enable C1E feature, CPU enters C1E, otherwise CPU enters C1. BIOSes

enables C1E via some MSR register. For example, for Intel Xeon 7000,

BIOS can set bit 25 of IA32_MISC_ENABLE_MSR (MSR 1A0).

Note that HLT can be used for C1 entry only. That means, you cannot

enable CPU to enter C2 or above by HLT.

2.2.2 P_LVLx I/O registers

----------------------------

And Intel defines P_LVLx I/O registers (x is 2 ~ 5).
I/O reading P_LVLx

register will cause CPU to enter into C-state. Generally, P_LVL2 for C2,

but P_LVL3 of Core i7 for C6 while
P_LVL3 of Duo 2 for C3
. It depends on

CPU model.


2.2.3 Monitor/MWait instruction

--------------------------------

Except HLT instruction and P_LVLx registers, Intel provides another way

to enable CPU to enter into C-State: MWait. This instruction should be

used together with Monitor. Normally, we use monitor instruction to

watch a range of memory, and then use mwait with some hintsto enable CPU

to enter into Cx-state.

Without this instruction, when a CPU is in sleeping state, if other CPUs

want to wake it up, the only way is to send an IPI. However, IPI is an

expensive operation, it takes much time (compared to Monitor/MWait). With

Monitor/MWait pair, other CPUs can wakup sleeping CPU by
modify the memory

watched (monitored) by the sleeping CPU.

/*-----------------------------------------------------------

现在用到的代码

stop_critical_timings();

if (!need_resched()) {

__monitor((void *)¤t_thread_info()->flags, 0, 0);

smp_mb();

if (!need_resched())

__mwait(eax, ecx);

}

----------------------------------------------------------

*/

start_critical_timings();

3. Re-activate CPU

-----------------------------

When a CPU runs into swapper process, there might be some processes in

various wait queues of this CPU. Once the condition changes, those

processes could become runnable again. Because they have been already

assigned to this CPU, before sleeping, the CPU must prepare to run the

processes in wait state in the near future.

Then, what's the conditions which a process can wait for ? Yes, time and/

or interrupt
. A process can wait on a timer orinterrupt
or some events

that will be triggered in interrupt handling.

Intel CPU returns to C0 from sleeping state once receiving interrupt, and

timer is implemented via hardware timer interrupt. So those processes in

waitqueues would be executed once they becomes runnable (we skip tickless

kernel and C3-stop LAPIC timer for the time being).

Besides, other CPUs can assign some jobs to an idle CPU andwake it up via

interrupt or the method provided by monitor/mwait.

4. ACPI & C-State

-------------------

ACPI
Advanced Configuration and Power Management Interface

defines two methods (control interfaces) to control CPU C-states. And

ACPI specification defines 3 C-states. Note that ACPI C-states is not the

same as Intel CPU C-States. For example, we can map Intel CPU C1/C1E to

ACPI C1, Intel C2/C2E to ACPI C2,Intel C3, C4, C5, C6 to ACPI C3.

4.1. P_LVLx registers in P_BLK

-------------------------------

In DSDT table, each processor optionaly can have a P_BLK register block,

For example,

Processor
(

\_PR.CPU0, //
Namespace name

1,

0x120, //
P_BLK system I/O address

6 //
size of P_BLK

)
{...}

P_LVL2: P_BLK
+ 4, 1 byte, system I/O space;

P_LVL3: P_BLK
+ 5, 1 byte, system I/O space;

Reading P_LVL2 causes CPU to enter C2 state; reading P_LVL3 causes CPU to

enter C3 state.

In FADT table, there are two fields to give C2 and C3 entry/exit latency

respectivly,

FADT.P_LVL2_LAT, The
worst-case hardware latency to enter/exit a

C2
state. A value > 100 indicates the system does

not
support a C2 state.

FADT.P_LVL3_LAT, The
worst-case hardware latency to enter/exit a

C3
state. A value > 1000 indicates the system does

not
support a C3 state.

Based on entry/exit latency, OS can select which C-state should be entered

into when CPU is idle. OS should select as deeper sleeping state as possible,

so as to save more power. In fact, the hardware entry/exit latency is used

as a reference point, and OS will adjust the entry/exit latency for each

C-state during runtime.

When CPU is idle, OS checks the most recent impending timer, and compares

the interval with C-State latency, and select one of C-state to enter.

4.2. _CST & _CSD ACPI objects

-----------------------------

4.2.1 _PDC

----------

_PDC, OS uses it to inform the platform of the level cpu power managemet

support
provided by OS;

Note that OS must use _PDC/_OSC method to inform the platform of the level of

power management which OS can handle. Based on this information, ACPI firmware

can return different values(package) for_CST and _CSD.

4.2.2 _CST

_CST是通过ACPI ASL code 汇报给OSPM的有关该平台CPU所支持的C-state的信息。它的格式如下所示:

CSTPackage : Package ( Count , CState ,…, CState )

其中Count表示所支持的C-state的个数

CState: Package ( Register , Type , Latency , Power )

Register表示OSPM调整C-state的方式,Type表示C State的类型(1=C1, 2=C2, 3=C3)。Latency表示进入该C-state的最大的延迟, Power表示在该C-state时的功耗(单位是毫瓦)。下述是一个sample code,注释部分已经讲的很明白了CPU0支持4个C-state,其中C1使用FFixedHW的方式访问,其它3个C-state都是通过P_LVL方式切入,第三和第四个Cstate都被映射到ACPI C3。

----------

_CST, the platform declares the supported C-States. ACPI can define a _CST

object
for a processor like,

Name
(_CST, Package()) {Count, CState,…, CState}, where,

CState:
Package (Register, Type, Latency, Power)

For example,

Processor (\_PR.CPU0,1,
0x120, 6) {

...

Name
(_CST, Package() {

4, //the
number of supported C-States

Package(){ResourceTemplate(){Register(FFixedHW,
0, 0, 0)}, 1, 20, 1000},

Package(){ResourceTemplate(){Register(SystemIO,
8, 0, 0x161)}, 2, 40, 750},

Package(){ResourceTemplate(){Register(SystemIO,
8, 0, 0x162)}, 3, 60, 500},

Package(){ResourceTemplate(){Register(SystemIO,
8, 0, 0x163)}, 3, 100, 250}

})

...

}

In this example,
CPU0 has 4 C-states, C1, C2 and two C3 with different

latency and average
power consumption.

C1:
FFixedHW, it means using "halt" or "mwait" instruction to enter C1;

C2:
SystemIO, 8-bit size, so a byte-read to I/O addr 0x161 to enter C2;

If Cx state uses
FFixedHW, we check if the CPU supports mwait instruction. Calling

cpuid.ax = 0x05,
the returned value in edx register tells us which C-state is

supported by mwait
instruction (including the number of sub-state of each C-State).

4.2.3 _CSD

C-State Dependency 用于向OSPM提供多个logic processor之间C-state的依赖关系。比如在一个Dual Core的平台上,每颗核可以独立运行C1但是如果其中一个核切换到C2,另一个也必须要切换到C2,这时就需要在_CSD中提供这部分信息。

------------

_CSD, the platform provides C-State control cross logical processor

dependency information to OS;

CSDPackage:
Package (CStateDep,…, CStateDep), where,

CStateDep: Package
(NumberOfEntries, Revision, Domain, CoordType,

NumProcessors,
Index)

For example,

Processor (\_SB.CPU0,
1, 0x120, 6) {

Name
(_CST, Package() {

3,

Package(){ResourceTemplate(){Register(FFixedHW,
0, 0, 0)}, 1, 20, 1000},

Package(){ResourceTemplate(){Register(SystemIO,
8, 0, 0x161)}, 2, 40, 750},

Package(){ResourceTemplate(){Register(SystemIO,
8, 0, 0x162)}, 3, 60, 500}

})

Name(_CSD,
Package() {

Package(){6,
0, 0, 0xFD, 2, 1}, // 6 entries, Revision 0, Domain 0, OSPM Coordinate

//
Initiate on Any Proc, 2 Procs, Index 1 (C2-type)

Package(){6,
0, 0, 0xFD, 2, 2} // 6 entries, Revision 0, Domain 0, OSPM Coordinate

//
Initiate on Any Proc, 2 Procs, Index 2 (C3-type)

})

}

Processor (\_SB.CPU1,
2, 0x130, 6) {

Name(_CST,
Package() {

3,

Package(){ResourceTemplate(){Register(FFixedHW,
0, 0, 0)}, 1, 20, 1000},

Package(){ResourceTemplate(){Register(SystemIO,
8, 0, 0x161)}, 2, 40, 750},

Package(){ResourceTemplate(){Register(SystemIO,
8, 0, 0x162)}, 3, 60, 500}

})

Name(_CSD,
Package() {

Package(){6,
0, 0, 0xFD, 2, 1}, // 6 entries (fields in this package), Revision 0,

//
Domain 0, OSPM Coordinate

//
Initiate on any Proc, 2 Procs, Index 1 (C2-type)

Package(){6,
0, 0, 0xFD, 2, 2} // 6 entries, Revision 0, Domain 0, OSPM Coordinate

//
Initiate on any Proc, 2 Procs, Index 2 (C3-type)

})

}

I am copying the following words from ACPI sepc,

OSPM can coordinate the transitions between logical processors, choosing to initiate

the transition when doing so does not lead to incorrect or non-optimal system behavior.

This OSPM coordination is referred to as Software Coordination. Alternately, it might

be possible for the underlying hardware to coordinate the state transition requests

on multiple logical processors, causing the processors to transition to the target

state when the transition is guaranteed to not lead to incorrect or non-optimal

system behavior. This scenario is referred to as Hardware (HW) coordination

5. Linux C-State Related Code

--------------------------

Linux has a global function pointer pm_idle, if nobody changes it, it is set

to default_idle(). The routine default_idle() just calls HLT instruct to put

CPU into halt state. If CPU supports C-state, this will cause CPU to enter C1

or into C1E if BIOS enabled C1E feature.

In fact, there are many module trying to have pm_idle point to a specific

routine. For example,

APM apm_cpu_idle() //legacy
APM power management

cpuidle cpuidle_idle_call()

AMD-CPU c1e_idle() //AMD
C1E acts like Intel C3

CPU
supporting MWait mwait_idle() //C1
only

idle=poll
by kernel-param poll_idle() //noop,
no power reducing

idle=halt
by kernel-param default_idle()

...

The priotrity of swapper process is very low, it executes only when there is

no other runable process. Any runnable process can preempt CPU from swapper

process. In a forever loop, swapper process executes cpu_idle() like this,

void
cpu_idle(void)

{

...

while (1) {

while
(!need_resched()) { <----If hasn't runnable process

local_irq_disable();

pm_idle();

}

...

schedule(); <------- select a new process to be executed

...

}

5.1 Architecture Overview

--------------------------

Linux CPU C-State related modules/drivers are orgnized as follows,

----------------

| sysfs
|

----------------

|

-------- ------ |

|
ladder | |menu| |

--------- ----- |

| | |

------------------------

|cpuidle
infrastructure |

------------------------

|

|

----------------------

|acpi-cpuidle
driver |

----------------------

|

|

----------------------------

|ACPI
processor bus driver |

----------------------------

5.1.1 Driver Register

-----------------------

In acpi_processor_init(), which is a module initialization routine and

called by do_initcalls(), two related drivers, acpi processor bus driver

and acpi_idle_driver, are registered. If you really want to look into it,

take a look at the following path:

kernel_init()

==> do_basic_setup()

==>
do_initcalls()

==> ...
acpi_processor_init();

==>
cpuidle_register_driver(&acpi_idle_driver);

acpi_bus_register_driver(&acpi_processor_driver);

Among, the registering of drivers is in driver/acpi/processor_core.c;

notes:

a) cpuidle insfrastructure
is NOT a driver, and it is initialized by

core_initcall().
It provides:

I)
In userland apps/users can check/switch cpuilde governor by

sysfs
interface: /sys/devices/system/cpu/(cpuX)/cpuidle/

II)
interfaces for governor registering;

III)
interfaces for cpuilde devices, cpuilde driver;

IV)
Set global pm_idle pointer to cpuilde_idle_call();

b) acpi_idle_driver
is registered into cpuidle infrastruct, while

acpi_processor_driver
is registered acpi subsystem as an acpi bus

driver;

c) cpuilde infrastructure
allows only one driver to register, it uses

a
global pointer to the registered acpi_idle_driver. Refer to

cpuidle_register_driver()
provided by cpuidle infrastructure in

driver/cpuidle/driver.c

d) ACPI process driver
registers a hotplug callback for cpu hotplug,

so
it will get notification when a CPU is online/offline.

5.1.2 Device Discovery & Register

---------------------------------

ACPI subsystem parses ACPI tables, and for each ACPI processor object,

it calls acpi processor bus driver's add entrypoint, acpi_processor_add(),

to add an acpi processor device.

After adding an acpi processor device, acpi subsystem will call processor

driver's start entrypoint function, acpi_processor_start().

In acpi_processor_start(), the routine acpi_processor_power_init() is

called to evaluate _PDC, and read & parse _CST, _CSD or use FADT/MADT

info to initialize processors' power state information, and then calls

cpuidle_register_device() to register a cpuidle device into cpuidle

infrastructure.

For hotplug CPUs, during acpi_processor_init() execution, the routine

acpi_processor_install_hotplug_notify() is called to register a CPU

hotplug callback. when a CPU is online, acpi_processor_start() gets

execution.

Please note that both the processors operate the same physical CPUs,

besides cpuidle driver, there are some other processor-related drivers,

such as T-State driver, P-state driver, CPU-hotplug infrastructure,

etc. The ACPI processor driver acts as a bridge/coordinator among

those drivers.

5.1.3 Driver/Device attach

-----------------------

acpi(高级配置和电源管理接口) subsystem
registered processors into acpi_process_driver, if/when

the registered CPU is online, the start entrypoint, acpi_processor_start()

is called. This entry function takes many initialization jobs for T-state,

P-state and C-state. Now we just look at c-state, it calls

acpi_processor_power_init();

==>
acpi_processor_get_power_info();

==>
acpi_processor_setup_cpuidle();

The first called routine will evaluate _CST or read FADT if _CST failed,

to get C-state description from ACPI tables. Refer to section 4.1/4.2,

and see how to handle c-state information.

The second one will setup some information for each valid c-state, note

for most cases (without kernel parameter, bus master, etc)

C1, state->enter
= acpi_idle_enter_c1;

C2, state->enter
= acpi_idle_enter_simple;

C3, state->enter
= acpi_idle_enter_bm;

This enter routine is used to enter corresponding C-state.

5.1.4 Governor

-----------------

The governors of cpuilde are simple to read/understand. It provides 3

main callbacks for cpuidle infrastructure.

rating <--
menu is 20, ladder is 10;

enable()

select()

reflect()

Each governor has a rating in its structure. When governors are registered

into cpuidle insfrastructure by the routine cpuidle_register_governor(),

cpuidle will select the one with max rating unless users specified one

via sysfs interface. The cpuilde_curr_governor pointers point to the

selected one.

Only one governor can be used at the same time. When, OS decides to put a

CPU into C-state, it calls select entrypoint of current governor, governor

will by its policy choose one C-state,

cpuilde_idle_call()

{

next_state
= cpuilde_curr_governor->select();

target_state
= &dev->states[next_state];

这边的代码有变化!

dev->last_state
= target_state;

dev->last_residency
= target_state->enter(dev, target_state);

cpuilde_curr_governor->reflect();

}

6. Linux Files related to C-States

----------------------------------

driver/acpi/processor_core.c

driver/acpi/processor_idle.c

driver/cpuidle/cpuidle.c

driver/cpuidle/driver.c

driver/cpuidle/governor.c

driver/cpuidle/sysfs.c

driver/cpuidle/governor/ladder.c

driver/cpuidle/governor/menu.c

7. Some Kernel Parameters

-------------------------------

idle=poll, polling,
always in C0, most no power-saving;

idle=halt, use
HLT instruction only, only enter C1;

idle=nomwait don't
use mwait, P_LVLx method is used;

idle=mwait force
OS to use mwait for C-state;

max_cstate=n specifiy
available max C-state, n is a number

Others (which may help locate issue when C-State doesn't work),

nohz=off don't
use dynamic tick/tickless mode

nolapic_timer don't
use local APIC timer

lapic_timer_c2_ok Local
APIC timer is ok in C2

clocksource=tsc (or hpet, pit, acpi_pm, jiffies), override
clock source

8. Sysfs & Proc

-----------------

Check C-State stastics & state,

/proc/acpi/processor/CPUX/

Check governor & driver,

/sys/devices/system/cpu/cpuidle/ (for
system0-wide)

/sys/devices/system/cpu/cpuX/cpuidle/ (for
CPU)

9. TBD

-----------

9.1 Broadcast Timer

------------------

When some CPU enters deep C (C3 or above), their Local APIC timer will

stop as well (Linux uses LAPIC timer as tick device in most cases). This

issue is handled by "broadcast timer scheme.

9.2 Dynamic Tick /Tickless

--------------------------

Linux supports tickless which causes the C-State code more complex.

9.3 Idle Load balancing

-----------------------

When CPUs enter into idle state, one of idle CPU will be nominated as ILB

(Idle Load Balancer). It is responsible for pulling task from busy CPUs and

re-assigne the tasks to idle CPUs and have idle CPUs to start-up.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: