您的位置：首页 > 编程语言 > Go语言

CEPH CRUSH 算法源码分析原文CEPH CRUSH algorithm source code analysis

2016-06-05 17:53 766 查看

原文地址 CEPH CRUSH algorithm source code analysis

http://www.shalandis.com/original/2016/05/19/CEPH-CRUSH-algorithm-source-code-analysis/

文章比较深入的写了CRUSH算法的原理和过程.通过调试深入的介绍了CRUSH计算的过程.文章中添加了些内容.

写在前面

读本文前,你需要对ceph的基本操作,pool和CRUSH map非常熟悉.并且较深入的读过源码.

分析的方法

首先,我们写了个c程序调用librados向pool中写入一个对象.然后使用 GDB(CGDB is recommended)来追踪CRUSH的运行过程.同时我们将会关注CRUSH相关的几个重要变量.以便于我们完全知道CRUSH源码是如何运行.

1 如何追踪

1.1 编译CEPH

首先通过源码安装CEPH,然后使用

./configure

.如下添加编译参数

/configure CFLAGS='-g3 –O0' CXXFLAGS='-g3 –O0'

-g3

意味着会产生大量的调试信息.

-O0

非常重要, 它意味着关闭编译器的优化，如果没有，使用GDB追踪程序时，大多数变量被优化,无法显示。配置后，

make and sudo make install

。附：

-O0

只适合用于实验情况，在生产环境中编译器优化是必须进行的。

1.2 得到函数stack

众所周知,CRSUSH的核心函数是

crush_do_rule

(位置 crush/mapper.c line 779).

/**
* crush_do_rule - calculate a mapping with the given input and rule
* @map: the crush_map
* @ruleno: the rule id
* @x: hash input
* @result: pointer to result vector
* @result_max: maximum result size
* @weight: weight vector (for map leaves)
* @weight_max: size of weight vector
* @scratch: scratch vector for private use; must be >= 3 * result_max
*/
int crush_do_rule(const struct crush_map *map,
int ruleno, int x, int *result, int result_max,
const __u32 *weight, int weight_max,
int *scratch)

通过这个函数将crush计算过程分为两部分:

1. input -> PGID

2. PGID -> OSD set.

第一部分,使用GDB来得到函数过程

通过

-g

参数来编译例子程序

gdb来参看

rados_write

然后,在添加断点

b crush_do_rule

前,进入GDB的接口.

在函数

crush_do_rule

处停留

得到函数stack,然后使用GDB log将调试信息输出到文件中 .

下面让我们来深入的研究这个过程.

函数stack和下面的相似

#12 main
#11 rados_write
#10 librados::IoCtxImpl::write
#9 librados::IoCtxImpl::operate
#8 Objecter::op_submit
#7 Objecter::_op_submit_with_budget
#6 Objecter::_op_submit
#5 Objecter::_calc_target
#4 OSDMap::pg_to_up_acting_osds
#3 OSDMap::_pg_to_up_acting_osds
#2 OSDMap::_pg_to_osds
#1 CrushWrapper::do_rule
#0 crush_do_rule

追踪过程

CRUSH 计算过程总结如下:

INPUT(object name & pool name) —> PGID —> OSD set.

本文中主要关注计算过程

2.1 input —> PGID

你可以按顺序阅读源码.最后,通过列出转换

关键

过程

从input 到 pgid

Ceph_hash.h (include):
extern unsigned ceph_str_hash(int type, const char *s, unsigned len);

OSDMap.h (osd):    int ret = object_locator_to_pg(oid, loc, pg);

首先从

rados_ioctx_create

, 和

lookup_pool

通过pool名字得到 poolid. 把poolid封装进

librados::IoCtxImpl

类型变量

ctx

;

然后在

rados_write

对象名被封装进

oid

; 然后在

librados::IoCtxImpl::operate

oid

和

oloc(comprising poolid)

被包装成

Objecter::Op *

类型变量

objecter_op

;

通过各种类型的封装我们到到

_calc_target

这层. 我们得到不断的

oid

和

poolid

. 然后读取目标

pool

的信息.

(in my cluster, pool “neo” id is 29, name of object to write is “neo-obj”)

在

object_locator_to_pg

, 第一次计算从

ceph_str_hash

哈希对象名字成为一个uint32_t 类型变量,也就是所谓的

ps

(placement seed)

unsigned int ceph_str_hash(int type, const char *s, unsigned int len)
{
switch (type) {
case CEPH_STR_HASH_LINUX:
return ceph_str_hash_linux(s, len);
case CEPH_STR_HASH_RJENKINS:
return ceph_str_hash_rjenkins(s, len);
default:
return -1;
}
}

然后得到

PGID

. 以前我认为

pgid

是单一变量,然而不是.

PGID

是个包含

poolid

和

ps

的结构体变量.

//pgid  不仅仅是一个数字,还有好多信息
// placement group id
struct pg_t
{
uint64_t m_pool;
uint32_t m_seed;
int32_t m_preferred;

pg_t() : m_pool(0), m_seed(0), m_preferred(-1) {}
pg_t(ps_t seed, uint64_t pool, int pref=-1) :
...
还有很多信息略

crush_do_rule

的输入参数x
是什么?让我们继续风雨兼程. 然后在

_pg_to_osds

有一行

ps_t pps = pool.raw_pg_to_pps(pg); //placement ps

. pps 就是

.

PPS 如何计算? 在函数

crush_hash32_2(CRUSH_HASH_RJENKINS1,ceph_stable_mod(pg.ps(), pgp_num, pgp_num_mask),pg.pool())

;

__u32 crush_hash32_2(int type, __u32 a, __u32 b)
{
switch (type) {
case CRUSH_HASH_RJENKINS1:
return crush_hash32_rjenkins1_2(a, b);
default:
return 0;
}
}

ps

mod

pgp_num_mask

的结果(例如 a) 和

poolid(例如 b)

进行哈希. 这就是pps,也就是

我们得到第二个过程输入参数的

.第一阶段过程图如下图

第一阶段过程图

P.S. you can find something in PG’s name and object name.

2.2 PGID —> OSD set

首先需要明白几个概念

weight VS reweight

这里,“ceph osd crush reweight” 设置了OSD的权重

weight

这个重量为任意值（通常是磁盘的TB大小,1TB设置为1），并且控制系统尝试分配到OSD的数据量。

reweight

reweight将覆盖了weight量。这个值在0到1的范围，并强制CRUSH重新分配数据。它不改变buckets 的权重,并且是CRUSH不正常的情况下的纠正措施。（例如，如果你的OSD中的一个是在90％以上，其余为50％，可以减少权重，进行补偿。）

primary-affinity

主亲和力默认为1（即，一个OSD可以作为主OSD）。primary-affinity变化范围从0到1.其中0意味着在OSD可能用作主OSD,设置为1则可以被用作主OSD.当其<1时，CRUSH将不太可能选择该OSD作为主守护进程。

PG VS PGP

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001610.html

pg 和 pgp的区别

- PG = Placement Group

- PGP = Placement Group for Placement purpose

- pg_num = number of placement groups mapped to an OSD

当增加每个pool的pg_num数量时,每个PG分裂成半,但他们都保持到它们的父OSD的映射。

直到这个时候，ceph不会启动平衡策略。现在，当你增加同一池中的pgp_num值，PGs启动从父OSD迁移到其他OSD,ceph开始启动平衡策略。这是PGP如何起着重要的作用的过程。

pgp-num是在CRUSH算法中使用的参数,不是 pg-num.例如pg-num = 1024 , pgp-num = 1.所有的1024个PGs都映射到同一个OSD.当您增加PG-NUM是分裂的PG，如果增加PGP-NUM将移动PGs，即改变OSD的map。PG和PGP是很重要的概念。

void do_rule(int rule, int x, vector<int>& out, int maxout,
const vector<__u32>& weight) const {}

在了解了这些概念后开始第二部分

PGID -> OSD set

. 现在我们在

do_rule: void do_rule(int rule, int x, vector<int>& out, int maxout, const vector<__u32>& weight)

do_rule源代码

void do_rule(int rule, int x, vector<int>& out, int maxout,
const vector<__u32>& weight) const {
Mutex::Locker l(mapper_lock);
int rawout[maxout];
int scratch[maxout * 3];
int numrep = crush_do_rule(crush, rule, x, rawout, maxout, &weight[0], weight.size(), scratch);
if (numrep < 0)
numrep = 0;
out.resize(numrep);
for (int i=0; i<numrep; i++)
out[i] = rawout[i];
}

让我们看下输入参数

就是我们已经得到的

pps

rule

就是内存中的

crushrule’s number

(不是 ruleid, 在我的crushrule set中, this rule’s id是 3),

weight

是已经讲过的 reweight 变化范围从1 到 65536. 我们定义了

rawout[maxout]

来存储 OSD set,

scratch[maxout * 3]

为计算使用. 然后我们进入了

crush_do_rule

PGID -> OSDset OUTLINE

下面要仔细研究3个函数,

firstn

意味着副本存储, CRUSH 需要去选择n

个osds存储副本.

indep

是纠删码存储过程.我们只关注副本存储方法

crush_do_rule

: 反复do crushrules

crush_choose_firstn

: 递归选择特定类型的桶或设备

crush_bucket_choose

: 直接选择bucket的子节点

crush_do_rule

首先这是我的crushrule 和集群的层次结构

参数

@map: the crush_map
@ruleno: the rule id
@x: hash input
@result: pointer to result vector
@result_max: maximum result size
@weight: weight vector (for map leaves)
@weight_max: size of weight vector
@scratch: scratch vector for private use; must be >= 3 * result_max
int crush_do_rule(const struct crush_map *map,
int ruleno, int x, int *result, int result_max,
const __u32 *weight, int weight_max,
int *scratch)

值得说的变量

emit

通常用在规则的结束,同时可以被用在在形相同规则下选择不同的树.更多详细信息看官网

http://docs.ceph.com/docs/master/rados/operations/crush-map/#crush-map-rules

scratch[3 * result_max]

int *a = scratch;
int *b = scratch + result_max;
int *c = scratch + result_max*2;

a, b, c 分别指向 scratch向量的0, 1/3, 2/3的位置.

w = a; o = b;

被用作一个先入先出队列来在CRUSH map中进行横向优先搜索(BFS traversal).

-

存储

crush_choose_firstn

选择的结果.

-

存储最终的OSD选择结果.

crush_choose_firstn

计算后如果结果不是OSD类型,

交给

.以便于

成为下次

crush_choose_firstn

的输入参数. 如上所述,

crush_do_rule

反复进行

crushrules

迭代. 你可以在内存中发现规则:

过程步骤

step 1 put root rgw1 in

w(enqueue)

;

step 2 would run crush_choose_firstn to choose 1 rack-type bucket from root rgw1

下面分析crush_choose_firstn过程

crush_choose_firstn 函数

这个函数递归的选择特定bucket或者设备,并且可以处理冲突,失败的情况.

如果当前是choose过程,通过调用

crush_bucket_choose

来直接选择.

如果当前是

chooseleaf

选择叶子节点的过程,该函数将递归直到得到叶子节点.

crush_bucket_choose 函数

crush_bucket_choose是CRUSH最重要的函数.应为默认的bucket类型是straw,常见的情况下我们会使用straw类型bucket,然后就会进入

bucket_straw_choose

case进行跳转

case CRUSH_BUCKET_STRAW:
return bucket_straw_choose((struct crush_bucket_straw *)in,

完整代码

static int crush_bucket_choose(struct crush_bucket *in, int x, int r)
{
dprintk(" crush_bucket_choose %d x=%d r=%d\n", in->id, x, r);
BUG_ON(in->size == 0);
switch (in->alg) {
case CRUSH_BUCKET_UNIFORM:
return bucket_uniform_choose((struct crush_bucket_uniform *)in,
x, r);
case CRUSH_BUCKET_LIST:
return bucket_list_choose((struct crush_bucket_list *)in,
x, r);
case CRUSH_BUCKET_TREE:
return bucket_tree_choose((struct crush_bucket_tree *)in,
x, r);
case CRUSH_BUCKET_STRAW:
return bucket_straw_choose((struct crush_bucket_straw *)in,
x, r);
case CRUSH_BUCKET_STRAW2:
return bucket_straw2_choose((struct crush_bucket_straw2 *)in,
x, r);
default:
dprintk("unknown bucket %d alg %d\n", in->id, in->alg);
return in->items[0];
}
}

/* straw */

static int bucket_straw_choose(struct crush_bucket_straw *bucket,
int x, int r)
{
__u32 i;
int high = 0;
__u64 high_draw = 0;
__u64 draw;

for (i = 0; i < bucket->h.size; i++) {
draw = crush_hash32_3(bucket->h.hash, x, bucket->h.items[i], r);
draw &= 0xffff;
draw *= bucket->straws[i];
if (i == 0 || draw > high_draw) {
high = i;
high_draw = draw;
}
}
return bucket->h.items[high];
}

bucket结构体

struct crush_bucket_straw {
struct crush_bucket h;
__u32 *item_weights;   /* 16-bit fixed point */
__u32 *straws;         /* 16-bit fixed point */
};

可以看到

bucket root rgw1’s id

是 -1,

type = 10

意味着根节点

alg = 4

意味着 straw 类型. 这里

weight

是 OSD权重 we set scales up by 65536(i.e. 37 * 65536 = 2424832). 然后看下循环 T: 对每个输入的

son

bucket , 例如升上图

rack1

crush_hash32_3 hashes x

bucket id(rack1’s id)

r(current selection’s order number)

, 这3个变量是

a uint32_t

类型变量, 结果

& 0xffff

, 然后乘

straw(rack1’s straw value, straw calculation seen below)

, 最后得到这个值, 在一次循环中, for one son bucket(rack1 here). 我们在循环中计算每个 son bucket然后选择最大的 . 然后一个

son bucket

被选择 . Nice job! 下面是个计算的例子

调用层次,图表描述

过程是在设备树种的搜索过程

结论

我们已经研究了CRUSH计算中的重要部分,其余的部分就是迭代和递归,直到选择了所有的OSD.

关于 straw 值

详细的代码在 src/crush/builder.c

crush_calc_straw

.

总之,straw 值总是和OSD权重正相关.straw2正在开发.

参考

本文翻译自Jieyu Xue的文章

CEPH CRUSH algorithm source code analysis

on May 19, 2016 under Original

http://www.shalandis.com/original/2016/05/19/CEPH-CRUSH-algorithm-source-code-analysis/

CEPH CRUSH algorithm source code analysis

http://www.shalandis.com/original/2016/05/19/CEPH-CRUSH-algorithm-source-code-analysis/

翻译原文

http://blog.csdn.net/xingkong_678/article/details/51590459

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： ceph 源码 sourcecode analysis crush

相关文章推荐

新的分享

章节导航

CEPH CRUSH 算法源码分析 原文CEPH CRUSH algorithm source code analysis