计算整数中1的个数的C语言实现
2006-07-29 08:08
726 查看
counting 1 bits C implementations
http://www.everything2.com/index.pl?node_id=1181258
Here are C implementations of all the methods for counting 1 bits mentioned in that node. (Go read that first, if you haven't already.) All of the statistical information is purely anecdotal, but for what it's worth, it's based on my testing the code on a Pentium 3 and a Celeron 2, using the cl compiler of Microsoft Visual C++, and on a Sun Ultra 5, using gcc and Sun's own cc. For testing 64-bit code, I used __int64 on the Intel machines, and long long on the Sparc. It's worth noting that while Sun's compiler outputs faster executables than gcc, it doesn't change the relative performance of the different methods.
Some symbols that I'll use to represent what's going on:
So a
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
You'll notice that the higher the step, the more known zeros (
Step 5:
(where "
However, you can go back even further and apply the same technique - all the way to step 3, in fact. The best I can think to optimize this changes the last three steps into the following: Step 3:
Step 4:
Step 5:
Anyway, that's all very lovely, but here's the C to do it:
The performance on this method is marginally worse than the lookup method in the 32 bit cases, slightly better than lookup on 64 bit Intel, and right about the same on 64 bit Sparc. Of note is the fact that loading one of these bitmasks into a register actually takes two instructions on RISC machines, and a longer-than-32-bit instruction on the Intel, because it's impossible to pack an instruction and 32 bits worth of data into a single 32 bit instruction. See the bottom of jamesc's writeup at MIPS for more details on that...
This method is identical to the "Optimized Counters" method, with two tricks applied:
To get rid of an AND in the first line: instead of adding adjacent bits, it subtracts the high bit by itself of a pair of the bits from the pair together, because the results are the same.
To merge the last two lines into one, it uses a multiply and a shift, which adds the four remaining byte-sized "counters" together in one step.
Now, why does this all matter? It doesn't, really, but it was sure a good way to waste some time, and maybe someone learned some optimizing tricks from it... (Well, I did, actually - so I hope someone else did as well.)
http://www.everything2.com/index.pl?node_id=1181258
Here are C implementations of all the methods for counting 1 bits mentioned in that node. (Go read that first, if you haven't already.) All of the statistical information is purely anecdotal, but for what it's worth, it's based on my testing the code on a Pentium 3 and a Celeron 2, using the cl compiler of Microsoft Visual C++, and on a Sun Ultra 5, using gcc and Sun's own cc. For testing 64-bit code, I used __int64 on the Intel machines, and long long on the Sparc. It's worth noting that while Sun's compiler outputs faster executables than gcc, it doesn't change the relative performance of the different methods.
Table Lookup
Use a pre-built lookup table of all the 1-bit counts for every possibly byte, then index into that for each byte that comprises the word. This is the fastest method (slightly) for 32 bits on both Intel and Sparc, and (even more slightly) the fastest for 64 bits on Sparc, falling to second fastest on 64 bits on Intel. Changing the lookup table from anything butunsignedor
intmakes it a little slower (what with the extra casting and byte-loading the compiler is forced to add.)
unsigned numbits_lookup_table[256] = { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 }; unsigned numbits_lookup(unsigned i) { unsigned n; n = numbits_lookup_table[i & 0xff]; n += numbits_lookup_table[i>>8 & 0xff]; n += numbits_lookup_table[i>>16 & 0xff]; n += numbits_lookup_table[i>>24 & 0xff]; return n; }
Counters
If you want a full explanation of how this works, read my writeup at counting 1 bits, but suffice it to say that you are essentially partitioning the word into groups, and combining the groups by adding them together in pairs until you are left with only one group, which is the answer. (performance notes in the next section.)unsigned numbits(unsigned int i) { unsigned int const MASK1 = 0x55555555; unsigned int const MASK2 = 0x33333333; unsigned int const MASK4 = 0x0f0f0f0f; unsigned int const MASK8 = 0x00ff00ff; unsigned int const MASK16 = 0x0000ffff; i = (i&MASK1 ) + (i>>1 &MASK1 ); i = (i&MASK2 ) + (i>>2 &MASK2 ); i = (i&MASK4 ) + (i>>4 &MASK4 ); i = (i&MASK8 ) + (i>>8 &MASK8 ); i = (i&MASK16) + (i>>16&MASK16); return i; }
Optimized Counters
call pointed out in counting 1 bits that you could optimize the Counters method further if you pay attention to which bits you care about and which you don't, which allows you to skip applying some of the masks.Some symbols that I'll use to represent what's going on:
0: bits we know are zero from the previous step
o: bits we know are zero due to masking
-: bits we know are zero due to shifting
X: bits that might be 1 and we care about their values
x: bits that might be 1 but we don't care about their values
So a
0plus a
0is still a
0, obviously; the tricky ones are the others, but they're not even so bad.
0plus
Xis
X, since if the
Xis a
0or a
1, added to
0it will pass through unchanged. However,
Xplus
Xis
XX, because the sum can range from
0(
0+0), to
10(
1+1). The same holds true with
xs, once those show up.
Step 1:
oXoXoXoXoXoXoXoXoXoXoXoXoXoXoXoX + -XoXoXoXoXoXoXoXoXoXoXoXoXoXoXoX XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Step 2:
ooXXooXXooXXooXXooXXooXXooXXooXX + --XXooXXooXXooXXooXXooXXooXXooXX 0XXX0XXX0XXX0XXX0XXX0XXX0XXX0XXX
Step 3:
oooo0XXXoooo0XXXoooo0XXXoooo0XXX + ----0XXXoooo0XXXoooo0XXXoooo0XXX 0000XXXX0000XXXX0000XXXX0000XXXX
Step 4:
oooooooo0000XXXXoooooooo0000XXXX + --------0000XXXXoooooooo0000XXXX 00000000000XXXXX00000000000XXXXX
Step 5:
oooooooooooooooo00000000000XXXXX + ----------------00000000000XXXXX 00000000000000000000000000XXXXXX
You'll notice that the higher the step, the more known zeros (
0) there are. call's suggestion was to change step 5 to this:
Step 5:
ooooooooooooxxxx00000000000XXXXX + ----------------00000000000XXXXX 000000000000xxxx0000000000XXXXXX (mask) ooooooooooooooooooooooooooXXXXXX
(where "
(mask)" means "after adding, apply a mask".)
However, you can go back even further and apply the same technique - all the way to step 3, in fact. The best I can think to optimize this changes the last three steps into the following: Step 3:
0xxx0XXX0xxx0XXX0xxx0XXX0xxx0XXX + ----0XXX0xxx0XXX0xxx0XXX0xxx0XXX 0xxxXXXX0xxxXXXX0xxxXXXX0xxxXXXX (mask) 0000XXXX0000XXXX0000XXXX0000XXXX
Step 4:
0000xxxx0000XXXX0000xxxx0000XXXX + --------0000XXXX0000xxxx0000XXXX 0000xxxx000XXXXX000xxxxx000XXXXX
Step 5:
0000xxxx000xxxxx000xxxxx000XXXXX + ----------------000xxxxx000XXXXX 0000xxxx000xxxxx00xxxxxx00XXXXXX (mask) ooooooooooooooooooooooooooXXXXXX
Anyway, that's all very lovely, but here's the C to do it:
unsigned numbits(unsigned int i) { unsigned int const MASK1 = 0x55555555; unsigned int const MASK2 = 0x33333333; unsigned int const MASK4 = 0x0f0f0f0f; unsigned int const MASK6 = 0x0000003f; unsigned int const w = (v & MASK1) + ((v >> 1) & MASK1); unsigned int const x = (w & MASK2) + ((w >> 2) & MASK2); unsigned int const y = (x + (x >> 4) & MASK4); unsigned int const z = (y + (y >> 8)); unsigned int const c = (z + (z >> 16)) & MASK6; return c; }
The performance on this method is marginally worse than the lookup method in the 32 bit cases, slightly better than lookup on 64 bit Intel, and right about the same on 64 bit Sparc. Of note is the fact that loading one of these bitmasks into a register actually takes two instructions on RISC machines, and a longer-than-32-bit instruction on the Intel, because it's impossible to pack an instruction and 32 bits worth of data into a single 32 bit instruction. See the bottom of jamesc's writeup at MIPS for more details on that...
Mind-bending "best" method (even more optimized counters)
A slightly-modified version of the code on this page: http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel, which in turn stole the code from the "Software Optimization Guide for AMD AthlonTM 64 and OpteronTM Processors":unsigned numbits(unsigned int i) { unsigned int const MASK1 = 0x55555555; unsigned int const MASK2 = 0x33333333; unsigned int const MASK4 = 0x0f0f0f0f; unsigned int const w = v - ((v >> 1) & MASK1); unsigned int const x = (w & MASK2) + ((w >> 2) & MASK2); unsigned int const y = (x + (x >> 4) & MASK4); unsigned int const c = (y * 0x01010101) >> 24; return c; }
This method is identical to the "Optimized Counters" method, with two tricks applied:
To get rid of an AND in the first line: instead of adding adjacent bits, it subtracts the high bit by itself of a pair of the bits from the pair together, because the results are the same.
00 - 0 = 0,
01 - 0 = 01,
10 - 1 = 01,
11 - 1 = 10
To merge the last two lines into one, it uses a multiply and a shift, which adds the four remaining byte-sized "counters" together in one step.
Subtract 1 and AND
See counting 1 bits SPOILER for a fuller explanation of this one, but basically the lowest 1-bit gets zeroed out every iteration, so when you run out of 1s to zero, you've iterated to the number of bits in the word. Clever. Unfortunately, not that terribly fast; it's roughly two to three times slower than the lookup and counters methods on both architectures.unsigned numbits_subtraction(unsigned i) { unsigned n; for(n=0; i; n++) i &= i-1; return n; }
Straightforwardly Examine Each Bit
The most easily understandable and slowest method: iterate over all the bits in the word; if the current bit is a 1, then increment the counter, otherwise, do nothing. That's actually done here by looking at the least-significant bit on each iteration, then shifting to the right one, and iterating until there are no more 1 bits in the word. There's a little optimization in the #ifndef here: instead of doingif (i & 1) n++;, which uses a branch instruction, just add the actual value of the least-significant bit to the counter (
n += (i & 1);), as it will be a 1 when you want to add 1, and 0 when you don't. (We're just twiddling bits anyway, so why not?) This actually makes the processor do more adds, but adding is fast, and branching is slow, on modern processors, so it turns out to be about twice as fast. However, even "twice as fast" is still four to five times slower than the lookup method, again, on all architectures.
unsigned numbits(unsigned int i) { unsigned n; for(n=0; i; i >>= 1) #ifndef MORE_OPTIMAL if (i & 1) n++; #else n += (i & 1); #endif return n; }
Now, why does this all matter? It doesn't, really, but it was sure a good way to waste some time, and maybe someone learned some optimizing tricks from it... (Well, I did, actually - so I hope someone else did as well.)
相关文章推荐
- C语言实现从字符串中提取整数组(正负数整数和零),并计算整数组的和
- C语言实现整数四则运算表达式的计算
- 用c语言实现,两个int(32位)整数m和n的二进制表达中,有多少个位(bit)不同?
- 【源代码】将一个整数的每位数分解并按逆序放入一个数组中(用递归算法)(C语言实现)
- C语言实现 计算句子的平均句长
- 大整数乘法-C语言实现
- C语言:实现N个整数排序,并插入一个整数!
- 计算数组的和: C=(3A+4B)/8和C语言实现的FIR算法,改成ARM汇编程序
- C语言 大整数乘法,模拟人工计算
- C语言实现整数反转-简练算法
- C语言itoa()函数和atoi()函数详解(整数转字符C实现)
- 计算单词出现的频率C语言实现的
- 【源码】将一个整数的每位数分解并按逆序放入一个数组中(用递归算法)(C语言实现)
- 转置矩阵的分块并行乘法(C语言实现),计算矩阵C[rawn][rawn]=A[rawm][rawn]'*B[rawm][rawn],子块大小为S*T,其算法实现原理参加本代码的附件。
- 利用C语言来实现交换两个变量的值,由终端输入两个整数给变量x、y,然后交换x和y的值后,输出x和y。 有不同的方法
- C语言求一个整数的二进制形式表示中1的个数,用函数实现
- 关于高精度正整数计算的JAVA实现
- 关于C语言实现大整数减法的修正
- C语言itoa()函数和atoi()函数详解(整数转字符C实现)
- C语言实现“1到100的所有整数中出现多少次数字9”