您的位置:首页 > 编程语言 > Python开发

Processing binary structured data with Python

2017-04-13 01:21 417 查看


中文版本请看这里
Please keep the orginal address of this post when reprinting. http://blog.csdn.net/ir0nf1st/article/details/70151190

<0x00> Preface

When processing binary file or receiving byte stream from network, the binary structured data in the stream may contain signed numbers. According to the pre-defined stream protocol, developer has already have prior knowledge about the alignment/byteorder/word
length/sign bit position of the binary structured data, but when developing with Python, there is no explicit way to pass these information to Python interperter and makes it diffcult to process binary data, especially binary signed numbers. This article gives
out a way to process binary signed numbers correctly with Python.

<0x01>  A glimpse into Python numeric type

In many other programming languages, the most significant bit of a number is used as sign bit. But Python numeric type has different implementation. Here is CPython definition of Python long type(All the following code are based on Python V2.7):

include/longobject.h
/* Long (arbitrary precision) integer object interface */

typedef struct _longobject PyLongObject; /* Revealed in longintrepr.h */


include/longintrepr.h
/* Long integer representation.
The absolute value of a number is equal to
SUM(for i=0 through abs(ob_size)-1) ob_digit[i] * 2**(SHIFT*i)
Negative numbers are represented with ob_size < 0;
zero is represented by ob_size == 0.
In a normalized number, ob_digit[abs(ob_size)-1] (the most significant
digit) is never zero.  Also, in all cases, for all valid i,
0 <= ob_digit[i] <= MASK.
The allocation function takes care of allocating extra memory
so that ob_digit[0] ... ob_digit[abs(ob_size)-1] are actually available.

CAUTION:  Generic code manipulating subtypes of PyVarObject has to
aware that longs abuse  ob_size's sign bit.
*/

struct _longobject {
PyObject_VAR_HEAD
digit ob_digit[1];
};


include/object.h
/* PyObject_VAR_HEAD defines the initial segment of all variable-size
* container objects.  These end with a declaration of an array with 1
* element, but enough space is malloc'ed so that the array actually
* has room for ob_size elements.  Note that ob_size is an element count,
* not necessarily a byte count.
*/
#define PyObject_VAR_HEAD               \
PyObject_HEAD                       \
Py_ssize_t ob_size; /* Number of items in variable part */


From the above source code and comments, we know that Python _longobject, namely long type or long object, uses 'ob_size' field in PyObject_VAR_HEAD to represent sign of a number and its 'ob_digit' field contains only magnitude of a number.
On the other hand, when we initilize a Python long object(to assign a numeric value to a Python long object), Python interperter will not take the most significant bit of a number as sign bit but as significant bit.
Let's take number -500 as an example to show what will happen, to make it simple I use 16bit word length here.
We know that a negtive number is represented in it two's complement in computer, firstly convert -500 to its two's complement:

Decimal- 500
HexDecimal- 0x 01 F4
Binary- 0b 0000 0001 1111 0100
sign-magnitude(put sign into MSB)  0b 1000 0001 1111 0100
one's complement(invert all bits except the sign bit)  0b 1111 1110 0000 1011
two's complement(add one to one's complement)  0b 1111 1110 0000 1100
two's complement in HexDecimal  0x  FE 0C
The following code simulate receving  a string '\xFE\x0C' from a stream and then processing and assigning it to a Python integer object.
>>> stream = '\xFE\x0C'
>>> number = (ord(stream[0]) << 8) + ord(stream[1])
>>> '0x{:0X}'.format(number)
'0xFE0C'
>>> print number
65036
>>>

As explained, the result is not what we want and the code need to be revised when initilaizing Python number object with binary signed number.

<0x02> Convey sign information using minus sign

Now we know Python does not take sign information from the sign bit of a number, we need to find out another way to convey sign information to Python interperter to make it process our number correctly. The minus sign '-' a.k.a negative operator in Python
can be used to fulfill this purpose.
Let's take a look how negative operator was implemented in CPython:
objects/longobject.c
static PyObject *
long_neg(PyLongObject *v)
{
PyLongObject *z;
if (v->ob_size == 0 && PyLong_CheckExact(v)) {
/* -0 == 0 */
Py_INCREF(v);
return (PyObject *) v;
}
z = (PyLongObject *)_PyLong_Copy(v);
if (z != NULL)
z->ob_size = -(v->ob_size);
return (PyObject *)z;
}
So the negtive operator only negative the 'ob_size' field of a long object and left 'ob_digit' field untouched.
We also know that if a number 'value' is negtive, then we have this formula: value = - abs(value). If we pass the negtive operator along with the abstract value of a negative number to Python interperter, Python interperter will be able to handle negative
number correctly.
Continue with the above example, we have the two's complement of  a negative number, the next step is to calculate its abstrct value from its two's complement.
Pseudo code looks like this:
if sign_bit_of_value is 1 {
abs_value = bit_wise_invert(value - 1)
}
Implementation in Python:
>>> number = 0xFE0C
>>> if (number & 0x8000) != 0:
...     number = -((number - 1) ^ 0xFFFF)
...
>>> print number
-500
>>>


You probabaly had found that I didn't use '~', the Python invert operator, but exclusive or with 0xFFFF to implement bit_wise_invert.
How about using invert operator?
>>> number = 0xFE0C
>>> if (number & 0x8000) != 0:
...     number = -(~(number - 1))
...
>>> print number
65036
>>>


Take a look into invert operator implementation of CPython:
static PyObject *
long_invert(PyLongObject *v)
{
/* Implement ~x as -(x+1) */
PyLongObject *x;
PyLongObject *w;
w = (PyLongObject *)PyLong_FromLong(1L);
if (w == NULL)
return NULL;
x = (PyLongObject *) long_add(v, w);
Py_DECREF(w);
if (x == NULL)
return NULL;
Py_SIZE(x) = -(Py_SIZE(x));
return (PyObject *)x;
}
As stated by the comment in the code, the invert operator did not really implement bit wise invert and can't be used to fullfile our purpose.

<0x03> A more general module for binary structured data processing in Python

The above sections revealed some Python internal implementation on numeric type processing and may help you to understand Python a little deeper. There is also a struct module can be used to process binary structured data. It is more developer friendly
and has stronger error handling/reporting mechnism. For real application development, I suggest to use struct.
Here is sample code using struct:
>>> import struct
>>> stream = '\xFE\x0C'
>>> number, = struct.unpack('>h', stream)
>>> print number
-500
>>>
See more detail introduction of struct
here.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐