您的位置：首页 > 编程语言 > Python开发

python 检查编码chardet 并非有效

2017-07-26 18:01 351 查看

样例网址如下：
http://www.angenechemical.com/productshow/AGN-PC-0JCLT7.html
用Python请求这个网站，并且让logger 记录请求到的body

def parse(self, response):
try:
result = {}
for tr in response.xpath("//table[@class='pInforstyle']/tr"):
name = "$".join(tr.xpath("td[1]/span/text()").extract())
value = "$".join(tr.xpath("td[2]/text()").extract())
result[name] = value
result.update({
"img_url": "http://www.angenechemical.com%s" % "".join(response.xpath("//div[@class='pd_contact']/table/tr[1]/td[1]/img[1]/@src").extract()),
"url": response.url,
})
raise Exception
yield result
except Exception,e:
self.logger.exception(response.body)

logging 模块会报错，utf-8 can't decode...什么什么的，或者是gbk can't decode

注意在这之前，我们已经更改过logging模块的源码，变成了

try:
stream.write(fs % msg.encode("UTF-8"))
except UnicodeError:
stream.write(fs % msg.decode("gbk").encode("UTF-8"))

出现这个问题就是说，现在decode gbk也出问题了

了所以这个try 还得继续try下去

decode gb2312不行，又看到

encode("UTF-8").strip(

也不行，最后看到一个奇葩的 windows-1252 这个居然通过了，能写入了

说明最后的编码是windows-1252。。。。。

然而用chardet 检测，

import chardet
print "\n\ndetect charset : "+str(chardet.detect(msg))

给出的答案是

detect charset : {'confidence': 0.99, 'language': 'Chinese', 'encoding': 'GB2312'}

但是明显的 gb2312根本不能decode, 能decode的是 windows-1252

所以最终logging的源码是被改成了

try:
stream.write(fs % msg.encode("UTF-8"))
except UnicodeError:
try:
stream.write(fs % msg.decode("gbk").encode("UTF-8"))
except UnicodeDecodeError:
stream.write(fs % msg.decode("windows-1252").encode("UTF-8"))

gbk 不行，那么就再来 windows-1252...

顺便说一句，在stackoverflower上看到一句非常得心的话，检测字符编码，基本上是不可能或者总是有错的

原链接在这里：
https://stackoverflow.com/questions/436220/determine-the-encoding-of-text-in-python

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航