您的位置:首页 > 其它

Beatiful Soup获取淘宝商品详情

2014-07-02 17:29 183 查看
Beatiful Soup生成商品详情页面的剖析树,

主要函数:findAll(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

利用findAll先获取标签范围的内容,再利用正则表达式进行匹配输出。

Beatiful Soup的中文文档:
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.zh.html#Searching%20the%20Parse%20Tree
程序:

#!/usr/bin/python
import urllib2
import sys
import chardet
import re
from BeautifulSoup import BeautifulSoup
def html():
#    rfile = open(urllist,'rb')
#    buf = rfile.read().split('\n')
#    rfile.close()
#    for i in range(len(buf)):
#        website = buf[i]
#        print website
website = raw_input("input link:")
page = urllib2.urlopen(website).read()
mychar=chardet.detect(page)
#    print mychar
html = BeautifulSoup(page)
#    print html.originalEncoding
#    html = BeautifulSoup(pageg, fromEncoding="gbk")
m = re.match('http:\/\/(.*).(com|cn)',website).group(1)
patt = '[1-9][0-9]*(?:\.[0-9]+)?|0\.[0-9]+]'
if m == 'item.taobao':
price = html.find(attrs={"class":"tb-public-price"})
match1 = re.search(patt,str(price))
img = html.find(attrs={"id":"J_ImgBooth"})
match2 = re.search('src="(http.*jpg)"',str(img))
print "title:",html.title.text
print "price:",match1.group()
print "img:",match2.group(1)
elif m == 'detail.tmall' or m == 'chaoshi.detail.tmall':
price = html.find(attrs={"class":"detail-price tm-clear"})
match1 = re.search(patt,str(price))
img = html.find(attrs={"id":"J_ImgBooth"})
match2 = re.search('src="(http.*jpg)"',str(img))
print "title:",html.title.text
print "price:",match1.group()
print "img:",match2.group(1)
elif m == 'detail.ju.taobao':
price = html.find(attrs={"class":"currentPrice floatleft"})
img = html.find(attrs={"class":"normal-pic "})
if img == None :
img = html.find(attrs={"class":"item-pic-wrap"})
match1 = re.search(patt,str(price))
match2 = re.search('src="(http[^\"]*?)"',str(img))
print "title:",html.title.text
print "price:",match1.group()
print "img:",match2.group(1)
else:
print website
if __name__ == '__main__':
html()


运行结果:

----@ubuntu:~/python$ python html.py
input link:http://item.taobao.com/item.htm?spm=1.7274553.1997522421.1.FKA5Ar&id=38443208410&scm=2004.1.515.0
title: 2014夏装新款欧美风ZARA MICN女装衬衫白底定位印花长袖雪纺衫女-淘宝网
price: 43.00
img: http://img03.taobaocdn.com/bao/uploaded/i3/T1MnJaFJXeXXXXXXXX_!!0-item_pic.jpg_400x400.jpg[/code] 
-----@ubuntu:~/python$ python html.py
input link:http://detail.ju.taobao.com/home.htm?spm=608.2291429.1.d1.tmDQQs&item_id=39165873670&id=10000002887630
title: 【聚_世界杯】【三只松鼠】爆款坚果组合750g-聚划算团购
price: 42.90
img: http://gju3.alicdn.com/bao/uploaded/i1/T1aV7LFGRcXXb1upjX.jpg_400x400Q90.jpg[/code] 
                                            
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: