您的位置:首页 > 编程语言 > Go语言

简单的解析Google Scholar数据

2013-11-11 22:18 337 查看
现在只是简单的查询一篇文章,但是可以给python小白用户(比如像我这样的

)提供启发,直接上代码:

import urllib2
import re, random
from bs4 import BeautifulSoup

def GoogleScholarTitle(queryTitle):
    user_agents = ['Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20130406 Firefox/23.0',\
               'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0',\
               'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533+',\
               '(KHTML, like Gecko) Element Browser 5.0',\
               'IBM WebExplorer /v0.94', 'Galaxy/1.0 [en] (Mac OS X 10.5.6; U; en)', \
               'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)', \
               'Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14', \
               'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko)',\
               'Version/6.0 Mobile/10A5355d Safari/8536.25', \
               'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)', \
               'Chrome/28.0.1468.0 Safari/537.36', \
               'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; TheWorld)']
    queryTitle = urllib2.quote(queryTitle.replace(' ', '+'))
##    queryTitle = queryTitle.replace(' ', '+')
##    print queryTitle
    url = 'http://scholar.google.com.hk/scholar?hl=zh-CN&q=%s' % queryTitle
    request = urllib2.Request(url)
    index = random.randint(0, 9)
    user_agent = user_agents[index]
    request.add_header('User-agent', user_agent)
    response = urllib2.urlopen(request)
    html = response.read()
    result = BeautifulSoup(html)
    print result

title = 'A Coarse-to-fine approach for fast deformable object detection'
GoogleScholarTitle(title)




【1】写的挺好的。只是需注意这个url千万别多写空格,但是这个hl=zh-CN不明白是啥东西,而且直接搜这篇文章会多加好多东西,比如&btnG=&lr=,也不知道啥意思。

参考:

【1】关于python抓取google搜索结果的若干问题 http://www.cnblogs.com/meibenjin/archive/2013/05/01/3053262.html
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: