您的位置:首页 > 编程语言 > Python开发

Python 3 爬虫之查询Github上哪些用户名没有被注册

2015-10-29 18:33 851 查看
想换个又短又有内涵还没什么人用的ID,想了几个一直被注册。于是在百度文库找了一份六千多个单词的文件,用爬虫挨个上Gibhub试。

写的时候还不会多线程,单线程发一次请求就停几秒,否则很快被拒绝访问。还好不是封IP。

抓完又觉得这样起名没意思。就当一次爬虫练习吧。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Author: LostInNight
# @Date:   2015-10-27 13:26:45
# @Last Modified by:   LostInNight
# @Last Modified time: 2015-10-28 08:33:26
# 上Github查询指定用户名是否存在

import requests
import sys
import os
import time

# 设置当前目录为当前工作目录,便于读写
# os.chdir(sys.path[0])
os.chdir(r'F:\PythonWorkspace\Github-Rename')

def trans_time(sec):
hour = int(sec / 3600)
sec = sec % 3600
minute = int(sec / 60)
sec = sec % 60
return "%s小时 %s分 %.2f秒" % (hour, minute, sec)

def get_html(url):
try:
time.sleep(3)
print('正在访问网址... ', url)
html = requests.get(url, headers=headers, timeout=10).text
except Exception as e:
print('出现异常,休眠十秒后重试')
print(e)
time.sleep(10)
return get_html(url)
print('成功获取网页!')
return html

start = time.time()
url = r'https://github.com/search?utf8=%E2%9C%93&q={0}&type=Users&ref=searchresults'

headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Host':'github.com',
'Referer':'https://github.com',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.71 Safari/537.36'
}
count = 0
found = 0

# 未被注册的单词少,找到再打开文件读写
with open('words.txt', 'r', encoding = 'utf-8') as input:
while True:
print('-' * 40)
print('读取单词文件...')
line = input.readline()
if not line:
break
word = line.split(' ', 1)[0]
print('成功读取取单词:', word)
used_time = trans_time(time.time() - start)
count += 1

print('正在检测 %s ,已检测 %s 个单词,找出 %s 个结果,脚本已运行 %s' % (word, count, found, used_time))

html = get_html(url.format(word))
# 如果某用户名没人使用,就会显示'We couldn’t find any users matching xxxxx'
print('正在写入文件...')
if 'We couldn’t find any users matching' in html:
found += 1
with open('uniquea.txt', 'a', encoding = 'utf-8') as output:
output.write(line)
print('写入成功!')

used_time = trans_time(time.time() - start)
print('抓取完成!\n耗时 %s\n共检测 %s 个单词,其中 %s 个没有被注册!' % (used_time, count, found))


words.txt是在百度文库找的免费文档,格式如下:

abacus   n.算盘
abandon   v.n.放弃,放纵
abase   v.贬抑,使卑下
abate   v.减轻,降低
abbreviation   n.缩短,缩写
abdicate   v.让位,辞职,放弃
abdomen   n.腹,下腹(胸部到腿部的部份)
abduct   v.绑架,拐走
aberrant   adj.越轨的,异常的
abet   v.教唆,协助(罪犯)
abeyance   n.中止,暂搁
abhor   v.憎恨,嫌恶
abhorrent   adj.可恨的,可厌的
abide   v.容忍,忍受
............
结果格式同上!
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: