您的位置：首页 > 编程语言 > Python开发

python使用 Timer 间隔一定时间爬取(BeautifulSoup)csdn的访问量

2018-03-25 23:26 316 查看

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Date    : 2018-03-25 22:11:35
# @Author  : awakeljw (liujw15@mails.tsinghua.edu.cn)
# @Link    : http://blog.csdn.net/awakeljw/ # @Version : $Id$

import  os

import re

import time

import urllib.request

from bs4 import BeautifulSoup

from threading import Timer

# filename = r'F:\wargame\title.txt'

# if not os.path.exists(filename):
#     os.system(r"touch %s" % filename)

def get_n_title():
url = 'https://blog.csdn.net/awakeljw'
user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
HEADERS = {"User-Agent":user_agent}
req = urllib.request.Request(url, headers=HEADERS)  #伪装浏览器访问
page = urllib.request.urlopen(req).read()  #读取网页
soup = BeautifulSoup(page,'lxml')#解析

forumlist = soup.find('div',class_="gradeAndbadge gradewidths")#找到访问量的位置

n_title = forumlist.get("title")#取title的值

with open('title.txt','a+') as f:#保存文件
string = str(time.strftime("%Y-%m-%d-%h-%m",time.localtime(time.time())))+'  '+str(n_title)
f.write(string)
f.write('\n')
print(n_title)

t = Timer(60, get_n_title)
t.start()#定时执行

if __name__ == "__main__":
i = 0
get_n_title()

1.BeautifulSoup爬取访问量

BS4解析完网页后，直接找到访问量所在的位置。

forumlist = soup.find('div',class_="gradeAndbadge gradewidths")#找到访问量的位置
n_title = forumlist.get("title")#取title的值

2.Timer模块间隔一定时间执行某一操作.实现循环任务和定时任务。也可以使用sched模块.

def get_n_title:
t = Timer(60, get_n_title)
t.start()

间隔60s执行一次get_n_title

def worker2(msg, starttime):
global total
total += 1
print (u'当前时刻：', time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())), '消息是：', msg, ' 启动时间是：', starttime)
# 只要没有让自己调用到第3次，那么继续重头开始执行本任务
if total < 3:
# 这里的delay可以重新指定
s.enter(5, 2, worker2, ('perfect world %d' % (total), time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))))
s.run()

使用sched的套路如下：

s = sched.scheduler(time.time, time.sleep)
s.enter(delay, priority, func1, (arg1, arg2, ...))
s.enter(delay, priority, func2, (arg1, arg2, arg3, ...))
s.run()

第二步各参数含义：

delay　相对于调度器添加这个任务时刻的延时，以秒为单位；

priority　优先级，数字越小优先级越高；

func1　任务函数

(arg1, arg2, …)　任务函数的参数

具体参考：https://blog.csdn.net/sunhuaqiang1/article/details/69391188

3.时间显示规则

time.strftime(“%Y-%m-%d-%h-%m”,time.localtime(time.time()))

4.也可使用windows自带的定时执行文件的方式定时执行文件。建议参考：https://blog.csdn.net/wwy11/article/details/51100432

http://www.jb51.net/article/104926.htm

5.在linux中可以修改crontab的文件并保存

vim /etc/crontab并在最后一行输入

* * 1 * * root /home/temp/bak.sh

保存，这时候更改就生效了。

crontab文件中前面有五个*号，代表五个数字，其含义和取值范围是：

分钟 0-59

小时 0-23

日 1-31

月 1-12

周 0-6 （0代表周日）

在后面的两项是用户和命令。

具体可参考https://blog.csdn.net/menglei8625/article/details/7660114

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航