Download arxiv paper
2017-02-05 15:24
295 查看
1. Code
#!/usr/bin/env python # -*- coding: utf-8 -*- ''' ########### Usage: python download.py site.txt(containing https://...) ''' from selenium import webdriver import time from pymouse import PyMouse m = PyMouse() def pause(length=1): time.sleep(length) def download(url): b = webdriver.Firefox() #b.set_page_load_timeout(60) # useless b.maximize_window() pause(1) b.get(url) pause(2) loading_time = 60 dt = b.find_elements_by_tag_name('dt') dd = b.find_elements_by_tag_name('dd') assert(len(dt) == len(dd)) dst_type = "Computer Vision" print b.get_window_size() bias = [254, 171] screenIsVertical = False if screenIsVertical: print "No implement when screen is vertical" return else: pos = [b.get_window_size()['width']/2 + bias[0], b.get_window_size()['height']/2 + bias[1]] for i in xrange(4, len(dt)): # no Computer Vision paper if dst_type not in dd[i].find_element_by_class_name('primary-subject').text: continue # no 'pdf' button try: dt[i].find_element_by_link_text('pdf').click() except Exception, e: continue pause(loading_time) b.find_element_by_id('download').click() pause(2) m.click(pos[0], pos[1], 1, 1) time.sleep(1) b.back() time.sleep(1) dt = b.find_elements_by_tag_name('dt') dd = b.find_elements_by_tag_name('dd') b.close() def main(): import sys if len(sys.argv) != 2: print(__doc__) return with open(sys.argv[1], 'r') as fid: urls = [x.split('\n')[0] for x in fid.readlines()] for url in urls: if url.startswith('#'): continue else: download(url) if __name__ == "__main__": main()
2. Usage
python download.py site.txt
site.txt (example)
https://arxiv.org/find/all/1/ti:+AND+object+detection/0/1/0/all/0/1 https://arxiv.org/find/all/1/ti:+AND+object+detection/0/1/0/all/0/1?skip=25&query_id=a6b6ed358647ff57 #https://arxiv.org/find/all/1/ti:+AND+object+detection/0/1/0/all/0/1?skip=50&query_id=a6b6ed358647ff57 https://arxiv.org/find/all/1/ti:+AND+object+detection/0/1/0/all/0/1?skip=75&query_id=a6b6ed358647ff57[/code]
You can use # to ignore specific url.
Refer this post for installing requirement.
相关文章推荐
- 蓝桥杯 算法训练 安慰奶牛
- Mybatis动态更新数据
- python发送邮件相关问题总结
- 在Android中自定义捕获Application全局异常,可以替换掉系统的强制退出对话框
- Spring
- spring-boot-maven-plugin插件的作用
- [问题]ubuntu的版本查询
- eclipse中spring配置文件的自动提示和命名空间的添加
- WPF中嵌入Office编辑器
- 原生的Ajax和Jquery的Ajax用法
- gulp源码解析(一)—— Stream详解
- Mybatis
- centos7 修改yum源为阿里源
- 【工作总结】月度总结
- Linux内存管理: mmap详解
- Boostrap零散
- running android lint has encountered a problem
- 1028.List Sorting (25)
- Linux内核同步:同步规则和说明
- 【codeforces 235E】 Number Challenge