[置顶] 爬虫如何实现每天爬取,定点爬取[以股票数据为例]
2017-01-21 16:21
806 查看
分析抓取的数据
抓包
框架
model
main
util
parse
db
问题所在
解决方法
job
jobmain
近期,有人将本人博客,复制下来,直接上传到百度文库等平台。
本文为原创博客,仅供技术学习使用。未经允许,禁止将其复制下来上传到百度文库等平台。如有转载请注明本文博客的地址(链接)
http://quote.eastmoney.com/center/list.html#28002464_0_2
如下截图为其数据格式。
从上图中,可以看出数据真实的请求地址及请求的方法。而获得的是json数组。如下图所示:
db:主要放的是数据库操作文件,包含MyDataSource【数据库驱动注册、连接数据库的用户名、密码】,MYSQLControl【连接数据库,插入操作、更新操作、建表操作等】。
model:用来封装对象,说的直白一些,封装的就是我要操作数据对应的属性名。有不明白的看之前写的一个简单的网络爬虫(http://blog.csdn.net/qy20115549/article/details/52203722)。
parse:这里面存放的是针对util获取的文件,进行解析,一般采用Jsoup解析;若是针对json数据,可采用正则表达式或者fastjson工具进行解析,建议使用fastjson,因其操作简单,快捷。
main:程序起点,也是重点,获取数据,执行数据库语句,存放数据。
job:用来执行的job任务。
jobmain:控制器,即合适执行job,如本文中的每天执行一次job。股票数据每天下午3点钟收盘,即设置为3点钟以后的某个时间点开始爬行相关股票数据。
以下类是用来处理各种时间格式之间的转化,大家以后也可以使用。
这个类实现的是保留几位小数。如股票价格等,保留两位小数。
这样按道理整个爬虫,程序就写完了,运行main方法就行了。如下图,为main方法获取数据的部分结果。
问题二:股票节假日,是不会开盘的,当网页中存在此数据,即网页中的显示,没有时间标签。针对此,又该如何处理呢?
首先,我带大家来看看我的数据库设计。
针对第二个问题使用是:即如何判断当天股票不开盘,采用的方法是从数据库中随机抽取三个股票(上次时间的,如今天是1月21日,周六,随机从数据库中抽取1月20日的三只股票。将1月20日的三只股票与今天相同id的股票价格进行比较,如果三个股票的价格都相同,则判断,改天为节假日,股票价格没有变动,无需将数据插入数据库)。
运行jobmain中的类,便可以实现每天定点爬取数据。
抓包
框架
model
main
util
parse
db
问题所在
解决方法
job
jobmain
近期,有人将本人博客,复制下来,直接上传到百度文库等平台。
本文为原创博客,仅供技术学习使用。未经允许,禁止将其复制下来上传到百度文库等平台。如有转载请注明本文博客的地址(链接)
分析抓取的数据
本文是以东方财富网的数据为例,这里只做技术学习使用,请勿滥用。如本文要抓取的数据是东方财富网的汽车板块及石油板块数据。如下为其地址:http://quote.eastmoney.com/center/list.html#28002481_0_2http://quote.eastmoney.com/center/list.html#28002464_0_2
如下截图为其数据格式。
抓包
写爬虫第一步是做网络抓包,这个我之前的博客中已经讲到。即看数据请求的真实地址。关于本文为什么这样设计,请看我的专题博客,爬虫原理及相关基础:http://blog.csdn.net/column/details/14269.html。从上图中,可以看出数据真实的请求地址及请求的方法。而获得的是json数组。如下图所示:
框架
本文使用的框架,如下图所示:db:主要放的是数据库操作文件,包含MyDataSource【数据库驱动注册、连接数据库的用户名、密码】,MYSQLControl【连接数据库,插入操作、更新操作、建表操作等】。
model:用来封装对象,说的直白一些,封装的就是我要操作数据对应的属性名。有不明白的看之前写的一个简单的网络爬虫(http://blog.csdn.net/qy20115549/article/details/52203722)。
parse:这里面存放的是针对util获取的文件,进行解析,一般采用Jsoup解析;若是针对json数据,可采用正则表达式或者fastjson工具进行解析,建议使用fastjson,因其操作简单,快捷。
main:程序起点,也是重点,获取数据,执行数据库语句,存放数据。
job:用来执行的job任务。
jobmain:控制器,即合适执行job,如本文中的每天执行一次job。股票数据每天下午3点钟收盘,即设置为3点钟以后的某个时间点开始爬行相关股票数据。
model
model用来封装我要爬去的数据,如当天的日期,股票的id,股票的名称,股票价格等等。如下面程序:package model; /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */ public class ExtMarketOilStockModel { private String date; private String stock_id; private String stock_name; private float stock_price; private float stock_change; private float stock_range; private float stock_amplitude; private int stock_trading_number; private int stock_trading_value; private float stock_yesterdayfinish_price; private float stock_todaystart_price; private float stock_max_price; private float stock_min_price; private float stock_fiveminuate_change; private String craw_time; public String getDate() { return date; } public void setDate(String date) { this.date = date; } public String getStock_id() { return stock_id; } public void setStock_id(String stock_id) { this.stock_id = stock_id; } public String getStock_name() { return stock_name; } public void setStock_name(String stock_name) { this.stock_name = stock_name; } public float getStock_price() { return stock_price; } public void setStock_price(float stock_price) { this.stock_price = stock_price; } public float getStock_change() { return stock_change; } public void setStock_change(float stock_change) { this.stock_change = stock_change; } public float getStock_range() { return stock_range; } public void setStock_range(float stock_range) { this.stock_range = stock_range; } public float getStock_amplitude() { return stock_amplitude; } public void setStock_amplitude(float stock_amplitude) { this.stock_amplitude = stock_amplitude; } public int getStock_trading_number() { return stock_trading_number; } public void setStock_trading_number(int stock_trading_number) { this.stock_trading_number = stock_trading_number; } public int getStock_trading_value() { return stock_trading_value; } public void setStock_trading_value(int stock_trading_value) { this.stock_trading_value = stock_trading_value; } public float getStock_yesterdayfinish_price() { return stock_yesterdayfinish_price; } public void setStock_yesterdayfinish_price(float stock_yesterdayfinish_price) { this.stock_yesterdayfinish_price = stock_yesterdayfinish_price; } public float getStock_todaystart_price() { return stock_todaystart_price; } public void setStock_todaystart_price(float stock_todaystart_price) { this.stock_todaystart_price = stock_todaystart_price; } public float getStock_max_price() { return stock_max_price; } public void setStock_max_price(float stock_max_price) { this.stock_max_price = stock_max_price; } public float getStock_min_price() { return stock_min_price; } public void setStock_min_price(float stock_min_price) { this.stock_min_price = stock_min_price; } public float getStock_fiveminuate_change() { return stock_fiveminuate_change; } public void setStock_fiveminuate_change(float stock_fiveminuate_change) { this.stock_fiveminuate_change = stock_fiveminuate_change; } public String getCraw_time() { return craw_time; } public void setCraw_time(String craw_time) { this.craw_time = craw_time; } }
main
主方法,尽量要求简单,这里我就这样写了。这里面有注释,很好理解。package navi.main; /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */ import java.util.ArrayList; import java.util.List; import db.MYSQLControl; import model.ExtMarketOilStockModel; import parse.ExtMarketOilStockParse; public class ExtMarketOilStockMain { public static void main(String[] args) throws Exception { List<String> urloillist=new ArrayList<String>(); List<String> urlcarlist=new ArrayList<String>(); List<ExtMarketOilStockModel> oilstocks=new ArrayList<ExtMarketOilStockModel>(); List<ExtMarketOilStockModel> carstocks=new ArrayList<ExtMarketOilStockModel>(); //石油相关股票就两页,对应两个地址 String url1="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=1&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.13204790262127375"; String url2="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=2&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.6972178580603532"; urloillist.add(url1); urloillist.add(url2); for (int i = 0; i < urloillist.size(); i++) { //解析url oilstocks=ExtMarketOilStockParse.parseurl(urloillist.get(i)); //存储每页的数据 MYSQLControl.insertoilStocks(oilstocks); } //汽车相关股票有6页,对应6个地址 for (int i = 1; i <6; i++) { String urli="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04811&sty=FCOIATA&sortType=C&sortRule=-1&page="+i+"&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.23492960370783944"; urlcarlist.add(urli); } for (int i = 0; i < urlcarlist.size(); i++) { //解析url carstocks=ExtMarketOilStockParse.parseurl(urlcarlist.get(i)); //存储数据 MYSQLControl.insertcarStocks(carstocks); } } }
util
这里有三个文件,HTTPUtils,TimeUtils(这是我自己经常用的一个类,主要是各种日期的转化,如String转化为date,获取当前时间等等),UumericalUtil(这是一个Float保留几位小数的类)。package util; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.net.URL; import java.net.URLConnection; /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */ public abstract class HTTPUtils { //这个方法是向后台请求数据,获取html或者json等 public static String getRawHtml(String personalUrl) throws InterruptedException,IOException { URL url = new URL(personalUrl); URLConnection conn = url.openConnection(); InputStream in=null; try { conn.setConnectTimeout(3000); in = conn.getInputStream(); } catch (Exception e) { } //将获取的数据转化为String String html = convertStreamToString(in); return html; } //这个方法是将InputStream转化为String public static String convertStreamToString(InputStream is) throws IOException { if (is == null) return ""; BufferedReader reader = new BufferedReader(new InputStreamReader(is,"utf-8")); StringBuilder sb = new StringBuilder(); String line = null; try { while ((line = reader.readLine()) != null) { sb.append(line); } } catch (IOException e) { e.printStackTrace(); } finally { try { is.close(); } catch (IOException e) { e.printStackTrace(); } } reader.close(); return sb.toString(); } }
以下类是用来处理各种时间格式之间的转化,大家以后也可以使用。
package util; import java.text.DateFormat; import java.text.DecimalFormat; import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.Calendar; import java.util.Date; import java.util.List; /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */ public class TimeUtils { public static void main( String[] args ) throws ParseException{ String time = getMonth("2002-1-08 14:50:38"); System.out.println(time); System.out.println(getDay("2002-1-08 14:50:38")); System.out.println(TimeUtils.parseTime("2016-05-19 19:17","yyyy-MM-dd HH:mm")); } //get current time public static String GetNowDate(String formate){ String temp_str=""; Date dt = new Date(); SimpleDateFormat sdf = new SimpleDateFormat(formate); temp_str=sdf.format(dt); return temp_str; } public static String getMonth( String time ){ SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM"); Date date = null; try { date = sdf.parse(time); Calendar cal = Calendar.getInstance(); cal.setTime(date); } catch (ParseException e) { e.printStackTrace(); } return sdf.format(date); } public static String getDay( String time ){ SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd"); Date date = null; try { date = sdf.parse(time); Calendar cal = Calendar.getInstance(); cal.setTime(date); } catch (ParseException e) { e.printStackTrace(); } return sdf.format(date); } public static Date parseTime(String inputTime) throws ParseException{ SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); Date date = sdf.parse(inputTime); return date; } public static String dateToString(Date date, String type) { DateFormat df = new SimpleDateFormat(type); return df.format(date); } public static Date parseTime(String inputTime, String timeFormat) throws ParseException{ SimpleDateFormat sdf = new SimpleDateFormat(timeFormat); Date date = sdf.parse(inputTime); return date; } public static Calendar parseTimeToCal(String inputTime, String timeFormat) throws ParseException{ SimpleDateFormat sdf = new SimpleDateFormat(timeFormat); Date date = sdf.parse(inputTime); Calendar calendar = Calendar.getInstance(); calendar.setTime(date); return calendar; } public static int getDaysBetweenCals(Calendar cal1, Calendar cal2) throws ParseException{ return (int) ((cal2.getTimeInMillis()-cal1.getTimeInMillis())/(1000*24*3600)); } public static Date parseTime(long inputTime){ // SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); Date date= new Date(inputTime); return date; } public static String parseTimeString(long inputTime){ SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); Date date= new Date(inputTime); return sdf.format(date); } public static String parseStringTime(String inputTime){ String date=null; try { Date date1 = new SimpleDateFormat("yyyyMMddHHmmss").parse(inputTime); date=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(date1); } catch (ParseException e) { // TODO Auto-generated catch block e.printStackTrace(); } return date; } public static List<String> YearMonth(int year) { List<String> yearmouthlist=new ArrayList<String>(); for (int i = 1; i < 13; i++) { DecimalFormat dfInt=new DecimalFormat("00"); String sInt = dfInt.format(i); yearmouthlist.add(year+sInt); } return yearmouthlist; } public static List<String> YearMonth(int startyear,int finistyear) { List<String> yearmouthlist=new ArrayList<String>(); for (int i = startyear; i < finistyear+1; i++) { for (int j = 1; j < 13; j++) { DecimalFormat dfInt=new DecimalFormat("00"); String sInt = dfInt.format(j); yearmouthlist.add(i +"-"+sInt); } } return yearmouthlist; } public static List<String> TOAllDay(int year){ List<String> daylist=new ArrayList<String>(); SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd"); int m=1;//月份计数 while (m<13) { int month=m; Calendar cal=Calendar.getInstance();//获得当前日期对象 cal.clear();//清除信息 cal.set(Calendar.YEAR,year); cal.set(Calendar.MONTH,month-1);//1月从0开始 cal.set(Calendar.DAY_OF_MONTH,1);//设置为1号,当前日期既为本月第一天 System.out.println("##########___" + sdf.format(cal.getTime())); int count=cal.getActualMaximum(Calendar.DAY_OF_MONTH); System.out.println("$$$$$$$$$$________" + count); for (int j=0;j<=(count - 2);) { cal.add(Calendar.DAY_OF_MONTH,+1); j++; daylist.add(sdf.format(cal.getTime())); } m++; } return daylist; } //获取昨天的日期 public static String getyesterday(){ Calendar cal = Calendar.getInstance(); cal.add(Calendar.DATE, -1); String yesterday = new SimpleDateFormat( "yyyy-MM-dd ").format(cal.getTime()); return yesterday; } }
这个类实现的是保留几位小数。如股票价格等,保留两位小数。
package util; /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */ import java.math.BigDecimal; import java.text.DecimalFormat; public class UumericalUtil { public static float FloatTO(float f, int number) { BigDecimal b = new BigDecimal(f); float f1 = b.setScale(number, BigDecimal.ROUND_HALF_UP).floatValue(); return f1; } public static String NumberTO(int number) { DecimalFormat dfInt=new DecimalFormat("00"); String sInt = dfInt.format(number); System.out.println(sInt); return sInt; } }
parse
parse主要是通过Jsoup或者其他工具来解析html文件。并将解析后的数据,封装在List集合中,将数据通过层层返回到main方法中。如这里只是采用最简单的字符串解析的方式。如下为某一页的数据,这要针对的是此类型的数据进行解析:var quote_123={rank:["2,002662,京威股份,15.62,0.38,2.49%,2.95,10294,15948185,15.24,15.28,15.65,15.20,-,-,-,-,-,-,-,-,0.00%,0.62,0.17,33.47","2,002536,西泵股份,13.15,0.32,2.49%,3.74,26558,34710121,12.83,12.88,13.27,12.79,-,-,-,-,-,-,-,-,0.00%,0.99,0.87,41.09","1,600741,华域汽车,16.22,0.39,2.46%,2.59,215140,346480560,15.83,15.85,16.26,15.85,-,-,-,-,-,-,-,-,0.12%,1.23,0.75,8.59","1,601689,拓普集团,29.74,0.68,2.34%,3.20,36329,107964394,29.06,29.06,29.94,29.01,-,-,-,-,-,-,-,-,-0.20%,1.34,2.13,34.32","1,603306,华懋科技,33.87,0.74,2.23%,4.50,9251,31242113,33.13,33.14,34.20,32.71,-,-,-,-,-,-,-,-,-0.03%,0.72,1.25,29.60","1,601799,星宇股份,37.40,0.80,2.19%,3.80,5522,20477010,36.60,36.40,37.50,36.11,-,-,-,-,-,-,-,-,0.03%,0.86,0.23,28.43","1,603166,福达股份,14.02,0.29,2.11%,2.91,47265,66170428,13.73,13.80,14.14,13.74,-,-,-,-,-,-,-,-,0.21%,0.96,3.15,95.59","2,002190,成飞集成,32.44,0.66,2.08%,2.99,25213,81219488,31.78,31.63,32.58,31.63,-,-,-,-,-,-,-,-,0.03%,0.86,0.73,93.58","1,600213,亚星客车,14.77,0.30,2.07%,3.46,18878,27820060,14.47,14.52,14.88,14.38,-,-,-,-,-,-,-,-,-0.07%,0.64,0.86,55.39","2,300432,富临精工,21.28,0.43,2.06%,4.70,28707,60945368,20.85,20.60,21.58,20.60,-,-,-,-,-,-,-,-,-0.14%,1.29,2.07,50.58","2,300375,鹏翎股份,21.25,0.42,2.02%,3.94,11367,24164157,20.83,20.83,21.45,20.63,-,-,-,-,-,-,-,-,-0.14%,0.83,1.44,30.27","2,002363,隆基机械,11.47,0.22,1.96%,2.49,33946,38796837,11.25,11.27,11.55,11.27,-,-,-,-,-,-,-,-,0.00%,0.80,0.88,61.45","1,600469,风神股份,11.55,0.22,1.94%,3.09,38444,44305565,11.33,11.33,11.63,11.28,-,-,-,-,-,-,-,-,0.09%,0.67,0.68,27.07","2,002454,松芝股份,12.98,0.24,1.88%,2.83,27839,36056020,12.74,12.70,13.06,12.70,-,-,-,-,-,-,-,-,0.00%,1.17,0.87,25.84","2,002488,金固股份,14.79,0.27,1.86%,2.48,29002,42872475,14.52,14.52,14.88,14.52,-,-,-,-,-,-,-,-,0.00%,0.72,0.75,-","2,002284,亚太股份,13.18,0.24,1.85%,3.32,61756,81198133,12.94,12.87,13.30,12.87,-,-,-,-,-,-,-,-,0.30%,1.10,0.90,58.15","1,603788,宁波高发,35.97,0.64,1.81%,3.40,6719,24160418,35.33,35.21,36.33,35.13,-,-,-,-,-,-,-,-,0.03%,0.59,1.37,34.10","2,000957,中通客车,14.36,0.25,1.77%,2.69,59696,85581415,14.11,14.07,14.45,14.07,-,-,-,-,-,-,-,-,0.00%,0.79,1.25,13.99","2,300304,云意电气,52.12,0.90,1.76%,5.70,179330,922614032,51.22,50.38,52.83,49.91,-,-,-,-,-,-,-,-,-0.04%,1.12,9.35,108.58","2,002607,亚夏汽车,10.03,0.17,1.72%,4.16,27760,27878904,9.86,9.89,10.19,9.78,-,-,-,-,-,-,-,-,-0.30%,0.97,1.03,57.87"],pages:6}
package parse; /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */ import java.util.ArrayList; import java.util.List; import model.ExtMarketOilStockModel; import util.HTTPUtils; import util.TimeUtils; import util.UumericalUtil; public class ExtMarketOilStockParse { public static List<ExtMarketOilStockModel> parseurl(String url) throws Exception { List<ExtMarketOilStockModel> list=new ArrayList<ExtMarketOilStockModel>(); String response=HTTPUtils.getRawHtml(url); String html = response.toString(); String jsonarra=html.split("rank:")[1].split(",pages")[0]; String stocks[]=jsonarra.split("\","); List<String> stocklist=new ArrayList<String>(); for (int i = 0; i < stocks.length; i++) { stocklist.add(stocks[i].replace("[\"", "").replace("\"", "").replace("]", "")); System.out.println(stocks[i].replace("[\"", "").replace("\"", "").replace("]", "")); } for (int i = 0; i < stocklist.size(); i++) { String date=TimeUtils.GetNowDate("yyyy-MM-dd"); String stock_id=stocklist.get(i).split(",")[1]; String stock_name=stocklist.get(i).split(",")[2]; float stock_price=0; float stock_change=0; float stock_range=0; float stock_amplitude=0; int stock_trading_number=0; int stock_trading_value=0; float stock_yesterdayfinish_price=0; float stock_todaystart_price=0; float stock_max_price=0; float stock_min_price=0; float stock_fiveminuate_change=0; if (!stocklist.get(i).split(",")[3].equals("-")) { //价格 stock_price=Float.parseFloat(stocklist.get(i).split(",")[3]); //涨跌额 stock_change=Float.parseFloat(stocklist.get(i).split(",")[4]); System.out.println(stock_change); //涨跌幅 stock_range=UumericalUtil.FloatTO((float) (Float.parseFloat(stocklist.get(i).split(",")[5].replace("%", ""))*0.01),4); stock_amplitude=UumericalUtil.FloatTO((float) (Float.parseFloat(stocklist.get(i).split(",")[6].replace("%", ""))*0.01),4);; stock_trading_number=Integer.parseInt(stocklist.get(i).split(",")[7].replace("%", "")); stock_trading_value=Integer.parseInt(stocklist.get(i).split(",")[8].replace("%", "")); stock_yesterdayfinish_price=Float.parseFloat(stocklist.get(i).split(",")[9]); stock_todaystart_price=Float.parseFloat(stocklist.get(i).split(",")[10]); stock_max_price=Float.parseFloat(stocklist.get(i).split(",")[11]); stock_min_price=Float.parseFloat(stocklist.get(i).split(",")[12]); stock_fiveminuate_change=UumericalUtil.FloatTO((float) (Float.parseFloat(stocklist.get(i).split(",")[21].replace("%", ""))*0.01),4);; System.out.println(stock_fiveminuate_change); } String craw_time=TimeUtils.GetNowDate("yyyy-MM-dd HH:mm:ss"); ExtMarketOilStockModel model=new ExtMarketOilStockModel(); model.setDate(date); model.setStock_id(stock_id); model.setStock_name(stock_name); model.setStock_price(stock_price); model.setStock_change(stock_change); model.setStock_range(stock_range); model.setStock_amplitude(stock_amplitude); model.setStock_trading_number(stock_trading_number); model.setStock_trading_value(stock_trading_value); model.setStock_yesterdayfinish_price(stock_yesterdayfinish_price); model.setStock_todaystart_price(stock_todaystart_price); model.setStock_max_price(stock_max_price); model.setStock_min_price(stock_min_price); model.setStock_fiveminuate_change(stock_fiveminuate_change); model.setCraw_time(craw_time); list.add(model); } return list; } }
db
db中包含两个java文件,MyDataSource,MYSQLControl。这两个文件的作用已在前面说明了。package db; /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */ import javax.sql.DataSource; import org.apache.commons.dbcp2.BasicDataSource; public class MyDataSource { public static DataSource getDataSource(String connectURI){ BasicDataSource ds = new BasicDataSource(); //MySQL的jdbc驱动 ds.setDriverClassName("com.mysql.jdbc.Driver"); ds.setUsername("root"); //所要连接的数据库名 ds.setPassword("112233"); //MySQL的登陆密码 ds.setUrl(connectURI); return ds; } }
package db; import java.sql.SQLException; import java.util.List; import javax.sql.DataSource; import org.apache.commons.dbutils.QueryRunner; import org.apache.commons.dbutils.ResultSetHandler; import org.apache.commons.dbutils.handlers.BeanListHandler; import org.apache.commons.dbutils.handlers.ColumnListHandler; import org.apache.commons.dbutils.handlers.ScalarHandler; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import model.ExtMarketOilStockModel; /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */ public class MYSQLControl { static final Log logger = LogFactory.getLog(MYSQLControl.class); static DataSource ds = MyDataSource.getDataSource("jdbc:mysql://127.0.0.1:3306/datacollection"); static QueryRunner qr = new QueryRunner(ds); //第一类方法 public static void executeUpdate(String sql){ try { qr.update(sql); } catch (SQLException e) { logger.error(e); } } //按照SQL查询单个结果 public static Object getScalaBySQL ( String sql ){ ResultSetHandler<Object> h = new ScalarHandler<Object>(1); Object obj = null; try { obj = qr.query(sql, h); } catch (SQLException e) { e.printStackTrace(); } return obj; } //按照SQL查询多个结果 public static <T> List<T> getListInfoBySQL (String sql, Class<T> type ){ List<T> list = null; try { list = qr.query(sql,new BeanListHandler<T>(type)); } catch (SQLException e) { e.printStackTrace(); } return list; } //查询一列 public static List<Object> getListOneBySQL (String sql,String id){ List<Object> list=null; try { list = (List<Object>) qr.query(sql, new ColumnListHandler(id)); } catch (SQLException e) { e.printStackTrace(); } return list; } //此种数据库操作方法需要优化 public static int insertoilStocks ( List<ExtMarketOilStockModel> oilstocks ) { Object[][] params = new Object[oilstocks.size()][17]; int c = 0; //success number of update int[] sum; for ( int i = 0; i < oilstocks.size(); i++ ){ params[i][0] = oilstocks.get(i).getDate(); params[i][1] = oilstocks.get(i).getStock_id(); params[i][2] = oilstocks.get(i).getStock_name(); params[i][3] = oilstocks.get(i).getStock_price(); params[i][4] = oilstocks.get(i).getStock_change(); params[i][5] = oilstocks.get(i).getStock_range(); params[i][6] = oilstocks.get(i).getStock_amplitude(); params[i][7] = oilstocks.get(i).getStock_trading_number(); params[i][8] = oilstocks.get(i).getStock_trading_value(); params[i][9] = oilstocks.get(i).getStock_yesterdayfinish_price(); params[i][10] = oilstocks.get(i).getStock_todaystart_price(); params[i][11] = oilstocks.get(i).getStock_max_price(); params[i][12] = oilstocks.get(i).getStock_min_price(); params[i][13] = oilstocks.get(i).getStock_fiveminuate_change(); params[i][14] = oilstocks.get(i).getCraw_time(); params[i][15] = null; params[i][16] = null; } QueryRunner qr = new QueryRunner(ds); try { sum = qr.batch("INSERT INTO `datacollection`.`ext_market_oil_stock` VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", params); } catch (SQLException e) { System.out.println(e); } System.out.println("石油数据入库完毕"); return c; } //此种数据库操作方法需要优化 public static int insertcarStocks ( List<ExtMarketOilStockModel> carstocks ) { int c = 0; //success number of update int[] sum; Object[][] params1 = new Object[carstocks.size()][17]; int c1 = 0; //success number of update for ( int i = 0; i < carstocks.size(); i++ ){ params1[i][0] = carstocks.get(i).getDate(); params1[i][1] = carstocks.get(i).getStock_id(); params1[i][2] = carstocks.get(i).getStock_name(); params1[i][3] = carstocks.get(i).getStock_price(); params1[i][4] = carstocks.get(i).getStock_change(); params1[i][5] = carstocks.get(i).getStock_range(); params1[i][6] = carstocks.get(i).getStock_amplitude(); params1[i][7] = carstocks.get(i).getStock_trading_number(); params1[i][8] = carstocks.get(i).getStock_trading_value(); params1[i][9] = carstocks.get(i).getStock_yesterdayfinish_price(); params1[i][10] = carstocks.get(i).getStock_todaystart_price(); params1[i][11] = carstocks.get(i).getStock_max_price(); params1[i][12] = carstocks.get(i).getStock_min_price(); params1[i][13] = carstocks.get(i).getStock_fiveminuate_change(); params1[i][14] = carstocks.get(i).getCraw_time(); params1[i][15] = null; params1[i][16] = null; } QueryRunner qr = new QueryRunner(ds); try { //插入的数据表及数据 sum = qr.batch("INSERT INTO `datacollection`.`ext_market_car_stock` VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", params1); } catch (SQLException e) { System.out.println(e); } System.out.println("汽车数据入库完毕"); return c; } }
这样按道理整个爬虫,程序就写完了,运行main方法就行了。如下图,为main方法获取数据的部分结果。
问题所在
问题1:针对股票这种数据,每周1到周五都会发布相关股票数据,那么如何每天定时定点让程序自动的去抓取,而不是手工每天运行一下呢?问题二:股票节假日,是不会开盘的,当网页中存在此数据,即网页中的显示,没有时间标签。针对此,又该如何处理呢?
首先,我带大家来看看我的数据库设计。
解决方法
这里使用Quartz实线定期运行程序,即上面提的第一个问题。(http://blog.csdn.net/qy20115549/article/details/52723907)。针对第二个问题使用是:即如何判断当天股票不开盘,采用的方法是从数据库中随机抽取三个股票(上次时间的,如今天是1月21日,周六,随机从数据库中抽取1月20日的三只股票。将1月20日的三只股票与今天相同id的股票价格进行比较,如果三个股票的价格都相同,则判断,改天为节假日,股票价格没有变动,无需将数据插入数据库)。
job
package job; import java.util.ArrayList; import java.util.List; import org.quartz.Job; import org.quartz.JobExecutionContext; import org.quartz.JobExecutionException; import db.MYSQLControl; import model.ExtMarketOilStockModel; import parse.ExtMarketOilStockParse; import timecontrol.TimeControl; /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */ public class ExtMarketOilStockJob implements Job { @Override public void execute(JobExecutionContext arg0) throws JobExecutionException { //获取上次的插入股票日期,加入判断是否为节假日 List<ExtMarketOilStockModel> randomlist = MYSQLControl.getListInfoBySQL("select stock_id,stock_price,stock_change from ext_market_oil_stock where date = (select date from ext_market_oil_stock order by date desc limit 1) ",ExtMarketOilStockModel.class); //表格更新时间 List<String> urloillist=new ArrayList<String>(); List<String> urlcarlist=new ArrayList<String>(); List<ExtMarketOilStockModel> oilstocks=new ArrayList<ExtMarketOilStockModel>(); List<ExtMarketOilStockModel> carstocks=new ArrayList<ExtMarketOilStockModel>(); String url1="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=1&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.13204790262127375"; String url2="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=2&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.6972178580603532"; urloillist.add(url1); urloillist.add(url2); int judge=0; for (int i = 0; i < urloillist.size(); i++) { try { oilstocks=ExtMarketOilStockParse.parseurl(urloillist.get(i)); } catch (Exception e) { e.printStackTrace(); } for (int j = 0; j < oilstocks.size(); j++) { String stock_id=oilstocks.get(j).getStock_id(); float stock_price=oilstocks.get(j).getStock_price(); if (stock_id.equals(randomlist.get(0).getStock_id())) { if (stock_price==randomlist.get(0).getStock_price()) { judge++; } } } for (int j = 0; j < oilstocks.size(); j++) { String stock_id=oilstocks.get(j).getStock_id(); float stock_price=oilstocks.get(j).getStock_price(); if (stock_id.equals(randomlist.get(1).getStock_id())) { if (stock_price==randomlist.get(1).getStock_price()) { judge++; } } } for (int j = 0; j < oilstocks.size(); j++) { String stock_id=oilstocks.get(j).getStock_id(); float stock_price=oilstocks.get(j).getStock_price(); if (stock_id.equals(randomlist.get(2).getStock_id())) { if (stock_price==randomlist.get(2).getStock_price()) { judge++; } } } if (judge!=3) { MYSQLControl.insertoilStocks(oilstocks); } } if (judge!=3) { for (int i = 1; i <6; i++) { String urli="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04811&sty=FCOIATA&sortType=C&sortRule=-1&page="+i+"&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.23492960370783944"; urlcarlist.add(urli); } for (int i = 0; i < urlcarlist.size(); i++) { try { carstocks=ExtMarketOilStockParse.parseurl(urlcarlist.get(i)); } catch (Exception e) { e.printStackTrace(); } MYSQLControl.insertcarStocks(carstocks); } } } }
jobmain
如下,控制的时间是每周一到周五,8点39执行job,即每天都去抓取数据。package jobmain; import static org.quartz.CronScheduleBuilder.cronSchedule; import static org.quartz.JobBuilder.newJob; import static org.quartz.TriggerBuilder.newTrigger; import java.text.SimpleDateFormat; import java.util.Date; import org.quartz.CronTrigger; import org.quartz.JobDetail; import org.quartz.Scheduler; import org.quartz.SchedulerFactory; import org.quartz.impl.StdSchedulerFactory; import job.ExtMarketOilStockJob; /** * @author:合肥工业大学 管理学院 钱洋 * @email:1563178220@qq.com * @ */ public class ExtMarketOilStockJobMain { public void go() throws Exception { // 首先,必需要取得一个Scheduler的引用 SchedulerFactory sf = new StdSchedulerFactory(); Scheduler sched = sf.getScheduler(); //jobs可以在scheduled的sched.start()方法前被调用 JobDetail job = newJob(ExtMarketOilStockJob.class).withIdentity("stockjob", "stockgroup").build(); //每周一到周五8点39开始执行job CronTrigger trigger = newTrigger().withIdentity("stocktrigger", "stockgroup").withSchedule(cronSchedule("0 39 20 ? * MON-FRI")).build(); Date ft = sched.scheduleJob(job, trigger); SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss SSS"); System.out.println(job.getKey() + " 已被安排执行于: " + sdf.format(ft) + ",并且以如下重复规则重复执行: " + trigger.getCronExpression()); sched.start(); } public static void main(String[] args) throws Exception { ExtMarketOilStockJobMain maingo = new ExtMarketOilStockJobMain(); maingo.go(); } }
运行jobmain中的类,便可以实现每天定点爬取数据。
相关文章推荐
- Python实现数据可视化看如何监控你的爬虫状态【推荐】
- 如何实现数据在表内部置顶
- python 每天如何定时启动爬虫任务(实现方法分享)
- 用JAVA如何实现每天1亿条记录的数据存储,数据库方面怎么设计?
- [置顶] 基于java的网络爬虫框架(实现京东数据的爬取,并将插入数据库)
- 如何处理光流传感器的数据实现稳定的定点悬停
- 如何实现:每天收集数据到表格,但只看到当天的数据?
- 如何实现100%的动态数据管道(一)
- 问题征解1:SPS数据如何实现分散存储(多台计算机和多个磁盘)
- [总结]SqlServer中如何实现自动备份数据!
- 如何实现win9X进程间数据通讯技术
- 如何快速的实现oracle数据表的的增删改?
- [原创]DataList横向排列数据如何实现交替行变色!
- 探讨如何在有着1000万条数据的MS SQL SERVER数据库中实现快速的数据提取和数据分页
- 如何实现每天定时对数据库的操作
- 如何在C#用WM_COPYDATA消息来实现两个进程之间传递数据
- [原创]DataList横向排列数据如何实现交替行变色!
- 如何实现100%的动态数据管道(三)
- 如何实现100%的动态数据管道(二)