您的位置:首页 > 其它

[置顶] 爬虫如何实现每天爬取,定点爬取[以股票数据为例]

2017-01-21 16:21 806 查看
分析抓取的数据

抓包

框架

model

main

util

parse

db

问题所在

解决方法
job

jobmain

近期,有人将本人博客,复制下来,直接上传到百度文库等平台。

本文为原创博客,仅供技术学习使用。未经允许,禁止将其复制下来上传到百度文库等平台。如有转载请注明本文博客的地址(链接)

分析抓取的数据

本文是以东方财富网的数据为例,这里只做技术学习使用,请勿滥用。如本文要抓取的数据是东方财富网的汽车板块及石油板块数据。如下为其地址:http://quote.eastmoney.com/center/list.html#28002481_0_2

http://quote.eastmoney.com/center/list.html#28002464_0_2

如下截图为其数据格式。



抓包

写爬虫第一步是做网络抓包,这个我之前的博客中已经讲到即看数据请求的真实地址。关于本文为什么这样设计,请看我的专题博客,爬虫原理及相关基础:http://blog.csdn.net/column/details/14269.html



从上图中,可以看出数据真实的请求地址及请求的方法。而获得的是json数组。如下图所示:



框架

本文使用的框架,如下图所示:



db:主要放的是数据库操作文件,包含MyDataSource【数据库驱动注册、连接数据库的用户名、密码】,MYSQLControl【连接数据库,插入操作、更新操作、建表操作等】。

model:用来封装对象,说的直白一些,封装的就是我要操作数据对应的属性名。有不明白的看之前写的一个简单的网络爬虫(http://blog.csdn.net/qy20115549/article/details/52203722)。

parse:这里面存放的是针对util获取的文件,进行解析,一般采用Jsoup解析;若是针对json数据,可采用正则表达式或者fastjson工具进行解析,建议使用fastjson,因其操作简单,快捷。

main:程序起点,也是重点,获取数据,执行数据库语句,存放数据。

job:用来执行的job任务。

jobmain:控制器,即合适执行job,如本文中的每天执行一次job。股票数据每天下午3点钟收盘,即设置为3点钟以后的某个时间点开始爬行相关股票数据。

model

model用来封装我要爬去的数据,如当天的日期,股票的id,股票的名称,股票价格等等。如下面程序:

package model;
/**
* @author:合肥工业大学 管理学院 钱洋
* @email:1563178220@qq.com
* @
*/
public class ExtMarketOilStockModel {
private String date;
private String stock_id;
private String stock_name;
private float stock_price;
private float stock_change;
private float stock_range;
private float stock_amplitude;
private int stock_trading_number;
private int stock_trading_value;
private float stock_yesterdayfinish_price;
private float stock_todaystart_price;
private float stock_max_price;
private float stock_min_price;
private float stock_fiveminuate_change;
private String craw_time;
public String getDate() {
return date;
}
public void setDate(String date) {
this.date = date;
}

public String getStock_id() {
return stock_id;
}
public void setStock_id(String stock_id) {
this.stock_id = stock_id;
}
public String getStock_name() {
return stock_name;
}
public void setStock_name(String stock_name) {
this.stock_name = stock_name;
}
public float getStock_price() {
return stock_price;
}
public void setStock_price(float stock_price) {
this.stock_price = stock_price;
}
public float getStock_change() {
return stock_change;
}
public void setStock_change(float stock_change) {
this.stock_change = stock_change;
}
public float getStock_range() {
return stock_range;
}
public void setStock_range(float stock_range) {
this.stock_range = stock_range;
}
public float getStock_amplitude() {
return stock_amplitude;
}
public void setStock_amplitude(float stock_amplitude) {
this.stock_amplitude = stock_amplitude;
}

public int getStock_trading_number() {
return stock_trading_number;
}
public void setStock_trading_number(int stock_trading_number) {
this.stock_trading_number = stock_trading_number;
}
public int getStock_trading_value() {
return stock_trading_value;
}
public void setStock_trading_value(int stock_trading_value) {
this.stock_trading_value = stock_trading_value;
}
public float getStock_yesterdayfinish_price() {
return stock_yesterdayfinish_price;
}
public void setStock_yesterdayfinish_price(float stock_yesterdayfinish_price) {
this.stock_yesterdayfinish_price = stock_yesterdayfinish_price;
}
public float getStock_todaystart_price() {
return stock_todaystart_price;
}
public void setStock_todaystart_price(float stock_todaystart_price) {
this.stock_todaystart_price = stock_todaystart_price;
}
public float getStock_max_price() {
return stock_max_price;
}
public void setStock_max_price(float stock_max_price) {
this.stock_max_price = stock_max_price;
}
public float getStock_min_price() {
return stock_min_price;
}
public void setStock_min_price(float stock_min_price) {
this.stock_min_price = stock_min_price;
}
public float getStock_fiveminuate_change() {
return stock_fiveminuate_change;
}
public void setStock_fiveminuate_change(float stock_fiveminuate_change) {
this.stock_fiveminuate_change = stock_fiveminuate_change;
}
public String getCraw_time() {
return craw_time;
}
public void setCraw_time(String craw_time) {
this.craw_time = craw_time;
}
}


main

主方法,尽量要求简单,这里我就这样写了。这里面有注释,很好理解。

package navi.main;
/**
* @author:合肥工业大学 管理学院 钱洋
* @email:1563178220@qq.com
* @
*/
import java.util.ArrayList;
import java.util.List;

import db.MYSQLControl;
import model.ExtMarketOilStockModel;
import parse.ExtMarketOilStockParse;

public class ExtMarketOilStockMain {

public static void main(String[] args) throws Exception {
List<String> urloillist=new ArrayList<String>();
List<String> urlcarlist=new ArrayList<String>();
List<ExtMarketOilStockModel> oilstocks=new ArrayList<ExtMarketOilStockModel>();
List<ExtMarketOilStockModel> carstocks=new ArrayList<ExtMarketOilStockModel>();
//石油相关股票就两页,对应两个地址
String url1="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=1&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.13204790262127375";
String url2="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=2&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.6972178580603532";
urloillist.add(url1);
urloillist.add(url2);
for (int i = 0; i < urloillist.size(); i++) {
//解析url
oilstocks=ExtMarketOilStockParse.parseurl(urloillist.get(i));
//存储每页的数据
MYSQLControl.insertoilStocks(oilstocks);
}
//汽车相关股票有6页,对应6个地址
for (int i = 1; i <6; i++) {
String urli="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04811&sty=FCOIATA&sortType=C&sortRule=-1&page="+i+"&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.23492960370783944";
urlcarlist.add(urli);
}
for (int i = 0; i < urlcarlist.size(); i++) {
//解析url
carstocks=ExtMarketOilStockParse.parseurl(urlcarlist.get(i));
//存储数据
MYSQLControl.insertcarStocks(carstocks);
}

}

}


util

这里有三个文件,HTTPUtils,TimeUtils(这是我自己经常用的一个类,主要是各种日期的转化,如String转化为date,获取当前时间等等),UumericalUtil(这是一个Float保留几位小数的类)。

package util;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;

/**
* @author:合肥工业大学 管理学院 钱洋
* @email:1563178220@qq.com
* @
*/
public abstract class HTTPUtils {
//这个方法是向后台请求数据,获取html或者json等
public static String  getRawHtml(String personalUrl) throws InterruptedException,IOException {
URL url = new URL(personalUrl);
URLConnection conn = url.openConnection();
InputStream in=null;
try {
conn.setConnectTimeout(3000);
in = conn.getInputStream();
} catch (Exception e) {
}
//将获取的数据转化为String
String html = convertStreamToString(in);
return html;
}
//这个方法是将InputStream转化为String
public static String convertStreamToString(InputStream is) throws IOException {
if (is == null)
return "";
BufferedReader reader = new BufferedReader(new InputStreamReader(is,"utf-8"));
StringBuilder sb = new StringBuilder();
String line = null;
try {
while ((line = reader.readLine()) != null) {
sb.append(line);
}
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
is.close();
} catch (IOException e) {
e.printStackTrace();
}
}
reader.close();
return sb.toString();

}
}


以下类是用来处理各种时间格式之间的转化,大家以后也可以使用。

package util;

import java.text.DateFormat;
import java.text.DecimalFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Calendar;
import java.util.Date;
import java.util.List;
/**
* @author:合肥工业大学 管理学院 钱洋
* @email:1563178220@qq.com
* @
*/
public class TimeUtils {

public static void main( String[] args ) throws ParseException{

String time = getMonth("2002-1-08 14:50:38");
System.out.println(time);
System.out.println(getDay("2002-1-08 14:50:38"));
System.out.println(TimeUtils.parseTime("2016-05-19 19:17","yyyy-MM-dd HH:mm"));

}
//get current time
public static String GetNowDate(String formate){
String temp_str="";
Date dt = new Date();
SimpleDateFormat sdf = new SimpleDateFormat(formate);
temp_str=sdf.format(dt);
return temp_str;
}
public static String getMonth( String time ){

SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM");
Date date = null;
try {

date = sdf.parse(time);
Calendar cal = Calendar.getInstance();
cal.setTime(date);

} catch (ParseException e) {
e.printStackTrace();
}

return sdf.format(date);

}

public static String getDay( String time ){

SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
Date date = null;
try {

date = sdf.parse(time);
Calendar cal = Calendar.getInstance();
cal.setTime(date);

} catch (ParseException e) {
e.printStackTrace();
}

return sdf.format(date);

}

public static Date parseTime(String inputTime) throws ParseException{

SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Date date = sdf.parse(inputTime);

return date;

}
public static String dateToString(Date date, String type) {
DateFormat df = new SimpleDateFormat(type);
return df.format(date);
}
public static Date parseTime(String inputTime, String timeFormat) throws ParseException{

SimpleDateFormat sdf = new SimpleDateFormat(timeFormat);
Date date = sdf.parse(inputTime);

return date;

}

public static Calendar parseTimeToCal(String inputTime, String timeFormat) throws ParseException{

SimpleDateFormat sdf = new SimpleDateFormat(timeFormat);
Date date = sdf.parse(inputTime);
Calendar calendar = Calendar.getInstance();
calendar.setTime(date);

return calendar;

}

public static int getDaysBetweenCals(Calendar cal1, Calendar cal2) throws ParseException{

return (int) ((cal2.getTimeInMillis()-cal1.getTimeInMillis())/(1000*24*3600));

}

public static Date parseTime(long inputTime){

//  SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Date date= new Date(inputTime);
return date;

}

public static String parseTimeString(long inputTime){

SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
Date date= new Date(inputTime);
return sdf.format(date);

}
public static String parseStringTime(String inputTime){

String date=null;
try {
Date date1 = new SimpleDateFormat("yyyyMMddHHmmss").parse(inputTime);
date=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(date1);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

return date;
}
public static List<String> YearMonth(int year) {
List<String> yearmouthlist=new ArrayList<String>();
for (int i = 1; i < 13; i++) {
DecimalFormat dfInt=new DecimalFormat("00");
String sInt = dfInt.format(i);
yearmouthlist.add(year+sInt);
}

return yearmouthlist;
}
public static List<String> YearMonth(int startyear,int finistyear) {
List<String> yearmouthlist=new ArrayList<String>();
for (int i = startyear; i < finistyear+1; i++) {
for (int j = 1; j < 13; j++) {
DecimalFormat dfInt=new DecimalFormat("00");
String sInt = dfInt.format(j);
yearmouthlist.add(i +"-"+sInt);
}
}
return yearmouthlist;
}
public static List<String> TOAllDay(int year){
List<String> daylist=new ArrayList<String>();
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
int m=1;//月份计数
while (m<13)
{
int month=m;
Calendar cal=Calendar.getInstance();//获得当前日期对象
cal.clear();//清除信息
cal.set(Calendar.YEAR,year);
cal.set(Calendar.MONTH,month-1);//1月从0开始
cal.set(Calendar.DAY_OF_MONTH,1);//设置为1号,当前日期既为本月第一天

System.out.println("##########___" + sdf.format(cal.getTime()));

int count=cal.getActualMaximum(Calendar.DAY_OF_MONTH);

System.out.println("$$$$$$$$$$________" + count);

for (int j=0;j<=(count - 2);)
{
cal.add(Calendar.DAY_OF_MONTH,+1);
j++;
daylist.add(sdf.format(cal.getTime()));
}
m++;
}
return daylist;
}
//获取昨天的日期
public static String getyesterday(){
Calendar   cal   =   Calendar.getInstance();
cal.add(Calendar.DATE,   -1);
String yesterday = new SimpleDateFormat( "yyyy-MM-dd ").format(cal.getTime());
return yesterday;
}
}


这个类实现的是保留几位小数。如股票价格等,保留两位小数。

package util;
/**
* @author:合肥工业大学 管理学院 钱洋
* @email:1563178220@qq.com
* @
*/
import java.math.BigDecimal;
import java.text.DecimalFormat;

public class UumericalUtil {

public static float FloatTO(float f, int number) {
BigDecimal   b  =   new BigDecimal(f);
float   f1   =  b.setScale(number, BigDecimal.ROUND_HALF_UP).floatValue();
return f1;
}
public static String NumberTO(int number) {
DecimalFormat dfInt=new DecimalFormat("00");
String sInt = dfInt.format(number);
System.out.println(sInt);
return sInt;
}

}


parse

parse主要是通过Jsoup或者其他工具来解析html文件。并将解析后的数据,封装在List集合中,将数据通过层层返回到main方法中。如这里只是采用最简单的字符串解析的方式。如下为某一页的数据,这要针对的是此类型的数据进行解析:

var quote_123={rank:["2,002662,京威股份,15.62,0.38,2.49%,2.95,10294,15948185,15.24,15.28,15.65,15.20,-,-,-,-,-,-,-,-,0.00%,0.62,0.17,33.47","2,002536,西泵股份,13.15,0.32,2.49%,3.74,26558,34710121,12.83,12.88,13.27,12.79,-,-,-,-,-,-,-,-,0.00%,0.99,0.87,41.09","1,600741,华域汽车,16.22,0.39,2.46%,2.59,215140,346480560,15.83,15.85,16.26,15.85,-,-,-,-,-,-,-,-,0.12%,1.23,0.75,8.59","1,601689,拓普集团,29.74,0.68,2.34%,3.20,36329,107964394,29.06,29.06,29.94,29.01,-,-,-,-,-,-,-,-,-0.20%,1.34,2.13,34.32","1,603306,华懋科技,33.87,0.74,2.23%,4.50,9251,31242113,33.13,33.14,34.20,32.71,-,-,-,-,-,-,-,-,-0.03%,0.72,1.25,29.60","1,601799,星宇股份,37.40,0.80,2.19%,3.80,5522,20477010,36.60,36.40,37.50,36.11,-,-,-,-,-,-,-,-,0.03%,0.86,0.23,28.43","1,603166,福达股份,14.02,0.29,2.11%,2.91,47265,66170428,13.73,13.80,14.14,13.74,-,-,-,-,-,-,-,-,0.21%,0.96,3.15,95.59","2,002190,成飞集成,32.44,0.66,2.08%,2.99,25213,81219488,31.78,31.63,32.58,31.63,-,-,-,-,-,-,-,-,0.03%,0.86,0.73,93.58","1,600213,亚星客车,14.77,0.30,2.07%,3.46,18878,27820060,14.47,14.52,14.88,14.38,-,-,-,-,-,-,-,-,-0.07%,0.64,0.86,55.39","2,300432,富临精工,21.28,0.43,2.06%,4.70,28707,60945368,20.85,20.60,21.58,20.60,-,-,-,-,-,-,-,-,-0.14%,1.29,2.07,50.58","2,300375,鹏翎股份,21.25,0.42,2.02%,3.94,11367,24164157,20.83,20.83,21.45,20.63,-,-,-,-,-,-,-,-,-0.14%,0.83,1.44,30.27","2,002363,隆基机械,11.47,0.22,1.96%,2.49,33946,38796837,11.25,11.27,11.55,11.27,-,-,-,-,-,-,-,-,0.00%,0.80,0.88,61.45","1,600469,风神股份,11.55,0.22,1.94%,3.09,38444,44305565,11.33,11.33,11.63,11.28,-,-,-,-,-,-,-,-,0.09%,0.67,0.68,27.07","2,002454,松芝股份,12.98,0.24,1.88%,2.83,27839,36056020,12.74,12.70,13.06,12.70,-,-,-,-,-,-,-,-,0.00%,1.17,0.87,25.84","2,002488,金固股份,14.79,0.27,1.86%,2.48,29002,42872475,14.52,14.52,14.88,14.52,-,-,-,-,-,-,-,-,0.00%,0.72,0.75,-","2,002284,亚太股份,13.18,0.24,1.85%,3.32,61756,81198133,12.94,12.87,13.30,12.87,-,-,-,-,-,-,-,-,0.30%,1.10,0.90,58.15","1,603788,宁波高发,35.97,0.64,1.81%,3.40,6719,24160418,35.33,35.21,36.33,35.13,-,-,-,-,-,-,-,-,0.03%,0.59,1.37,34.10","2,000957,中通客车,14.36,0.25,1.77%,2.69,59696,85581415,14.11,14.07,14.45,14.07,-,-,-,-,-,-,-,-,0.00%,0.79,1.25,13.99","2,300304,云意电气,52.12,0.90,1.76%,5.70,179330,922614032,51.22,50.38,52.83,49.91,-,-,-,-,-,-,-,-,-0.04%,1.12,9.35,108.58","2,002607,亚夏汽车,10.03,0.17,1.72%,4.16,27760,27878904,9.86,9.89,10.19,9.78,-,-,-,-,-,-,-,-,-0.30%,0.97,1.03,57.87"],pages:6}


package parse;
/**
* @author:合肥工业大学 管理学院 钱洋
* @email:1563178220@qq.com
* @
*/
import java.util.ArrayList;
import java.util.List;
import model.ExtMarketOilStockModel;
import util.HTTPUtils;
import util.TimeUtils;
import util.UumericalUtil;
public class ExtMarketOilStockParse {
public static List<ExtMarketOilStockModel> parseurl(String url) throws Exception {
List<ExtMarketOilStockModel> list=new ArrayList<ExtMarketOilStockModel>();
String response=HTTPUtils.getRawHtml(url);
String html = response.toString();
String jsonarra=html.split("rank:")[1].split(",pages")[0];
String stocks[]=jsonarra.split("\",");
List<String> stocklist=new ArrayList<String>();
for (int i = 0; i < stocks.length; i++) {
stocklist.add(stocks[i].replace("[\"", "").replace("\"", "").replace("]", ""));
System.out.println(stocks[i].replace("[\"", "").replace("\"", "").replace("]", ""));
}
for (int i = 0; i < stocklist.size(); i++) {
String date=TimeUtils.GetNowDate("yyyy-MM-dd");
String stock_id=stocklist.get(i).split(",")[1];
String stock_name=stocklist.get(i).split(",")[2];
float stock_price=0;
float stock_change=0;
float stock_range=0;
float stock_amplitude=0;
int stock_trading_number=0;
int stock_trading_value=0;
float stock_yesterdayfinish_price=0;
float stock_todaystart_price=0;
float stock_max_price=0;
float stock_min_price=0;
float stock_fiveminuate_change=0;
if (!stocklist.get(i).split(",")[3].equals("-")) {
//价格
stock_price=Float.parseFloat(stocklist.get(i).split(",")[3]);
//涨跌额
stock_change=Float.parseFloat(stocklist.get(i).split(",")[4]);
System.out.println(stock_change);
//涨跌幅
stock_range=UumericalUtil.FloatTO((float) (Float.parseFloat(stocklist.get(i).split(",")[5].replace("%", ""))*0.01),4);
stock_amplitude=UumericalUtil.FloatTO((float) (Float.parseFloat(stocklist.get(i).split(",")[6].replace("%", ""))*0.01),4);;
stock_trading_number=Integer.parseInt(stocklist.get(i).split(",")[7].replace("%", ""));
stock_trading_value=Integer.parseInt(stocklist.get(i).split(",")[8].replace("%", ""));
stock_yesterdayfinish_price=Float.parseFloat(stocklist.get(i).split(",")[9]);
stock_todaystart_price=Float.parseFloat(stocklist.get(i).split(",")[10]);
stock_max_price=Float.parseFloat(stocklist.get(i).split(",")[11]);
stock_min_price=Float.parseFloat(stocklist.get(i).split(",")[12]);
stock_fiveminuate_change=UumericalUtil.FloatTO((float) (Float.parseFloat(stocklist.get(i).split(",")[21].replace("%", ""))*0.01),4);;
System.out.println(stock_fiveminuate_change);
}
String craw_time=TimeUtils.GetNowDate("yyyy-MM-dd HH:mm:ss");
ExtMarketOilStockModel model=new ExtMarketOilStockModel();
model.setDate(date);
model.setStock_id(stock_id);
model.setStock_name(stock_name);
model.setStock_price(stock_price);
model.setStock_change(stock_change);
model.setStock_range(stock_range);
model.setStock_amplitude(stock_amplitude);
model.setStock_trading_number(stock_trading_number);
model.setStock_trading_value(stock_trading_value);
model.setStock_yesterdayfinish_price(stock_yesterdayfinish_price);
model.setStock_todaystart_price(stock_todaystart_price);
model.setStock_max_price(stock_max_price);
model.setStock_min_price(stock_min_price);
model.setStock_fiveminuate_change(stock_fiveminuate_change);
model.setCraw_time(craw_time);
list.add(model);
}
return list;
}
}


db

db中包含两个java文件,MyDataSource,MYSQLControl。这两个文件的作用已在前面说明了。

package db;
/**
* @author:合肥工业大学 管理学院 钱洋
* @email:1563178220@qq.com
* @
*/
import javax.sql.DataSource;
import org.apache.commons.dbcp2.BasicDataSource;

public class MyDataSource {

public static DataSource getDataSource(String connectURI){

BasicDataSource ds = new BasicDataSource();
//MySQL的jdbc驱动
ds.setDriverClassName("com.mysql.jdbc.Driver");
ds.setUsername("root");              //所要连接的数据库名
ds.setPassword("112233");                //MySQL的登陆密码
ds.setUrl(connectURI);

return ds;

}

}


package db;
import java.sql.SQLException;
import java.util.List;
import javax.sql.DataSource;
import org.apache.commons.dbutils.QueryRunner;
import org.apache.commons.dbutils.ResultSetHandler;
import org.apache.commons.dbutils.handlers.BeanListHandler;
import org.apache.commons.dbutils.handlers.ColumnListHandler;
import org.apache.commons.dbutils.handlers.ScalarHandler;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import model.ExtMarketOilStockModel;
/**
* @author:合肥工业大学 管理学院 钱洋
* @email:1563178220@qq.com
* @
*/
public class MYSQLControl {
static final Log logger = LogFactory.getLog(MYSQLControl.class);
static DataSource ds = MyDataSource.getDataSource("jdbc:mysql://127.0.0.1:3306/datacollection");
static QueryRunner qr = new QueryRunner(ds);
//第一类方法
public static void executeUpdate(String sql){
try {
qr.update(sql);
} catch (SQLException e) {
logger.error(e);
}
}
//按照SQL查询单个结果
public static Object getScalaBySQL ( String sql ){

ResultSetHandler<Object> h = new ScalarHandler<Object>(1);
Object obj = null;
try {
obj = qr.query(sql, h);
} catch (SQLException e) {
e.printStackTrace();
}
return obj;

}
//按照SQL查询多个结果
public static <T> List<T> getListInfoBySQL (String sql, Class<T> type ){
List<T> list = null;
try {
list = qr.query(sql,new BeanListHandler<T>(type));
} catch (SQLException e) {
e.printStackTrace();
}
return list;
}
//查询一列
public static List<Object> getListOneBySQL (String sql,String id){
List<Object> list=null;

try {
list = (List<Object>) qr.query(sql, new ColumnListHandler(id));
} catch (SQLException e) {
e.printStackTrace();
}
return list;
}
//此种数据库操作方法需要优化
public static int insertoilStocks ( List<ExtMarketOilStockModel> oilstocks ) {

Object[][] params = new Object[oilstocks.size()][17];
int c = 0;  //success number of update
int[] sum;
for ( int i = 0; i < oilstocks.size(); i++ ){
params[i][0] = oilstocks.get(i).getDate();
params[i][1] = oilstocks.get(i).getStock_id();
params[i][2] = oilstocks.get(i).getStock_name();
params[i][3] = oilstocks.get(i).getStock_price();
params[i][4] = oilstocks.get(i).getStock_change();
params[i][5] = oilstocks.get(i).getStock_range();
params[i][6] = oilstocks.get(i).getStock_amplitude();
params[i][7] = oilstocks.get(i).getStock_trading_number();
params[i][8] = oilstocks.get(i).getStock_trading_value();
params[i][9] = oilstocks.get(i).getStock_yesterdayfinish_price();
params[i][10] = oilstocks.get(i).getStock_todaystart_price();
params[i][11] = oilstocks.get(i).getStock_max_price();
params[i][12] = oilstocks.get(i).getStock_min_price();
params[i][13] = oilstocks.get(i).getStock_fiveminuate_change();
params[i][14] = oilstocks.get(i).getCraw_time();
params[i][15] = null;
params[i][16] = null;
}

QueryRunner qr = new QueryRunner(ds);
try {
sum = qr.batch("INSERT INTO `datacollection`.`ext_market_oil_stock` VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", params);
} catch (SQLException e) {
System.out.println(e);
}
System.out.println("石油数据入库完毕");

return c;

}
//此种数据库操作方法需要优化
public static int insertcarStocks ( List<ExtMarketOilStockModel> carstocks ) {

int c = 0;  //success number of update
int[] sum;
Object[][] params1 = new Object[carstocks.size()][17];
int c1 = 0; //success number of update
for ( int i = 0; i < carstocks.size(); i++ ){
params1[i][0] = carstocks.get(i).getDate();
params1[i][1] = carstocks.get(i).getStock_id();
params1[i][2] = carstocks.get(i).getStock_name();
params1[i][3] = carstocks.get(i).getStock_price();
params1[i][4] = carstocks.get(i).getStock_change();
params1[i][5] = carstocks.get(i).getStock_range();
params1[i][6] = carstocks.get(i).getStock_amplitude();
params1[i][7] = carstocks.get(i).getStock_trading_number();
params1[i][8] = carstocks.get(i).getStock_trading_value();
params1[i][9] = carstocks.get(i).getStock_yesterdayfinish_price();
params1[i][10] = carstocks.get(i).getStock_todaystart_price();
params1[i][11] = carstocks.get(i).getStock_max_price();
params1[i][12] = carstocks.get(i).getStock_min_price();
params1[i][13] = carstocks.get(i).getStock_fiveminuate_change();
params1[i][14] = carstocks.get(i).getCraw_time();
params1[i][15] = null;
params1[i][16] = null;
}
QueryRunner qr = new QueryRunner(ds);
try {
//插入的数据表及数据
sum = qr.batch("INSERT INTO `datacollection`.`ext_market_car_stock` VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", params1);

} catch (SQLException e) {
System.out.println(e);
}
System.out.println("汽车数据入库完毕");

return c;

}

}


这样按道理整个爬虫,程序就写完了,运行main方法就行了。如下图,为main方法获取数据的部分结果。



问题所在

问题1:针对股票这种数据,每周1到周五都会发布相关股票数据,那么如何每天定时定点让程序自动的去抓取,而不是手工每天运行一下呢?

问题二:股票节假日,是不会开盘的,当网页中存在此数据,即网页中的显示,没有时间标签。针对此,又该如何处理呢?

首先,我带大家来看看我的数据库设计。



解决方法

这里使用Quartz实线定期运行程序,即上面提的第一个问题。(http://blog.csdn.net/qy20115549/article/details/52723907)。

针对第二个问题使用是:即如何判断当天股票不开盘,采用的方法是从数据库中随机抽取三个股票(上次时间的,如今天是1月21日,周六,随机从数据库中抽取1月20日的三只股票。将1月20日的三只股票与今天相同id的股票价格进行比较,如果三个股票的价格都相同,则判断,改天为节假日,股票价格没有变动,无需将数据插入数据库)。

job

package job;

import java.util.ArrayList;
import java.util.List;
import org.quartz.Job;
import org.quartz.JobExecutionContext;
import org.quartz.JobExecutionException;
import db.MYSQLControl;
import model.ExtMarketOilStockModel;
import parse.ExtMarketOilStockParse;
import timecontrol.TimeControl;
/**
* @author:合肥工业大学 管理学院 钱洋
* @email:1563178220@qq.com
* @
*/
public class ExtMarketOilStockJob implements Job {

@Override
public void execute(JobExecutionContext arg0) throws JobExecutionException {
//获取上次的插入股票日期,加入判断是否为节假日
List<ExtMarketOilStockModel> randomlist = MYSQLControl.getListInfoBySQL("select stock_id,stock_price,stock_change from ext_market_oil_stock where date = (select date from ext_market_oil_stock order by date desc limit 1) ",ExtMarketOilStockModel.class);
//表格更新时间

List<String> urloillist=new ArrayList<String>();
List<String> urlcarlist=new ArrayList<String>();
List<ExtMarketOilStockModel> oilstocks=new ArrayList<ExtMarketOilStockModel>();
List<ExtMarketOilStockModel> carstocks=new ArrayList<ExtMarketOilStockModel>();
String url1="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=1&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.13204790262127375";
String url2="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04641&sty=FCOIATA&sortType=C&sortRule=-1&page=2&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.6972178580603532";
urloillist.add(url1);
urloillist.add(url2);
int judge=0;
for (int i = 0; i < urloillist.size(); i++) {
try {
oilstocks=ExtMarketOilStockParse.parseurl(urloillist.get(i));
} catch (Exception e) {
e.printStackTrace();
}

for (int j = 0; j < oilstocks.size(); j++) {
String stock_id=oilstocks.get(j).getStock_id();
float stock_price=oilstocks.get(j).getStock_price();
if (stock_id.equals(randomlist.get(0).getStock_id())) {
if (stock_price==randomlist.get(0).getStock_price()) {
judge++;
}
}
}
for (int j = 0; j < oilstocks.size(); j++) {
String stock_id=oilstocks.get(j).getStock_id();
float stock_price=oilstocks.get(j).getStock_price();
if (stock_id.equals(randomlist.get(1).getStock_id())) {
if (stock_price==randomlist.get(1).getStock_price()) {
judge++;
}
}
}
for (int j = 0; j < oilstocks.size(); j++) {
String stock_id=oilstocks.get(j).getStock_id();
float stock_price=oilstocks.get(j).getStock_price();
if (stock_id.equals(randomlist.get(2).getStock_id())) {
if (stock_price==randomlist.get(2).getStock_price()) {
judge++;
}
}
}
if (judge!=3) {
MYSQLControl.insertoilStocks(oilstocks);
}
}
if (judge!=3) {
for (int i = 1; i <6; i++) {
String urli="http://nufm.dfcfw.com/EM_Finance2014NumericApplication/JS.aspx?type=CT&cmd=C.BK04811&sty=FCOIATA&sortType=C&sortRule=-1&page="+i+"&pageSize=20&js=var%20quote_123%3d{rank:[(x)],pages:(pc)}&token=7bc05d0d4c3c22ef9fca8c2a912d779c&jsName=quote_123&_g=0.23492960370783944";
urlcarlist.add(urli);
}
for (int i = 0; i < urlcarlist.size(); i++) {
try {
carstocks=ExtMarketOilStockParse.parseurl(urlcarlist.get(i));
} catch (Exception e) {
e.printStackTrace();
}
MYSQLControl.insertcarStocks(carstocks);
}
}

}

}


jobmain

如下,控制的时间是每周一到周五,8点39执行job,即每天都去抓取数据。

package jobmain;
import static org.quartz.CronScheduleBuilder.cronSchedule;
import static org.quartz.JobBuilder.newJob;
import static org.quartz.TriggerBuilder.newTrigger;
import java.text.SimpleDateFormat;
import java.util.Date;
import org.quartz.CronTrigger;
import org.quartz.JobDetail;
import org.quartz.Scheduler;
import org.quartz.SchedulerFactory;
import org.quartz.impl.StdSchedulerFactory;
import job.ExtMarketOilStockJob;
/**
* @author:合肥工业大学 管理学院 钱洋
* @email:1563178220@qq.com
* @
*/
public class ExtMarketOilStockJobMain {

public void go() throws Exception {
// 首先,必需要取得一个Scheduler的引用
SchedulerFactory sf = new StdSchedulerFactory();
Scheduler sched = sf.getScheduler();
//jobs可以在scheduled的sched.start()方法前被调用
JobDetail job = newJob(ExtMarketOilStockJob.class).withIdentity("stockjob", "stockgroup").build();
//每周一到周五8点39开始执行job
CronTrigger trigger = newTrigger().withIdentity("stocktrigger", "stockgroup").withSchedule(cronSchedule("0 39 20 ? * MON-FRI")).build();
Date ft = sched.scheduleJob(job, trigger);
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss SSS");
System.out.println(job.getKey() + " 已被安排执行于: " + sdf.format(ft) + ",并且以如下重复规则重复执行: " + trigger.getCronExpression());
sched.start();
}
public static void main(String[] args) throws Exception {
ExtMarketOilStockJobMain maingo = new ExtMarketOilStockJobMain();
maingo.go();
}

}


运行jobmain中的类,便可以实现每天定点爬取数据。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: