使用HTTPURLConnection模拟登陆,爬取网页内容
2015-05-21 18:46
405 查看
如果你需要爬取某些网页的内容,但这些网站需要登录,那就需要一些额外的步骤来由程序来完成这些登录并爬取我们需要的网页内容了,任意登录页面都是向服务器发送请求,如果我们能够模拟向服务器发送请求,那么自然登录也就不在话下,通过Fiddler抓取我们需要的一些信息,很轻松的就能模拟出向服务器发送的请求,下面我们可以使用HTTPURLConnection进行模拟登陆并爬取我们需要的网页内容。
在模拟登陆的时候,我们其实可以通过Fiddler来抓取网页提交参数,直接将Cookie写到我们的Connection的RequestProperty中去。
Fiddler抓取登录参数
将抓取到的参数直接填充到Connection的RequestProperty属性中去,轻松抓取网页内容。如果我们抓取的页面内容是中文的,注意charset的编码方式,并在读取页面返回的字符流时进行对应的编码:
下面是一段相对完整的代码
import java.io.BufferedReader; import java.io.InputStream; import java.io.InputStreamReader; import java.io.OutputStreamWriter; import java.net.HttpURLConnection; import java.net.URL; import java.util.List; import java.util.Map.Entry; public class INotesPost { public static void main(String[] args) throws Exception { String surl = "***?login"; URL url = new URL(surl); HttpURLConnection connection = (HttpURLConnection) url.openConnection(); connection.setDoOutput(true); connection.setDoInput(true); connection.setRequestMethod("POST"); connection.setUseCaches(false); connection.setRequestProperty("Content-Type", "application/x-www-form-urlencoded"); connection.setRequestProperty("User-Agent","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.1.4322; .NET4.0C; .NET4.0E)"); connection.setRequestProperty("Accept-Language","zh-CN"); connection.setRequestProperty("Accept-Encoding","gzip, deflate"); OutputStreamWriter out = new OutputStreamWriter( connection.getOutputStream(), "UTF-8"); // 其中的memberName和password可通过fiddler来抓取 out.write("username=***&password=***"); out.flush(); out.close(); connection.connect(); InputStream in = connection.getInputStream(); StringBuilder retStr = new StringBuilder(); BufferedReader br = new BufferedReader(new InputStreamReader(in)); String temp = br.readLine(); while (temp != null) { retStr.append(temp); temp = br.readLine(); } br.close(); in.close(); System.out.println(retStr); for(Entry<String, List<String>> header: connection.getHeaderFields().entrySet()){ System.out.println(header.getKey() +" " + header.getValue()); } } }
在模拟登陆的时候,我们其实可以通过Fiddler来抓取网页提交参数,直接将Cookie写到我们的Connection的RequestProperty中去。
Fiddler抓取登录参数
将抓取到的参数直接填充到Connection的RequestProperty属性中去,轻松抓取网页内容。如果我们抓取的页面内容是中文的,注意charset的编码方式,并在读取页面返回的字符流时进行对应的编码:
BufferedReader bufferedReader = new BufferedReader( new InputStreamReader(urlStream,"utf-8"));
下面是一段相对完整的代码
String s = "****";
url = new URL(s);
HttpURLConnection resumeConnection = (HttpURLConnection) url.openConnection();
resumeConnection.setRequestProperty("Accept-Charset","utf-8");
resumeConnection.setRequestProperty("Content-Type","text/html;utf-8");
resumeConnection.setRequestProperty("Cookie","AttachmentAuth=77u/PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiPz48U1A+MCMuZnxpaWNwfDAwMDgyMzkwNSwwIy5mfGlpY3B8MDAwODIzOTA1LDEzMDc2NjgxMDQzODc3NDA0OCxUcnVlLEV0eHBYWVlYVHNYQ0hYR3hjRmZjdWowOXV6ekRXc01Hd0FLUzVkaFNmcEErcWo4S3pGTUYvYVRYZFJnWitSRW1pVmR4N0xKVzdoOUhzMitUamY5Z0E2VHY4a2hxeHNTQXlVRmhmQ1pwelBUOFBWQmc0NXI2cHo4eGZxZkEyNzAyOUo0eFBrcU9MM0dWNm1IVGdVNEZFT3E1OVIzSHA3dmZrS0tHR1YxNVJpTllKcXF1dUVCMmhlU1lGT0VLUjlBMitEQ00rMVlwdXBVTEJ0UGdWYk5lODBobEtydUttc1MyWWkrSmpXMFozTVVyRHJzN1VkU1VxNmdrYmo0dTB4OWNrTXRFZXJ1cUlZbDROb3N2UWhpSmNRTlVGcm9kNkVXaWhBL0tjUVpaZlY1UFJBREtjalZIYmx3dnRXMkIwZ1VPMVM3REJFa0VzOS9GQUViVzM2bnhJQT09LGh0dHA6Ly9vYS5zZGMuaWNiYzo4Mi9zdG9yYWdlL2F0dGFjaG1lbnQyLzIwMTUtMDUvZTY1Yjc3ZjUtNGZkMC00NDI2LWE1OWYtMjQxNTAxYWE0MjI1L+mZhOS7tjIu6L2v5Lu25byA5Y+R5Lit5b+D6K665paH5L2T5L6L6KaB5rGCLmRvYzwvU1A+");
resumeConnection.setRequestProperty("Cookie","PortalAuth=77u/PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiPz48U1A+MCMuZnxpaWNwfDAwMDgyMzkwNSwwIy5mfGlpY3B8MDAwODIzOTA1LDEzMDc2NzA1NzM3NTI3MDY4NCxUcnVlLFFldU1Fa2xDelI0bEZaTTJkbVVtZGxPVmhsUVdwQWMzQlk2TCtWdlVOb1ZsRjVHZ1BMRVhMTTAwcHBKWW5WTGZLYzFPTTh2aGRydmRIVWVLR3JOb255dWpTS2lMeEhyQUlBbmtYZTVBTWlFVGpFMlF4bzRjWVRKeEhjNU5ScEhMSWJOWHdWckFTWHhuNUd5bURST0xTK2d3cUFWbThFUllPM3J1enR4aGgwT1VrTDJGMGkrUDdWcHViRm84blFrTXp4MFNyMXdtQzE3UEJkcGpGVU1nOW8xRkJoeHhzWElDdHhLVEpVSHRGMmpDNmNKS285bGJtTXZJZnlwR0k1VGpLd29TTUpaenhyb1BkQ3VOVW13Wk01T0ZEUExSK1lqajVCRitJSFc1enV0UlpXM08wWHhNaldIWk1nWHhncjF0dUc1b3E3RlRwOGhCMFVCWjAydDlGQT09LGh0dHA6Ly9vYS5zZGMuaWNiYy88L1NQPg==");
resumeConnection.connect();
InputStream urlStream = resumeConnection.getInputStream();
BufferedReader bufferedReader = new BufferedReader( new InputStreamReader(urlStream,"utf-8"));
String ss = null;
StringBuilder total = new StringBuilder();
while ((ss = bufferedReader.readLine()) != null) {
total.append(ss);
}
bufferedReader.close();
resumeConnection.disconnect();
// System.out.println(total.toString());
相关文章推荐
- 使用HTTPURLConnection模拟登陆,爬取网页内容
- 使用HttpURLConnection获取网页内容
- HttpURLConnection模拟登录后添加cookie读取网页
- HttpURLConnection模拟用户登陆
- HTTP 获取网页内容 HttpURLConnection与HttpClient
- 用HttpUrlConnection抓取网页内容
- 模拟手机客户端与Web应用的交互[URL和HttpURLConnection的使用]
- java 抓取网页内容,可设定代理(HttpURLConnection)
- 在android上用HttpURLConnection获取网页内容
- HttpURLConnection模拟登录后添加cookie读取网页
- c#第一篇 在WPF的window窗体中使用httpwebrequest实现模拟登陆网页,并在webbroser控件中显示
- HttpURLConnection 从网页获取内容与乱码问题解决
- HttpURLConnection获取网页内容,解决乱码的通用方法
- 模拟手机客户端与Web应用的交互[URL和HttpURLConnection的使用]
- Java HttpURLConnection 抓取网页内容 解析gzip格式输入流数据并转换为String格式字符串
- HttpURLConnection写的模拟登陆
- 用HttpUrlConnection抓取网页内容
- java使用HttpURLConnection和HttpClient分别模拟get和post请求以及操作cookies
- 用HttpUrlConnection抓取网页内容
- HttpURLConnection连接网页和获取数据的使用实例