您的位置:首页 > 其它

发个原创小工具,下载autohome 论坛帖子离线浏览

2013-04-10 10:54 302 查看
【提出问题】

autohome是个汽车门户,有时论坛里面会有一些比较好看的帖子,比如“一家四口环中国行”,主贴100多页,跟帖4000多页,看起来很爽。

但是,其论坛的JS脚本写的并不好,如果一帖图片非常多的情况下,经常有图片显示不了,很是郁闷。

于是有思路想下载帖子出来离线浏览。有人可能会说,现在有很多现成的离线浏览软件呀,不错,但是下载不了这里的图片,因为其图片URL做了个小小的手脚。

【分析问题】

1、URL规律分析

第一帖是 http://club.autohome.com.cn/bbs/thread-o-200042-19582947-1.html
第二帖子 http://club.autohome.com.cn/bbs/thread-o-200042-19582947-2.html
发现其N贴是 http://club.autohome.com.cn/bbs/thread-o-200042-19582947-N.html
2、图片分析

查看源文件,其图片的HTML为

<img id="img-0-8" name="lazypic" onload="tz.picLoaded(this)" onerror="tz.picNotFind(this)" style="width:700px;height:464px" src="http://x.autoimg.cn/club/lazyload.png" src9="http://club1.autoimg.cn/album/userphotos/2013/2/25/500_9bed_79b1f6c8_79b1f6c8.jpg" />


默认的src是一个等待图片,真实的src为src9属性,通过onload事件来替换src实现显示图片,超时或者出错是显示onerror事件

那么我们抓取src9就可以下载图片了

3、图片抓取尝试

比如上面的图片URL为 http://club1.autoimg.cn/album/userphotos/2013/2/25/500_9bed_79b1f6c8_79b1f6c8.jpg ,如果直接下载图片的话,服务器会拒绝,因为你在盗链

所以最好是用HTTP 1.1的指令方式发起HTTP REQUEST,同时要传达 request.Referer 属性,可用fiddler监控

GET http://club1.autoimg.cn/album/userphotos/2013/4/3/500_43be_cde0b51b_cde0b51b.jpg HTTP/1.1
Accept: */*
Referer: http://club.autohome.com.cn/bbs/thread-o-200042-19582947-1.html Accept-Language: zh-CN
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/6.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; KB974487)
Accept-Encoding: gzip, deflate
Connection: Keep-Alive
DNT: 1
Host: club1.autoimg.cn


为了方便图片显示,建议按照原来的路径保存图片文件,比如图片 http://club1.autoimg.cn/album/userphotos/2013/4/3/500_43be_cde0b51b_cde0b51b.jpg,则按照文件夹 club1.autoimg.cn/album/userphotos/2013/4/3/500_43be_cde0b51b_cde0b51b.jpg 来保存。

4、分页链接

为了便于浏览,下载后的分页连接要能连上

<div class="pages fs">
<a href="forum-o-200042-1.html">返回列表</a></div>
<div class="pages" id="x-pages1" maxindex="4927"><span class="cur">1</span><a target="_self" href="thread-o-200042-19582947-2.html">2</a><a target="_self" href="thread-o-200042-19582947-3.html">3</a><a target="_self" href="thread-o-200042-19582947-4.html">4</a><a target="_self" href="thread-o-200042-19582947-5.html">5</a><span>...</span><a target="_self" href="thread-o-200042-19582947-4927.html">4927</a><span class="gopage"><input type="text" value="1" title="输入页码,按回车快速跳转" onkeydown="if(event.keyCode==13){tz.goPage(this)}" /><span class="fs" title="共 4927 页"> / 4927 页</span></span><a target="_self" class="afpage" href="thread-o-200042-19582947-2.html" title="支持键盘 ← → 键翻页">下一页</a></div>
<div class="jfwen">
到第<span><input type="text" value="" class="topinp txtcenter" id="txtGoFloor1" maxlength="7"
title="输入楼层数,按回车快速定位" onkeydown="if(event.keyCode==13){tz.goFloor(null,'txtGoFloor1')}" /></span>楼</div>


发现本来的连接就是html文件的文件名,所以只要按原来的文件名保存就可以了。

5、无用代码过滤

将onload和onerror事件去掉,将script的始末标签替换为DIV,将无用http://开头 替换为本地./,方便本地浏览不占资源

【解决问题】

1、http

private static readonly string DefaultUserAgent = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)";
/// <summary>
/// 创建GET方式的HTTP请求  ,拿来的改了下
/// </summary>
/// <param name="url">请求的URL</param>
/// <param name="timeout">请求的超时时间</param>
/// <param name="userAgent">请求的客户端浏览器信息,可以为空</param>
/// <param name="referer">请求来源URL</param>
/// <param name="cookies">随同HTTP请求发送的Cookie信息,如果不需要身份验证可以为空</param>
/// <returns></returns>
public static HttpWebResponse CreateGetHttpResponse(string url, int? timeout, string userAgent, string referer, CookieCollection cookies)
{
if (string.IsNullOrEmpty(url))
{
throw new ArgumentNullException("url");
}
HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
request.Method = "GET";
request.UserAgent = DefaultUserAgent;
if (!string.IsNullOrEmpty(userAgent))
{
request.UserAgent = userAgent;
}
if (timeout.HasValue)
{
request.Timeout = timeout.Value;
}
if (referer != null)
{
request.Referer = referer;
}
if (cookies != null)
{
request.CookieContainer = new CookieContainer();
request.CookieContainer.Add(cookies);
}
return request.GetResponse() as HttpWebResponse;
}


2、主要动作按钮

private void btnStart_Click(object sender, EventArgs e)
{
btnStart.Enabled = false;
btnStop.Enabled = true;
timer1.Enabled = true;

var dt1 = DateTime.Now;
var dir = tbSaveDir.Text;
var baseUrl = tbURL.Text.Replace("-1.", "-#.");
var totalPage = (int) tbTotalPage.Value;
var fromPage = (int) tbFromPage.Value;
var isDefault = radioButton1.Checked;
var html = "";
var imgurl = "";
var htmlFile = "";
var imgNum = 0;

//进度条
progressBar1.Maximum = totalPage - fromPage + 1;
progressBar1.Value = 1;
progressBar2.Value = 1;

for (int i = fromPage; i <= totalPage; i++)
{
//处理进度条
progressBar1.Step = 1;
progressBar1.PerformStep();

//处理操作
var url = baseUrl.Replace("#", i.ToString());
try
{
var response = HttpWebResponseUtility.CreateGetHttpResponse(url, null, null, url, null);
if (response != null)
{

var sr = new StreamReader(response.GetResponseStream(), Encoding.GetEncoding("gb2312"),true);
html = sr.ReadToEnd();
response.Close();

string pattern = "src9=\"http://[a-zA-Z0-9_./]+\"";
var gc = Regex.Matches(html, pattern);
//Console.WriteLine(gc);

//处理HTML文件
html = html.Replace("src=\"http://x.autoimg.cn/club/lazyload.png\" src9=\"", "src=\"");
html = html.Replace("http://", "");
html = html.Replace("onload=", "x1=");
html = html.Replace("onerror=", "x2=");
html = html.Replace("<script", "<DIV style=\"display:none\" ");
html = html.Replace("</script", "</DIV");
htmlFile = dir + Path.DirectorySeparatorChar + Path.GetFileName(url.Replace("http://", ""));
var sw = new StreamWriter(htmlFile, true, Encoding.GetEncoding("gb2312"));
sw.Write(html);
sw.Close();
sw.Dispose();
tbLog.AppendText(htmlFile + " ok");

imgNum = 0;
foreach (var match in gc)
{
//Console.WriteLine(match.ToString());
imgurl = match.ToString().Replace("\"", "").Replace("src9=", "");
_myQue.Enqueue(new ParamEntity(dir, imgurl, url, isDefault));

imgNum++;
}//end-foreach

tbLog.AppendText(", " + imgNum +" image(s)" + Environment.NewLine);
_totalNum += imgNum;
}
}
catch (Exception exception)
{
//Console.WriteLine(exception);
MessageBox.Show(exception.Message);
}

}
btnStart.Enabled = true;
var dt2 = DateTime.Now;
var timeUse = dt2 - dt1;
MessageBox.Show(string.Format("页面下载已结束,耗时 {0} 分钟,请等待图片下载结束,结束后打开目录 {1} 查看下载内容。", timeUse.TotalMinutes.ToString("F2"), dir));
}


3、下载图片

private void DownloadImage(object obj)
{
var pe = obj as ParamEntity;
var tmp = pe.ImgUrl.Replace("http://", "");
var dir = pe.SaveDir + Path.DirectorySeparatorChar + Path.GetDirectoryName(tmp);
var filename = pe.SaveDir + Path.DirectorySeparatorChar + tmp;

if (!Directory.Exists(dir))
{
Directory.CreateDirectory(dir);
}
try
{
_runNum++;
if (pe.IsType1)
{
var wc = new WebClient();
wc.DownloadFile(pe.ImgUrl, filename);
}
else
{
var imgres = HttpWebResponseUtility.CreateGetHttpResponse(pe.ImgUrl, null, null, pe.PageUrl, null);
if (imgres != null)
{
var reader = imgres.GetResponseStream();
var writer = new FileStream(filename, FileMode.OpenOrCreate, FileAccess.Write);
var buff = new byte[512];
var c = 0; //实际读取的字节数
while ((c = reader.Read(buff, 0, buff.Length)) > 0)
{
writer.Write(buff, 0, c);
}
writer.Close();
writer.Dispose();
reader.Close();
reader.Dispose();
imgres.Close();
}
}

}
catch (Exception e)
{
//Console.WriteLine(e.ToString());
_logQue.Enqueue(Path.GetFileName(tmp) + " fail. " + e.Message);
}
}


4、timer触发器

private void timer1_Tick(object sender, EventArgs e)
{
//处理进度条
progressBar2.Maximum = _totalNum;
progressBar2.Step = 5;
progressBar2.PerformStep();

for (var i = 1; i <= 5; i++)
{
if (_myQue.Count > 0)
{
//Console.WriteLine(@"RunThread ({0}) {1}", i, _runNum);
var pe = (ParamEntity) _myQue.Dequeue();
var thread = new Thread(new ParameterizedThreadStart(DownloadImage));
thread.Start(pe);
}
}
if (_logQue.Count > 0)
{
tbLog.AppendText(_logQue.Dequeue().ToString() + Environment.NewLine);
}

Application.DoEvents();
}


本来不想用timer的,想用一个队列,自己处理完了会再继续处理下一个,结果没写成。

【可能的技术要点】

1、http请求带referer

2、多线程,界面不阻塞(backgroundWorker,我还没改)

3、progressBar

4、Queue

【成品】



猛击下载

【心得】

不求最好,但求心安。新手的看看,大虾的指点。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐