您的位置：首页 > 其它

自动提取网页的信息，并分析之 ()

2009-05-24 15:52 537 查看

本文是参照摩诘的Blog

今天遇到这样一个问题，从政府网站中，根据一个关键数据KeyData，提取相关数据。

这个问题可分为三部分解决：

1）取得政府网站交互的方法；

2）按照合适的方法，用HttpWebResponse，取得相关数据

3）分析取回来的数据

第一部分：获取网站交互信息，采用工具ieHTTPHeadersSetup.exe

得到的数据如下：

GET /search.asp?key=2006002995&ys_type=hy&imageField2.x=32&imageField2.y=20 HTTP/1.1

Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, application/vnd.ms-excel, application/vnd.ms-powerpoint, application/msword, */*

Accept-Language: zh-cn

Accept-Encoding: gzip, deflate

User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)

Host: http://www.suzhou-logistics.com/

Connection: Keep-Alive

可以看出，

url: http:// href="http://www.suzhou-logistics.com/" target=_blank>http://www.suzhou-logistics.com//search.asp?

Data:key=2006002995&ys_type=hy&imageField2.x=32&imageField2.y=20

也可以直接作为url：http://www.suzhou-logistics.com/search.asp?key=2006002995&ys_type=hy&imageField2.x=32&imageField2.y=20

第二部分：根据第一部分的分析，通过HttpWebResponse取HTML

在此就给出一个通用的函数

public static string GetPage(string url, string postData,string encodeType,out string err)

public static DataSet ParsePage(string pageContent, string xclpath,string xrpath,out string err)

private void Button1_Click(object sender, System.EventArgs e)

string SgmlReaderTest(Uri baseUri, string url, TextWriter log, bool upper, bool formatted)

{

string inputUri = url;

try

{

SgmlReader r = new SgmlReader();

r.SetBaseUri(Server.MapPath("."));

r.DocType = "HTML";

r.Href = url;

if (upper) r.CaseFolding = CaseFolding.ToUpper;

StringWriter sw = new StringWriter();

XmlTextWriter w = new XmlTextWriter(sw);

if (formatted)

{

w.Formatting = Formatting.Indented;

r.WhitespaceHandling = WhitespaceHandling.None;

}

r.Read();

while (!r.EOF)

{

w.WriteNode(r, true);

}

w.Flush();

w.Close();

return sw.ToString();

}

catch (Exception e)

{

return e.ToString();

}

}

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航

自动提取网页的信息 ，并分析之 ()

自动提取网页的信息，并分析之 ()