您的位置：首页 > 其它

解决网爬工具爬取页面信息出现乱码的问题

2007-01-21 21:15 399 查看

问题：
网爬工具中自动搜集页面信息时，有的页面出现了出现乱码现象
原因：
读取页面信息是使用了错误的编码类型。C#.NET从现在的类中获取得来的编码信息有时是错误的，本人认为对不是ASP.NET的应用程序，它读过来的编码信息都是错误的。
解决：
思路：必须先在运行时获取得该页面的编码，再去读取页面的内容，这样得来的页面内容才不会出现乱码现象。
方法：
1:使用ASCII编码去读取页面内容。
2:使用正则表达式从读取的页面内容中筛选出页面的编码信息。上个步骤获取的页面信息可能会有乱码。但HTML标志是正确的，所有可以从HTML标志中得到编码的信息。
3.用正确的编码类型去读取页面信息。
如果哪位有更好的方法，请多赐教啊！

下面附上代码：

using System;
using System.Collections.Generic;
using System.Text;
using System.Net;
using System.Web;
using System.IO;
using System.Text.RegularExpressions;
namespace charset

http://www.gdqy.edu.cn页面的使用的编码格式是：gb2312
第一个方法显示的内容是：
context type:text/html
charset:ISO-8859-1
content encoding:
第二个方法显示的内容是：
charset:gb2312

所以第一个方法获取的信息是错误的，第二个方法是对的。
为什么第一个方法获取的的编码格式是：ISO-8859-1呢？
我用Reflector反射工具获取了CharacterSet属性的源代码，从中不难看出其原因。如果能获取出ContentType属性的源代码就不以看出其出错的原因了，但是搞了许久都没找出，如果那位那补上，那就太感谢了。
下面我附上Reflector反射工具获取了CharacterSet属性的源代码，有兴趣的朋友看一看。

public string CharacterSet
{
get
{
this.CheckDisposed();
string text1 = this.m_HttpResponseHeaders.ContentType;
if ((this.m_CharacterSet == null) && !ValidationHelper.IsBlankString(text1))
{
this.m_CharacterSet = string.Empty;
string text2 = text1.ToLower(CultureInfo.InvariantCulture);
if (text2.Trim().StartsWith("text/"))
{
this.m_CharacterSet = "ISO-8859-1";
}
int num1 = text2.IndexOf(";");
if (num1 > 0)
{
while ((num1 = text2.IndexOf("charset", num1)) >= 0)
{
num1 += 7;
if ((text2[num1 - 8] == ';') || (text2[num1 - 8] == ' '))
{
while ((num1 < text2.Length) && (text2[num1] == ' '))
{
num1++;
}
if ((num1 < (text2.Length - 1)) && (text2[num1] == '='))
{
num1++;
int num2 = text2.IndexOf(';', num1);
if (num2 > num1)
{
this.m_CharacterSet = text1.Substring(num1, num2).Trim();
break;
}
this.m_CharacterSet = text1.Substring(num1).Trim();
break;
}
}
}
}
}
return this.m_CharacterSet;
}

结束！

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航