[经验总结]Perl提取网页信息
2010-07-10 23:44
211 查看
#!/usr/bin/perl -w
# Gist: https://gist.github.com/2928006
use LWP::Simple;my $url=$ARGV[0];my $filename =$ARGV[1]; my $content = get($url) or die "Couldn't get $url";
#$content =~ s#^.*?(<div.*?</div>).*$##m;
if ($content =~ m#.*(<div id="enText" style="display:block">.*?</div>).*#sg) { $text = $1;
# 打开模板文件 open(TEMPLATE, "template.html") or die " Couldn't open template.html for writing: $! " ;
# 读取模板文件 $/="</html>"; # 读到</html>结束 my $reads = <TEMPLATE>;
# 替换听力文本 $reads =~ s/==TEXT_CONTENT==/$text/gix; #print $reads;
# 输出html文件 open (OUT , " > $filename " ) or die " Couldn't open $filename for writing: $! "; print OUT $reads;
# 下载听力 my $baseUrl = $url; $baseUrl =~ s/(.*)(//.*/.html)/$1/g; my $reslink = $content; $reslink =~ s/.*<a href="(.*?)" title="进入下载资料页面">下载听力<//a>.*/$1/sg; $reslink = $baseUrl . "/" . $reslink;
print "/nreslink:",$reslink,"/n"; my $respage = get($reslink) or die "Couldn't get $reslink";
my $mp3link = $respage; print $mp3link;#$mp3link =~ s#.*<a href="(.*?)" target="_blank"><img src="/images/downloadurl1/.jpg"></a>.*#$1#sg;#if ($mp3link =~ m/.*<a href="(.*?)" target="_blank"><img src="//images//downloadurl1/.jpg"><//a>.*/sg) {
# 没登录,下载链接获取不到,怎么办? if ($mp3link =~ m/downloadurl1/sg) { print "匹配/n"; } else { print "不匹配/n"; } #<a href="(.*?)" target="_blank"><img src="/images/downloadurl1.jpg"></a>
print "/ndownload:".$reslink."/n";}else { print "不匹配/n";}
# Gist: https://gist.github.com/2928006
use LWP::Simple;my $url=$ARGV[0];my $filename =$ARGV[1]; my $content = get($url) or die "Couldn't get $url";
#$content =~ s#^.*?(<div.*?</div>).*$##m;
if ($content =~ m#.*(<div id="enText" style="display:block">.*?</div>).*#sg) { $text = $1;
# 打开模板文件 open(TEMPLATE, "template.html") or die " Couldn't open template.html for writing: $! " ;
# 读取模板文件 $/="</html>"; # 读到</html>结束 my $reads = <TEMPLATE>;
# 替换听力文本 $reads =~ s/==TEXT_CONTENT==/$text/gix; #print $reads;
# 输出html文件 open (OUT , " > $filename " ) or die " Couldn't open $filename for writing: $! "; print OUT $reads;
# 下载听力 my $baseUrl = $url; $baseUrl =~ s/(.*)(//.*/.html)/$1/g; my $reslink = $content; $reslink =~ s/.*<a href="(.*?)" title="进入下载资料页面">下载听力<//a>.*/$1/sg; $reslink = $baseUrl . "/" . $reslink;
print "/nreslink:",$reslink,"/n"; my $respage = get($reslink) or die "Couldn't get $reslink";
my $mp3link = $respage; print $mp3link;#$mp3link =~ s#.*<a href="(.*?)" target="_blank"><img src="/images/downloadurl1/.jpg"></a>.*#$1#sg;#if ($mp3link =~ m/.*<a href="(.*?)" target="_blank"><img src="//images//downloadurl1/.jpg"><//a>.*/sg) {
# 没登录,下载链接获取不到,怎么办? if ($mp3link =~ m/downloadurl1/sg) { print "匹配/n"; } else { print "不匹配/n"; } #<a href="(.*?)" target="_blank"><img src="/images/downloadurl1.jpg"></a>
print "/ndownload:".$reslink."/n";}else { print "不匹配/n";}
相关文章推荐
- 使用MFC开发网页提取程序的经验总结
- SEO优化实战经验总结:网页减肥
- 基于MVC4+EasyUI的Web开发框架经验总结(16)--使用云打印控件C-Lodop打印页面或套打报关运单信息
- Matlab 提取网页信息保存到Excel(正则表达式)
- perl应用:SNP的提取(4):信息的补全all.pl和重复区域的删除repeat_move_all_information.pl!
- HtmlParser提取网页中的纯文本信息-java
- 项目经验总结---网页在线聊天
- SEO优化实战经验总结:简洁而实用的SEO网页工具
- 基于Metronic的Bootstrap开发框架经验总结(4)--Bootstrap图标的提取和利用
- 网页开发经验总结
- lmth1 一个用Python编写的便捷网页信息提取工具
- Perl模式匹配经验总结
- [python] 常用正则表达式爬取网页信息及分析HTML标签总结
- 用perl 提取时间信息并处理数据--实例。
- php抓取alexa网页内容 提取站点统计信息
- 有主题网页的信息提取算法
- perl相关的HTML模块(用于快速提取HTML文件中信息)
- 初学perl一些经验总结,不对的地方还请指正
- JAVA中判断某详细信息列表中是否有空项(经验总结)
- 常用正则表达式爬取网页信息及HTML分析总结