定义信息源的一些示例(xml文件)
2015-10-12 11:54
148 查看
本文目录
1、订阅博客,简单一例
2、从网页获取信息,简单一例
3、充分使用callback回调代码
4、html_re中包含多个block
5、使用html_json这个worker,解析json数据
1、订阅博客,简单一例:
<source> <name>范志红博客</name> <comment>搜狐博客。原创营养信息。</comment> <link>http://snowheart19.blog.sohu.com/</link> <worker>rss_atom</worker> <data> <url>http://snowheart19.blog.sohu.com/rss</url> </data> </source>
2、从网页获取信息,简单一例:
<source> <name>ybk168新邮预告</name> <comment>ybk168新邮预告</comment> <link>http://www.ybk168.com/newslist/00040051.html</link> <worker>html_re</worker> <data> <url>http://www.ybk168.com/newslist/00040051.html</url> <block> <blockre flags='DOTALL'> <![CDATA[ <div class="list">(.*?)<div class="page"> ]]> </blockre> <itemre flags='DOTALL'> <![CDATA[ <li><span.*?href="([^"]+)".*?title="([^"]+)".*? class="list_lr">([^<]+)< ]]> </itemre> <maprules> <title>2</title> <url>'http://www.ybk168.com', 1</url> <pub_date>3</pub_date> </maprules> </block> </data> </source>
3、充分使用callback回调代码:
<source> <name>北京空气质量</name> <comment>北京环境监测的微博。',利有散染预【8时' in s or '浓度】' not in s</comment> <link>http://weibo.cn/u/2516831703</link> <worker>html_re</worker> <data> <url>http://weibo.cn/u/2516831703</url> <block> <blockre flags='DOTALL'> <![CDATA[ <div class="b">(.*)$ ]]> </blockre> <itemre flags='DOTALL'> <![CDATA[ weibo\.cn\[([\d-]+) ]]> </itemre> <maprules> <title>'notitle'</title> <pub_date>1</pub_date> <suid>1</suid> </maprules> </block> <block> <blockre flags='DOTALL'> <![CDATA[ ^(?:.*?\[<span class="kt">置顶</span>\]|.*?<span class="pms">) (.*?) <input type="submit" value="查看更多内容" ]]> </blockre> <itemre flags='DOTALL'> <![CDATA[ <div class="c" id="([^"]+)"> (?:<div><span class="ctt">|.*?<span class="cmt">转发理由:</span>) (.*?) (?:</span>|<a [^>]+>赞\[\d+\]).*? <span class="ct">([^& ]+) ]]> </itemre> <maprules> <title>'notitle'</title> <summary>2</summary> <pub_date>3</pub_date> <suid>1</suid> </maprules> </block> </data> <callback> <![CDATA[ if posi == 0: temp_date = info.pub_date info.temp = 'del' elif '日' in info.pub_date: info.temp = 'del' else: s = info.summary if ',' in s or \ '利' in s or \ '有' in s or \ '散' in s or \ '染' in s or \ '预' in s or \ '【8时' in s or \ '浓度】' not in s: info.url = 'http://weibo.cn/u/2516831703' info.pub_date = '' info.title = '[' + temp_date + '] ' + s[:16] + '…' else: info.temp = 'del' ]]> </callback> </source>
4、html_re中包含多个block:
<source> <name>中国国家地理</name> <comment>中国国家地理</comment> <link>http://www.dili360.com/</link> <worker>html_re</worker> <data> <url>http://www.dili360.com/</url> <block> <blockre flags='DOTALL'> <![CDATA[ <div class="community-item" id="community-items" > (.*?)<!--end--> ]]> </blockre> <itemre flags='DOTALL'> <![CDATA[ <li class="img-block".*? <a target="_blank" href="([^"]+)">.*? <h4>(.*?)</h4> ]]> </itemre> <maprules> <title>2</title> <url>'http://www.dili360.com', 1</url> </maprules> </block> <block> <blockre flags='DOTALL'> <![CDATA[ <div class="community-item" id="community-items" > (.*?)<!--end--> ]]> </blockre> <itemre flags='DOTALL'> <![CDATA[ <dt><a href="([^"]+)" target="_blank">(.*?)</a></dt> ]]> </itemre> <maprules> <title>2</title> <url>'http://www.dili360.com', 1</url> </maprules> </block> <block> <blockre flags='DOTALL'> <![CDATA[ <ul class="style-1" id="replace">(.*?)</ul> ]]> </blockre> <itemre flags='DOTALL'> <![CDATA[ <div class="detail">.*? <a href="([^"]+)" target="_blank"><h4>(.*?)</h4> ]]> </itemre> <maprules> <title>2</title> <url>'http://www.dili360.com', 1</url> <summary>'景观图片'</summary> </maprules> </block> </data> </source>
5、使用html_json这个worker,解析json数据:
<source> <name>新浪书讯</name> <comment>新浪图书,书讯。</comment> <link>http://book.sina.com.cn/</link> <worker>html_json</worker> <data> <url>http://feed.mix.sina.com.cn/api/roll/get?callback=jsonp1436772833418&pageid=8&lid=156&num=20</url> <re flags='DOTALL'> <![CDATA[ ^try\{\w+\( (.*) \);\}catch\(e\)\{\};$ ]]> </re> <block> <block_path>'result', 'data'</block_path> <title>'title'</title> <url>'url'</url> <summary>'summary'</summary> <temp>'intro'</temp> <pub_date>'ctime'</pub_date> </block> </data> <callback> <![CDATA[ info.pub_date = unixtime(info.pub_date) info.summary = info.summary or info.temp info.temp = 0 ]]> </callback> </source>
相关文章推荐
- elasticsearch 文档
- mac os 基本命令
- win10 IIS10 HTTP 错误 404.2 - Not Found
- jsp实现简单验证码的方法
- [置顶] 【数字之魅】寻找最大的K个数(求第k大的数)
- Android Volley完全解析
- Go相关资料
- PHPCMS {$pages}上一页下一页的个性修改方法
- Standford 机器学习应用的建议及机器学习系统的设计
- 【软考】---平衡二叉树
- magento新闻模块开发(一)
- magento新闻模块开发(一)
- iOS发布证书申请
- Jetty Quick Start
- 图解 virtualbox 共享文件夹
- iOS定时器
- cocos2d-x Node
- 黑马程序员——Java基础--------String类
- android adb push No space left on device
- IOS开发:url编码和解码