您的位置:首页 > 其它

法律网推荐(二) 用Pig进行数据预处理

2016-12-24 21:06 183 查看
上接法律网推荐(一) 用Hive进行数据探索分析  

  3)数据预处理

           1. 数据清洗

           2. 数据变换

           3. 属性规约

通过上述网址类型分布分析,后续分析中,选取其中占比最多的两类(咨询内容页、知识内容页)进行模型分析。可以发现一些与分析目标无关的数据清洗规则:

实验内容:数据清洗,数据变换,属性规约

实验步骤:

1、 删除无点击.html行为的用户记录,统计删除记录及剩余记录数。

2、基于步骤1的结果,删除中间类型网页(带有midques_关键字),统计删除记录及剩余记录数。

3、基于步骤2的结果,删除网址中带有“?”类型数据,统计删除记录及剩余记录数。

4、基于步骤3的结果,删除法律快车-律师助手记录,页面标题包含“法律快车-律师助手”关键字,统计删除记录及剩余记录数。

5、 基于步骤4的结果,筛选模型所需数据(咨询与知识页面数据),统计删除记录及剩余记录数。

6、基于步骤5的结果,删除重复数据(同一时间同一用户,访问相同网页),统计删除记录及剩余记录数。

7、将步骤1-5的处理结果输出到HDFS。

步骤1-----无.html点击行为的用户记录统计(删除无点击.html行为的用户记录,统计删除记录及剩余记录数)
law = load '/root/law_utf8.csv' using PigStorage (',');
law_filter = filter law by not ($10 matches '.*\\.html');
law_grp = group law_filter all;
count_num = foreach law_grp generate COUNT (law_filter) as delete_num,(837450-COUNT (law_filter)) as remain_num;
set job.name 'law_filter';

dump count_num;


步骤2:

--------中间类型网页统计
law_filter = filter law by
$10 matches '.*\\.html'
and not ($10 matches '.*midques_.*');
law_grp = group law_filter all;
count_num = foreach law_grp generate
(671769-COUNT (law_filter)) as delete_num,
COUNT (law_filter) as remain_num;
set job.name 'law_mid';
dump count_num;

结果:2036,669733

步骤3:

--------访问网址中带有?的记录统计
law_filter = filter law by
$10 matches '.*\\.html'
and not ($10 matches '.*midques_.*')
and not ($10 matches '.*\\?.*');
law_grp = group law_filter all;
count_num = foreach law_grp generate
(669733-COUNT (law_filter)) as delete_num,
COUNT (law_filter) as remain_num;
set job.name 'law_mark';
dump count_num;

结果:52,669681

步骤4:

-------页面标题包含法律快车-律师助手统计
law_filter = filter law by $10 matches '.*\\.html'
and not ($10 matches '.*midques_.*')
and not ($10 matches '.*\\?.*')
and not ($13 matches '.*法律快车-律师助手.*');
law_grp = group law_filter all;
count_num = foreach law_grp generate
(669681-COUNT (law_filter)) as delete_num,
COUNT (law_filter) as remain_num;
set job.name 'law_kuaiche';
dump count_num;

结果:(11,669670)

步骤5:

-------筛选模型所需数据(咨询与知识页面数据)
注意:为了使用SUBSTRING()方法,需注册piggybank.jar包。

register pig安装目录/lib/piggybank.jar
law_filter = filter law by $10 matches '.*\\.html'
and not ($10 matches '.*midques_.*')
and not ($10 matches '.*\\?.*')
and not ($13 matches '.*法律快车-律师助手.*')
and (SUBSTRING($11,0,3) == '101' or SUBSTRING($11,0,3) == '107');
law_grp = group law_filter all;
count_num = foreach law_grp generate
(669670-COUNT (law_filter)) as delete_num,
COUNT (law_filter) as remain_num;
set job.name 'law_model';
dump count_num;

结果:100460,569210

步骤6:

-------重复记录统计
注意:为了使用SUBSTRING()方法,需注册piggybank.jar包。

register pig安装目录/lib/piggybank.jar
law_filter = filter law by $10 matches '.*\\.html'
and not ($10 matches '.*midques_.*')
and not ($10 matches '.*\\?.*')
and not ($13 matches '.*法律快车-律师助手.*')
and (SUBSTRING($11,0,3) == '101' or SUBSTRING($11,0,3) == '107');

law_dist_field = foreach law_filter generate $4 ,$7,$10;

law_distinct = distinct law_dist_field;

law_grp = group law_distinct all;
count_num = foreach law_grp generate
(569210-COUNT (law_distinct)) as delete_num,
COUNT (law_distinct) as remain_num;
set job.name 'store_extract';
dump count_num;

结果:(14057,555153)

步骤7:

-------将步骤1-5清洗结果输出HDFS
注意:为了使用SUBSTRING()方法,需注册piggybank.jar包。

register pig安装目录/lib/piggybank.jar
law_filter = filter law by $10 matches '.*\\.html'
and not ($10 matches '.*midques_.*')
and not ($10 matches '.*\\?.*')
and not ($13 matches '.*法律快车-律师助手.*')
and (SUBSTRING($11,0,3) == '101' or SUBSTRING($11,0,3) == '107');
set job.name 'store_result';
store law_filter into '/data/out' using PigStorage(',');
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐