Hive 在多维统计分析中的应用 & 技巧总结
2017-04-18 10:21
435 查看
本文原地址:https://my.oschina.net/leejun2005/blog/121945多维统计一般分两种,我们看看 Hive 中如何解决:
1、同属性的多维组合统计(1)问题:
有如下数据,字段内容分别为:url, catePath0, catePath1, catePath2, unitparamshttps://cwiki.apache.org/confluence 0 1 8 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 1 23 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 1 25 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} https://cwiki.apache.org/confluence 0 5 18 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 5 118 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 3 98 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 3 8 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 5 81 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 9 8 {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
(2)需求:
计算 catePath0, catePath1, catePath2 这三种维度组合下,各个 url 对应的 pv、uv,如:0 1 23 1 1
0 1 25 1 1
0 1 8 1 1
0 1 ALL 3 3
0 3 8 1 1
0 3 98 1 1
0 3 ALL 2 1
0 5 118 1 1
0 5 18 1 1
0 5 81 1 1
0 5 ALL 3 2
0 ALL ALL 8 3
ALL ALL ALL 8 3
(3)解决思路:
hive 中同属性多维统计问题通常用 union all 组合出各种维度然后 group by 进行求解:
2、不同属性的多维组合统计这种场景下我们一般选择 Multi Table/File Inserts,下面选自《programming hive》P124Making Multiple Passes over the Same Data
Hive has a special syntax for producing multiple aggregations from a single pass
through a source of data, rather than rescanning it for each aggregation. This change
can save considerable processing time for large input data sets. We discussed the details
previously in Chapter 5.
For example, each of the following two queries creates a table from the same source
table, history:
hive> INSERT OVERWRITE TABLE sales
> SELECT * FROM history WHERE action='purchased';
hive> INSERT OVERWRITE TABLE credits
> SELECT * FROM history WHERE action='returned';
This syntax is correct, but inefficient. The following rewrite achieves the same thing,
but using a single pass through the source history table:
hive> FROM history
> INSERT OVERWRITE sales SELECT * WHERE action='purchased'
> INSERT OVERWRITE credits SELECT * WHERE action='returned';
注意事项以及一些小技巧:1、hive union all 的用法:不支持 top level,以及各个select字段名称、属性必须严格一致2、结果的顺序问题,可以自己加字符控制排序3、多重insert和union all一样也只扫描一次,但因为要insert到多个分区,所以做了很多其他的事情,导致消耗的时间非常长,其会产生多个job,union all 本身只有一个job关于 insert overwrite 产生多 job 并行执行的问题:set hive.exec.parallel=true; //打开任务并行执行
set hive.exec.parallel.thread.number=16; //同一个sql允许最大并行度,默认为8。
http://superlxw1234.iteye.com/blog/1703713
4、当前HIVE 不支持 not in 中包含查询子句的语法,形如如下的HQ语句是不被支持的:
查询在key字段在a表中,但不在b表中的数据
select a.key from a where key not in(select key from b) 该语句在hive中不支持
可以通过left outer join进行查询,(假设B表中包含另外的一个字段 key1
select a.key from a left outer join b on a.key=b.key where b.key1 is null5、left out join 不能连续3个以上使用,必须2个一组,2个一组包装起来使用。
8、hive中巧用正则表达式的贪婪匹配http://superlxw1234.iteye.com/blog/17512169、hive匹配全中文字段用java中匹配中文的正则即可:name rlike '^[\\u4e00-\\u9fa5]+$'判断一个字段是否全数字:
select mobile from woa_login_log_his where pt = '2012-01-10' and mobile rlike '^\\d+$' limit 50;
10、hive中使用sql window函数 LAG/LEAD/FIRST/LASThttp://superlxw1234.iteye.com/blog/1600323http://www.shaoqun.com/a/18839.aspx11、hive优化之------控制hive任务中的map数和reduce数http://superlxw1234.iteye.com/blog/158288012、hive中转义$等特殊字符http://superlxw1234.iteye.com/blog/156873913、日期处理:查看N天前的日期:select from_unixtime(unix_timestamp('20111102','yyyyMMdd') - N*86400,'yyyyMMdd') from t_lxw_test1 limit 1; 获取两个日期之间的天数/秒数/分钟数等等:select ( unix_timestamp('2011-11-02','yyyy-MM-dd')-unix_timestamp('2011-11-01','yyyy-MM-dd') ) / 86400 from t_lxw_test limit 1;
14、删除 Hive 临时文件 hive.exec.scratchdirhttp://hi.baidu.com/youziguo/item/1dd7e6315dcc0f28b2c0c576
REF:http://superlxw1234.iteye.com/blog/1536440 http://liubingwwww.blog.163.com/blog/static/3048510720125201749323/ http://blog.csdn.net/azhao_dn/article/details/6921429http://superlxw1234.iteye.com/category/228899
1、同属性的多维组合统计(1)问题:
有如下数据,字段内容分别为:url, catePath0, catePath1, catePath2, unitparamshttps://cwiki.apache.org/confluence 0 1 8 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 1 23 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 1 25 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} https://cwiki.apache.org/confluence 0 5 18 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 5 118 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 3 98 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 3 8 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://my.oschina.net/leejun2005/blog/83058 0 5 81 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} http://www.hao123.com/indexnt.html?sto 0 9 8 {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"}
(2)需求:
计算 catePath0, catePath1, catePath2 这三种维度组合下,各个 url 对应的 pv、uv,如:0 1 23 1 1
0 1 25 1 1
0 1 8 1 1
0 1 ALL 3 3
0 3 8 1 1
0 3 98 1 1
0 3 ALL 2 1
0 5 118 1 1
0 5 18 1 1
0 5 81 1 1
0 5 ALL 3 2
0 ALL ALL 8 3
ALL ALL ALL 8 3
(3)解决思路:
hive 中同属性多维统计问题通常用 union all 组合出各种维度然后 group by 进行求解:
create EXTERNAL table IF NOT EXISTS t_log ( url string, c0 string, c1 string, c2 string, unitparams string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' location '/tmp/decli/1';select * from ( select host, c0, c1, c2 from t_log t0 LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9 union all select host, c0, c1, 'ALL' c2 from t_log t0 LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9 union all select host, c0, 'ALL' c1, 'ALL' c2 from t_log t0 LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9 union all select host, 'ALL' c0, 'ALL' c1, 'ALL' c2 from t_log t0 LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9) test;select c0, c1, c2, count(host) PV, count(distinct(host)) UV from ( select host, c0, c1, c2 from t_log t0 LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9 union all select host, c0, c1, 'ALL' c2 from t_log t0 LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9 union all select host, c0, 'ALL' c1, 'ALL' c2 from t_log t0 LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9 union all select host, 'ALL' c0, 'ALL' c1, 'ALL' c2 from t_log t0 LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9) test group by c0, c1, c2;
2、不同属性的多维组合统计这种场景下我们一般选择 Multi Table/File Inserts,下面选自《programming hive》P124Making Multiple Passes over the Same Data
Hive has a special syntax for producing multiple aggregations from a single pass
through a source of data, rather than rescanning it for each aggregation. This change
can save considerable processing time for large input data sets. We discussed the details
previously in Chapter 5.
For example, each of the following two queries creates a table from the same source
table, history:
hive> INSERT OVERWRITE TABLE sales
> SELECT * FROM history WHERE action='purchased';
hive> INSERT OVERWRITE TABLE credits
> SELECT * FROM history WHERE action='returned';
This syntax is correct, but inefficient. The following rewrite achieves the same thing,
but using a single pass through the source history table:
hive> FROM history
> INSERT OVERWRITE sales SELECT * WHERE action='purchased'
> INSERT OVERWRITE credits SELECT * WHERE action='returned';
FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum' SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY pv_users.age;https://cwiki.apache.org/confluence/display/Hive/Tutorial
注意事项以及一些小技巧:1、hive union all 的用法:不支持 top level,以及各个select字段名称、属性必须严格一致2、结果的顺序问题,可以自己加字符控制排序3、多重insert和union all一样也只扫描一次,但因为要insert到多个分区,所以做了很多其他的事情,导致消耗的时间非常长,其会产生多个job,union all 本身只有一个job关于 insert overwrite 产生多 job 并行执行的问题:set hive.exec.parallel=true; //打开任务并行执行
set hive.exec.parallel.thread.number=16; //同一个sql允许最大并行度,默认为8。
http://superlxw1234.iteye.com/blog/1703713
4、当前HIVE 不支持 not in 中包含查询子句的语法,形如如下的HQ语句是不被支持的:
查询在key字段在a表中,但不在b表中的数据
select a.key from a where key not in(select key from b) 该语句在hive中不支持
可以通过left outer join进行查询,(假设B表中包含另外的一个字段 key1
select a.key from a left outer join b on a.key=b.key where b.key1 is null5、left out join 不能连续3个以上使用,必须2个一组,2个一组包装起来使用。
select p.ssi,p.pv,p.uv,p.nuv,p.visits,'2012-06-19 17:00:00' from ( select * from ( select * from (select ssi,count(1) pv,sum(visits) visits from FactClickAnalysis where logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi ) p1 left outer join ( select ssi,count(1) uv from (select ssi,cookieid from FactClickAnalysis where logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi,cookieid ) t1 group by ssi ) p2 on p1.ssi=p2.ssi ) p3 left outer join ( select ssi, count(1) nuv from FactClickAnalysis where logTime = insertTime and logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi ) p4 on p3.ssi=p4.ssi ) p6、hive本地执行mrhttp://superlxw1234.iteye.com/blog/17035467、hive动态分区创建过多遇到的一个错误http://superlxw1234.iteye.com/blog/1677938
8、hive中巧用正则表达式的贪婪匹配http://superlxw1234.iteye.com/blog/17512169、hive匹配全中文字段用java中匹配中文的正则即可:name rlike '^[\\u4e00-\\u9fa5]+$'判断一个字段是否全数字:
select mobile from woa_login_log_his where pt = '2012-01-10' and mobile rlike '^\\d+$' limit 50;
10、hive中使用sql window函数 LAG/LEAD/FIRST/LASThttp://superlxw1234.iteye.com/blog/1600323http://www.shaoqun.com/a/18839.aspx11、hive优化之------控制hive任务中的map数和reduce数http://superlxw1234.iteye.com/blog/158288012、hive中转义$等特殊字符http://superlxw1234.iteye.com/blog/156873913、日期处理:查看N天前的日期:select from_unixtime(unix_timestamp('20111102','yyyyMMdd') - N*86400,'yyyyMMdd') from t_lxw_test1 limit 1; 获取两个日期之间的天数/秒数/分钟数等等:select ( unix_timestamp('2011-11-02','yyyy-MM-dd')-unix_timestamp('2011-11-01','yyyy-MM-dd') ) / 86400 from t_lxw_test limit 1;
14、删除 Hive 临时文件 hive.exec.scratchdirhttp://hi.baidu.com/youziguo/item/1dd7e6315dcc0f28b2c0c576
REF:http://superlxw1234.iteye.com/blog/1536440 http://liubingwwww.blog.163.com/blog/static/3048510720125201749323/ http://blog.csdn.net/azhao_dn/article/details/6921429http://superlxw1234.iteye.com/category/228899
相关文章推荐
- Hive 在多维统计分析中的应用 & 技巧总结
- Hive 在多维统计分析中的应用 & 技巧总结
- Hive 在多维统计分析中的应用
- asp获取URL参数的几种方法分析总结[原创]_应用技巧_脚本之家
- Android应用技巧总结
- 中移动开发者社区应用测试统计分析报告
- ORACLE数据库对象统计分析技术应用(原创)
- C#特性Attribute的实际应用之:代码统计分析
- WindowsXP日常应用技巧及经验总结
- 对象统计分析技术应用
- WatiN的一些应用技巧总结
- 项目开发技巧(一):将Web应用打包成war文件的方法总结
- 统计分析与数据挖掘所涉及的应用领域探讨
- jQuery应用(三)--jQuery技巧总结
- windowsXP日常应用技巧及经验总结(转载)【实用】
- 数据库分析与设计中总结出来的14个技巧
- 统计分析与数据挖掘所涉及的应用领域探讨
- SQL进行排序、分组、统计的10个新技巧(个人总结)-------Mondify By LiFuyun
- jQuery应用(三)--jQuery技巧总结
- 技巧总结:div中class与id的区别及应用