您的位置：首页 > 其它

Pig 知识盲点

2015-06-03 20:38 260 查看

1. Word Count 例子

inputfile = load 'file' as (line);

内容：

{line: bytearray}

(Look at the stars, )

(Look how they shine for you, )

(And everything you do, )

(Yeah they were all yellow,)

wordsLine = foreach inputfile generate flatten(TOKENIZE(line)) aseachWord;

内容：

{eachWord: chararray}

(Look)

(at)

(the)

(stars)

(Look)

(how)

(they)

(shine)

(for)

(you)

(And)

(everything)

(you)

(do)

(Yeah)

(they)

(were)

(all)

(yellow)

groupedWords = group wordsLine by eachWord;

内容：

{group: chararray,wordsLine: {(eachWord: chararray)}}

(at,{(at)})

(do,{(do)})

(And,{(And)})

(all,{(all)})

(for,{(for)})

(how,{(how)})

(the,{(the)})

(you,{(you),(you)})

(Look,{(Look),(Look)})

(Yeah,{(Yeah)})

(they,{(they),(they)})

(were,{(were)})

(shine,{(shine)})

(stars,{(stars)})

(yellow,{(yellow)})

(everything,{(everything)})

wordcount = foreach groupedWords generate group, COUNT(wordsLine) as cnt

{group: chararray, cnt:long}

(at,1)

(do,1)

(And,1)

(all,1)

(for,1)

(how,1)

(the,1)

(you,2)

(Look,2)

(Yeah,1)

(they,2)

(were,1)

(shine,1)

(stars,1)

(yellow,1)

(everything,1)

2.Pig 返回码

0：成功；1：失败，可重试；2：失败；3：部分失败......其余是各种异常。

3. 复杂数据类型(很少用)

map:

tuple: 定长有序

bag: 无序

4. NULL值

null 对任何运算符都抵消：

x+null = null

null==1 ? 1: 0 , 返回值为null

5.加载和存储

加载函数：

PigStoreage(',') ：HDFS路径

HBaseStoreage()：HBase表

TextLoader: HDFS路径, 每行作为一个tuple

存储函数：

PigStoreage(',') ：HDFS路径

HBaseStoreage()：HBase表

TextLoader: HDFS路径, 每行作为一个tuple

6.大小写敏感：
关键字不敏感： load == LOAD
变量敏感：tablea != tableA
自定义函数敏感： count != COUNT

7.Parallel

可触发reduce的操作：group、order、distinct、join、cogroup、cross

后面可使用parallel 指定并行数目

[b]8. 注册UDF[/b]

使用命令register，或者属性 -Dudf.import.list, 或-Dpig.additional.jars

[b]9. Java静态函数[/b]

实际是使用反射来运行

[b]10. flattern[/b]

操作bag，一行变多行

[b]11. replicated Join[/b]

map side join

[b]12. skew join[/b]

先抽样，确定键的分布，重写partitioner，从而均衡各个reducer的负载

[b]13. merge join[/b]

已经排好序，比默认的高效

[b]14. cogroup[/b]

对多个输入进行group，如 : C = cogroup A by id, B by id.

相当于join的前一半

[b]15. stream[/b]

执行perl、python等脚本

[b]16. 直接运行mapreduce[/b]

使用mapreduce命令执行

[b]17. 有向无环图 DAG[/b]

[b]18. 分割器 Partitioner[/b]

可注册自定义jar包

[b]19. 宏[/b]

使用define关键字

[b]20. 嵌套Pig 脚本[/b]

使用import关键字

[b]21. 执行计划 exlain[/b]

用于调试

[b]22. illustrate[/b]

用于调试

[b]23. Pig 统计信息[/b]

在日志或者终端输出

[b]24. PigUnit[/b]

集成到JUnit

[b]25. 与 Python 交互[/b]

[b]26. 评估函数[/b]

求值函数，继承org.apache.pig.EvalFunc<V>, 实现 ecec(Tuple input)

也可以用Python书写

[b]27. 过滤函数[/b]

继承org.apache.pig.FilterFunc

28. 加载函数

29. 存储函数

30. Piggybank

内置常用聚合函数、数学函数、字符串处理函数。

特别的，有正则表达式匹配函数：REGEX_EXTRACT、REGEX_EXTRACT_ALL

统计函数：相关系数COR、协方差COV等等。

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航