您的位置:首页 > 运维架构

[ Hadoop | MapReduce ] 使用 CompositeInputSplit 来提高Join效率

2015-04-01 11:09 736 查看
Map side join is the most efficient way. On Hadoop, between two large datasets, we can utilizeComposite Join to achieve this goal.

The Use Case

First use Identity Mapper and Identity Reducer to sort and partition two inputs, making both have same partition numbers.

use -Dmapred.reduce.tasks=2

Secondly, use composite join…

Note: if the two inputs have different partition numbers(i.e. part* files) , an exception will be thrown: java.io.IOException: Inconsistent split cardinality from child 1 (1/2)

The simplest way to use composite join is to make reduce number = 1, so that there is only one partition for each input file, provided the performance is fine.

The Source Code for the application
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: