[ Hadoop | MapReduce ] 使用 CompositeInputSplit 来提高Join效率
2015-04-01 11:09
736 查看
Map side join is the most efficient way. On Hadoop, between two large datasets, we can utilizeComposite Join to achieve this goal.
The Use Case
First use Identity Mapper and Identity Reducer to sort and partition two inputs, making both have same partition numbers.
use -Dmapred.reduce.tasks=2
Secondly, use composite join…
Note: if the two inputs have different partition numbers(i.e. part* files) , an exception will be thrown: java.io.IOException: Inconsistent split cardinality from child 1 (1/2)
The simplest way to use composite join is to make reduce number = 1, so that there is only one partition for each input file, provided the performance is fine.
The Source Code for the application
The Use Case
First use Identity Mapper and Identity Reducer to sort and partition two inputs, making both have same partition numbers.
use -Dmapred.reduce.tasks=2
Secondly, use composite join…
Note: if the two inputs have different partition numbers(i.e. part* files) , an exception will be thrown: java.io.IOException: Inconsistent split cardinality from child 1 (1/2)
The simplest way to use composite join is to make reduce number = 1, so that there is only one partition for each input file, provided the performance is fine.
The Source Code for the application
相关文章推荐
- Hadoop中 MapReduce中InputSplit的分析
- Hadoop中 MapReduce中InputSplit的分析
- Hadoop MapReduce进阶 使用DataJoin包实现Join
- Hadoop中 MapReduce中InputSplit的分析
- TaggedInputSplit cannot be cast to org.apache.hadoop.mapreduce.lib.input.FileSplit
- Hadoop中 MapReduce中InputSplit的分析
- Hadoop中 MapReduce中InputSplit的分析
- Hadoop 使用Combiner提高Map/Reduce程序效率
- Hadoop中 MapReduce中InputSplit的分析
- Hadoop 使用Combiner提高Map/Reduce程序效率
- Hadoop中 MapReduce中InputSplit的分析
- Hadoop中 MapReduce中InputSplit的分析
- Hadoop中 MapReduce中InputSplit的分析
- Hadoop实战:使用Combiner提高Map/Reduce程序效率
- Hadoop MapReduce进阶 使用DataJoin包实现Join
- Hadoop中 MapReduce中InputSplit的分析
- Hadoop MapReduce进阶 使用DataJoin包实现Join
- Hadoop中 MapReduce中InputSplit的分析
- Hadoop中 MapReduce中InputSplit的分析
- Mac OSX 下 Hadoop 使用本地库提高效率