[HIve - LanguageManual] Joins
2015-01-26 11:01
393 查看
Hive Joins
Hive JoinsJoin Syntax
Examples
MapJoin Restrictions
Join Optimization
Predicate Pushdown in Outer Joins
Enhancements in Hive Version 0.11
Join Syntax
Hive supports the following syntax for joining tables:Hive 暂时只支持等值连接。
See Select Syntax for the context of this join syntax.
Version 0.13.0+: Implicit join notation
Icon
Implicit join notation is supported starting with Hive 0.13.0 (see HIVE-5558). This allows the FROM clause to join a comma-separated list of tables, omitting the JOIN keyword. For example:
SELECT * FROM table1 t1, table2 t2, table3 t3 WHERE t1.id = t2.id AND t2.id = t3.id AND t1.zipcode = '02535';
Version 0.13.0+: Unqualified column references
Icon
Unqualified column references are supported in join conditions, starting with Hive 0.13.0 (see HIVE-6393). Hive attempts to resolve these against the inputs to a Join. If an unqualified column reference resolves to more than one table, Hive will flag it as an ambiguous reference.
For example:
CREATE TABLE a (k1 string, v1 string);
CREATE TABLE b (k2 string, v2 string);
SELECT k1, v1, k2, v2
FROM a JOIN b ON k1 = k2;
Examples
Some salient points to consider when writing join queries are as follows:Only equality joins are allowed e.g.
More than 2 tables can be joined in the same query e.g.
Hive converts joins over multiple tables into a single map/reduce job if for every table the same column is used in the join clauses e.g.
In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers where as the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. e.g. in
In every map/reduce stage of the join, the table to be streamed can be specified via a hint. e.g. in
LEFT, RIGHT, and FULL OUTER joins exist in order to provide more control over ON clauses for which there is no match. For example, this query:
Joins occur BEFORE WHERE CLAUSES. So, if you want to restrict the OUTPUT of a join, a requirement should be in the WHERE clause, otherwise it should be in the JOIN clause. A big point of confusion for this issue is partitioned tables:
Joins are NOT commutative! Joins are left-associative regardless of whether they are LEFT or RIGHT joins.
To achieve the more intuitive effect, we should instead do FROM c LEFT OUTER JOIN a ON (c.key = a.key) LEFT OUTER JOIN b ON (c.key = b.key).
LEFT SEMI JOIN implements the uncorrelated IN/EXISTS subquery semantics in an efficient way.(Hive0.13开始支持In,Not In /Exists, Not Exists等操作) As of Hive 0.13 the IN/NOT IN/EXISTS/NOT EXISTS operators are supported using subqueries so most of these JOINs don't have to be performed manually anymore. (所有这些子查询再也不必手动操作了)。The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc. (SEMI JOIN的限制是,右边的表必须ON子句中而不是Where条件子句中)
If the tables being joined are bucketized on the join columns, and the number of buckets in one table is a multiple of the number of buckets in the other table, the buckets can be joined with each other. If table A has 4 buckets and table B has 4 buckets, the following join
MapJoin Restrictions ??
If all but one of the tables being joined are small, the join can be performed as a map only job. The queryThe following is not supported.
Union Followed by a MapJoin
Lateral View Followed by a MapJoin
Reduce Sink (Group By/Join/Sort By/Cluster By/Distribute By) Followed by MapJoin
MapJoin Followed by Union
MapJoin Followed by Join
MapJoin Followed by MapJoin
The configuration variable hive.auto.convert.join (if set to true) automatically converts the joins to mapjoins at runtime if possible, and it should be used instead of the mapjoin hint. The mapjoin hint should only be used for the following query.
If all the inputs are bucketed or sorted, and the join should be converted to a bucketized map-side join or bucketized sort-merge join.
Consider the possibility of multiple mapjoins on different keys:
hive.auto.convert.join.noconditionaltask - Whether Hive enable the optimization about converting common join into mapjoin based on the input file size. If this paramater is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than the specified size, the join is directly converted to a mapjoin (there is no conditional task).
hive.auto.convert.join.noconditionaltask.size - If hive.auto.convert.join.noconditionaltask is off, this parameter does not take affect. However, if it is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than this size, the join is directly converted to a mapjoin(there is no conditional task). The default is 10MB.
Join Optimization 连接优化
Predicate Pushdown in Outer Joins
See Hive Outer Join Behavior for information about predicate pushdown in outer joins.Enhancements in Hive Version 0.11
See Join Optimization for information about enhancements to join optimization introduced in Hive version 0.11.0. The use of hints is de-emphasized in the enhanced optimizations (HIVE-3784 and related JIRAs).相关文章推荐
- [Hive - LanguageManual] Create/Drop/Alter Database Create/Drop/Truncate Table
- [Hive - LanguageManual ] Windowing and Analytics Functions (待)
- [Hive - LanguageManual] Alter Table/Partition/Column
- [Hive - LanguageManual] Create/Drop/Alter -View、 Index 、 Function
- [Hive - LanguageManual ] Explain (待)
- [Hive - LanguageManual] Create/Drop/Grant/Revoke Roles and Privileges / Show Use
- [Hive - LanguageManual] Hive Concurrency Model (待)
- [Hive - LanguageManual] Describe
- [Hive - LanguageManual ] ]SQL Standard Based Hive Authorization
- [Hive - LanguageManual] Hive Default Authorization - Legacy Mode
- [HIve - LanguageManual] Sort/Distribute/Cluster/Order By
- [Hive - LanguageManual] Statistics in Hive
- [HIve - LanguageManual] Transform [没懂]
- [Hive - LanguageManual] Archiving for File Count Reduction
- [Hive - LanguageManual] DML: Load, Insert, Update, Delete
- [HIve - LanguageManual] Hive Operators and User-Defined Functions (UDFs)
- [Hive - LanguageManual] Import/Export
- [HIve - LanguageManual] XPathUDF
- [HIve - LanguageManual] Join Optimization (不懂)
- [HIve - LanguageManual] Union