您的位置:首页 > 其它

hive sort by,order by ,distribute by,cluster by

2014-05-04 16:07 423 查看
ORDER BY and SORT BY
The ORDER BY clause is familiar from other SQLdialects. It performs a
total ordering of
thequery result set. This means that all the data is passed through a single reducer,
whichmay take an unacceptably long time to execute for larger data sets.
Hiveadds an alternative, SORT BY, that orders the data only within each reducer, thereby
performinga localordering, where eachreducer’s output will be sorted. Better performance
istraded for total ordering.
Inboth cases, the syntax differs only by the use of the ORDER or SORT keyword. You can
specifyany columns you wish and specify whether or not the columns are ascending
usingthe ASC keyword (thedefault) or descending using the DESC keyword.
Hereis an example using ORDER BY:
SELECT s.ymd,
s.symbol,
s.price_close
FROM stocks s
ORDER BY s.ymd
ASC,
s.symbol DESC;
Hereis the same example using SORT BY instead:
SELECT s.ymd,
s.symbol,
s.price_close
FROM stocks s
SORT BY
s.ymd
ASC,
s.symbol DESC;
Thetwo queries look almost identical, but if more than one reducer is invoked, the
outputwill be sorted differently. While each reducer’s output files will be sorted,the
datawill probably overlap with the output of other reducers.
BecauseORDER BY can result in excessively long runtimes, Hive will require a LIMIT
clausewith ORDER BY if the propertyhive.mapred.mode is set to strict. By default, it is
set tononstrict.
DISTRIBUTE BY with SORT BY
DISTRIBUTE BY controls how map output is divided among reducers. Alldata that flows
througha MapReduce job is organized into key-value pairs. Hive must use this feature
internallywhen it converts your queries to MapReduce jobs.
Usually,you won’t need to worry about this feature. The exceptions are queries that
usethe Streamingfeature (see Chapter 14) and some stateful UDAFs (User-Defined
Aggregate Functions; see “Aggregate Functions” on page 164). There is one other scenario
wherethese clauses are useful.
Bydefault, MapReduce computes a hash on the keys output by mappers and tries to
evenlydistribute the key-value pairs among the available reducers using the hashvalues.
Unfortunately,this means that when we use SORT BY, the contents of one reducer’s
outputwill overlap significantly with the output of the other reducers, as far assorted
orderis concerned, even though the data is sorted within
each reducer’s output.
DISTRIBUTE BY with SORT BY | 107
Say wewant the data for each stock symbol to be captured together. We can use
DISTRIBUTE BY to ensure that the records for each stock symbol go tothe same reducer,
thenuse SORT BY to order thedata the way we want. The following query demonstrates
thistechnique:
hive>
SELECT s.ymd,
s.symbol,
s.price_close
> FROM
stocks s
> DISTRIBUTE
BY s.symbol
> SORT
BY s.symbol
ASC,
s.ymd ASC;
1984-09-07
AAPL 26.5
1984-09-10
AAPL 26.37
1984-09-11
AAPL 26.87
1984-09-12
AAPL 26.12
1984-09-13
AAPL 27.5
1984-09-14
AAPL 27.87
1984-09-17
AAPL 28.62
1984-09-18
AAPL 27.62
1984-09-19
AAPL 27.0
1984-09-20
AAPL 27.12
...
Ofcourse, the ASC keywords couldhave been omitted as they are the defaults. The
ASC keyword is placed here for reasons that will be described shortly.
DISTRIBUTE BY works similar to GROUP BY in the sense that it controls how reducers
receiverows for processing, while SORT BY controls the sorting of data inside thereducer.
Notethat Hive requires that the DISTRIBUTE BY clause come before the SORT BY clause.
CLUSTER BY
In theprevious example, the s.symbol column was used in the DISTRIBUTE BY clause,
andthe s.symbol and the s.ymd columns in the SORT BY clause. Suppose that the same
columnsare used in both clauses and all columns are sorted by ascending order (the
default).In this case, the CLUSTER BY clause is a shor-hand way of expressing the same
query.
Forexample, let’s modify the previous query to drop sorting by s.ymd and use CLUSTER
BY on s.symbol:
hive>
SELECT s.ymd,
s.symbol,
s.price_close
> FROM
stocks s
> CLUSTER BY
s.symbol;
2010-02-08
AAPL 194.12
2010-02-05
AAPL 195.46
2010-02-04
AAPL 192.05
2010-02-03
AAPL 199.23
2010-02-02
AAPL 195.86
2010-02-01
AAPL 194.73
2010-01-29
AAPL 192.06
2010-01-28
AAPL 199.29
2010-01-27
AAPL 207.88

Because the sort requirements are removed for the s.ymd, the output reflects the originalorder of the stock data, which is sorted descending.Using DISTRIBUTE BY ... SORT BY or the shorthand CLUSTER BY
clauses is a way to exploit the parallelism of SORT BY, yet achieve a total ordering across the output files
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: