Hive编程(六)【HiveQL:查询】
2017-11-30 23:15
239 查看
6.1 select…from语句
有如下表结构:CREATE TABLE employees ( name STRING, salary FLOAT, subordinates ARRAY<STRING>, deductions MAP<STRING, FLOAT>, address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> ) PARTITIONED BY (country STRING, state STRING);
hive> SELECT name, salary FROM employees; John Doe 100000.0 Mary Smith 80000.0 Todd Jones 70000.0 Bill King 60000.0为表指定别名
hive> SELECT e.name, e.salary FROM employees e;
查询列类型为数组,其值使用
JSON语法输出,其值使用一个被括在
[]内以逗号分隔的列表进行表示。集合的字符串元素是加引号表示的。基本数据类型
STRING的列值是不加引号的。如下所示:
hive> SELECT name, subordinates FROM employees; John Doe ["Mary Smith","Todd Jones"] Mary Smith ["Bill King"] Todd Jones [] Bill King []
查询列类型为
MAP,其值使用
JSON格式来表示。即用一个被括在
{}内以逗号分隔的键值对列表。如下所示:
hive> SELECT name, deductions FROM employees; John Doe {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1} Mary Smith {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1} Todd Jones {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1} Bill King {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
查询列类型为
STRUCT,其值也是使用
JSON格式来表示。如下所示:
hive> SELECT name, address FROM employees; John Doe {"street":"1 Michigan Ave.","city":"Chicago","state":"IL","zip":60600} Mary Smith {"street":"100 Ontario St.","city":"Chicago","state":"IL","zip":60601} Todd Jones {"street":"200 Chicago Ave.","city":"Oak Park","state":"IL","zip":60700} Bill King {"street":"300 Obscure Dr.","city":"Obscuria","state":"IL","zip":60100}数组索引是从
0开始的,查询数组中第一个元素的例子:
hive> SELECT name, subordinates[0] FROM employees; John Doe Mary Smith Mary Smith Bill King Todd Jones NULL Bill King NULL注意:查询一个不存在元素将会返回
NULL。
查询一个
MAP元素,使用
ARRAY[]语法,但是使用的是键值而不是数组下标索引。如:
hive> SELECT name, deductions["State Taxes"] FROM employees; John Doe 0.05 Mary Smith 0.05 Todd Jones 0.03 Bill King 0.03查询
STRUCT中的元素使用
.(点)符号.类似于表的别名.列名。如:
hive> SELECT name, address.city FROM employees; John Doe Chicago Mary Smith Chicago Todd Jones Oak Park Bill King Obscuria
6.1.1 使用正则表达式来指定列
hive> SELECT symbol, `price.*` FROM stocks; AAPL 195.69 197.88 194.0 194.12 194.12 AAPL 192.63 196.0 190.85 195.46 195.46 AAPL 196.73 198.37 191.57 192.05 192.05 AAPL 195.17 200.2 194.42 199.23 199.23 AAPL 195.91 196.32 193.38 195.86 195.86
6.1.2 使用列值进行计算
hive > SELECT upper(name), salary, deductions["Federal Taxes"], > round(salary * (1 - deductions["Federal Taxes"])) FROM employees; JOHN DOE 100000.0 0.2 80000 MARY SMITH 80000.0 0.2 64000 TODD JONES 70000.0 0.15 59500 BILL KING 60000.0 0.15 51000
6.1.3 算术运算符
A + B Numbers A - B Numbers A * B Numbers A / B Numbers A % B Numbers A & B Numbers A | B Numbers A ^ B Numbers ~A Numbers
6.1.4 使用函数
数学函数BIGINT round(d) DOUBLE round(d, N) BIGINT floor(d) BIGINT ceil(d),ceiling(DOUBLE d) DOUBLE rand(), rand(seed) DOUBLE exp(d) DOUBLE ln(d) DOUBLE log10(d) DOUBLE log2(d) DOUBLE log(base, d) DOUBLE pow(d, p), power(d, p) DOUBLE sqrt(d) STRING bin(i) STRING hex(i) STRING hex(str) STRING unhex(i) STRING conv(i, from_base, to_base) STRING conv(str, from_base,to_base) DOUBLE abs(d) INT pmod(i1, i2) DOUBLE pmod(d1, d2) DOUBLE sin(d) DOUBLE asin(d) DOUBLE cos(d) DOUBLE acos(d) DOUBLE tan(d) DOUBLE atan(d) DOUBLE degrees(d) DOUBLE radians(d) INT positive(i) DOUBLE positive(d) INT negative(i) DOUBLE negative(d) FLOAT sign(d) DOUBLE e() DOUBLE pi()
聚合函数
BIGINT count(*) BIGINT count(expr) BIGINT count(DISTINCT expr[, expr_.]) DOUBLE sum(col) DOUBLE sum(DISTINCT col) DOUBLE avg(col) DOUBLE avg(DISTINCT col) DOUBLE min(col) DOUBLE max(col) DOUBLE variance(col),var_pop(col) DOUBLE var_samp(col) DOUBLE stddev_pop(col) DOUBLE stddev_samp(col) DOUBLE covar_pop(col1, col2) DOUBLE covar_samp(col1, col2) DOUBLE corr(col1, col2) DOUBLE percentile(int_expr, p) ARRAY<DOUBLE> percentile(int_expr,[p1, ...]) DOUBLE percentile_approx(int_expr,p , NB) DOUBLE percentile_approx(int_expr,[p1, ...] , NB) ARRAY<STRUCT{'x','y'}> histogram_numeric(col, NB) ARRAY collect_set(col)通过设置
hive.map.aggr值为
true提高聚合的性能。如:
hive> SET hive.map.aggr=true; hive> SELECT count(*), avg(salary) FROM employees;
表生成函数
hive> SELECT explode(subordinates) AS sub FROM employees; Mary Smith Todd Jones Bill King
N rows explode(array) N rows explode(map) tuple json_tuple(jsonStr, p1, p2, …,pn) tuple parse_url_tuple(url, partname1, partname2, …, partnameN) where N >= 1 N ows stack(n, col1, …, colM)
SELECT parse_url_tuple(url, 'HOST', 'PATH', 'QUERY') as (host, path, query) FROM url_table;
其他内置函数
BOOLEAN test in(val1, val2, …) INT length(s) STRING reverse(s) STRING concat(s1, s2, …) STRING concat_ws(separator, s1, s2,…) STRING substr(s, start_index) STRING substr(s, int start, int length) STRING upper(s) STRING ucase(s) STRING lower(s) STRING lcase(s) STRING trim(s) STRING ltrim(s) STRING rtrim(s) STRING regexp_replace(s, regex,replacement) STRING regexp_extract(subject,regex_pattern, index) STRING parse_url(url, partname, key) int size(map<K.V>) int size(array<T>) value of type cast(<expr> as <type>) STRING from_unixtime(int unixtime) STRING to_date(timestamp) INT year(timestamp) INT month(timestamp) INT day(timestamp) STRING get_json_object(json_string,path) STRING space(n) STRING repeat(s, n) STRING ascii(s) STRING lpad(s, len, pad) STRING rpad(s, len, pad) ARRAY<STRING> split(s, pattern) INT find_in_set(s, commaSeparated String) INT locate(substr, str, pos]) INT instr(str, substr) MAP<STRING,STRING> str_to_map(s, delim1, delim2) ARRAY<ARRAY<STRING>> sentences(s, lang, locale) ARRAY<STRUCT<STRING,DOUBLE>> ngrams(array<array<string>>,N, K, pf) ARRAY<STRUCT<STRING,DOUBLE>> context_ngrams(array<array<string>>,array<string>,int K, int pf) BOOLEAN in_file(s, filename)
6.1.5 limit语句
hive> SELECT upper(name), salary, deductions["Federal Taxes"], > round(salary * (1 - deductions["Federal Taxes"])) FROM employees > LIMIT 2; JOHN DOE 100000.0 0.2 80000 MARY SMITH 80000.0 0.2 64000使用
limit来限制返回行数。
6.1.6 列别名
hive> SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes, > round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes > FROM employees LIMIT 2; JOHN DOE 100000.0 0.2 80000 MARY SMITH 80000.0 0.2 64000
6.1.7 嵌套select语句
hive> FROM ( > SELECT upper(name), salary, deductions["Federal Taxes"] as fed_taxes, > round(salary * (1 - deductions["Federal Taxes"])) as salary_minus_fed_taxes > FROM employees > ) e > SELECT e.name, e.salary_minus_fed_taxes > WHERE e.salary_minus_fed_taxes > 70000; JOHN DOE 100000.0 0.2 80000
6.1.8 case…when…then 句式
CASE … WHEN … THEN和
IF语句类似,用于处理单个列的查询结果。如:
hive> SELECT name, salary, > CASE > WHEN salary < 50000.0 THEN 'low' > WHEN salary >= 50000.0 AND salary < 70000.0 THEN 'middle' > WHEN salary >= 70000.0 AND salary < 100000.0 THEN 'high' > ELSE 'very high' > END AS bracket FROM employees; John Doe 100000.0 very high Mary Smith 80000.0 high Todd Jones 70000.0 high Bill King 60000.0 middle Boss Man 200000.0 very high Fred Finance 150000.0 very high Stacy Accountant 60000.0 middle
6.1.9 什么情况下hive可以避免进行mapreduce
在Hive中下面的查询不会触发MapReduce
SELECT * FROM employees;对于
WHERE子句中过滤条件只是分区字段这种情况,也无需
MapReduce
SELECT * FROM employees WHERE country = 'US' AND state = 'CA' LIMIT 100;若
hive.exec.mode.local.auto属性为
true的话Hive会尝试以本地模式运行。
set hive.exec.mode.local.auto=true;否则Hive使用
MapReduce来执行。
最好将
set hive.exec.mode.local.auto=true;这个设置添加到
$HOME/.hiverc文件中。
6.2 where语句
SELECT * FROM employees WHERE country = 'US' AND state = 'CA';
hive> SELECT name, salary, deductions["Federal Taxes"], > salary * (1 - deductions["Federal Taxes"]) > FROM employees > WHERE round(salary * (1 - deductions["Federal Taxes"])) > 70000; John Doe 100000.0 0.2 80000.0
hive> SELECT name, salary, deductions["Federal Taxes"], > salary * (1 - deductions["Federal Taxes"]) as salary_minus_fed_taxes > FROM employees > WHERE round(salary_minus_fed_taxes) > 70000; FAILED: Error in semantic analysis: Line 4:13 Invalid table alias or column reference 'salary_minus_fed_taxes': (possible column names are: name, salary, subordinates, deductions, address)不能在
WHERE语句中使用列别名。可以使用嵌套
SELECT
hive> SELECT e.* FROM > (SELECT name, salary, deductions["Federal Taxes"] as ded, > salary * (1 - deductions["Federal Taxes"]) as salary_minus_fed_taxes > FROM employees) e > WHERE round(e.salary_minus_fed_taxes) > 70000; John Doe 100000.0 0.2 80000.0 Boss Man 200000.0 0.3 140000.0 Fred Finance 150000.0 0.3 105000.0
6.2.1 谓词操作符
6.2.2 关于浮点数比较
6.2.3 like和rlike
LIKEhive> SELECT name, address.street FROM employees WHERE address.street LIKE '%Ave.'; John Doe 1 Michigan Ave. Todd Jones 200 Chicago Ave. hive> SELECT name, address.city FROM employees WHERE address.city LIKE 'O%'; Todd Jones Oak Park Bill King Obscuria hive> SELECT name, address.street FROM employees WHERE address.street LIKE '%Chi%'; Todd Jones 200 Chicago Ave.RLIKE 通常后面跟正则表达式
hive> SELECT name, address.street > FROM employees WHERE address.street RLIKE '.*(Chicago|Ontario).*'; Mary Smith 100 Ontario St. Todd Jones 200 Chicago Ave.
6.3 group by 语句
GROUP BY通常与聚合函数一起使用。按一个或多个结果进行分组。
hive> SELECT year(ymd), avg(price_close) FROM stocks > WHERE exchange = 'NASDAQ' AND symbol = 'AAPL' > GROUP BY year(ymd); 1984 25.578625440597534 1985 20.193676221040867 1986 32.46102808021274 1987 53.88968399108163 1988 41.540079275138766 1989 41.65976212516664 1990 37.56268799823263 1991 52.49553383386182 1992 54.80338610251119 1993 41.02671956450572 1994 34.0813495847914
HAVING语句,跟
GROUP BY一起使用,用来对分组条件过滤。
hive> SELECT year(ymd), avg(price_close) FROM stocks > WHERE exchange = 'NASDAQ' AND symbol = 'AAPL' > GROUP BY year(ymd) > HAVING avg(price_close) > 50.0; 1987 53.88968399108163 1991 52.49553383386182 1992 54.80338610251119 1999 57.77071460844979 2000 71.74892876261757 2005 52.401745992993554若没有
HAVING,使用嵌套的
SELECT来查询
hive> SELECT s2.year, s2.avg FROM > (SELECT year(ymd) AS year, avg(price_close) AS avg FROM stocks > WHERE exchange = 'NASDAQ' AND symbol = 'AAPL' > GROUP BY year(ymd)) s2 > WHERE s2.avg > 50.0; 1987 53.88968399108163
6.4 join语句
6.4.1 inner join
取连接的两个表中都存在与连接标准匹配的数据。hive> SELECT a.ymd, a.price_close, b.price_close > FROM stocks a JOIN stocks b ON a.ymd = b.ymd > WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM'; 2010-01-04 214.01 132.45 2010-01-05 214.38 130.85 2010-01-06 210.97 130.0 2010-01-07 210.58 129.55 2010-01-08 211.98 130.85 2010-01-11 210.11 129.48Hive中不支持下面这样的查询语法
SELECT a.ymd, a.price_close, b.price_close FROM stocks a JOIN stocks b ON a.ymd <= b.ymd WHERE a.symbol = 'AAPL' AND b.symbol = 'IBM';Hive不支持
ON子句中使用
OR.
6.4.2 join优化
6.4.3 left outer join
左外连接(LEFT OUTER JOIN)左边符合
WHERE条件的语句会返回,右表中匹配不到的字段值用
NULL代替.
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend > FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol > WHERE s.symbol = 'AAPL'; 1987-05-01 AAPL 80.0 NULL 1987-05-04 AAPL 79.75 NULL 1987-05-05 AAPL 80.25 NULL 1987-05-06 AAPL 80.0 NULL 1987-05-07 AAPL 80.25 NULL 1987-05-08 AAPL 79.0 NULL 1987-05-11 AAPL 77.0 0.015 1987-05-12 AAPL 75.5 NULL 1987-05-13 AAPL 78.5 NULL 1987-05-14 AAPL 79.25 NULL 1987-05-15 AAPL 78.25 NULL 1987-05-18 AAPL 75.75 NULL 1987-05-19 AAPL 73.25 NULL 1987-05-20 AAPL 74.5 NULL
6.4.4 outer join
对于外连接(OUTER JOIN)会忽略掉分区过滤条件,对于内链接(
INNER JOIN)则不会忽略掉过滤条件.
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend > FROM stocks s LEFT OUTER JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol > WHERE s.symbol = 'AAPL' > AND s.exchange = 'NASDAQ' AND d.exchange = 'NASDAQ'; 1987-05-11 AAPL 77.0 0.015 1987-08-10 AAPL 48.25 0.015 1987-11-17 AAPL 35.0 0.02 1988-02-12 AAPL 41.0 0.02 1988-05-16 AAPL 41.25 0.02
6.4.5 right outer join
右外连接(RIGHT OUTER JOIN)返回右边表符合
WHERE语句的记录.左表中匹配不到的字段值用
NULL代替.
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend > FROM dividends d RIGHT OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol > WHERE s.symbol = 'AAPL'; 1987-05-07 AAPL 80.25 NULL 1987-05-08 AAPL 79.0 NULL 1987-05-11 AAPL 77.0 0.015 1987-05-12 AAPL 75.5 NULL 1987-05-13 AAPL 78.5 NULL
6.4.6 full outer join
完全外连接(FULL OUTER JOIN)返回所有表中符合
WHERE语句条件的所有记录.
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend > FROM dividends d FULL OUTER JOIN stocks s ON d.ymd = s.ymd AND d.symbol = s.symbol > WHERE s.symbol = 'AAPL'; 1987-05-07 AAPL 80.25 NULL 1987-05-08 AAPL 79.0 NULL 1987-05-11 AAPL 77.0 0.015 1987-05-12 AAPL 75.5 NULL 1987-05-13 AAPL 78.5 NULL
6.4.7 left semi-join
左半开连接(LEFT SEMI JOIN),在满足右边表中
ON语句的条件判定的前提下,返回左边表中的记录。
Hive中不支持的查询
SELECT s.ymd, s.symbol, s.price_close FROM stocks s WHERE s.ymd, s.symbol IN (SELECT d.ymd, d.symbol FROM dividends d);用下面的查询来代替
hive> SELECT s.ymd, s.symbol, s.price_close > FROM stocks s LEFT SEMI JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol; 1962-11-05 IBM 361.5 1962-08-07 IBM 373.25 1962-05-08 IBM 459.5 1962-02-06 IBM 551.5注意:
SELECT和
WHERE语句中不能引用到右边表中的字段.Hive不支持右半开连接.
6.4.8 笛卡尔积join
笛卡尔积,返回左边表中的行数乘以右边表中的行数。SELECTS * FROM stocks JOIN dividends;
6.4.9 map-side join
6.5 order by和sort by
在Hive中提供ORDER BY和
SORT BY来对数据进行排序。其区别是:
ORDER BY用来对结果进行全局排序
SORT BY只会在每个
reducer中对数据排序。是局部排序。只保证每个
reducer的输出是有序的。
ORDER BY是全局排序。运行时间会较长。若属性
hive.mapred.mode的值是
strict,Hive要求必须使用
LIMIT限制输出.默认是
nonstrict。
ORDER BY
SELECT s.ymd, s.symbol, s.price_close FROM stocks s ORDER BY s.ymd ASC, s.symbol DESC;
SORT BY
SELECT s.ymd, s.symbol, s.price_close FROM stocks s SORT BY s.ymd ASC, s.symbol DESC;
6.6 含有sort by 的distribute by
DISTRIBUTE BY控制
map的输出在
reducer中是如何划分的。
MapReduce中传输的所有数据都是键值对。默认情况下,
MapReduce计算框架依据
map输入的键计算其哈希值,然后按得到的哈希值将键值分发到多个
reducer中。
DISTRIBUTE BY和
GROUP BY在其控制着
reducer是如何接受一行行数据进行处理这方面是类似的。而
SORT BY控制着
reducer内的数据是如何排序的。
DISTRIBUTE BY语句位于
SORT BY语句之前。
hive> SELECT s.ymd, s.symbol, s.price_close > FROM stocks s > DISTRIBUTE BY s.symbol > SORT BY s.symbol ASC, s.ymd ASC; 1984-09-07 AAPL 26.5 1984-09-10 AAPL 26.37 1984-09-11 AAPL 26.87 1984-09-12 AAPL 26.12 1984-09-13 AAPL 27.5 1984-09-14 AAPL 27.87 1984-09-17 AAPL 28.62 1984-09-18 AAPL 27.62 1984-09-19 AAPL 27.0 1984-09-20 AAPL 27.12
6.7 cluster by
hive> SELECT s.ymd, s.symbol, s.price_close > FROM stocks s > CLUSTER BY s.symbol; 2010-02-08 AAPL 194.12 2010-02-05 AAPL 195.46 2010-02-04 AAPL 192.05 2010-02-03 AAPL 199.23 2010-02-02 AAPL 195.86 2010-02-01 AAPL 194.73 2010-01-29 AAPL 192.06 2010-01-28 AAPL 199.29 2010-01-27 AAPL 207.88
6.8 类型转换
Hive使用cast进行数据类型转换。
SELECT name, salary FROM employees WHERE cast(salary AS FLOAT) < 100000.0;若
cast中的数据类型不合法为返回为
NULL。
需要注意的是将浮点数转化为整形推荐使用
round()或
floor()。
类型转换
BINARY值
SELECT (2.0*cast(cast(b as string) as double)) from src;
6.9 抽样查询
使用rand()函数进行抽样。
hive> SELECT * from numbers TABLESAMPLE(BUCKET 3 OUT OF 10 ON rand()) s; 2 4 hive> SELECT * from numbers TABLESAMPLE(BUCKET 3 OUT OF 10 ON rand()) s; 7 10
hive> SELECT * from numbers TABLESAMPLE(BUCKET 3 OUT OF 10 ON number) s; 2 hive> SELECT * from numbers TABLESAMPLE(BUCKET 5 OUT OF 10 ON number) s; 4 hive> SELECT * from numbers TABLESAMPLE(BUCKET 3 OUT OF 10 ON number) s; 2
6.9.1 数据块抽样
基于百分比的数据抽样hive> SELECT * FROM numbersflat TABLESAMPLE(0.1 PERCENT) s;
6.9.2 分桶表的输入裁剪
6.10 union all
UNION ALL将两个或多个表进行合并。每一个
union子查询都必须具有相同的列。且对应的每个字段类型必须一致。
SELECT log.ymd, log.level, log.message FROM ( SELECT l1.ymd, l1.level, l1.message, 'Log1' AS source FROM log1 l1 UNION ALL SELECT l2.ymd, l2.level, l2.message, 'Log2' AS source FROM log1 l2 ) log SORT BY log.ymd ASC;使用
UNION对同一个源表的数据进行合并。如:
FROM ( FROM src SELECT src.key, src.value WHERE src.key < 100 UNION ALL FROM src SELECT src.* WHERE src.key > 110 ) unioninput INSERT OVERWRITE DIRECTORY '/tmp/union.out' SELECT unioninput.*
相关文章推荐
- Hive编程(四)【HiveQL:数据定义】
- HiveQL 查询
- Hive编程(五)【HiveQL:数据操作】
- HiveQL查询变量(动态参数值)的使用
- Hive编程(七)【HiveQL:视图】
- Hive[6] HiveQL 查询
- HiveQL:查询
- Hive学习总结之四:HiveQL查询操作
- HiveQL:查询
- Hive--HiveQL:查询
- Hadoop Hive概念学习系列之hive里的HiveQL——查询语言(十五)
- Hive编程(八)【HiveQL:索引】
- logstash传输自定义字段数据到hdfs进行分年月日分区,并且hive可以通过hiveQL快速查询数据
- Hive数据仓库--HiveQL查询
- HiveQL查询中JOIN语句
- 分组查询取最大时间记录的多种方式
- sql查询语句查询条件字段的拼装
- mysql慢查询开启及分析方法
- 腾讯QQ隐私信息泄露:可怕的qq群关系查询网站
- jquery-easyui组件combobox模糊查询在IE中数据太多导致浏览器崩溃