大数据Spark “蘑菇云”行动第90课:Hive中Join电影店铺系统案例和性能优化、Index和Bucket案例实战
2016-12-01 21:42
1596 查看
大数据Spark “蘑菇云”行动第90课:Hive中Join电影店铺系统案例和性能优化、Index和Bucket案例实战
电影数据集下载
http://grouplens.org/datasets/movielens/ http://grouplens.org/datasets/movielens/1m/ http://files.grouplens.org/datasets/movielens/ml-1m-README.txt http://files.grouplens.org/datasets/movielens/ml-1m.zip
1,hive的map join
小表 -> hashtables ->广播到 map端 ->大表做一行行的匹配
2,semi join 传递表的key传过去,
三个表的数据结构
RATINGS FILE DESCRIPTION UserID::MovieID::Rating::Timestamp
USERS FILE DESCRIPTION UserID::Gender::Age::Occupation::Zip-code
MOVIES FILE DESCRIPTION MovieID::Title::Genres
hive> create database hivestudy
show databases;
use hivestudy;
show tables;
建立用户表://不支持LONG
CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zip-code String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '::';
LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/users.dat' INTO TABLE users PARTITION(dt='20161101');
建立评分表:
CREATE TABLE ratings(UserID BigInt, MovieID BigInt, Rating Double, Timestamped String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '::';
LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/ratings.dat' INTO TABLE ratings;
建立电影表
CREATE TABLE movies(MovieID BigInt, Title String, Genres String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '::';
LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/movies.dat' INTO TABLE movies;
DROP TABLE users;
CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '::';
LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/users.dat' INTO TABLE users PARTITION(dt='20161101');
CREATE TABLE ratings(UserID BigInt, MovieID BigInt, Rating Double, Timestamped String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '::';
LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/ratings.dat' INTO TABLE ratings;
CREATE TABLE movies(MovieID BigInt, Title String, Genres String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '::';
LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/movies.dat' INTO TABLE movies;
hive>select count(1) from ratings;
RATINGS UserID::MovieID::Rating::Timestamp
USERS UserID::Gender::Age::Occupation::Zip-code
MOVIES MovieID::Title::Genres
hive> SELECT users.UserID, users.Age, users.Gender FROM ratings JOIN users ON (ratings.UserID = users.UserID) WHERE ratins.MovieID = 2116 ;
空结果,原因是数据没有导入进来
select * from users limit 10; //null
删除表,重来,
drop tables users;
转义字符,转不了
CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\:\:';
用ascii编码:
CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\u3A\u3A';
CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\\58';
数据清洗换个办法:将::用\t 换表符替换掉!
现在好了
CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t';
重新测试join
RATINGS UserID::MovieID::Rating::Timestamp
USERS UserID::Gender::Age::Occupation::Zip-code
MOVIES MovieID::Title::Genres
hive> SELECT users.UserID, users.Age, users.Gender FROM ratings JOIN users ON (ratings.UserID = users.UserID) WHERE ratins.MovieID = 2116 ;
hive> SELECT users.UserID, users.Age, users.Gender,movies.title,movies.gendre FROM ratings JOIN users ON (ratings.UserID = users.UserID) JOIN movies
ON (rating.movie.id = movies .movid) WHERE ratins.MovieID = 2116 ;
HIVE很诡异的一个用法,map端 速度快一点
hive>SELECT /*+MAP JOIN(MOVIES)*/ users.UserID, users.Age, users.Gender,movies.title,movies.gendre FROM ratings JOIN users ON (ratings.UserID = users.UserID) JOIN movies ON (rating.movie.id = movies .movid) WHERE ratins.MovieID = 2116 ;
MAPJION会把小表全部读入内存中,在map阶段直接拿另外一个表的数据和内存中表数据做匹配,由于在map是进行了join操作,省去了reduce运行的效率也会高很多
这样就不会由于数据倾斜导致某个reduce上落数据太多而失败。于是原来的sql可以通过使用hint的方式指定join时使用mapjoin。
![](http://img.blog.csdn.net/20161201213810328?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center)
![](http://img.blog.csdn.net/20161201213822007?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center)
![](http://img.blog.csdn.net/20161201213829453?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center)
![](http://img.blog.csdn.net/20161201213834726?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center)
![](http://img.blog.csdn.net/20161201213844797?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center)
![](http://img.blog.csdn.net/20161201213850430?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center)
![](http://img.blog.csdn.net/20161201213856211?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center)
![](http://img.blog.csdn.net/20161201213901368?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center)
![](http://img.blog.csdn.net/20161201213906883?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center)
![](http://img.blog.csdn.net/20161201213915134?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center)
![](http://img.blog.csdn.net/20161201213920602?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center)
![](http://img.blog.csdn.net/20161201213926024?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQv/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center)
电影数据集下载
http://grouplens.org/datasets/movielens/ http://grouplens.org/datasets/movielens/1m/ http://files.grouplens.org/datasets/movielens/ml-1m-README.txt http://files.grouplens.org/datasets/movielens/ml-1m.zip
1,hive的map join
小表 -> hashtables ->广播到 map端 ->大表做一行行的匹配
2,semi join 传递表的key传过去,
三个表的数据结构
RATINGS FILE DESCRIPTION UserID::MovieID::Rating::Timestamp
USERS FILE DESCRIPTION UserID::Gender::Age::Occupation::Zip-code
MOVIES FILE DESCRIPTION MovieID::Title::Genres
hive> create database hivestudy
show databases;
use hivestudy;
show tables;
建立用户表://不支持LONG
CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zip-code String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '::';
LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/users.dat' INTO TABLE users PARTITION(dt='20161101');
建立评分表:
CREATE TABLE ratings(UserID BigInt, MovieID BigInt, Rating Double, Timestamped String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '::';
LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/ratings.dat' INTO TABLE ratings;
建立电影表
CREATE TABLE movies(MovieID BigInt, Title String, Genres String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '::';
LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/movies.dat' INTO TABLE movies;
DROP TABLE users;
CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '::';
LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/users.dat' INTO TABLE users PARTITION(dt='20161101');
CREATE TABLE ratings(UserID BigInt, MovieID BigInt, Rating Double, Timestamped String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '::';
LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/ratings.dat' INTO TABLE ratings;
CREATE TABLE movies(MovieID BigInt, Title String, Genres String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '::';
LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/movies.dat' INTO TABLE movies;
hive>select count(1) from ratings;
RATINGS UserID::MovieID::Rating::Timestamp
USERS UserID::Gender::Age::Occupation::Zip-code
MOVIES MovieID::Title::Genres
hive> SELECT users.UserID, users.Age, users.Gender FROM ratings JOIN users ON (ratings.UserID = users.UserID) WHERE ratins.MovieID = 2116 ;
空结果,原因是数据没有导入进来
select * from users limit 10; //null
删除表,重来,
drop tables users;
转义字符,转不了
CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\:\:';
用ascii编码:
CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\u3A\u3A';
CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\\58';
数据清洗换个办法:将::用\t 换表符替换掉!
现在好了
CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS
TERMINATED BY '\t';
重新测试join
RATINGS UserID::MovieID::Rating::Timestamp
USERS UserID::Gender::Age::Occupation::Zip-code
MOVIES MovieID::Title::Genres
hive> SELECT users.UserID, users.Age, users.Gender FROM ratings JOIN users ON (ratings.UserID = users.UserID) WHERE ratins.MovieID = 2116 ;
hive> SELECT users.UserID, users.Age, users.Gender,movies.title,movies.gendre FROM ratings JOIN users ON (ratings.UserID = users.UserID) JOIN movies
ON (rating.movie.id = movies .movid) WHERE ratins.MovieID = 2116 ;
HIVE很诡异的一个用法,map端 速度快一点
hive>SELECT /*+MAP JOIN(MOVIES)*/ users.UserID, users.Age, users.Gender,movies.title,movies.gendre FROM ratings JOIN users ON (ratings.UserID = users.UserID) JOIN movies ON (rating.movie.id = movies .movid) WHERE ratins.MovieID = 2116 ;
MAPJION会把小表全部读入内存中,在map阶段直接拿另外一个表的数据和内存中表数据做匹配,由于在map是进行了join操作,省去了reduce运行的效率也会高很多
这样就不会由于数据倾斜导致某个reduce上落数据太多而失败。于是原来的sql可以通过使用hint的方式指定join时使用mapjoin。
相关文章推荐
- 在sublime text 3中安装emmet后报错:please wait a bit while pyv8 binary
- hdu 1848 Fibonacci again and again(简单sg)
- HDU1848Fibonacci again and again(SG函数入门)
- Flask-Mail邮件发送问题
- 文件查看命令
- Ubuntu 11.10 Waiting up to 60 more seconds for network configuration…
- UBUNTU开机时出现“waiting for network configuration” 问题的解决 (转)
- 1090. Highways Constraints
- nginx: [error] OpenEvent("Global\ngx_reload_23212") failed (2: The system cannot find the file spec
- hdoj 1848 Fibonacci again and again(组合博弈, sg函数)
- PX4-terrain_estimator
- fork,waitpid实现守护进程
- 【转载】云计算的三种服务模式:IaaS,PaaS和SaaS
- 修改 K8s kubelet启动参数
- 【AI每日播报】英特尔发力人工智能 TensorFlow官方支持Windows
- IDEA(jetbrain通用)优雅级使用教程
- aix系统下的websphere的静默安装
- 1014. Waiting in Line (30)
- 对if __name__ == "__main__":的理解
- Debain系统上部署zabbix