您的位置：首页 > 编程语言 > Python开发

hive 的日志处理统计网站的 PV 、UV案例与给合 python的数据清洗数据案例

2018-04-12 11:00 1126 查看

一：hive 清理日志处理统计PV、UV 访问量

二： hive 数据python 的数据清洗

一：日志处理

统计每个时段网站的访问量:

1.1 在hive 上面创建表结构：

在创建表时不能直接导入问题
create table db_bflog.bf_log_src (
remote_addr string,
remote_user string,
time_local string,
request string,
status string,
body_bytes_sent string,
request_body string,
http_referer string,
http_user_agent string,
http_x_forwarded_for string,
host string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\"[^ ]*\") (\"-|[^ ]*\") (\"[^\]]*\") (\"[^\"]*\") (\"[0-9]*\") (\"[0-9]*\") (-|[^ ]*) (\"[^ ]*\") (\"[^\"]*\") (-|[^ ]*) (\"[^ ]*\")"
)
STORED AS TEXTFILE;

1.2 加载数据到 hive 表当中：

load data local inpath '/home/hadoop/moodle.ibeifeng.access.log' into table db_bflog.bf_log_src ;

1.3 自定义UDF函数

1.3.1：udf函数去除相关引号

package org.apache.hadoop.udf;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

/**
* * New UDF classes need to inherit from this UDF class.
*
* @author zhangyy
*
*/
public class RemoveQuotesUDF extends UDF {

/*
1. Implement one or more methods named "evaluate" which will be called by Hive.
2."evaluate" should never be a void method. However it can return "null" if needed.
*/
public Text evaluate(Text str){
if(null == str){
return null;
}

// validate
if(StringUtils.isBlank(str.toString())){
return null ;
}

// lower
return new Text(str.toString().replaceAll("\"", ""));
}

public static void main(String[] args) {
System.out.println(new RemoveQuotesUDF().evaluate(new Text("\"GET /course/view.php?id=27 HTTP/1.1\"")));
}
}

1.3.2：udf函数时间格式进行转换

package org.apache.hadoop.udf;

import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

/**
* * New UDF classes need to inherit from this UDF class.
*
* @author zhangyy
*
*/
public class DateTransformUDF extends UDF {

private final SimpleDateFormat inputFormat = new SimpleDateFormat("dd/MMM/yy:HH:mm:ss", Locale.ENGLISH) ;
private final SimpleDateFormat outputFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss") ;

/*
1. Implement one or more methods named "evaluate" which will be called by Hive.
2."evaluate" should never be a void method. However it can return "null" if needed.
*/
/**
* input:
*      31/Aug/2015:00:04:37 +0800
* output:
*      2015-08-31 00:04:37
*/
public Text evaluate(Text str){
Text output = new Text() ;

if(null == str){
return null;
}

// validate
if(StringUtils.isBlank(str.toString())){
return null ;
}

try{
// 1) parse
Date parseDate = inputFormat.parse(str.toString().trim());
// 2) transform
String outputDate = outputFormat.format(parseDate) ;
// 3) set
output.set(outputDate);
}catch(Exception e){
e.printStackTrace();
}

// lower
return output;
}

public static void main(String[] args) {
System.out.println(new DateTransformUDF().evaluate(new Text("31/Aug/2015:00:04:37 +0800")));
}
}

将RemoveQuotesUDF 与 DateTransformUDF 到出成jar 包 放到/home/hadoop/jars 目录下面：

1.4 去hive 上面生成 udf 函数

RemoveQuotesUDF 加载成udf函数 ：

add jar /home/hadoop/jars/RemoveQuotesUDF.jar ;

create temporary function My_RemoveQuotes as "org.apache.hadoop.udf.RemoveQuotesUDF" ;

DateTransformUDF 加载成udf 函数：

add jar /home/hadoop/jars/DateTransformUDF.jar ;

create temporary function My_DateTransform as "org.apache.hadoop.udf.DateTransformUDF" ;

1.5 创建生成所要要求表：

create table db_bflog.bf_log_comm(
remote_addr string,
time_local string,
request string,
http_referer string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS ORC tblproperties ("orc.compress"="SNAPPY");

从原有表中提取相关的数据处理：

insert into table db_bflog.bf_log_comm select remote_addr, time_local, request, http_referer from db_bflog.bf_log_src ;

执行sql 统计每小时的pv 访问量：

select t.hour,count(*) cnt
from
(select substring(my_datetransform(my_removequotes(time_local)),12,2) hour from bf_log_comm) t
group by t.hour order by cnt desc ;

二： hive 数据python 的数据清洗

统计国外一家影院的每周看电影的人数
测试数据下载地址：

wget http://files.grouplens.org/datasets/movielens/ml-100k.zip unzip ml-100k.zip

2.1 创建hive 的数据表

CREATE TABLE u_data (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

2.2 加载数据：

LOAD DATA LOCAL INPATH '/home/hadoop/ml-100k/u.data'
OVERWRITE INTO TABLE u_data;

2.3 创建weekday_mapper.py 脚本

import sys
import datetime

for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime = line.split('\t')
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '\t'.join([userid, movieid, rating, str(weekday)])

2.4 创建临时hive 表用于提取数据：

CREATE TABLE u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

增加python 脚本到hive

add FILE /home/hadoop/weekday_mapper.py;

2.5 从旧表中数据提取

INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;

2.6 查找所需要的数据：

SELECT weekday, COUNT(*)
FROM u_data_new
GROUP BY weekday;

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签： 大数据 hadoop hive 数据清洗

相关文章推荐

新的分享

章节导航

hive 的日志处理统计网站的 PV 、UV案例 与 给合 python的数据清洗数据案例

一： 日志处理

1.1 在hive 上面创建表结构：

1.2 加载数据到 hive 表当中：

1.3 自定义UDF函数

1.3.1：udf函数去除相关引号

1.3.2：udf函数时间格式进行转换

1.4 去hive 上面 生成 udf 函数

1.5 创建生成所要要求表：

从原有表中提取 相关的数据处理：

执行sql 统计每小时的pv 访问量：

二： hive 数据python 的数据清洗

2.1 创建hive 的数据表

2.2 加载数据：

2.3 创建weekday_mapper.py 脚本

2.4 创建临时hive 表 用于提取数据：

2.5 从旧表中数据提取

2.6 查找所需要的数据：

hive 的日志处理统计网站的 PV 、UV案例与给合 python的数据清洗数据案例

一：日志处理

1.4 去hive 上面生成 udf 函数

从原有表中提取相关的数据处理：

2.4 创建临时hive 表用于提取数据：