您的位置:首页 > 大数据

大数据基础(九)Maven构建Hadoop日志清洗项目(一)

2016-08-13 19:12 537 查看
Maven Hadoop日志清洗项目(一)

hadoop 2.7.2

参考:

Maven Hadoop:
http://www.cnblogs.com/Leo_wl/p/4862820.html http://blog.csdn.net/kongxx/article/details/42339581
日志清洗:
http://www.cnblogs.com/edisonchou/p/4458219.html
1、新建Maven工程

Eclipse-》新建Maven工程
http://mvnrepository.com/search?q=hadoop-mapreduce-client
groupid:com 

artifactid:first

依赖包

hadoop-common

hadoop-hdfs

hadoop-mapreduce-client-core

hadoop-mapreduce-client-jobclient

hadoop-mapreduce-client-common

我加了个hadoop-yarn-common,这个可以不要

pom.xml【注意:版本改成你自己的】

<dependencies>

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-common</artifactId>

            <version>2.7.2</version>

        </dependency>

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-hdfs</artifactId>

            <version>2.7.2</version>

        </dependency>

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-mapreduce-client-core</artifactId>

            <version>2.7.2</version>

        </dependency>

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>

            <version>2.7.2</version>

        </dependency>

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-mapreduce-client-common</artifactId>

            <version>2.7.2</version>

        </dependency>

        <dependency>

            <groupId>jdk.tools</groupId>

            <artifactId>jdk.tools</artifactId>

            <version>1.8</version>

            <scope>system</scope>

            <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>

        </dependency>

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-yarn-common</artifactId>

            <version>2.7.2</version>

        </dependency>

</dependencies>

点击保存,开始构建。

构建完成后可以在Maven Dependencies下看到依赖包。

2、新建LogCleanJob类

代码见附录【详细代码解释参考原文http://www.cnblogs.com/edisonchou/p/4458219.html】

注意:pom.xml要添加assembly插件,直接用jar导出一直报错,没找到原因。

还有原文的@Override在run方法编译不通过,注释掉了。

E:\fm-workspace\workspace_2\first>mvn assembly:assembly

cd first\target下

first-0.0.1-SNAPSHOT-jar-with-dependencies.jar

E:\fm-workspace\workspace_2\first\target>dir

2016/08/13  18:21    <DIR>          .

2016/08/13  18:21    <DIR>          ..

2016/08/13  18:19    <DIR>          archive-tmp

2016/08/13  17:34    <DIR>          classes

2016/08/13  18:21        42,996,951 first-0.0.1-SNAPSHOT-jar-with-dependencies.jar

2016/08/13  18:21             9,266 first-0.0.1-SNAPSHOT.jar

2016/08/13  18:19    <DIR>          maven-archiver

2016/08/13  17:31    <DIR>          maven-status

2016/08/13  18:19    <DIR>          surefire-reports

2016/08/13  17:34    <DIR>          test-classes

               2 个文件     43,006,217 字节

               8 个目录 113,821,888,512 可用字节

重命名first-0.0.1-SNAPSHOT-jar-with-dependencies.jar 为first.jar并拷贝到linux下

root@py-server:/projects/data# ll

总用量 42008

drwxr-xr-x 4 root root     4096  8月 13 18:52 ./

drwxr-xr-x 7 root root     4096  8月 11 16:29 ../

-rw-r--r-- 1 root root 42996951  8月 13 18:21 first.jar

drwxr-xr-x 2 root root     4096  8月 13 15:36 hadoop-logs/

drwxr-xr-x 2 root root     4096  8月  3 21:04 test/

5、上传数据到HDFS

数据文件在原文找吧:http://www.cnblogs.com/edisonchou/p/4458219.html,大概200MB左右。

也可以用你自己的日志文件,不过格式要一致。

root@py-server:/projects/data/hadoop-logs# ll

总用量 213056

drwxr-xr-x 2 root root      4096  8月 13 15:36 ./

drwxr-xr-x 4 root root      4096  8月 13 18:25 ../

-rw-r--r-- 1 root root  61084192  4月 26  2015 access_2013_05_30.log

-rw-r--r-- 1 root root 157069653  4月 26  2015 access_2013_05_31.log

HDFS默认路径是 /user/root/

root@py-server:/projects/data# hadoop fs -put hadoop-logs/ .

root@py-server:/projects/data# hadoop fs -ls 

Found 14 items

drwxr-xr-x   - root supergroup          0 2016-08-09 23:59 .sparkStaging

drwxr-xr-x   - root supergroup          0 2016-08-13 15:38 hadoop-logs

-rw-r--r--   2 root supergroup      85285 2016-08-06 07:59 imdb_labelled.txt

-rw-r--r--   2 root supergroup         72 2016-08-04 09:29 kmeans_data.txt

drwxr-xr-x   - root supergroup          0 2016-08-09 23:59 kmeans_result

drwxr-xr-x   - root supergroup          0 2016-08-05 16:16 kmeans_result.txt

-rw-r--r--   2 root supergroup      43914 2016-08-04 12:33 ks_aio.py

drwxr-xr-x   - root supergroup          0 2016-08-09 10:51 mymlresult

drwxr-xr-x   - root supergroup          0 2016-08-09 10:28 naive_bayes_result

-rw-r--r--   2 root supergroup      66288 2016-08-09 23:57 price_data.txt

-rw-r--r--   2 root supergroup       1619 2016-08-08 17:54 price_data2.txt

-rw-r--r--   2 root supergroup       1619 2016-08-09 09:13 price_train_data.txt

-rw-r--r--   2 root supergroup        120 2016-08-04 09:24 sample_kmeans_data.txt

-rw-r--r--   2 root supergroup     104736 2016-08-08 17:14 sample_libsvm_data.txt

6、Hadoop测试

root@py-server:/projects/data# hadoop jar first.jar /user/root/hadoop-logs/ /user/root/logcleanjob_output
结果:【速度超快,不到瞬间啊!36s】

在hadoop UI看(本人的是:py-server:8088)下看:

User:root
Name:LogCleanJob
Application Type:MAPREDUCE
Application Tags: 
YarnApplicationState:FINISHED
FinalStatus Reported by AM:SUCCEEDED
Started:星期六 八月 13 18:46:18 +0800 2016
Elapsed:36sec
Tracking URL:History
Diagnostics: 
Clean process success!

root@py-server:/projects/data# hadoop fs -ls /user/root/

Found 15 items

drwxr-xr-x   - root supergroup          0 2016-08-09 23:59 /user/root/.sparkStaging

drwxr-xr-x   - root supergroup          0 2016-08-13 18:45 /user/root/hadoop-logs

-rw-r--r--   2 root supergroup      85285 2016-08-06 07:59 /user/root/imdb_labelled.txt

-rw-r--r--   2 root supergroup         72 2016-08-04 09:29 /user/root/kmeans_data.txt

drwxr-xr-x   - root supergroup          0 2016-08-09 23:59 /user/root/kmeans_result

drwxr-xr-x   - root supergroup          0 2016-08-05 16:16 /user/root/kmeans_result.txt

-rw-r--r--   2 root supergroup      43914 2016-08-04 12:33 /user/root/ks_aio.py

drwxr-xr-x   - root supergroup          0 2016-08-13 18:46 /user/root/logcleanjob_output

drwxr-xr-x   - root supergroup          0 2016-08-09 10:51 /user/root/mymlresult

drwxr-xr-x   - root supergroup          0 2016-08-09 10:28 /user/root/naive_bayes_result

-rw-r--r--   2 root supergroup      66288 2016-08-09 23:57 /user/root/price_data.txt

-rw-r--r--   2 root supergroup       1619 2016-08-08 17:54 /user/root/price_data2.txt

-rw-r--r--   2 root supergroup       1619 2016-08-09 09:13 /user/root/price_train_data.txt

-rw-r--r--   2 root supergroup        120 2016-08-04 09:24 /user/root/sample_kmeans_data.txt

-rw-r--r--   2 root supergroup     104736 2016-08-08 17:14 /user/root/sample_libsvm_data.txt

root@py-server:/projects/data# hadoop fs -ls /user/root/logcleanjob_output

Found 2 items

-rw-r--r--   2 root supergroup          0 2016-08-13 18:46 /user/root/logcleanjob_output/_SUCCESS

-rw-r--r--   2 root supergroup   50810594 2016-08-13 18:46 /user/root/logcleanjob_output/part-r-00000

root@py-server:/projects/data# hadoop fs -cat /user/root/logcleanjob_output/part-r-00000

118.112.191.88 20130530204006source/plugin/wsh_wx/img/wsh_zk.css

113.107.237.31 20130530204005thread-10500-1-1.html

110.251.129.203 20130531081904forum.php?mod=ajax&action=forumchecknew&fid=111&time=1369959258&inajax=yes

118.112.191.88 20130530204006data/cache/style_1_common.css?y7a

220.231.55.69 20130530204005home.php?mod=spacecp&ac=pm&op=checknewpm&rand=1369917603

110.75.174.58 20130531081903thread-21066-1-1.html

118.112.191.88 20130530204006data/cache/style_1_forum_viewthread.css?y7a

110.75.174.55 20130531081904home.php?do=thread&from=space&mod=space&uid=71469&view=me

14.17.29.89 20130530204006home.php?mod=misc&ac=sendmail&rand=1369917604

121.25.131.148 20130531081906data/attachment/common/c2/common_12_usergroup_icon.jpg

59.174.191.135 20130530204003forum.php?mod=forumdisplay&fid=111&page=1&filter=author&orderby=dateline

118.112.191.88 20130530204007data/attachment/common/65/common_11_usergroup_icon.jpg

121.25.131.148 20130531081905home.php?mod=misc&ac=sendmail&rand=1369959541

101.229.199.98 20130530204007data/cache/style_1_widthauto.css?y7a

59.174.191.135 20130530204005home.php?mod=space&uid=71081&do=profile&from=space

#######################################

问题解决:

1. mave 中断怎么办
http://www.cnblogs.com/tangyanbo/p/4329303.html
右键项目:maven->update project并勾选force选项,如果勾选force,那么不用删除未下载成功的残余文件,在大量jar包未下载成功的时候可以选择勾选force

重新build一下。

2. hadoop jar 没有指定主类名,直接将主类名放在first.jar后会提示找不到input那个文件夹

hadoop jar first.jar /user/root/hadoop-logs/ /user/root/logcleanjob_output

#######################################

附录:LogCleanJob.java

package com.first;

//package techbbs;

import java.net.URI;

import java.text.ParseException;

import java.text.SimpleDateFormat;

import java.util.Date;

import java.util.Locale;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.conf.Configured;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.NullWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.Tool;

import org.apache.hadoop.util.ToolRunner;

public class LogCleanJob extends Configured implements Tool {

    public static void main(String[] args) {

        Configuration conf = new Configuration();

        try {

            int res = ToolRunner.run(conf, new LogCleanJob(), args);

            System.exit(res);

        } catch (Exception e) {

            e.printStackTrace();

        }

    }

    //@Override

    public int run(String[] args) throws Exception {

        final Job job = new Job(new Configuration(),

                LogCleanJob.class.getSimpleName());

        // 设置为可以打包运行

        job.setJarByClass(LogCleanJob.class);

        FileInputFormat.setInputPaths(job, args[0]);

        job.setMapperClass(MyMapper.class);

        job.setMapOutputKeyClass(LongWritable.class);

        job.setMapOutputValueClass(Text.class);

        job.setReducerClass(MyReducer.class);

        job.setOutputKeyClass(Text.class);

        job.setOutputValueClass(NullWritable.class);

        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 清理已存在的输出文件

        FileSystem fs = FileSystem.get(new URI(args[0]), getConf());

        Path outPath = new Path(args[1]);

        if (fs.exists(outPath)) {

            fs.delete(outPath, true);

        }

        

        boolean success = job.waitForCompletion(true);

        if(success){

            System.out.println("Clean process success!");

        }

        else{

            System.out.println("Clean process failed!");

        }

        return 0;

    }

    static class MyMapper extends

            Mapper<LongWritable, Text, LongWritable, Text> {

        LogParser logParser = new LogParser();

        Text outputValue = new Text();

        protected void map(

                LongWritable key,

                Text value,

                org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, LongWritable, Text>.Context context)

                throws java.io.IOException, InterruptedException {

            final String[] parsed = logParser.parse(value.toString());

            // step1.过滤掉静态资源访问请求

            if (parsed[2].startsWith("GET /static/")

                    || parsed[2].startsWith("GET /uc_server")) {

                return;

            }

            // step2.过滤掉开头的指定字符串

            if (parsed[2].startsWith("GET /")) {

                parsed[2] = parsed[2].substring("GET /".length());

            } else if (parsed[2].startsWith("POST /")) {

                parsed[2] = parsed[2].substring("POST /".length());

            }

            // step3.过滤掉结尾的特定字符串

            if (parsed[2].endsWith(" HTTP/1.1")) {

                parsed[2] = parsed[2].substring(0, parsed[2].length()

                        - " HTTP/1.1".length());

            }

            // step4.只写入前三个记录类型项

            outputValue.set(parsed[0] + "\t" + parsed[1] + "\t" + parsed[2]);

            context.write(key, outputValue);

        }

    }

    static class MyReducer extends

            Reducer<LongWritable, Text, Text, NullWritable> {

        protected void reduce(

                LongWritable k2,

                java.lang.Iterable<Text> v2s,

                org.apache.hadoop.mapreduce.Reducer<LongWritable, Text, Text, NullWritable>.Context context)

                throws java.io.IOException, InterruptedException {

            for (Text v2 : v2s) {

                context.write(v2, NullWritable.get());

            }

        };

    }

    /*

     * 日志解析类

     */

    static class LogParser {

        public static final SimpleDateFormat FORMAT = new SimpleDateFormat(

                "d/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);

        public static final SimpleDateFormat dateformat1 = new SimpleDateFormat(

                "yyyyMMddHHmmss");

        public static void main(String[] args) throws ParseException {

            final String S1 = "27.19.74.143 - - [30/May/2013:17:38:20 +0800] \"GET /static/image/common/faq.gif HTTP/1.1\" 200 1127";

            LogParser parser = new LogParser();

            final String[] array = parser.parse(S1);

            System.out.println("样例数据: " + S1);

            System.out.format(

                    "解析结果:  ip=%s, time=%s, url=%s, status=%s, traffic=%s",

                    array[0], array[1], array[2], array[3], array[4]);

        }

        /**

         * 解析英文时间字符串

         * 

         * @param string

         * @return

         * @throws ParseException

         */

        private Date parseDateFormat(String string) {

            Date parse = null;

            try {

                parse = FORMAT.parse(string);

            } catch (ParseException e) {

                e.printStackTrace();

            }

            return parse;

        }

        /**

         * 解析日志的行记录

         * 

         * @param line

         * @return 数组含有5个元素,分别是ip、时间、url、状态、流量

         */

        public String[] parse(String line) {

            String ip = parseIP(line);

            String time = parseTime(line);

            String url = parseURL(line);

            String status = parseStatus(line);

            String traffic = parseTraffic(line);

            return new String[] { ip, time, url, status, traffic };

        }

        private String parseTraffic(String line) {

            final String trim = line.substring(line.lastIndexOf("\"") + 1)

                    .trim();

            String traffic = trim.split(" ")[1];

            return traffic;

        }

        private String parseStatus(String line) {

            final String trim = line.substring(line.lastIndexOf("\"") + 1)

                    .trim();

            String status = trim.split(" ")[0];

            return status;

        }

        private String parseURL(String line) {

            final int first = line.indexOf("\"");

            final int last = line.lastIndexOf("\"");

            String url = line.substring(first + 1, last);

            return url;

        }

        private String parseTime(String line) {

            final int first = line.indexOf("[");

            final int last = line.indexOf("+0800]");

            String time = line.substring(first + 1, last).trim();

            Date date = parseDateFormat(time);

            return dateformat1.format(date);

        }

        private String parseIP(String line) {

            String ip = line.split("- -")[0].trim();

            return ip;

        }

    }

}

######################################################

完整的pom.xml 

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com</groupId>

  <artifactId>first</artifactId>

  <version>0.0.1-SNAPSHOT</version>

  <packaging>jar</packaging>

  <name>first</name>

  <url>http://maven.apache.org</url>

  <properties>

    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

  </properties>

  <dependencies>

    <dependency>

      <groupId>junit</groupId>

      <artifactId>junit</artifactId>

      <version>3.8.1</version>

      <scope>test</scope>

    </dependency>

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-common</artifactId>

            <version>2.7.2</version>

        </dependency>

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-hdfs</artifactId>

            <version>2.7.2</version>

        </dependency>

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-mapreduce-client-core</artifactId>

            <version>2.7.2</version>

        </dependency>

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>

            <version>2.7.2</version>

        </dependency>

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-mapreduce-client-common</artifactId>

            <version>2.7.2</version>

        </dependency>

        <dependency>

            <groupId>jdk.tools</groupId>

            <artifactId>jdk.tools</artifactId>

            <version>1.8</version>

            <scope>system</scope>

            <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>

        </dependency>

        <dependency>

            <groupId>org.apache.hadoop</groupId>

            <artifactId>hadoop-yarn-common</artifactId>

            <version>2.7.2</version>

        </dependency>

  </dependencies>

   <build>

  <defaultGoal>compile</defaultGoal>

  <plugins>  

            <plugin>  

                <artifactId>maven-assembly-plugin</artifactId>  

                <configuration>  

                    <archive>  

                        <manifest>  

                            <mainClass>com.first.LogCleanJob</mainClass>  

                        </manifest>  

                    </archive>  

                    <descriptorRefs>  

                        <descriptorRef>jar-with-dependencies</descriptorRef>  

                    </descriptorRefs>  

                </configuration>  

            </plugin>  

        </plugins>  

  </build>

</project>
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  hadoop maven 日志 清洗