用Apache Spark做大数据处理——第四部分:Spark MLlib机器学习库(英文版)
2017-07-22 17:25
429 查看
In this article, we'll discuss machine learning concepts and how to use Apache Spark MLlib library for running predictive analytics. We will use a sample application to illustrate the powerful API Spark provides in machine learning area.
Spark Machine Learning API includes two packages called
The spark.mllib package contains the original
Spark machine learning API built on Resilient Distributed Datasets (RDDs).
It offers machine learning techniques which include correlation, classification and regression, collaborative filtering, clustering, and dimensionality reduction.
On the other hand, spark.ml package provides machine
learning API built on theDataFrames which
are becoming the core part of Spark SQL library. This package can be used for developing and managing the machine learning pipelines. It also provides Feature Extractors, Transformers, Selectors, and machine learning techniques like classification and regression,
and clustering.
We’ll focus on Spark MLlib package in this article and discuss various machine learning techniques this package brings to the table. In the next article, we’ll cover the Spark ML package and look at how to create and manage data pipelines.
Machine Learning is about learning from existing data to make predictions about the future. It's based on creating models from input data sets for data-driven decision making.
Data science is
4000
the discipline of extracting the knowledge from large data sets (structured or unstructured) to provide insights to business teams and influence the business strategies and roadmaps. The role
of Data Scientist is more critical than ever in solving the problems that are not easy to solve using the traditional numerical methods.
There are different types of machine learning models like:
Supervised learning
Unsupervised learning
Semi-supervised Learning
Reinforcement learning
Let’s briefly look at each of these machine learning models and how they compare with each other.
Supervised learning: This technique is used to predict an outcome by training the program using an existing set of training data (labeled data). Then we use the program to predict the label for a new unlabeled data set.
There are two sub-models under supervised machine learning, Regression and Classification.
Unsupervised learning: This is used to find hidden patterns and correlations within the raw data. No training data used in this model, so this technique is based on unlabeled data.
Algorithms like k-means and Principle Component Analysis (PCA) fall into this category.
Semi-supervised Learning: This technique uses
both supervised and unsupervised learning models for predictive analytics. It uses labeled and unlabeled data sets for training. It typically involves using a small amount of labeled data with a large amount of unlabeled data. It can be used for machine learning
methods like classification and regression.
Reinforcement learning: The Reinforcement
Learning technique is used to learn how to maximize a numerical reward goal by trying different actions and discovering which actions result in the maximum reward.
Following table shows some examples where these machine learning models can be used to solve the problems.
Table 1. Machine Learning Models and Real-world Examples
There are several algorithms to help
with machine learning solutions. Let’s look at some of the algorithms supported by machine learning frameworks.
Naive Bayes: Naive Bayes is a supervised learning algorithm used for classification. It's based on applying Bayes theorem and a set of conditional independence assumptions.
k-means Clustering: k-means algorithm creates k groups from a set of objects so that the members of a group are more similar.
Support vector machines: Support vector machines (SVMs) is a supervised learning algorithm used to find the boundary that separates classes by as wide a margin as possible. Given a set of training examples,
each marked for belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. Applications of SVM include bioinformatics, text, and image recognition.
Decision Trees: Decision trees are used in many types of machine learning problems including multi-class classification. MLlib supports both basic decision tree algorithm and ensembles of trees. Two ensemble
algorithms are available, Gradient-Boosted Trees and Random Forests.
As documented on this website,
here is a summary of all the machine learning styles, problems, and methods to solve them, shown in Table 2 below.
Table 2. Machine learning styles, problems, and methods
When working machine learning projects, other tasks like data preparation, cleansing and analysis are also very important tasks in addition to the actual learning models and algorithms used to solve the business problems.
Following are the steps performed in a typical machine learning program.
Featurization
Training
Model Evaluation
Figure 1 below shows the process flow of a typical machine learning solution.
Figure 1. Machine Learning Process Flow
It's important to know that if the raw data isn't cleaned or prepared before running machine learning algorithms, the resulting patterns will not be accurate or useful as well as some anomalies may be missed.
The quality of training data we provide to machine learning programs also plays a critical role in the prediction results. If the training data is not random enough, resulting patterns won’t be accurate. And if the data is too small, the machine learning program
may give inaccurate predictions.
The business use cases for machine learning span different domains and scenarios including recommendation engines (food
recommendation engine), predictive analytics (stock
price prediction or predicting
flights delay), targeted advertising, fraud detection, image and video recognition, self-driving cars and various other artificial intelligence.
Let’s look at two popular machine learning applications, recommendation engine and fraud detection, in more detail.
Recommendation engines use the attributes of an item or a user or the behavior of a user or their peers, to make the predictions. There are different factors that drive an effective recommendation engine model. Some of these factors are list below:
Peer based
Customer behavior
Corporate deals or offers
Item clustering
Market/Store factors
Recommendation engine solutions are implemented by leveraging two algorithms, content-based filtering and collaborative filtering.
Content-based filtering: This is based on how similar a particular item is to other items based on usage and ratings. The model uses the content attributes of items (such as categories, tags, descriptions
and other data) to generate a matrix of each item to other items and calculates similarity based on the ratings provided. Then the most similar items are listed together with a similarity score. Items with the highest score are most similar.
Movie recommendation is a good example of this model. It recommends that "Users who liked a particular movie liked these other movies as well".
These models don’t take into account the overall behavior of other users, so they don't provide personalized recommendations compared to other models like collaborative filtering.
Collaborative Filtering: On the other hand, collaborative filtering model is based on making predictions to find a specific item or user based on similarity with other items or users. The filter applies
weights based on the "peer user" preferences. The assumption is users who display similar profile or behavior have similar preferences for items.
An example of this model is the recommendations on ecommerce websites like Amazon. When you search for an item on the website you would see something like "Customers Who Bought This Item Also Bought."
Items with the highest recommendation score are the most relevant to the user in context.
Collaborative filtering based solutions perform better compared to other models. Spark MLlib implements a collaborative filtering algorithm called Alternating
Least Squares (ALS). There are two variations of the entries in collaborative filtering, called explicit
and implicit feedback. Explicit feedback is based on the direct preferences given by the user to the item (like a movie). Explicit feedback is nice, but many times it's skewed because users who strongly like or dislike a product provide reviews on it.
We don't get the opinion of many people towards the center of the bell shaped curve of data points.
Implicit feedback examples are user's views, clicks, likes etc. Implicit feedback data is used a lot in the industry for predictive analytics because of the ease to gather this type of data.
There are also model based methods for
recommendation engines. These often incorporate methods from collaborative and content-based filtering. Model-based approach gets the best of both worlds, the power and performance of collaborative filtering and the flexibility and adaptability of content-based
filtering. Deep learning techniques are
good examples of this model.
You can also integrate other algorithms like K-Means into
the recommendation engine solution to get more refined predictions. K-Means algorithm works by partitioning "n" observations into "k" clusters in which each observation belongs to the cluster with the nearest mean. Using K-Means technique, we can find similar
items or users based on their attributes.
Figure 2 below shows different components of a recommendation engine, user factors, other factors like market data and different algorithms.
Figure 2. Components of a Recommendation Engine
Fraud detection is another important use case of using machine learning techniques because it addresses a critical problem in financial industry quickly and accurately. Financial services organizations only have few hundred milliseconds to determine if a particular
online transaction is legitimate or a fraud.
Neural network techniques are used for point-of-sale (POS) fraud detection use cases. Organizations like PayPal use
different types of machine learning algorithms
for risk management like linear, neural network, and deep learning.
Spark MLlib library provides several
algorithms for solving this use case, including linear SVMs, logistic regression, decision trees, and naive Bayes. In addition, ensemble models (which combine the predictions of a set of models) such as random forests or gradient-boosting trees are also
available.
When you are working on a machine learning project in your organization, as recommended inthis
article, you can follow the steps listed below to implement machine learning solutions in your own projects.
Select the programming language
Select the appropriate algorithm or algorithms
Select problem
Research the algorithms
Unit test all the functions in the ML solution
Now, let’s look at how Apache Spark framework implements the machine learning algorithms.
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality
reduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
Like we learned earlier, there are two different ways of using Spark Machine Learning API, Spark MLlib and Spark ML.
In addition to various algorithms Spark MLlib supports, there are also data processing functions and data analytics utilities
and tools available in this library.
Frequent itemset mining via FP-growth and association rules
Sequential pattern mining via PrefixSpan
Summary statistics and hypothesis testing
Feature transformations
Model evaluation and hyper-parameter tuning
Figure 3 below shows Apache Spark framework with Spark MLlib library.
Figure 3. Spark Ecosystem with Spark Machine Learning Library
Spark MLlib API is available in Scala, Java, and Python programming languages. Here are the links to API in each of these languages.
Spark MLlib Scala
API
Java API
Python API
We’ll develop a sample Machine Learning application using classification technique, specifically collaborative filtering method, to predict the movies to recommend to a user based on other users’ ratings on different movies.
Our recommendation engine solution will use Alternating Least Squares (ALS) machine learning algorithm.
Even though the data sets used in the code example in this article are not very large in size and complexity, Spark MLlib can be used for any type of real world problems that are complex in nature and deal with data containing several dimensions, and very complex
predictor functions. Remember machine learning can be used to solve problems that cannot be solved by numerical means alone.
To keep the machine learning application simple so we can focus on Spark MLlib API, we’ll follow the Movie Recommendations example discussed in Spark
Summit workshop. This exercise is a very good resource to learn more about Spark MLlib library. For more details on the application, visit their website.
The use case we want to implement Spark based machine learning solution is a recommendation engine.
Recommendation engines are used to make predictions for unknown user-item associations (e.g. movie ratings) based on known user-item associations. They can make predictions based on based on user’s affinity to other items and other users’ affinity to this specific
item. The engine builds a prediction model based on the known data (called Training Data) and then make predictions for unknown user-item associations (called Test Data).
Our program includes the following steps to arrive at the top movie recommendations for the user.
Load movies data file
Load the data file with ratings provided by a specific user (you)
Load the ratings data provided by other users (community)
Join the user ratings data with community ratings into a single RDD
Train the model using ALS algorithm using ratings data
Identify the movies not rated by a particular user (userId = 1)
Predict the ratings of the items not rated by user
Get top N recommendations (N=5 in our example)
Display the recommendation data on the console
If you want to process or run further analysis on output data, you can store the results in a NoSQL database like Cassandra or MongoDB.
We will use the movie datasets provided by MovieLens group. There are few different
data files we need for the sample application. These datasets are available for download from GroupLens
website. We will use one of the latest datasets (smaller version with 100K ratings). Download thedataset
zip file from the website.
Following table shows the different datasets used in the application.
User ratings file is for the user in context. You can update the ratings in this file based on your movie preferences. We’ll assign a user id called “User Id 0” to represent these ratings.
When we run the recommendation engine program, we’ll join the specific user ratings data with ratings from the community (other users).
We will use Java to write the Spark MLlib program which can be executed as a stand-alone application. The program uses the following MLlib Java classes (all are located in
We will use the following technologies in the sample application to illustrate how Spark MLlib API is used for performing predictive analytics.
Apache Spark 1.6.1
Spark Java API
JDK version 1.8.0_65
Apache Maven 3.3.3
Eclipse IDE
Architecture components of the sample application are shown in Figure 4 below.
Figure 4. Spark Machine Learning Sample Application Architecture
There are several implementations of movie recommendation example available in different languages supported by Spark, like Scala (Databricks and MapR),
Java (Spark
Examplesand Java
based Recommendation Engine), and Python.
We’ll use the Java solution in our sample application. Download the Java
program from Spark
examples website to run the example on your local machine. Create a new Maven based Java project called
in the data sets discussed in the previous section.
Make sure you include the required Spark Java libraries in the dependencies section of Maven
Set environment parameters for JDK (
For Windows Operating System:
If you are using a Linux or Mac OSX system, you can run the following commands:
If application build is successful, the packaged JAR file will be created in
We will use
Windows:
Linux/Mac:
We can monitor the Spark program status on the web console which is available at URL.
Let’s look at some of these visualization tools showing the Spark machine learning program statistics.
We can view the details of all jobs in the sample machine learning program. Click on “Jobs” tab on the web console screen to navigate the Spark Jobs web page that shows these job details.
Figure 5 below shows the status of the jobs from the sample program.
(Click on the image to enlarge it)
Figure 5. Spark jobs statistics for the machine learning program
Direct Acyclic Graph (DAG) shows the dependency graph of different RDD operations we ran in the program. Figure 6 below shows the screenshot of this visualization of Spark machine learning job.
(Click on the image to enlarge it)
Figure 6. DAG visualization graph of Spark machine learning program
Spark MLlib is one of the two machine learning libraries from Apache Spark. It's used for implementing predictive analytics solutions for business use cases like recommendation engine and fraud detection system. In this article, we learned about how to use Spark
MLlib to create a recommendation engine application to predict the movies that a user might like.
Make sure you perform sufficient testing to assess the effectiveness as well as performance of different machine learning techniques to find out the best solution for your requirements and use cases.
In the next article, we will focus on the other Machine Learning API from Spark, called Spark
ML, which is the recommended solution for big data projects developed on data pipelines.
Spark Machine Learning API includes two packages called
spark.mlliband
spark.ml.
The spark.mllib package contains the original
Spark machine learning API built on Resilient Distributed Datasets (RDDs).
It offers machine learning techniques which include correlation, classification and regression, collaborative filtering, clustering, and dimensionality reduction.
On the other hand, spark.ml package provides machine
learning API built on theDataFrames which
are becoming the core part of Spark SQL library. This package can be used for developing and managing the machine learning pipelines. It also provides Feature Extractors, Transformers, Selectors, and machine learning techniques like classification and regression,
and clustering.
We’ll focus on Spark MLlib package in this article and discuss various machine learning techniques this package brings to the table. In the next article, we’ll cover the Spark ML package and look at how to create and manage data pipelines.
Machine Learning & Data Science
Machine Learning is about learning from existing data to make predictions about the future. It's based on creating models from input data sets for data-driven decision making.Data science is
4000
the discipline of extracting the knowledge from large data sets (structured or unstructured) to provide insights to business teams and influence the business strategies and roadmaps. The role
of Data Scientist is more critical than ever in solving the problems that are not easy to solve using the traditional numerical methods.
There are different types of machine learning models like:
Supervised learning
Unsupervised learning
Semi-supervised Learning
Reinforcement learning
Let’s briefly look at each of these machine learning models and how they compare with each other.
Supervised learning: This technique is used to predict an outcome by training the program using an existing set of training data (labeled data). Then we use the program to predict the label for a new unlabeled data set.
There are two sub-models under supervised machine learning, Regression and Classification.
Unsupervised learning: This is used to find hidden patterns and correlations within the raw data. No training data used in this model, so this technique is based on unlabeled data.
Algorithms like k-means and Principle Component Analysis (PCA) fall into this category.
Semi-supervised Learning: This technique uses
both supervised and unsupervised learning models for predictive analytics. It uses labeled and unlabeled data sets for training. It typically involves using a small amount of labeled data with a large amount of unlabeled data. It can be used for machine learning
methods like classification and regression.
Reinforcement learning: The Reinforcement
Learning technique is used to learn how to maximize a numerical reward goal by trying different actions and discovering which actions result in the maximum reward.
Following table shows some examples where these machine learning models can be used to solve the problems.
ML Model | Examples |
Supervised learning | Fraud detection |
Unsupervised learning | Social network applications, language prediction |
Semi-supervised Learning | Image categorization, Voice recognition |
Reinforcement learning | Artificial Intelligence (AI) applications |
Machine Learning Algorithms
There are several algorithms to helpwith machine learning solutions. Let’s look at some of the algorithms supported by machine learning frameworks.
Naive Bayes: Naive Bayes is a supervised learning algorithm used for classification. It's based on applying Bayes theorem and a set of conditional independence assumptions.
k-means Clustering: k-means algorithm creates k groups from a set of objects so that the members of a group are more similar.
Support vector machines: Support vector machines (SVMs) is a supervised learning algorithm used to find the boundary that separates classes by as wide a margin as possible. Given a set of training examples,
each marked for belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. Applications of SVM include bioinformatics, text, and image recognition.
Decision Trees: Decision trees are used in many types of machine learning problems including multi-class classification. MLlib supports both basic decision tree algorithm and ensembles of trees. Two ensemble
algorithms are available, Gradient-Boosted Trees and Random Forests.
As documented on this website,
here is a summary of all the machine learning styles, problems, and methods to solve them, shown in Table 2 below.
ML Model | Problems | Algorithms |
Supervised Learning | Classification, Regression, Anomaly Detection | Logistic Regression, Back Propagation Neural Network |
Unsupervised Learning | Clustering, Dimensionality reduction | k-Means , Apriori algorithm |
Semi-Supervised Learning | Classification, Regression | Self training, Semi-supervised Support Vector Machines (S3VMs) |
Steps in a Machine Learning Program
When working machine learning projects, other tasks like data preparation, cleansing and analysis are also very important tasks in addition to the actual learning models and algorithms used to solve the business problems.Following are the steps performed in a typical machine learning program.
Featurization
Training
Model Evaluation
Figure 1 below shows the process flow of a typical machine learning solution.
Figure 1. Machine Learning Process Flow
It's important to know that if the raw data isn't cleaned or prepared before running machine learning algorithms, the resulting patterns will not be accurate or useful as well as some anomalies may be missed.
The quality of training data we provide to machine learning programs also plays a critical role in the prediction results. If the training data is not random enough, resulting patterns won’t be accurate. And if the data is too small, the machine learning program
may give inaccurate predictions.
Use Cases
The business use cases for machine learning span different domains and scenarios including recommendation engines (foodrecommendation engine), predictive analytics (stock
price prediction or predicting
flights delay), targeted advertising, fraud detection, image and video recognition, self-driving cars and various other artificial intelligence.
Let’s look at two popular machine learning applications, recommendation engine and fraud detection, in more detail.
Recommendation Engines
Recommendation engines use the attributes of an item or a user or the behavior of a user or their peers, to make the predictions. There are different factors that drive an effective recommendation engine model. Some of these factors are list below:Peer based
Customer behavior
Corporate deals or offers
Item clustering
Market/Store factors
Recommendation engine solutions are implemented by leveraging two algorithms, content-based filtering and collaborative filtering.
Content-based filtering: This is based on how similar a particular item is to other items based on usage and ratings. The model uses the content attributes of items (such as categories, tags, descriptions
and other data) to generate a matrix of each item to other items and calculates similarity based on the ratings provided. Then the most similar items are listed together with a similarity score. Items with the highest score are most similar.
Movie recommendation is a good example of this model. It recommends that "Users who liked a particular movie liked these other movies as well".
These models don’t take into account the overall behavior of other users, so they don't provide personalized recommendations compared to other models like collaborative filtering.
Collaborative Filtering: On the other hand, collaborative filtering model is based on making predictions to find a specific item or user based on similarity with other items or users. The filter applies
weights based on the "peer user" preferences. The assumption is users who display similar profile or behavior have similar preferences for items.
An example of this model is the recommendations on ecommerce websites like Amazon. When you search for an item on the website you would see something like "Customers Who Bought This Item Also Bought."
Items with the highest recommendation score are the most relevant to the user in context.
Collaborative filtering based solutions perform better compared to other models. Spark MLlib implements a collaborative filtering algorithm called Alternating
Least Squares (ALS). There are two variations of the entries in collaborative filtering, called explicit
and implicit feedback. Explicit feedback is based on the direct preferences given by the user to the item (like a movie). Explicit feedback is nice, but many times it's skewed because users who strongly like or dislike a product provide reviews on it.
We don't get the opinion of many people towards the center of the bell shaped curve of data points.
Implicit feedback examples are user's views, clicks, likes etc. Implicit feedback data is used a lot in the industry for predictive analytics because of the ease to gather this type of data.
There are also model based methods for
recommendation engines. These often incorporate methods from collaborative and content-based filtering. Model-based approach gets the best of both worlds, the power and performance of collaborative filtering and the flexibility and adaptability of content-based
filtering. Deep learning techniques are
good examples of this model.
You can also integrate other algorithms like K-Means into
the recommendation engine solution to get more refined predictions. K-Means algorithm works by partitioning "n" observations into "k" clusters in which each observation belongs to the cluster with the nearest mean. Using K-Means technique, we can find similar
items or users based on their attributes.
Figure 2 below shows different components of a recommendation engine, user factors, other factors like market data and different algorithms.
Figure 2. Components of a Recommendation Engine
Fraud Detection
Fraud detection is another important use case of using machine learning techniques because it addresses a critical problem in financial industry quickly and accurately. Financial services organizations only have few hundred milliseconds to determine if a particularonline transaction is legitimate or a fraud.
Neural network techniques are used for point-of-sale (POS) fraud detection use cases. Organizations like PayPal use
different types of machine learning algorithms
for risk management like linear, neural network, and deep learning.
Spark MLlib library provides several
algorithms for solving this use case, including linear SVMs, logistic regression, decision trees, and naive Bayes. In addition, ensemble models (which combine the predictions of a set of models) such as random forests or gradient-boosting trees are also
available.
When you are working on a machine learning project in your organization, as recommended inthis
article, you can follow the steps listed below to implement machine learning solutions in your own projects.
Select the programming language
Select the appropriate algorithm or algorithms
Select problem
Research the algorithms
Unit test all the functions in the ML solution
Now, let’s look at how Apache Spark framework implements the machine learning algorithms.
Spark MLlib
MLlib is Spark’s machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. It consists of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionalityreduction, as well as lower-level optimization primitives and higher-level pipeline APIs.
Like we learned earlier, there are two different ways of using Spark Machine Learning API, Spark MLlib and Spark ML.
In addition to various algorithms Spark MLlib supports, there are also data processing functions and data analytics utilities
and tools available in this library.
Frequent itemset mining via FP-growth and association rules
Sequential pattern mining via PrefixSpan
Summary statistics and hypothesis testing
Feature transformations
Model evaluation and hyper-parameter tuning
Figure 3 below shows Apache Spark framework with Spark MLlib library.
Figure 3. Spark Ecosystem with Spark Machine Learning Library
Spark MLlib API is available in Scala, Java, and Python programming languages. Here are the links to API in each of these languages.
Spark MLlib Scala
API
Java API
Python API
Sample Application
We’ll develop a sample Machine Learning application using classification technique, specifically collaborative filtering method, to predict the movies to recommend to a user based on other users’ ratings on different movies.Our recommendation engine solution will use Alternating Least Squares (ALS) machine learning algorithm.
Even though the data sets used in the code example in this article are not very large in size and complexity, Spark MLlib can be used for any type of real world problems that are complex in nature and deal with data containing several dimensions, and very complex
predictor functions. Remember machine learning can be used to solve problems that cannot be solved by numerical means alone.
To keep the machine learning application simple so we can focus on Spark MLlib API, we’ll follow the Movie Recommendations example discussed in Spark
Summit workshop. This exercise is a very good resource to learn more about Spark MLlib library. For more details on the application, visit their website.
Use Case
The use case we want to implement Spark based machine learning solution is a recommendation engine.Recommendation engines are used to make predictions for unknown user-item associations (e.g. movie ratings) based on known user-item associations. They can make predictions based on based on user’s affinity to other items and other users’ affinity to this specific
item. The engine builds a prediction model based on the known data (called Training Data) and then make predictions for unknown user-item associations (called Test Data).
Our program includes the following steps to arrive at the top movie recommendations for the user.
Load movies data file
Load the data file with ratings provided by a specific user (you)
Load the ratings data provided by other users (community)
Join the user ratings data with community ratings into a single RDD
Train the model using ALS algorithm using ratings data
Identify the movies not rated by a particular user (userId = 1)
Predict the ratings of the items not rated by user
Get top N recommendations (N=5 in our example)
Display the recommendation data on the console
If you want to process or run further analysis on output data, you can store the results in a NoSQL database like Cassandra or MongoDB.
Data Sets
We will use the movie datasets provided by MovieLens group. There are few differentdata files we need for the sample application. These datasets are available for download from GroupLens
website. We will use one of the latest datasets (smaller version with 100K ratings). Download thedataset
zip file from the website.
Following table shows the different datasets used in the application.
# | Dataset | File Name | Description | Data Fields |
1 | Movies Data | movies.csv | Movie details. | movieId,title,genres |
2 | User Ratings Data | user-ratings.csv | Ratings by a specific user. | userId,movieId,rating,timestamp |
3 | Community Ratings Data | ratings.csv | Ratings by other users. | userId,movieId,rating,timestamp |
When we run the recommendation engine program, we’ll join the specific user ratings data with ratings from the community (other users).
Technologies
We will use Java to write the Spark MLlib program which can be executed as a stand-alone application. The program uses the following MLlib Java classes (all are located inorg.apache.spark.mllib.recommendationpackage):
ALS
MatrixFactorizationModel
Rating
We will use the following technologies in the sample application to illustrate how Spark MLlib API is used for performing predictive analytics.
Apache Spark 1.6.1
Spark Java API
JDK version 1.8.0_65
Apache Maven 3.3.3
Eclipse IDE
Architecture components of the sample application are shown in Figure 4 below.
Figure 4. Spark Machine Learning Sample Application Architecture
There are several implementations of movie recommendation example available in different languages supported by Spark, like Scala (Databricks and MapR),
Java (Spark
Examplesand Java
based Recommendation Engine), and Python.
We’ll use the Java solution in our sample application. Download the Java
program from Spark
examples website to run the example on your local machine. Create a new Maven based Java project called
spark-mllib-sample-appand copy the Java class into the project. Modify the Java class to pass
in the data sets discussed in the previous section.
Make sure you include the required Spark Java libraries in the dependencies section of Maven
pom.xmlfile. To do a clean build and download Spark library JAR files, you can run the following commands.
Set environment parameters for JDK (
JAVA_HOME), Maven (
MAVEN_HOME), and Spark (
SPARK_HOME)
For Windows Operating System:
set JAVA_HOME=[JDK_INSTALL_DIRECTORY] set PATH=%PATH%;%JAVA_HOME%\bin set MAVEN_HOME=[MAVEN_INSTALL_DIRECTORY] set PATH=%PATH%;%MAVEN_HOME%\bin set SPARK_HOME=[SPARK_INSTALL_DIRECTORY] set PATH=%PATH%;%SPARK_HOME%\bin cd c:\dev\projects\spark-mllib-sample-app mvn clean install mvn eclipse:clean eclipse:eclipse
If you are using a Linux or Mac OSX system, you can run the following commands:
export JAVA_HOME=[JDK_INSTALL_DIRECTORY] export PATH=$PATH:$JAVA_HOME/bin export MAVEN_HOME=[MAVEN_INSTALL_DIRECTORY] export PATH=$PATH:$MAVEN_HOME/bin export SPARK_HOME=[SPARK_INSTALL_DIRECTORY] export PATH=$PATH:$SPARK_HOME/bin cd /Users/USER_NAME/spark-mllib-sample-app mvn clean install mvn eclipse:clean eclipse:eclipse
If application build is successful, the packaged JAR file will be created in
targetdirectory.
We will use
spark-submitcommand to execute the Spark program. Here are the commands for running the program in Windows and Linux/Mac respectively.
Windows:
%SPARK_HOME%\bin\spark-submit --class "org.apache.spark.examples.mllib.JavaRecommendationExample" --master local[*] target\spark-mllib-sample-1.0.jar
Linux/Mac:
$SPARK_HOME/bin/spark-submit --class "org.apache.spark.examples.mllib.JavaRecommendationExample" --master local[*] target/spark-mllib-sample-1.0.jar
Monitoring of Spark MLlib Application
We can monitor the Spark program status on the web console which is available at URL.Let’s look at some of these visualization tools showing the Spark machine learning program statistics.
We can view the details of all jobs in the sample machine learning program. Click on “Jobs” tab on the web console screen to navigate the Spark Jobs web page that shows these job details.
Figure 5 below shows the status of the jobs from the sample program.
(Click on the image to enlarge it)
Figure 5. Spark jobs statistics for the machine learning program
Direct Acyclic Graph (DAG) shows the dependency graph of different RDD operations we ran in the program. Figure 6 below shows the screenshot of this visualization of Spark machine learning job.
(Click on the image to enlarge it)
Figure 6. DAG visualization graph of Spark machine learning program
Conclusions
Spark MLlib is one of the two machine learning libraries from Apache Spark. It's used for implementing predictive analytics solutions for business use cases like recommendation engine and fraud detection system. In this article, we learned about how to use SparkMLlib to create a recommendation engine application to predict the movies that a user might like.
Make sure you perform sufficient testing to assess the effectiveness as well as performance of different machine learning techniques to find out the best solution for your requirements and use cases.
What's Next
In the next article, we will focus on the other Machine Learning API from Spark, called SparkML, which is the recommended solution for big data projects developed on data pipelines.
相关文章推荐
- 用Apache Spark做大数据处理——第五部分:Spark机器学习数据流水线
- 用Apache Spark做大数据处理之Spark机器学习(5)
- Apache Spark进行大数据处理 -- 第二部分:Spark SQL
- 用Apache Spark进行大数据处理——第二部分:Spark SQL
- 用Apache Spark进行大数据处理——第三部分:Spark流
- 用Apache Spark进行大数据处理——第二部分:Spark SQL
- 用Apache Spark进行大数据处理 - 第六部分: 用Spark GraphX进行图数据分析
- 指南:优化Apache Spark作业(第2部分)
- Spark-sql执行部分语句是报错:WARN org.apache.spark.scheduler.TaskSchedulerImpl: Initial job has not accepted a
- 指南:优化Apache Spark作业(第1部分)
- 用Apache Spark进行大数据处理——第二部分:Spark SQL
- 用Apache Spark进行大数据处理-第三部分:Spark流
- Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN
- Apache Spark
- Apache Spark技术实战之1 -- KafkaWordCount
- 超声波传感器知识(第四部分):测量精度的影响
- cocos2d-x简单游戏<打飞机>代码实现|第四部分:主场景<Helloworld.m>
- 第四部分 电 子 电 路 仿 真 实 验
- javascript教程 - 第四部分 关于form对象
- Apache Spark学习:将Spark部署到Hadoop 2.2.0上