您的位置:首页 > 其它

Spark的39个机器学习库-英文

2016-03-10 17:49 525 查看


Apache Spark itself 

1. MLlib


AMPLab 

Spark originally came out of Berkeley AMPLab and even today AMPLab projects, even though they are not in Apache Spark Foundation, enjoy a status a bit over your everyday github project. 


ML Base

Spark's own MLLib forms the bottom layer of the three-layer ML Base, with MLI being the middle layer and ML Optimizer being the most abstract layer. 

2. MLI

3. ML Optimizer (aka Ghostface)

Ghostware was described in 2014 but never released. Of the 39 machine learning libraries, this is the only one that is vaporware, and is included only due to its AMPLab and ML Base status. 

回到顶部


Other than ML Base 

4. Splash

A recent project from June, 2015, this set of stochastic learning algorithms claims 25x - 75x faster performance than Spark MLlib on Stochastic Gradient Descent (SGD). Plus it's an AMPLab project that begins with the letters "sp", so it's worth watching. 

5. Keystone ML

Brought machine learning pipelines to Spark, but pipelines have matured in recent versions of Spark. Also promises some computer vision capability, but there are limitations I
previously blogged about. 

6. Velox

A server to manage a large collection of machine learning models. 

7. CoCoA

Faster machine learning on Spark by optimizing communication patterns and shuffles, as described in the paper Communication-Efficient
Distributed Dual Coordinate Ascent


Frameworks 

回到顶部


GPU-based 

8. DeepLearning4j

I previously blogged DeepLearning4j Adds Spark
GPU Support

9. Elephas

Brand new and frankly why I started this list for this blog post. Provides an interface to Keras

回到顶部


Non-GPU-based 

10. DistML

Parameter server for model-parallel rather than data-parallel (as Spark's MLlib is). 

11. Aerosolve

From Airbnb, used in their automated pricing 

12. Zen

Logistic regression, LDA, Factorization machines, Neural Network, Restricted Boltzmann Machines 

13. Distributed Data Frame

Similar to Spark DataFrames, but agnostic to engine (i.e. will run on engines other than Spark in the future). Includes cross-validation and interfaces to external machine learning libraries. 


Interfaces to other Machine Learning systems 

14. spark-corenlp

Wraps Stanford CoreNLP

15. Sparkit-learn

Interface to Python's Scikit-learn

16. Sparkling Water

Interface to H2O

17. hivemall-spark

Wraps Hivemall, machine learning in Hive 

18. spark-pmml-exporter-validator

Export PMML, an industry standard XML format for transporting machine learning models. 


Add-ons that enhance MLlib's existing algorithms 

19. MLlib-dropout

Adds dropout capability to Spark MLLib, based on the paper Dropout:
A simple way to prevent neural networks from overfitting. 

20. generalized-kmeans-clustering

Adds arbitrary distance functions to K-Means 

21. spark-ml-streaming

Visualize the Streaming Machine Learning algorithms built into Spark MLlib 


Algorithms 

回到顶部


Supervised learning 

22. spark-libFM

Factorization Machines 

23. ScalaNetwork

Recursive Neural Networks (RNNs) 

24. dissolve-struct

SVM based on the performant Spark communication framework CoCoA listed above. 

25. Sparkling Ferns

Based on Image Classification using Random Forests
and Ferns

26. streaming-matrix-factorization

Matrix Factorization Recommendation System 

回到顶部


Unsupervised learning 

27. PatchWork

40x faster clustering than Spark MLlib K-Means 

28. Bisecting K-Meams Clustering

K-Means that produces more uniformly-sized clusters, based on A
Comparison of Document Clustering Techniques

29. spark-knn-graphs

Build graphs using k-nearest-neighbors and locality sensitive hashing (LSH) 

30. TopicModeling

Online Latent Dirichlet Allocation (LDA), Gibbs Sampling LDA, Online Hierarchical Dirichlet Process (HDP) 

回到顶部


Algorithm building blocks 

31. sparkboost

Adaboost and MP-Boost 

32. spark-tfocs

Port to Spark of TFOCS: Templates for First-Order Conic Solvers. If your machine learning
cost function happens to be convex, then TFOCS can solve it. 

33. lazy-linalg

Linear algebra operators to work with Spark MLlib's linalg package 

回到顶部


Feature extractors 

34. spark-infotheoretic-feature-selection

Information-theoretic basis for feature selection, based on Conditional
likelihood maximisation: a unifying framework for information theoretic feature selection

35. spark-MDLP-discretization

Given labeled data, "discretize" one of the continuous numeric dimensions such that each bin is relatively homogenous in terms of data classes. This is a foundational idea CART and ID3 algorithms to generate decision trees. Based on Multi-interval
discretization of continuous-valued attributes for classification learning. 

36. spark-tsne

Distributed t-Distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality
reduction. 

37. modelmatrix

Sparse feature vectors 


Domain-specific 

38. Spatial and time-series data

K-Means, Regression, and Statistics 

39. Twitter data
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: