Spark的39个机器学习库-英文
2016-03-10 17:49
525 查看
Apache Spark itself
1. MLlib
AMPLab
Spark originally came out of Berkeley AMPLab and even today AMPLab projects, even though they are not in Apache Spark Foundation, enjoy a status a bit over your everyday github project.
ML Base
Spark's own MLLib forms the bottom layer of the three-layer ML Base, with MLI being the middle layer and ML Optimizer being the most abstract layer. 2. MLI
3. ML Optimizer (aka Ghostface)
Ghostware was described in 2014 but never released. Of the 39 machine learning libraries, this is the only one that is vaporware, and is included only due to its AMPLab and ML Base status.
回到顶部
Other than ML Base
4. SplashA recent project from June, 2015, this set of stochastic learning algorithms claims 25x - 75x faster performance than Spark MLlib on Stochastic Gradient Descent (SGD). Plus it's an AMPLab project that begins with the letters "sp", so it's worth watching.
5. Keystone ML
Brought machine learning pipelines to Spark, but pipelines have matured in recent versions of Spark. Also promises some computer vision capability, but there are limitations I
previously blogged about.
6. Velox
A server to manage a large collection of machine learning models.
7. CoCoA
Faster machine learning on Spark by optimizing communication patterns and shuffles, as described in the paper Communication-Efficient
Distributed Dual Coordinate Ascent
Frameworks
回到顶部
GPU-based
8. DeepLearning4jI previously blogged DeepLearning4j Adds Spark
GPU Support
9. Elephas
Brand new and frankly why I started this list for this blog post. Provides an interface to Keras.
回到顶部
Non-GPU-based
10. DistMLParameter server for model-parallel rather than data-parallel (as Spark's MLlib is).
11. Aerosolve
From Airbnb, used in their automated pricing
12. Zen
Logistic regression, LDA, Factorization machines, Neural Network, Restricted Boltzmann Machines
13. Distributed Data Frame
Similar to Spark DataFrames, but agnostic to engine (i.e. will run on engines other than Spark in the future). Includes cross-validation and interfaces to external machine learning libraries.
Interfaces to other Machine Learning systems
14. spark-corenlpWraps Stanford CoreNLP.
15. Sparkit-learn
Interface to Python's Scikit-learn
16. Sparkling Water
Interface to H2O
17. hivemall-spark
Wraps Hivemall, machine learning in Hive
18. spark-pmml-exporter-validator
Export PMML, an industry standard XML format for transporting machine learning models.
Add-ons that enhance MLlib's existing algorithms
19. MLlib-dropoutAdds dropout capability to Spark MLLib, based on the paper Dropout:
A simple way to prevent neural networks from overfitting.
20. generalized-kmeans-clustering
Adds arbitrary distance functions to K-Means
21. spark-ml-streaming
Visualize the Streaming Machine Learning algorithms built into Spark MLlib
Algorithms
回到顶部
Supervised learning
22. spark-libFMFactorization Machines
23. ScalaNetwork
Recursive Neural Networks (RNNs)
24. dissolve-struct
SVM based on the performant Spark communication framework CoCoA listed above.
25. Sparkling Ferns
Based on Image Classification using Random Forests
and Ferns
26. streaming-matrix-factorization
Matrix Factorization Recommendation System
回到顶部
Unsupervised learning
27. PatchWork40x faster clustering than Spark MLlib K-Means
28. Bisecting K-Meams Clustering
K-Means that produces more uniformly-sized clusters, based on A
Comparison of Document Clustering Techniques
29. spark-knn-graphs
Build graphs using k-nearest-neighbors and locality sensitive hashing (LSH)
30. TopicModeling
Online Latent Dirichlet Allocation (LDA), Gibbs Sampling LDA, Online Hierarchical Dirichlet Process (HDP)
回到顶部
Algorithm building blocks
31. sparkboostAdaboost and MP-Boost
32. spark-tfocs
Port to Spark of TFOCS: Templates for First-Order Conic Solvers. If your machine learning
cost function happens to be convex, then TFOCS can solve it.
33. lazy-linalg
Linear algebra operators to work with Spark MLlib's linalg package
回到顶部
Feature extractors
34. spark-infotheoretic-feature-selectionInformation-theoretic basis for feature selection, based on Conditional
likelihood maximisation: a unifying framework for information theoretic feature selection
35. spark-MDLP-discretization
Given labeled data, "discretize" one of the continuous numeric dimensions such that each bin is relatively homogenous in terms of data classes. This is a foundational idea CART and ID3 algorithms to generate decision trees. Based on Multi-interval
discretization of continuous-valued attributes for classification learning.
36. spark-tsne
Distributed t-Distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality
reduction.
37. modelmatrix
Sparse feature vectors
Domain-specific
38. Spatial and time-series dataK-Means, Regression, and Statistics
39. Twitter data
相关文章推荐
- leetcode(198)(213) HouseRobber HouseRobber-II
- 利用ViewPager实现引导界面+底部小圆点
- java 正则表达式
- 安卓sdk FaceDetector 进行人脸抓取
- javascript精华技巧
- C# 游戏开发中使用 Dictionary 实现消息分发
- window覆盖导航栏
- 条件注释
- jenkins配置svn、gradle、ssh
- myeclipse内存配置
- httpclient 学习备忘
- Android:常用代码片段整理
- iOS ViewController的生命周期分析和使用
- java设计模式——代理模式
- 线程2
- 线程1
- xib - awakefromnib什么时候调用
- HDOJ2897 邂逅明下(巴士博弈)
- iOS 排序算法总结、二分法查找
- openh264 在 osx 上的 nasm 问题