您的位置:首页 > 其它

机器学习简史

2016-04-13 20:54 399 查看
(第一次翻译日期:2016-04-13)

机器学习简史

原文:Brief History of Machine Learning



My subjective ML timeline

译文:我的时间轴(大事年表)

Since the initial standpoint of science, technology and AI, scientists following Blaise Pascal and Von Leibniz ponder about a machine that is intellectually capable as much as humans. Famous writers like Jules.

译文:著名作家Jules认为,从最初的科学,技术和人工智能的角度来看,在科学家Blaise Pascal 和 Von Leibniz之后(其他)科学家认为机器是能像人类一样智能的。



Pascal’s machine performing subtraction and summation - 1642

译文:Pascal的机器用来计算减法和加法-1642

Machine Learning is one of the important lanes of AI which is very spicy hot subject in the research or industry. Companies, universities devote many resources to advance their knowledge. Recent advances in the field propel very solid results for different tasks, comparable to human performance (98.98% at Traffic Signs - higher than human-).

译文:机器学习是人工智能的重要通道之一,在研究领域或工业领域也是非常非常热门的。企业和高校投入许多资源来推进这一领域的发展。(第二次翻译日期 :2016-04-15) 该领域的最新进展在对于不同的任务下,推动产生了坚实的成果。(在交通信号方面98.98% ——这一数值高于人类)

Here I would like to share a crude timeline of Machine Learning and sign some of the milestones by no means complete. In addition, you should add “up to my knowledge” to beginning of any argument in the text.

译文:在此,我想分享一个粗略的且并不完整的(关于)机器学习及其里程碑的时间轴。此外,你还需要(自行脑补)在行文开始处的提到的内容。

First step toward prevalent ML was proposed by Hebb , in 1949, based on a neuropsychological learning formulation. It is called Hebbian Learning theory. With a simple explanation, it pursues correlations between nodes of a Recurrent Neural Network (RNN). It memorizes any commonalities on the network and serves like a memory later. Formally, the argument states that;

译文:第一步,在1949年,普遍ML被Hebb提出,基于神经心理学的构想。简单做一解释,这一理论试图找到在递归神经网络中的各个结点之间的联系。它可以记录网络中的共性,就像内存一样。在形式上,可以这样来解释。

Let us assume that the persistence or repetition of a reverberatory activity (or “trace”) tends to induce lasting cellular changes that add to its stability.… When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.[1]

(第三次翻译时间:2016/04/16)

翻译:让我们假设持久的或者重复的反射活动(亦或是 “跟踪”)往往会对细胞产生持久的变化,增加其稳定性。…当一个轴突细胞A ,通过重复,持续的刺激仍然不足以激发一个细胞B的时候,由于A的作用功效,一些生长过程或者代谢变化发生在一个或者两个细胞,作为被刺激的B细胞,它(的这种影响)被增加。



Arthur Samuel

In 1952 , Arthur Samuel at IBM, developed a program playing Checkers . The program was able to observe positions and learn a implicit model that gives better moves for the latter cases. Samuel played so many games with the program and observed that the program was able to play better in the course of time.

With that program Samuel confuted the general providence dictating machines cannot go beyond the written codes and learn patterns like human-beings. He coined “machine learning, ” which he defines as;a field of study that gives computer the ability without being explicitly programmed.

翻译:1952年,Arthur Samuel 在IBM的时候开发了一个玩跳棋的程序。程序可以观察位置和学习一个隐式模型可以让后者更好的移动。Samuel和这个程序进行了许多次游戏,同时他也注意到程序在一段时间之后可以玩得更好。针对这个程序,Samuel驳斥了普遍认为机器不能够超越所编写的代码以及像人一样的学习模式。他创造了“machine learing”这个词,并将该词用作定义;一个领域的研究使得计算机有能力显式编程的能力。



F. Rosenblatt

In 1957 , Rosenblatt’s Perceptron was the second model proposed again with neuroscientific background and it is more similar to today’s ML models. It was a very exciting discovery at the time and it was practically more applicable than Hebbian’s idea. Rosenblatt introduced the Perceptron with the following lines;

翻译:1957年,Rosenblatt的基于神经科学背景(基础)的感知器模型被第二次提出,这更像今天的机器学习模型。这在当时是一个非常激动人心的发现,特别是这一理论相比Hebbian的观点被更广泛的接受了。Rosenblatt是这样介绍感知器的,如下文所述。

The perceptron is designed to illustrate some of the fundamental properties of intelligent systems in general, without becoming too deeply enmeshed in the special, and frequently unknown, conditions which hold for particular biological organisms.[2]

翻译: 感知器的设计初衷是为了说明一些智能系统的基本属性,而不至于太过深入,并且为人所熟知的特定生物有机体。

After 3 years later, Widrow [4] engraved Delta Learning rule that is then used as practical procedure for Perceptron training. It is also known as Least Square problem. Combination of those two ideas creates a good linear classifier. However, Perceptron’s excitement was hinged by Minsky [3] in 1969 . He proposed the famous XOR problem and the inability of Perceptrons in such linearly inseparable data distributions. It was the Minsky’s tackle to NN community. Thereafter, NN researches would be dormant up until 1980s

翻译:三年之后,Widrow结合了梯度学习法则,后被用于训练感知器。它也被称为最小二乘问题。将两者结合创造出一个好的线性分类器。然而,感知器真正的激动人心的时刻在1969年,是由在研究神经网络社区的Minsky创造的。他提出了著名的 XOR问题,感知器在这种情况下是将数据不可线性分割的。此后,神经网络的研究一直停滞,直到20世纪80年代。



XOR problem which is nor linearly seperable data orientation.

翻译:XOR是面向(解决)线性不可分数据问题的。

There had been not to much effort until the intuition of Multi-Layer Perceptron (MLP) was suggested by Werbos[6] in 1981 with NN specific Backpropagation(BP) algorithm, albeit BP idea had been proposed before by Linnainmaa [5] in 1970 in the name “reverse mode of automatic differentiation”. Still BP is the key ingredient of today’s NN architectures. With those new ideas, NN researches accelerated again. In 1985 - 1986 NN researchers successively presented the idea of MLP with practical BP training (Rumelhart, Hinton, Williams [7] - Hetch, Nielsen[8])

(第四次翻译时间:2016/04/16)

在此之前,一直未取得重大进展,到1981年,由Werbos提出的多层感知器,它是以各种特定神经网络反向传播算法,不过在此之前BP(反向传播算法)被Linnainmaauo提出来,被命名为“反向模式自动分化”。直到今天,BP算法任然是神经网络架构的关键要素。根据这些新提出的方法,神经网络再一次得到了快速发展。在1985年至1986年间,研究人员陆续提出反向传播实践训练的多层感知器。



From Hetch and Nielsen [8]

At the another spectrum, a very-well known ML algorithm was proposed by J. R. Quinlan [9] in 1986 that we call Decision Trees , more specifically ID3 algorithm. This was the spark point of the another mainstream ML. Moreover, ID3 was also released as a software able to find more real-life use case with its simplistic rules and its clear inference, contrary to still black-box NN models.

(第五次翻译时间:2016/04/18)

翻译:与此同时,在1986年,一个非常著名的机器学习算法被J.R.Quinlan提出,我们称之为决策树,更具体地说就是ID3算法。这是又一个主流机器学习的亮点。此外,ID3还被作为一个能够通过其简单的规则和清晰的推理来找出现实生活用例的软件发布,这任然是一个黑箱神经网络模型。

After ID3, many different alternatives or improvements have been explored by the community (e.g. ID4, Regression Trees, CART …) and still it is one of the active topic in ML.

翻译:在ID3之后,由社区对其进行的改进提升,同时该方法任然在机器学习领域里比较活跃。



From Quinlan [9]

One of the most important ML breakthrough was Support Vector Machines (Networks) (SVM), proposed by Vapnik and Cortes[10] in 1995 with very strong theoretical standing and empirical results.

翻译:在1995年,一种使得机器学习取得重大进展的方法被提出来,称为“支持向量机”(网络),是由Vapnik 和 Cortes提出的,该方法拥有非常坚实的理论基础并且能得到理想的结果。

That was the time separating the ML community into two crowds as NN or SVM advocates. However the competition between two community was not very easy for the NN side after Kernelized version of SVM by near 2000s .(I was not able to find the first paper about the topic), SVM got the best of many tasks that were occupied by NN models before. In addition, SVM was able to exploit all the profound knowledge of convex optimization, generalization margin theory and kernels against NN models. Therefore, it could find large push from different disciplines causing very rapid theoretical and practical improvements.

(第六次翻译时间:2016/05/08)

翻译:与此同时,机器学习社区分为两个支持者阵营神经网络(NN)和支撑向量机(SVM)。然而,神经网络阵营在两个阵营之间竞争中表现的并不容易,尤其是在2000年左右,在当Kernelized版本的支持向量机被提出之后。(我已经没办法找出第一篇有关于该问题的文章了),SVM能够很好的完成许多任务,在此之前一直是被NN所完成的。另外,SVM能够利用深厚的凸优化理论,泛化边界理论和针对NN模型的内核。然而,可以发现来自不同学科的理论和实践快速提高所产生的巨大推力,推动其他发展。



From Vapnik and Cortes [10]From Vapnik and Cortes [10]

NN took another damage by the work of Hochreiter’s thesis [40] in 1991 and Hochreiter et. al.[11] in 2001 , showing the gradient loss after the saturation of NN units as we apply BP learning. Simply means, it is redundant to train NN units after a certain number of epochs owing to saturated units hence NNs are very inclined to over-fit in a short number of epochs.

Little before, another solid ML model was proposed by Freund and Schapire in 1997 prescribed with boosted ensemble of weak classifiers called Adaboost. This work also gave the Godel Prize to the authors at the time. Adaboost trains weak set of classifiers that are easy to train, by giving more importance to hard instances. This model still the basis of many different tasks like face recognition and detection. It is also a realization of PAC (Probably Approximately Correct) learning theory. In general, so called weak classifiers are chosen as simple decision stumps (single decision tree nodes). They introduced Adaboost as ;

The model we study can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting…[11]

Another ensemble model explored by Breiman [12] in 2001 that ensembles multiple decision trees where each of them is curated by a random subset of instances and each node is selected from a random subset of features. Owing to its nature, it is called Random Forests(RF) . RF has also theoretical and empirical proofs of endurance against over-fitting. Even AdaBoost shows weakness to over-fitting and outlier instances in the data, RF is more robust model against these caveats.(For more detail about RF, refer tomy old post.). RF shows its success in many different tasks like Kaggle competitions as well.

Random forests are a combination of tree predictors such that each tree depends on the values of a

random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large[12]

As we come closer today, a new era of NN called Deep Learning has been commerced. This phrase simply refers NN models with many wide successive layers. The 3rd rise of NN has begun roughly in 2005 with the conjunction of many different discoveries from past and present by recent mavens Hinton, LeCun, Bengio, Andrew Ng and other valuable older researchers. I enlisted some of the important headings (I guess, I will dedicate complete post for Deep Learning specifically) ;

GPU programming

Convolutional NNs [18][20][40]

Deconvolutional Networks [21]

Optimization algorithms

Stochastic Gradient Descent [19][22]

BFGS and L-BFGS [23]

Conjugate Gradient Descent [24]

Backpropagation [40][19]

Rectifier Units

Sparsity [15][16]

Dropout Nets [26]

Maxout Nets [25]

Unsupervised NN models [14]

Deep Belief Networks [13]

Stacked Auto-Encoders [16][39]

Denoising NN models [17]

With the combination of all those ideas and non-listed ones, NN models are able to beat off state of art at very different tasks such as Object Recognition, Speech Recognition, NLP etc. However, it should be noted that this absolutely does not mean, it is the end of other ML streams. Even Deep Learning success stories grow rapidly , there are many critics directed to training cost and tuning exogenous parameters of these models. Moreover, still SVM is being used more commonly owing to its simplicity. (said but may cause a huge debate :) )

Before finish, I need to touch on one another relatively young ML trend. After the growth of WWW and Social Media, a new term, BigData emerged and affected ML research wildly. Because of the large problems arising from BigData , many strong ML algorithms are useless for reasonable systems (not for giant Tech Companies of course). Hence, research people come up with a new set of simple models that are dubbed Bandit Algorithms [27 - 38] (formally predicated with Online Learning ) that makes learning easier and adaptable for large scale problems.

I would like to conclude this infant sheet of ML history. If you found something wrong (you should :) ), insufficient or non-referenced, please don’t hesitate to warn me in all manner.

References —-

[1] Hebb D. O., The organization of behaviour. New York: Wiley & Sons.

[2] Rosenblatt, Frank. “The perceptron: a probabilistic model for information storage and organization in the brain.” Psychological review 65.6 (1958): 386.

[3] Minsky, Marvin, and Papert Seymour. “Perceptrons.” (1969).

[4]Widrow, Hoff “Adaptive switching circuits.” (1960): 96-104.

[5]S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor

expansion of the local rounding errors. Master’s thesis, Univ. Helsinki, 1970.

[6] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In Proceedings of the 10th

IFIP Conference, 31.8 - 4.9, NYC, pages 762–770, 1981.

[7] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. Learning internal representations by error propagation. No. ICS-8506. CALIFORNIA UNIV SAN DIEGO LA JOLLA INST FOR COGNITIVE SCIENCE, 1985.

[8] Hecht-Nielsen, Robert. “Theory of the backpropagation neural network.” Neural Networks, 1989. IJCNN., International Joint Conference on. IEEE, 1989.

[9] Quinlan, J. Ross. “Induction of decision trees.” Machine learning 1.1 (1986): 81-106.

[10] Cortes, Corinna, and Vladimir Vapnik. “Support-vector networks.” Machine learning 20.3 (1995): 273-297.

[11] Freund, Yoav, Robert Schapire, and N. Abe. “A short introduction to boosting.” Journal-Japanese Society For Artificial Intelligence 14.771-780 (1999): 1612.

[12] Breiman, Leo. “Random forests.” Machine learning 45.1 (2001): 5-32.

[13] Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. “A fast learning algorithm for deep belief nets.” Neural computation 18.7 (2006): 1527-1554.

[14] Bengio, Lamblin, Popovici, Larochelle, “Greedy Layer-Wise

Training of Deep Networks”, NIPS’2006

[15] Ranzato, Poultney, Chopra, LeCun ” Efficient Learning of Sparse Representations with an Energy-Based Model “, NIPS’2006

[16] Olshausen B a, Field DJ. Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Res. 1997;37(23):3311–25. Available at: http://www.ncbi.nlm.nih.gov/pubmed/9425546.

[17] Vincent, H. Larochelle Y. Bengio and P.A. Manzagol, Extracting and Composing Robust Features with Denoising Autoencoders , Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML‘08), pages 1096 - 1103, ACM, 2008.

[18] Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202.

[19] LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324.

[20] LeCun, Yann, and Yoshua Bengio. “Convolutional networks for images, speech, and time series.” The handbook of brain theory and neural networks3361 (1995).

[21] Zeiler, Matthew D., et al. “Deconvolutional networks.” Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010.

[22] S. Vishwanathan, N. Schraudolph, M. Schmidt, and K. Mur- phy. Accelerated training of conditional random fields with stochastic meta-descent. In International Conference on Ma- chine Learning (ICML ’06), 2006.

[23] Nocedal, J. (1980). ”Updating Quasi-Newton Matrices with Limited Storage.” Mathematics of Computation 35 (151): 773782. doi:10.1090/S0025-5718-1980-0572855-

[24] S. Yun and K.-C. Toh, “A coordinate gradient descent method for l1- regularized convex minimization,” Computational Optimizations and Applications, vol. 48, no. 2, pp. 273–307, 2011.

[25] Goodfellow I, Warde-Farley D. Maxout networks. arXiv Prepr arXiv …. 2013. Available at: http://arxiv.org/abs/1302.4389. Accessed March 20, 2014.

[26] Wan L, Zeiler M. Regularization of neural networks using dropconnect. Proc …. 2013;(1). Available at: http://machinelearning.wustl.edu/mlpapers/papers/icml2013_wan13. Accessed March 13, 2014.

[27] Alekh Agarwal , Olivier Chapelle , Miroslav Dudik , John Langford , A Reliable Effective Terascale Linear Learning System , 2011

[28] M. Hoffman , D. Blei , F. Bach , Online Learning for Latent Dirichlet Allocation , in Neural Information Processing Systems (NIPS) 2010.

[29] Alina Beygelzimer , Daniel Hsu , John Langford , and Tong Zhang Agnostic Active Learning Without Constraints NIPS 2010.

[30] John Duchi , Elad Hazan , and Yoram Singer , Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , JMLR 2011 & COLT 2010.

[31] H. Brendan McMahan , Matthew Streeter , Adaptive Bound Optimization for Online Convex Optimization , COLT 2010.

[32] Nikos Karampatziakis and John Langford , Importance Weight Aware Gradient Updates UAI 2010.

[33] Kilian Weinberger , Anirban Dasgupta , John Langford , Alex Smola , Josh Attenberg , Feature Hashing for Large Scale Multitask Learning , ICML 2009.

[34] Qinfeng Shi , James Petterson , Gideon Dror , John Langford , Alex Smola , and SVN Vishwanathan , Hash Kernels for Structured Data , AISTAT 2009.

[35] John Langford , Lihong Li , and Tong Zhang , Sparse Online Learning via Truncated Gradient , NIPS 2008.

[36] Leon Bottou , Stochastic Gradient Descent , 2007.

[37] Avrim Blum , Adam Kalai , and John Langford Beating the Holdout: Bounds for KFold and Progressive Cross-Validation . COLT99 pages 203-208.

[38] Nocedal, J. (1980). “Updating Quasi-Newton Matrices with Limited Storage”. Mathematics of Computation 35: 773–782.

[39] D. H. Ballard. Modular learning in neural networks. In AAAI, pages 279–284, 1987.

[40] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f ̈ur In-

formatik, Lehrstuhl Prof. Brauer, Technische Universit ̈at M ̈unchen, 1991. Advisor: J. Schmidhuber.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: