如何评价5新闻周刊9月23日8日LeCun等人刊发于Nature的Deep Learning这一论文

点击联系发帖人 时间：2017-07-24 03:26

新闻周刊9月23日

论文笔记：Deep Learning [nature review by Lecun, Bengio, & Hinton] - CSDN博客
论文笔记：Deep Learning [nature review by Lecun, Bengio, & Hinton]
如今，机器学习的技术在我们的生活中扮演着越来越重要的角色。从搜索引擎到推荐系统，从图像识别到语音识别。而这些应用都开始逐渐使用一类叫做深度学习（Deep Learning）的技术。
传统机器学习算法的局限性在于，它们往往很难处理那些未被加工过的自然数据（natural data），例如一张原始的RGB图像。因此，构建一个传统的机器学习系统，往往需要一些有经验的工程师设计一个特征提取器，将原始数据转化成机器能识别的feature representation。
有一类叫做representation learning的算法可以实现让机器自发地从输入的原始数据中发现那些有用的feature。Deep Learning正是这样的一类算法。
下面是Lecun等人给出的Deep Learning的正式定义：
Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear
modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level.
从这段话中可以看出，Deep Learning有三个核心的要素：
a kind of representation learning methods
深度学习的精髓在于，各个layer上的特征不是由人类工程师设计的，而是通过一类general-purpose的learning procedure从数据中主动地习得。with multiple levels of representation from raw to abstract
以图片为例，原始数据只是一些毫无意义的像素点构成的矩阵。而深度学习学习到的第一层特征能够检测图片中是否存在指向某个方向的线条；更高层的特征则通过组合低层级的特征，在更抽象的水平上——例如特定的花纹——进行检测。non-linear transformation of representation
理论上，通过组合足够数量的非线性变换，可以对任意函数进行拟合。
可见，Deep Learning非常擅长于挖掘高维数据中的内在结构，也因此在很多领域上取得了令人惊异的成果。
有监督学习
Supervised learning，有监督学习，是机器学习一种常见的形式。它的任务是训练一个模型，使其能在给定的输入下，输出预期的value。为此，我们需要一个error function来计算输出值与期望值的误差，并通过调节模型内部的参数来减小这个误差。梯度下降（Gradient Descent）和随机梯度下降（SGD）是两种常见的参数调节的算法。
目前，针对有监督学习问题，大部分机器学习系统都是在人工挑选的feature上运行一个线性分类器。然而，线性分类器的缺陷在于，它只能将输入空间划分为一些简单的region，因此在诸如图像识别和语言识别的问题上往往无能为力（这些问题需要模型对一些特定特征的微小变化极其敏感，而对不相关特征的变化极不敏感）。例如，在像素层面上，同一只Samoyed在不同背景下的两张图片的差别很大，而相同背景下的Samoyed和Wolf的图片差异却很小。这对于传统的线性分类器，或是任一个浅层（Shallow）分类器，想在区分后一组图片中的Samoyed和Wolf的同时，把前一组图片中的Samoyed放在同一个类别下，几乎是一个impossible
mission。这也被称之为selectivity–invariance dilemma：我们需要一组特征，它们能够选择性地响应图片中的重要部分，而对图片中不重要部分的变化保持不变性。
这一问题传统的解决方案是人工设计一些特征提取器。然而，借助Deep Learning，我们有希望从数据中自发地学习到这些特征。
反向传播算法
我们可以用随机梯度下降算法（SGD）来训练一个multilayer networks的模型。这一算法也被称之为反向传播算法（Backpropagation）。该算法的背后不过是微积分第一堂课里就学到的链式求导法则。我们将误差函数对layer中一个模块的输入的偏导，表示成该误差函数对下一层layer的输入的偏导的函数，并在此基础上求出模型参数的梯度。
前向反馈神经网络（feedforwrad neural network）正是这样一个multilayer network。许多深度学习的模型都采用了与之类似的网络结构。在前向传播的过程中，每一层神经元都对上一层神经元的输出进行加权求和，并通过一个非线性的变换传递给下一层神经元。目前在深度学习网络中被广泛使用的非线性变换是ReLU（rectified linear
unit）：f(z)=max(z,0)f(z)=max(z,0)。与传统的平滑非线性变换（tanh(z)tanh(z)或logistic函数）相比，ReLU的学习速度更快。通过每一个隐藏层上对输入空间的非线性变换，我们最终得到了一个线性可分的特征空间。
然而，在上个世纪90年代末期，神经网络的发展遇到了极大的阻碍。人们认为，梯度下降算法会使得模型很容易陷入一些远离真实值的局部最优解。事实上，近期的一些研究表明，这些最优解大都是分布在误差空间上的鞍点；它们有着相近的误差函数值。因此，我们并不需要关心算法最终落到了哪个最优解上。
深度神经网络的复兴发生在2006年。CIFAR的一批研究者提出了一种逐层训练的无监督学习算法；每一个隐藏层上的神经元都试图去重构上一层神经元习得的特征，从而学习到更高级的特征表达。最终，通过一个输出层的反向传播过程来对模型的参数进行微调，得到一个有监督的学习模型。
卷积神经网络
与全连接的前向反馈神经网络相比，卷积神经网络（Convolutional Neural Networks）更加易于训练。事实上，当整个神经网络的研究都处于低谷的时候，CNN却独树一帜，在解决许多实际的问题中都有着不俗的表现。最近几年，CNN更在计算机视觉（CV）领域中得到广泛的应用。
CNN一般被用于处理multiple arrays形式的数据输入。例如一段文本（1D array）；一张图像（2D array）；或是一段视频（3D array）。CNN之所以能够有效的处理这些原生态的数据，离不开它的四个核心要素：
局部连接（local connections）共享权重（shared weights）池化（pooling）多层网络结构（multiple layers）
下图是一个卷积神经网络的典型结构，主要由两种类型的layer构成：卷积层（convolutional layer）和池化层（pooling layer）。
卷积层由多个feature maps构成（类似原始输入数据里的通道），每一个feature maps里的神经元都通过一组权重（filter bank）与前一层所有feature maps里的部分神经元相连（local connection），并对前一层相连神经元的输出加权求和，传递给一个非线性的变换器（通常是ReLU）。值得注意的是，同一个feature map里的神经元共享同一个filter
bank；不同feature maps之间的filter bank并不相同（shared weights）。这么做出于两点考虑：1. 在列状数据（array data）中，相邻的数据点一般是高度相关的；局域的连接更有利于特征的检测；2. 这种局域的统计特征往往与位置无关，从而使得不同位置的神经元可以通过共享权重检测同一个特征。数学上，一个feature map对输入特征的操作，等效于一个离散的卷积过程。这也是卷积神经网络名字的由来。
卷积层的作用是组合上一层的局域特征并进行检测；而池化层的作用是将检测到的距离相近的特征合并为一，从而降低特征相对位置的变化对最终结果的影响。一种常见的池化操作是maximum pooling，它对一个local patch里的神经元的状态取最大值并输出。池化操作可以有效地降低特征的维度，并增强模型的泛化能力。
将2-3个由卷积层、非线性变换、和池化层构成的stage堆叠在一起，与一个全连接的输出层相连，就组成了一个完整的卷积神经网络。反向传播算法依然可被用来训练这个网络中的连接权重。
同许多深度神经网络一样，卷积神经网络成功地利用了自然信号中内在的层级结构属性：高层级的特征由低层级的特征组成。例如，一张图片中的物体可以拆分成各个组件；每个组件又可以进一步拆分成一些基本的图案；而每个基本的图案又是由更基本的线条组成。
Image Understanding与深度卷积网络
虽然早在2000年，卷积神经网络在图像识别的领域中就已经取得了不错的成绩；然而直到2012年的ImageNet比赛后，CNN才被计算机视觉和机器学习的主流科学家们所接受。CNN的崛起依赖于四个因素：GPU的高性能计算；ReLU的提出；一种叫做dropout的正则化技术；和一种对已有数据进行变形以生成更多的训练样本的技术。一个深度卷积神经网络通常有10-20个卷积层，数亿的权重和连接。得益于计算硬件和并行计算的高速发展，使得深度卷积神经网络的训练成为了可能。如今，深度CNN带来了计算机视觉领域的一场革命，被广泛应用于几乎所有与图像识别有关的任务中（例如无人车的自动驾驶）。最近的一项研究表明，如果将深度CNN学习到的高维特征与RNN结合在一起，甚至可以教会计算机“理解”图片里的内容。
Distributed Representation与自然语言处理
深度学习理论指出，与传统的浅层学习模型相比，深度学习网络有两个指数级的优势：
分布式的特征表达（distributed representation）使得模型的泛化空间成指数倍的增长（即便是训练空间中未出现的样本也可以通过分布式特征组合出来）；层级结构的特征表达在深度上加速了这种指数倍的增长。
下面以深度神经网络在自然语言处理中的一个应用，来解释distributed representation的概念。
假设我们需要训练一个深度神经网络来预测一段文本序列的下一个单词。我们用一个one-of-N的0-1向量来表示上下文中出现的单词。神经网络将首先通过一个embedding层为每一个输入的0-1向量生成一个word vector，并通过剩下的隐藏层将这些word vector转化为目标单词的word vector。这里的word vector就是一种distributed
representation。向量中的每一个元素都对应着原始单词的某一个语义特征。这些特征互不排斥，共同表达了原始文本里的单词。要注意的是，这些语义特征即非显式地存在于原始的输入数据中，也非由专家事先指定，而是通过神经网络从输入输出的结构联系中自动挖掘出来。因此，对于我们的单词预测问题，模型学习到的word vector可以很好地表示两个单词在语义上的相似度（例如，在这个问题下，Tuesday和Wednesday这两个单词给出的word vector相似度就很高）。而传统的统计语言模型就很难做到这一点（它们通常是把单词作为一个不可分的最小单元）。
如今，这种从文本中学习word vector的技术被广泛应用于各种自然语言处理的问题中。
递归神经网络
递归神经网络（Recurrent Neural Network）通常用于处理一些序列的输入（例如语音或文本）。它的基本思想是，一次只处理输入序列中的一个元素，但在hidden units中维护一个状态向量，隐式地编码之前输入的历史信息。如果我们将不同时刻的隐藏单元在空间上展开，就得到了一个（时间）深度网络。显然，我们可以在这个深度网络上运用反向传播算法来训练一个RNN模型。
在RNN模型中，每一个时刻的状态向量stst都由上一时刻的状态向量st-1st-1和当前时刻的输入xtxt所决定。通过这种递归的方式，RNN将每一时刻的输入xtxt都映射为一个依赖其历史所有输入的输出otot。注意，模型中的参数（U,V,WU,V,W）是与序列时刻无关的权重。
RNN在自然语言处理上有很多应用。例如，可以训练一个RNN模型，将一段英文“编码”成一个语义向量，再训练另一个RNN模型，将语义向量“解码”成一段法文。这就实现了一个基于深度学习的翻译系统。除此之外，在“编码”阶段，我们还可以用一个深度卷积网络将一张原始的图片转化为高级的语义特征，并在此基础上训练一个RNN“解码器”，就可以实现“看图说话”的功能。
尽管RNN设计的初衷是为了学习长记忆依赖，然而一些理论和实验的研究表明，“it is difficult to learn to store information for very long”。为此，人们提出了long short-term memory(LSTM)模型。LSTM模型通过在RNN模型的基础上引入一些特殊的中间神经元（门变量）来控制长短期记忆的均衡，被证明要比传统的RNN模型更加高效和强大。
还有一类模型是通过引入一个记忆存储单元来增强RNN模型的记忆能力。Neural Turing Machine和memory networks就是这一类模型。它们在处理一些知识问答的推断系统中被证明十分有效。
Deep Learning的未来
无监督学习：可以说，正是对无监督学习的研究才催化了深度学习的复兴。然而，如今无监督学习似乎已被有监督学习的巨大光芒所掩盖。考虑到人和动物大部分是通过无监督的学习来了解这个世界，长期来看，对无监督学习的研究将会愈发的重要。
深度学习与强化学习的结合：在CNN和RNN的基础上，结合Reinforcement Learning让计算机学会进一步的决策。这方面的研究虽尚处于萌芽，但已有一些不俗的表现。例如前段时间的AlphaGo。
自然语言的理解。虽然RNN已被广泛应用于自然语言处理，然而在教会机器理解自然语言的目标上，还有很长的一段路要走。
特征学习和特征推断的结合。这或许会极大地推动人工智能的发展
本文已收录于以下专栏：
相关文章推荐
/nature/journal/v521/n7553/pdf/nature14539.pdfAbstract：DL允许多个处理层的计算机模型来学习具有多层次抽...
深度学习-LeCun、Bengio和Hinton的联合综述
原文摘要：深度学习可以让那些拥有多个处理层的计算模型来学习具有多层次抽象的数据的表示。这些方法在许多方面都带来了显著的改善，包括最先进的语...
2015年，DL界三大神（Yann LeCun，Yoshua Bengio & Geoffrey Hinton），为了纪念人工智能60周年，合作在Nature上发表深度学习的综述性文章。原文地址：De...
Hinton、LeCun、Bengio
是深度学习的最权威的科学家。文中介绍的网络是深度学习中最为成熟，经典的部分。读这篇文章可以对深度学习的核心模块有一个最快的认识。
   背景
卷积神经网络是为了处理多维数组提出的模型。这个多维数组可以指3通道的二维数组比如颜色图像。除此之外，这种类型的数据还有许多：1维的信号或序列数组（比如文字）、二维的图像或音频频谱图以及三维的视频或立体...
他的最新文章
讲师：何宇健
讲师：董岩
您举报文章：
举报原因：
原文地址：
原因补充：
(最多只允许输入30个字)REVIEW；Deeplearning；YannLeCun1,2,YoshuaBengi；doi:10.1038/nature14539；Deeplearningallowscomput；achine-learningtechnolog；Conventionalmachine-lear；Representationlearningis；Wethinkthat
REVIEWDeep learningYann LeCun1,2, Yoshua Bengio3 & Geoffrey Hinton4,5doi:10.1038/nature14539Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec-ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech. Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social net-works to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts or products with users’ interests, and select relevant results of search. Increasingly, these applications make use of a class of techniques called deep learning. Conventional machine-learning techniques were limited in their ability to process natural data in their raw form. For decades, con-structing a pattern-recognition or machine-learning system required careful engineering and considerable domain expertise to design a fea-ture extractor that transformed the raw data (such as the pixel values of an image) into a suitable internal representation or feature vector from which the learning subsystem, often a classifier, could detect or classify patterns in the input. Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep-learning methods are representation-learning methods with multiple levels of representa-tion, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations. An image, for example, comes in the form of an array of pixel values, and the learned features in the first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts. The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure. Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence commu-nity for many years. It has turned out to be very good at discovering 1intricate structures in high-dimensional data and is therefore applica-ble to many domains of science, business and government. In addition to beating records in image recognition1C4 and speech recognition5C7, it has beaten other machine-learning techniques at predicting the activ-ity of potential drug molecules8, analysing particle accelerator data9,10, reconstructing brain circuits11, and predicting the effects of mutations in non-coding DNA on gene expression and disease12,13. Perhaps more surprisingly, deep learning has produced extremely promising results for various tasks in natural language understanding14, particularly topic classification, sentiment analysis, question answering15 and lan-guage translation16,17. We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available com-putation and data. New learning algorithms and architectures that are currently being developed for deep neural networks will only acceler-ate this progress. Supervised learning The most common form of machine learning, deep or not, is super-vised learning. Imagine that we want to build a system that can classify images as containing, say, a house, a car, a person or a pet. We first collect a large data set of images of houses, cars, people and pets, each labelled with its category. During training, the machine is shown an image and produces an output in the form of a vector of scores, one for each category. We want the desired category to have the highest score of all categories, but this is unlikely to happen before training. We compute an objective function that measures the error (or dis-tance) between the output scores and the desired pattern of scores. The machine then modifies its internal adjustable parameters to reduce this error. These adjustable parameters, often called weights, are real numbers that can be seen as ‘knobs’ that define the inputCoutput func-tion of the machine. In a typical deep-learning system, there may be hundreds of millions of these adjustable weights, and hundreds of millions of labelled examples with which to train the machine. To properly adjust the weight vector, the learning algorithm com-putes a gradient vector that, for each weight, indicates by what amount the error would increase or decrease if the weight were increased by a tiny amount. The weight vector is then adjusted in the opposite direc-tion to the gradient vector. The objective function, averaged over all the training examples, can Facebook AI Research, 770 Broadway, New York, New York 10003 USA. 2New York University, 715 Broadway, New York, New York 10003, USA. 3Department of Computer Science and Operations Research Université de Montréal, Pavillon André-Aisenstadt, PO Box 6128
Centre-Ville STN Montréal, Quebec H3C 3J7, Canada. 4Google, 1600 Amphitheatre Parkway, Mountain View, California 94043, USA. 5Department of Computer Science, University of Toronto, 6 King’s College Road, Toronto, Ontario M5S 3G4, Canada.436 | NATURE | VOL 521 | 28 MAY 2015? 2015 Macmillan Publishers Limited. All rights reservedInputOutputOutput units (2)(1 sigmoid)REVIEWINSIGHTMany of the current practical applications of machine learning use linear classifiers on top of hand-engineered features. A two-class linear classifier computes a weighted sum of the feature vector components. If the weighted sum is above a threshold, the input is classified as belonging to a particular category. Since the 1960s we have known that linear classifiers can only carve their input space into very simple regions, namely half-spaces sepa-rated by a hyperplane19. But problems such as image and speech recog-nition require the inputCoutput function to be insensitive to irrelevant variations of the input, such as variations in position, orientation or illumination of an object, or variations in the pitch or accent of speech, while being very sensitive to particular minute variations (for example, the difference between a white wolf and a breed of wolf-like white dog called a Samoyed). At the pixel level, images of two Samoyeds in different poses and in different environments may be very different from each other, whereas two images of a Samoyed and a wolf in the same position and on similar backgrounds may be very similar to each other. A linear classifier, or any other ‘shallow’ classifier operating on bbe seen as a kind of hilly landscape in the high-dimensional space of Hidden units H2 weight values. The negative gradient vector indicates the direction of steepest descent in this landscape, taking it closer to a minimum, where the output error is low on average. In practice, most practitioners use a procedure called stochastic Hidden units H1 gradient descent (SGD). This consists of showing the input vector for a few examples, computing the outputs and the errors, computing the average gradient for those examples, and adjusting the weights accordingly. The process is repeated for many small sets of examples Input units from the training set until the average of the objective function stops decreasing. It is called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. This simple procedure usually finds a good set of weights surprisingly quickly when compared with far more elaborate optimization tech-niques18. After training, the performance of the system is measured on a different set of examples called a test set. This serves to test the generalization ability of the machine ― its ability to produce sensible answers on new inputs that it has never seen during training. az??z??y??y??xxHidden(2 sigmoid)??zΔz=??yΔyy??yΔxΔy=??x??z??yΔxΔz=??y??x??z??z??y??x=??y??xcldyl=f(zl)zl=wklykwklk ?? H2E=ykwklEzlCompare outputs with correct answer to get error derivativesE=yltlyllEEyl=zlylzlwklkyk=f(zk)I ?? outwjkjzk=wjkyjkwjkE=yjwjkEzkj ?? H1yj=f(zj)zj=wijxiEEyk=zkykzkjwijk ?? H2wiji ?? InputEEyj=zjyjzjiiFigure 1 | Multilayer neural networks and backpropagation. a, A multi-layer neural network (shown by the connected dots) can distort the input space to make the classes of data (examples of which are on the red and blue lines) linearly separable. Note how a regular grid (shown on the left) in input space is also transformed (shown in the middle panel) by hidden units. This is an illustrative example with only two input units, two hidden units and one output unit, but the networks used for object recognition or natural language processing contain tens or hundreds of thousands of units. Reproduced with permission from C. Olah (http://colah.github.io/). b, The chain rule of derivatives tells us how two small effects (that of a small change of x on y, and that of y on z) are composed. A small change Δx in x gets transformed first into a small change Δy in y by getting multiplied by ?y/?x (that is, the definition of partial derivative). Similarly, the change Δy creates a change Δz in z. Substituting one equation into the other gives the chain rule of derivatives ― how Δx gets turned into Δz through multiplication by the product of ?y/?x and ?z/?x. It also works when x, y and z are vectors (and the derivatives are Jacobian matrices). c, The equations used for computing the forward pass in a neural net with two hidden layers and one output layer, each constituting a module through which one can backpropagate gradients. At each layer, we first compute the total input z to each unit, which is a weighted sum of the outputs of the units in the layer below. Then a non-linear function f(.) is applied to z to get the output of the unit. For simplicity, we have omitted bias terms. The non-linear functions used in neural networks include the rectified linear unit (ReLU) f(z) = max(0,z), commonly used in recent years, as well as the more conventional sigmoids, such as the hyberbolic tangent, f(z) = (exp(z) ? exp(?z))/(exp(z) + exp(?z)) and logistic function logistic, f(z) = 1/(1 + exp(?z)). d, The equations used for computing the backward pass. At each hidden layer we compute the error derivative with respect to the output of each unit, which is a weighted sum of the error derivatives with respect to the total inputs to the units in the layer above. We then convert the error derivative with respect to the output into the error derivative with respect to the input by multiplying it by the gradient of f(z). At the output layer, the error derivative with respect to the output of a unit is computed by differentiating the cost function. This gives yl ? tl if the cost function for unit l is 0.5(yl ? tl)2, where tl is the target value. Once the ?E/?zk is known, the error-derivative for the weight wjk on the connection from unit j in the layer below is just yj ?E/?zk.28 MAY 2015 | VOL 521 | NATURE | 437? 2015 Macmillan Publishers Limited. All rights reservedINSIGHTREVIEWSamoyed (16); Papillon (5.7); Pomeranian (2.7); Arctic fox (1.0); Eskimo dog (0.6); white wolf (0.4); Siberian husky (0.4)Convolutions and ReLUMax poolingConvolutions and ReLUMax poolingConvolutions and ReLURedGreenBlueFigure 2 | Inside a convolutional network. The outputs (not the filters) of each layer (horizontally) of a typical convolutional network architecture applied to the image of a Samoyed dog ( and RGB (red, green, blue) inputs, bottom right). Each rectangular image is a feature map corresponding to the output for one of the learned features, detected at each of the image positions. Information flows bottom up, with lower-level features acting as oriented edge detectors, and a score is computed for each image class in output. ReLU, rectified linear unit.raw pixels could not possibly distinguish the latter two, while putting the former two in the same category. This is why shallow classifiers require a good feature extractor that solves the selectivityCinvariance dilemma ― one that produces representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects such as the pose of the animal. To make classifiers more powerful, one can use generic non-linear features, as with kernel methods20, but generic features such as those arising with the Gaussian kernel do not allow the learner to general-ize well far from the training examples21. The conventional option is to hand design good feature extractors, which requires a consider-able amount of engineering skill and domain expertise. But this can all be avoided if good features can be learned automatically using a general-purpose learning procedure. This is the key advantage of deep learning. A deep-learning architecture is a multilayer stack of simple mod-ules, all (or most) of which are subject to learning, and many of which compute non-linear inputCoutput mappings. Each module in the stack transforms its input to increase both the selectivity and the invariance of the representation. With multiple non-linear layers, say a depth of 5 to 20, a system can implement extremely intricate func-tions of its inputs that are simultaneously sensitive to minute details ― distinguishing Samoyeds from white wolves ― and insensitive to large irrelevant variations such as the background, pose, lighting and surrounding objects. Backpropagation to train multilayer architectures From the earliest days of pattern recognition22,23, the aim of research-ers has been to replace hand-engineered features with trainable multilayer networks, but despite its simplicity, the solution was not widely understood until the mid 1980s. As it turns out, multilayer architectures can be trained by simple stochastic gradient descent. As long as the modules are relatively smooth functions of their inputs and of their internal weights, one can compute gradients using the backpropagation procedure. The idea that this could be done, and that it worked, was discovered independently by several different groups during the 1970s and .
The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain 438 | NATURE | VOL 521 | 28 MAY 2015rule for derivatives. The key insight is that the derivative (or gradi-ent) of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module) (Fig.?1). The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gradients with respect to the weights of each module. Many applications of deep learning use feedforward neural net-work architectures (Fig. 1), which learn to map a fixed-size input (for example, an image) to a fixed-size output (for example, a prob-ability for each of several categories). To go from one layer to the next, a set of units compute a weighted sum of their inputs from the previous layer and pass the result through a non-linear function. At present, the most popular non-linear function is the rectified linear unit (ReLU), which is simply the half-wave rectifier f(z) = max(z, 0). In past decades, neural nets used smoother non-linearities, such as tanh(z) or 1/(1 + exp(?z)), but the ReLU typically learns much faster in networks with many layers, allowing training of a deep supervised network without unsupervised pre-training28. Units that are not in the input or output layer are conventionally called hidden units. The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer (Fig.?1). In the late 1990s, neural nets and backpropagation were largely forsaken by the machine-learning community and ignored by the computer-vision and speech-recognition communities. It was widely thought that learning useful, multistage, feature extractors with lit-tle prior knowledge was infeasible. In particular, it was commonly thought that simple gradient descent would get trapped in poor local minima ― weight configurations for which no small change would reduce the average error. In practice, poor local minima are rarely a problem with large net-works. Regardless of the initial conditions, the system nearly always reaches solutions of very similar quality. Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general. Instead, the landscape is packed with a combinato-rially large number of saddle points where the gradient is zero, and the surface curves up in most dimensions and curves down in the ? 2015 Macmillan Publishers Limited. All rights reservedREVIEWINSIGHTremainder29,30. The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objec-tive function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at. Interest in deep feedforward networks was revived around 2006 (refs?31C34) by a group of researchers brought together by the Cana-dian Institute for Advanced Research (CIFAR). The researchers intro-duced unsupervised learning procedures that could create layers of feature detectors without requiring labelled data. The objective in learning each layer of feature detectors was to be able to reconstruct or model the activities of feature detectors (or raw inputs) in the layer below. By ‘pre-training’ several layers of progressively more complex feature detectors using this reconstruction objective, the weights of a deep network could be initialized to sensible values. A final layer of output units could then be added to the top of the network and the whole deep system could be fine-tuned using standard backpropaga-tion33C35. This worked remarkably well for recognizing handwritten digits or for detecting pedestrians, especially when the amount of labelled data was very limited36. The first major application of this pre-training approach was in speech recognition, and it was made possible by the advent of fast graphics processing units (GPUs) that were convenient to program37 and allowed researchers to train networks 10 or 20 times faster. In 2009, the approach was used to map short temporal windows of coef-ficients extracted from a sound wave to a set of probabilities for the various fragments of speech that might be represented by the frame in the centre of the window. It achieved record-breaking results on a standard speech recognition benchmark that used a small vocabu-lary38 and was quickly developed to give record-breaking results on a large vocabulary task39. By 2012, versions of the deep net from 2009 were being developed by many of the major speech groups6 and were already being deployed in Android phones. For smaller data sets, unsupervised pre-training helps to prevent overfitting40, leading to significantly better generalization when the number of labelled exam-ples is small, or in a transfer setting where we have lots of examples for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets. There was, however, one particular type of deep, feedforward net-work that was much easier to train and generalized much better than networks with full connectivity between adjacent layers. This was the convolutional neural network (ConvNet)41,42. It achieved many practical successes during the period when neural networks were out of favour and it has recently been widely adopted by the computer-vision community. this architecture is twofold. First, in array data such as images, local groups of values are often highly correlated, forming distinctive local motifs that are easily detected. Second, the local statistics of images and other signals are invariant to location. In other words, if a motif can appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array. Mathemati-cally, the filtering operation performed by a feature map is a discrete convolution, hence the name. Although the role of the convolutional layer is to detect local con-junctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one. Because the relative positions of the features forming a motif can vary somewhat, reliably detecting the motif can be done by coarse-graining the posi-tion of each feature. A typical pooling unit computes the maximum of a local patch of units in one feature map (or in a few feature maps). Neighbouring pooling units take input from patches that are shifted by more than one row or column, thereby reducing the dimension of the representation and creating an invariance to small shifts and dis-tortions. Two or three stages of convolution, non-linearity and pool-ing are stacked, followed by more convolutional and fully-connected layers. Backpropagating gradients through a ConvNet is as simple as through a regular deep network, allowing all the weights in all the filter banks to be trained. Deep neural networks exploit the property that many natural sig-nals are compositional hierarchies, in which higher-level features are obtained by composing lower-level ones. In images, local combi-nations of edges form motifs, motifs assemble into parts, and parts form objects. Similar hierarchies exist in speech and text from sounds to phones, phonemes, syllables, words and sentences. The pooling allows representations to vary very little when elements in the previ-ous layer vary in position and appearance. The convolutional and pooling layers in ConvNets are directly inspired by the classic notions of simple cells and complex cells in visual neuroscience43, and the overall architecture is reminiscent of the LGNCV1CV2CV4CIT hierarchy in the visual cortex ventral path-way44. When ConvNet models and monkeys are shown the same pic-ture, the activations of high-level units in the ConvNet explains half of the variance of random sets of 160?neurons in the monkey’s infer-otemporal cortex45. ConvNets have their roots in the neocognitron46, the architecture of which was somewhat similar, but did not have an end-to-end supervised-learning algorithm such as backpropagation. A primitive 1D ConvNet called a time-delay neural net was used for the recognition of phonemes and simple words47,48. There have been numerous applications of convolutional net-works going back to the early 1990s, starting with time-delay neu-ral networks for speech recognition47 and document reading42. The document reading system used a ConvNet trained jointly with a probabilistic model that implemented language constraints. By the late 1990s this system was reading over 10% of all the cheques in the United States. A number of ConvNet-based optical character recog-nition and handwriting recognition systems were later deployed by Microsoft49. ConvNets were also experimented with in the early 1990s for object detection in natural images, including faces and hands50,51, and for face recognition52. Convolutional neural networks ConvNets are designed to process data that come in the form of multiple arrays, for example a colour image composed of three 2D arrays containing pixel intensities in the three colour channels. Many data modalities are in the form of multiple arrays: 1D for signals and sequences, 2D for images o and 3D for video or volumetric images. There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers. The architecture of a typical ConvNet (Fig. 2) is structured as a series of stages. The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolu-tional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank. The result of this local weighted sum is then passed through a non-linearity such as a ReLU. All units in a feature map share the same filter bank. Differ-ent feature maps in a layer use different filter banks. The reason for Image understanding with deep convolutional networks Since the early 2000s, ConvNets have been applied with great success to the detection, segmentation and recognition of objects and regions in images. These were all tasks in which labelled data was relatively abun-dant, such as traffic sign recognition53, the segmentation of biological images54 particularly for connectomics55, and the detection of faces, text, pedestrians and human bodies in natural images36,50,51,56C58. A major recent practical success of ConvNets is face recognition59. Importantly, images can be labelled at the pixel level, which will have applications in technology, including autonomous mobile robots and 28 MAY 2015 | VOL 521 | NATURE | 439? 2015 Macmillan Publishers Limited. All rights reservedINSIGHTREVIEWLanguageGenerating RNNA group of people shopping at an outdoor market.There are many vegetables at the fruit stand.VisionDeep CNNA woman is throwing a frisbee in a park.A dog is standing on a hardwood floor.A stop sign is on a road with amountain in the backgroundA little girl sitting on a bed with a teddy bear.A group of people sitting on a boat in the water.A giraffe standing in a forest withtrees in the background.Figure 3 | From image to text. Captions generated by a recurrent neural network (RNN) taking, as extra input, the representation extracted by a deep convolution neural network (CNN) from a test image, with the RNN trained to ‘translate’ high-level representations of images into captions (top). Reproduced with permission from ref. 102. When the RNN is given the ability to focus its attention on a different location in the input image ( the lighter patches were given more attention) as it generates each word (bold), we found86 that it exploits this to achieve better ‘translation’ of images into captions.self-driving cars60,61. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision sys-tems for cars. Other applications gaining importance involve natural language understanding14 and speech recognition7. Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spec-tacular results, almost halving the error rates of the best compet-ing approaches1. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout62, and tech-niques to generate more training examples by deforming the existing ones. This success has brought about a revolutio ConvNets are now the dominant approach for almost all recognition and detection tasks4,58,59,63C65 and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig.?3). Recent ConvNet architectures have 10 to 20 layers of ReLUs, hun-dreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours. The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook, 440 | NATURE | VOL 521 | 28 MAY 2015Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services. ConvNets are easily amenable to efficient hardware implemen-tations in chips or field-programmable gate arrays66,67. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars. Distributed representations and language processing Deep-learning theory shows that deep nets have two different expo-nential advantages over classic learning algorithms that do not use distributed representations21. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure40. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2n combinations are possible with n binary features)68,69. Second, composing layers of representation in a deep net brings the potential for another exponential advantage70 (exponential in the depth). The hidden layers of a multilayer neural network learn to repre-sent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local ? 2015 Macmillan Publishers Limited. All rights reserved三亿文库包含各类专业文献、应用写作文书、外语学习资料、各类资格考试、中学教育、幼儿教育、小学教育、专业论文、文学作品欣赏、生活休闲娱乐、73nature14539 - Deep Learning等内容。　}

我爱游戏网