神经网络数学建模论文建模怎么在工作空间里看到训练时间和误差分析

点击联系发帖人 时间：2017-05-01 06:49

人工神经网络建模

您还可以使用以下方式登录
当前位置：&>&&>& > 神经网络的误差分析
神经网络的误差分析
A Neural Network Approach to the Problem of RecoveringLost Data in a Network of Marine BuoysS. Puca* , B. Tirozzi*, G. Arena**, S. Corsini**, R. Inghilesi*** University of Rome “La Sapienza”, Department of Physics** SIMN (National Hydrological and Marine Survey –National Technical Surveys Dept. of Italy)AbstractNeural Network (NN) technology provides several reliable tools foranalysis in many science and technology applications. In particular NNare often applied to the development of statistical models forintrinsically non-linear systems, since NN usually behave better thanARMA or GARCH models in complex conditions. A member of this classof problems is the analysis of time series of significant wave heightsfrom a network of buoys. A project is being carried out by the ItalianDSTN-SIMN (Technical Surveys Dept. –National Hydrological andMarine Survey) and the Dept. Of Physics of the University of Rome “LaSapienza”, in order to reproduce the time series collected by the ItalianSWaN network of buoys (Sea Wave monitoring Network). Aim of theproject is the determination of the best way to fill gaps and long periodsof missing data with the best accuracy by means of a reanalysis of thewhole ten years’ data set of the SWaN. Here a NN model is proposed fortime-space analyses of the marine data. Main feature of the tool is theability to reproduce long time series of data without any increase of theerror. The method is based on a preliminary spatial analysis of the waveclimates in order to classify the degree of overlapping of informationfrom different stations. This overlapping, where possible, led to anoptimal and selective training of the NN by means of data collected indifferent, nearby, locations. NN numerical simulations of someimportant historical storm are compared with the data originallyobserved at the stations of Crotone (Ionian Sea), Pescara (Adriatic Sea)and Monopoli (Adriatic Sea).. IntroductionThe Sea WAve monitoring Network (SWAN) is a network of 10 buoysmoored all round the coasts of Italy (Arena et al., 1997). It has beenworking since June 1989 with the original 8 stations of Alghero, LaSpezia, Ponza, Mazara, Catania, Monopoli, and Ortona. Two stations,Cetraro and Ancona, have been added since 1999. At present time sixbuoys provide real time data, all the network will be real time workingwithin the current year. In more than eleven years of activity, theefficiency of the monitoring system was high on the average, Overallstatistics indicate that less than 5% of data were lost for 6 out of 8buoys. For only two stations, namely Mazara and Ponza, the percentageis higher, about 10-15% (Corsini et al., 2000) . But even for the best ofthe circumstances no system can provide for 100% of the data forundefinitely long periods. Causes of relatively long periods of missingwith ships, or prolonged radio transmission problems. The occurrenceof problems in the measure or even in the transmission of data affectsboth the principal activities of the network. It decreases the real-timemonitoring efficiency (every buoy fault produces a decrease of 16% ofthe overall efficiency of the system), and deploys the quality of the longterm statistics. It must be observed that most of the problems at sea areassociated with severe weather conditions. That is to say that moreoften than not lost information would have been the most valuable. Another aspect of the problem is that in the ordinary time series statisticalanalysis there are tasks, like the evaluation of the autocorrelationfunction, the Fourier analysis or the extreme wave analysis, whichFL - 04S. PucaPage 1 of 4requires some sort of replacement of the gaps. Methods’ reliabilitydepends usually on the length of the gap, univariate methods hardly canprovide reliable estimates of entire storm episodes hidden by large (twoweeks long, for instance) gaps. The aim of the present work is theassessment of a practical and reliable method of recovering lostinformation by means of multivariate Neural Network methods.NN approachThere are several methods currently available to fill the gaps of a timeseries, or, at least, to evaluate their possible influence on the statisticalanalysis. Most of them are univariate techniques, i.e. methods thatoperate on the same time series. There can be statistical or empiricalmethods, which just help to fill the gaps in a way that preserves somefeature (the overall expected value, as an example). More sophisticatedmethods have the aim of simulating the actual time series by means ofthe past history (ARMA or Neural Networks). The intrinsic error in thecase of ARMA model comes from the fact that the prediction (or theestimate) of missing data is obtained by conditional expectation withrespect to the known data. In this algorithms each estimate of the variateat a certain time implies a prediction error, which adds up at every timestep. Same situation occurs with neural networks applied in the sameway, that is, using a NN which captures the relationship among the dataat time t+1 and those at previous times t, t-1, t-m. Ordinary NNapproach would result in errors in the estimates growing really fast withthe size of the gap, giving meaningless estimates after a few iterations.The problem of numerically evaluating the time series of observationscollected at a single station has little chances to be successfully solved.Dealing with networks of evenly positioned stations, on the other hand,allows the application of different strategies, in particular it suggest theuse of adaptive methods with multivariate data. The method which ledto the best results in simulating the behaviour of a long time series forthe SWAN was found to be the use of a superposition of the directionand significant wave height (Hmo) information provided from the nearbystations with non-linear coefficients (weights) estimated by means of aSWAN-tailored NN. The directional information was responsible forthe weights (correlations) attributed by the NN to Hmo data collected atdifferent locations.FL - 04S. PucaNeural Network ArchitectureNeural Networks are a class of algorithms particularly suitable for themodelling of time series. Data can be chaotic or stochastic, usuallygoverned by deterministic evolution equations with stochastic terms(Herz et al., 1993). These algorithms can detect any form of non linearrelation which generates a sequence of input-output pairs (patterns){u,u}Pu=1, whereuis a n-dimensional vector anduis a m-dimensional vector, u is a temporal index. Once it is assumed theexistence of a function()f:Rn→Rm of the following form:u=fu, the aim of the algorithm is then to determine the bestapproximation of the function f(u). If the output of the neuralnetwork is called u, a learning process is introduced in order tominimise the learning error EL1)E1P?mu?L=P∑?=1?∑zi-yui?ui=1?on a sequence of P known patterns by means of the optimisation of thesynaptic weights ,. The complete data set is divided in twodifferent subsets. One, the Learning Set (LS), coincides with thepatterns collection on which the weights are estimated by means of theminimisation of EL. The generalisation skill of the NN is then evaluatedapplying the error ET on the second subset of data, called the TestingSet (TS).The expression for the ET error is:PT2)E1T=∑?m?zu-yu?PT=1?∑ii??,ui=1where PT is the number of patterns present in the TS.The method described in the present paper was based on a two-layeredneural network (NN), in which the input layer was made of eightneurons, the hidden of ten, and the output of three neurons, as shown infig.2The input vector u+13)u+1=(was defined as:xuu+1uu+1uu+1uu+1H1,xH1,xd1,xd1,xH2,xH2,xd2,xd2)Where u is a time index,xuH1,2 andxud1,2are respectively thesignificant wave heights and mean wave direction at the time u of twostations near the buoy whose series is to be simulated. Each of the 8input neuron was connected to all 10 neurons of the first layer by thesynaptic interaction (weights) w(i,j) with i=1,..,10 and j=1,..,8. In thesame way each neuron of the hidden layer is connected with the threeneurons of the output by the synaptic weights v(h,k)
with h=1,..,3 andk=1, 2,..., 10.The values of the output vector u+1=(αu+10,αu+11,αu+12), are:4)u+1=σ?10?8??h??∑vu+12?h,kσ1??∑wk,jxj??k=1?j=1????with h=1, 2, 3 and σ(x) is the non linear input-output function ofthe neurons, defined as follows:5)σ1i(x)=1+e-λixwith i=1,2.The components of the output vector u+1linear combination of the
vectoru+1=(are the coefficients of theDxu+1u+1P,H1,xH2), where DPage 2 of 4is an average value of the significant heights at the site 0 where theprediction is to be made:6)D=(max(xumin(xuPH0)-H0))The final output
of the model zu+17)zu+1=(is the following scalar value:αu+1Dαu+1xu+1u+1u+10P+1H1+α2xH2)which represents the best estimate for the unknown xu+1H0.The NN learns how to evaluate the different Hm0 contributions of thecorrelated stations at every time step m.The EL actually adopted in the application of the learning algorithmdiffers from (1), in factEL(,,,) was defined as:8)E1P(uuL=P∑xH0xH0-zu).u=1This particular form was chosen in order to give more emphasis to theevents of greater magnitude, i.e. to calibrate the NN on storms rather onthan calm periods. To get the global minimum in the learning phase, theMonte Carlo Method was used in order to explore all the possiblevalues of the free variables{wi=8,j=10=10,k=3i,j}i,j=1,{vhh,k}h,k=1,λ1 and λ2belonging to a discrete bounded set. The Simulated Annealing (SA)algorithm was adopted as the more effective Monte Carlo method in thepresent conditions. The SA skill in determining the global minimumresult was theoretically stated in the Geman and Geman theorem(Geman et Al. 1984) and checked numerically many times (Aarts et al,1990,). A problem with the SA method, sometimes cited in literature(for example Mhaskar 1996), deals with the algorithm convergencevelocity. Nevertheless, in the present work the convergence velocitywas found to be high enough to operate efficiently. One of the principalreasons for the use of the SA is the stability with respect to the choice ofrandom initial conditions. This aspect was carefully tested during theapplication of the method.After the learning phase, in the following testing phase, a different setof data was used to evaluate the error ET.Test and preliminary resultsThe NN method was applied to simulate the time series observed atMonopoli during the first ten years of activity. The learning phase wasset on by means of the first 3000 available patterns (which correspondto the first year of data, since the 1st of June 1989, of the buoys ofCrotone, Monopoli and Pescara) . In the learning phase the weights inthe expression (3) were evaluated, then in the test phase the whole set ofdata of Crotone and Pescara were used to simulate the time series atMonopoli until December 1999. Overall statistics indicate as averagelearning error to be EL=0.021m, while the average testing error wasET=0.023 m. The closeness of the ET to the EL strongly suggest that thelearning phase was successful in determining the behaviour of thesystem, so any further increase of dimension of the learning set wouldslow down the convergence process without any significantenhancement in the result accuracy. In order to assess the reliability ofthe results, i.e. to ascertain that simulated time series can trustfully beused to recover large gaps in the original time series of observations inreal-time monitoring and statistical analysis, the three major stormactivity periods were selected for a direct comparison with observationsat Monopoli. A further comparison was introduced with the numericalsimulation given by the deterministic ‘physical’ Wave Model (Komenet al, 1994) operationally run at the European Centre for Medium RangeForecasts (ECMWF). For the comparison, the nearest grid point toMonopoli was singled out from the Mediterranean 0.25/0.25 degreehigh resolution ECMWF WAM analysis. The obtained 6-hourlynumerical time series (Hwam) is consistent (in the sense that it ismeaningfully comparable) with the 3 hourly observations averaged overFL - 04S. Puca30 minutes, as can be verified by the comparison of the averagingperiod of the measurement with the time scale Ts which characterisesthe dynamics of the model. The Ts can be roughly estimated by the timeit takes a physical signal travelling with typical speed of 10 m/s to coverthe model grid length at 40° of latitude. Ts estimates for high-resolutionmodels give times not greater than 80 minutes, being of the same orderof magnitude of the averaging periods of measurement.1st comparison period: 28.12.94-10.02.95buoy’s activity was observed in the period considered. Less intenseepisodes were observed respectively one week before and one monthlater the peak. It can be seen
in fig (3) that there is a gap in theobserved time series (Hmo) ranging from immediately after the extremepeak to the next (smaller) one. It may lead to wander whethersignificant episodes were hidden in the analysis by the gap. Thecomparison between numerical WAM simulation (HWAM) and the NNsimulation (HNN) showed that NN performs much better than WAM inall the period considered . This is not really a surprise, as it is wellknown that WAM outputs (especially before 1996, when significantimprovements were made at the ECMWF) in the Mediterranean Area,and in particular in the Adriatic Sea, tends to underestimate Hmo. It isperhaps worth to mention that all three episodes were correctlyreproduced by NN, but during the strongest peak there was a significantloss of data at the nearby station of Crotone. This caused the relativefailure of the NN in predicting the exact maximum at the real time. TheHNN maximum was then shifted 3 hours forward (HNN=4 m), when theactual measurement Hmo was significantly smaller (Hmo=3 m).Nevertheless, taken as a whole the episode was the best possiblesimulation of the event (considered the simulated hypotheticaloperational conditions: nor information from Monopoli, and nor fromCrotone during the storm rise. What really happened was that at leastone of the two was always working in the period considered).2nd comparison period: 15.03.95-8.04.95Two distinct storm episodes were observed in the period: the first one(HMO=3m) occurred around the 20th. of March, and the second, moresevere (Hmo=4m), at the end of the month. Both episodes, shown in Fig(4), were very well reproduced by NN, differences from the Hmo beingwithin the order of centimetres at the peaks. The HWAM was found againto underestimate systematically the event. As in the previouscomparison period it can be seen that a part of the series could not bereproduced by NN due to a minor loss of data in the leading edge of thePage 3 of 4higher peak curve. Nevertheless, in the present case it didn’t affectsignificantly the results.3rdcomparison period: 15.02.97-15.03.97Three episodes were observed in the period, (fig 5) the first occurring inthe mid-January, the second (the extreme one) at the beginning ofMarch, and the last one just after two weeks. The major feature to beobserved is a neat overestimation (difference between peaks of about 1m) of the Hmo by HNN in the January episode. Further investigationswill be carried out in order to assess whether the overestimate can berelated to the gap in the time series of the nearby buoy which can beseen along the leading (rising) edge of the peak line, or just to aninadequate representation of the directional information in the learningphase. The higher peak is well reproduced by NN, giving a reliablerepresentation for the descending tail of the curve, which was lost in theobservations.ConclusionsThe preliminary analysis carried out on the test cases proposed, showedthat the numerical simulation of a very long time series, as theMonopoli time series, by means of the information collected in thenearby locations (Pescara and Crotone) processed by optimal NNalgorithms, is at least promising in order to provide a method to recoverFL - 04S. Pucathe effects of the loss of data of whatever long period. In the entire testperiods considered NN were found to be far more effective in thesimulation of the physical process than the numerical high resolutionECMWF Wave Model. Comparison with the observations showed thatNN could reproduce most of the sea storm almost exactly in terms ofthe Hmo time series. Only one of the episodes considered in the threeperiods was 30% overestimated by NN, the failure being possiblyrelated to the loss of nearby location data in the development stage ofthe storm. Minor weak points of NN were found to be the impossibilityof data recovery when more than one nearby stations fails (situation thatfortunately was seldom seen to occur) and the uncertainty about themean wave directional information. It must be said that, despite thepoorer ability of HWAM simulation, WAM was found to be quiteeffective in the determination of the mean wave direction.Further analysis is being carried out in order to assure the optimisationof the learning sets for all the SWAN time series. More complete resultswill indicate the effects of the recovery of the gaps currently present intime series in standard statistical analyses (i.e. wave climate) as well asin the analysis of extreme waves.ReferencesE. Aarts and J. Korst, “Simulated Annealing and Boltzmann Machine”,John Wiley & Sons
New York, 1990.Arena G, Corsini S., “Activities of the National Hydrographic AndOceanographic Service in the maritime field” , PIANC-PIC Congress,Venice, 1997.Corsini S., F. Guiducci, R. Inghilesi "Statistical Extreme Wave Analysisof the Italian Sea Wave Measurement Neywork Data in the period" ISOPE 2000 proc., Seattle.J. Herz, A. Krog, R. G. Palmer,"Introduction to the theory of neuralcomputation.", Addison-Wesley, 1993.Geman S.and D. Geman, **** IEEE Trans. on Pattern Analysis andMachine Intelligence, 6, p.721-741, 1984.Komen G. J., L. Cavaleri, M. Donelan, K. Hasselmann, S. Hasselmann,P.A.E.M. Janssen, , “ Dynamics and Modelling of Ocean Waves”,Cambridge Univ. Press, 1994.Mhaskar H. N., “Neural Networks for optimal approximation of smoothand analytic functions”, Neural Computation, 8, 164-177, 1996.Tirozzi B., “Modelli Matematici di Reti Neurali”, CEDAMMilano,1995.Page 4 of 4百度搜索“就爱阅读”,专业资料,生活学习,尽在就爱阅读网92to.com,您的在线图书馆
欢迎转载：
相关推荐：零基础入门深度学习(四)：循环神经网络
投稿：Intelligent Software Development
团队介绍：团队成员来自一线互联网公司，工作在架构设计与优化、工程方法研究与实践的最前线，曾参与搜索、互联网广告、共有云/私有云等大型产品的设计、开发和技术优化工作。目前主要专注在机器学习、微服务架构设计、虚拟化/容器化、持续交付/DevOps等领域，希望通过先进技术和工程方法最大化提升软件和服务的竞争力。
在前面的文章系列文章中，我们介绍了全连接神经网络和卷积神经网络，以及它们的训练和使用。他们都只能单独的取处理一个个的输入，前一个输入和后一个输入是完全没有关系的。但是，某些任务需要能够更好的处理序列的信息，即前面的输入和后面的输入是有关系的。
比如，当我们在理解一句话意思时，孤立的理解这句话的每个词是不够的，我们需要处理这些词连接起来的整个序列；当我们处理视频的时候，我们也不能只单独的去分析每一帧，而要分析这些帧连接起来的整个序列。这时，就需要用到深度学习领域中另一类非常重要神经网络：循环神经网络(Recurrent Neural Network)。RNN种类很多，也比较绕脑子。
不过读者不用担心，本文将一如既往地对复杂的东西剥茧抽丝，帮助您理解RNN以及它的训练算法，并动手实现一个循环神经网络。
RNN是在自然语言处理领域中最先被用起来的，比如，RNN可以为语言模型来建模。那么，什么是语言模型呢？
我们可以和电脑玩一个游戏，我们写出一个句子前面的一些词，然后，让电脑帮我们写下接下来的一个词。比如下面这句：
我昨天上学迟到了，老师批评了____。
我们给电脑展示了这句话前面这些词，然后，让电脑写下接下来的一个词。在这个例子中，接下来的这个词最有可能是『我』，而不太可能是『小明』，甚至是『吃饭』。
语言模型就是这样的东西：给定一个一句话前面的部分，预测接下来最有可能的一个词是什么。
语言模型是对一种语言的特征进行建模，它有很多很多用处。比如在语音转文本(STT)的应用中，声学模型输出的结果，往往是若干个可能的候选词，这时候就需要语言模型来从这些候选词中选择一个最可能的。当然，它同样也可以用在图像到文本的识别中(OCR)。
使用RNN之前，语言模型主要是采用N-Gram。N可以是一个自然数，比如2或者3。它的含义是，假设一个词出现的概率只与前面N个词相关。我们以2-Gram为例。首先，对前面的一句话进行切词：
我昨天上学迟到了，老师批评了 ____。
如果用2-Gram进行建模，那么电脑在预测的时候，只会看到前面的『了』，然后，电脑会在语料库中，搜索『了』后面最可能的一个词。不管最后电脑选的是不是『我』，我们都知道这个模型是不靠谱的，因为『了』前面说了那么一大堆实际上是没有用到的。如果是3-Gram模型呢，会搜索『批评了』后面最可能的词，感觉上比2-Gram靠谱了不少，但还是远远不够的。因为这句话最关键的信息『我』，远在9个词之前！
现在读者可能会想，可以提升继续提升N的值呀，比如4-Gram、5-Gram.......。实际上，这个想法是没有实用性的。因为我们想处理任意长度的句子，N设为多少都不合适；另外，模型的大小和N的关系是指数级的，4-Gram模型就会占用海量的存储空间。
所以，该轮到RNN出场了，RNN理论上可以往前看(往后看)任意多个词。
循环神经网络是啥
循环神经网络种类繁多，我们先从最简单的基本循环神经网络开始吧。
基本循环神经网络
下图是一个简单的循环神经网络如，它由输入层、一个隐藏层和一个输出层组成：
纳尼？！相信第一次看到这个玩意的读者内心和我一样是崩溃的。因为循环神经网络实在是太难画出来了，网上所有大神们都不得不用了这种抽象艺术手法。不过，静下心来仔细看看的话，其实也是很好理解的。如果把上面有W的那个带箭头的圈去掉，它就变成了最普通的全连接神经网络。
x是一个向量，它表示输入层的值（这里面没有画出来表示神经元节点的圆圈）；s是一个向量，它表示隐藏层的值（这里隐藏层面画了一个节点，你也可以想象这一层其实是多个节点，节点数与向量s的维度相同）；U是输入层到隐藏层的权重矩阵（读者可以回到第二篇文章，看看我们是怎样用矩阵来表示全连接神经网络的计算的）；o也是一个向量，它表示输出层的值；V是隐藏层到输出层的权重矩阵。那么，现在我们来看看W是什么。循环神经网络的隐藏层的值s不仅仅取决于当前这次的输入x，还取决于上一次隐藏层的值s。权重矩阵 W就是隐藏层上一次的值作为这一次的输入的权重。
如果我们把上面的图展开，循环神经网络也可以画成下面这个样子：
现在看上去就比较清楚了，这个网络在t时刻接收到输入之后，隐藏层的值是，输出值是。关键一点是，的值不仅仅取决于，还取决于。我们可以用下面的公式来表示循环神经网络的计算方法：
式1是输出层的计算公式，输出层是一个全连接层，也就是它的每个节点都和隐藏层的每个节点相连。V是输出层的权重矩阵，g是激活函数。式2是隐藏层的计算公式，它是循环层。U是输入x的权重矩阵，W是上一次的值作为这一次的输入的权重矩阵，f是激活函数。
从上面的公式我们可以看出，循环层和全连接层的区别就是循环层多了一个权重矩阵 W。
如果反复把式2带入到式1，我们将得到：
从上面可以看出，循环神经网络的输出值，是受前面历次输入值…….影响的，这就是为什么循环神经网络可以往前看任意多个输入值的原因。
双向循环神经网络
对于语言模型来说，很多时候光看前面的词是不够的，比如下面这句话：
我的手机坏了，我打算____一部新手机。
可以想象，如果我们只看横线前面的词，手机坏了，那么我是打算修一修？换一部新的？还是大哭一场？这些都是无法确定的。但如果我们也看到了横线后面的词是『一部新手机』，那么，横线上的词填『买』的概率就大得多了。
在上一小节中的基本循环神经网络是无法对此进行建模的，因此，我们需要双向循环神经网络，如下图所示：
当遇到这种从未来穿越回来的场景时，难免处于懵逼的状态。不过我们还是可以用屡试不爽的老办法：先分析一个特殊场景，然后再总结一般规律。我们先考虑上图中，y2的计算。
从上图可以看出，双向卷积神经网络的隐藏层要保存两个值，一个A参与正向计算，另一个值A'参与反向计算。最终的输出值y2取决于A2和A'2。其计算方法为：
A2和A'2则分别计算：
从上面三个公式我们可以看到，正向计算和反向计算不共享权重，也就是说U和U'、W和W'、V和V'都是不同的权重矩阵。
深度循环神经网络
前面我们介绍的循环神经网络只有一个隐藏层，我们当然也可以堆叠两个以上的隐藏层，这样就得到了深度循环神经网络。如下图所示：
循环神经网络的训练
循环神经网络的训练算法：BPTT
BPTT算法是针对循环层的训练算法，它的基本原理和BP算法是一样的，也包含同样的三个步骤：
前向计算每个神经元的输出值；
反向计算每个神经元的误差项值，它是误差函数E对神经元j的加权输入的偏导数；
计算每个权重的梯度。
最后再用随机梯度下降算法更新权重。
循环层如下图所示：
使用前面的式2对循环层进行前向计算：
注意，上面的都是向量，用黑体字母表示；而U、V是矩阵，用大写字母表示。向量的下标表示时刻，例如，表示在t时刻向量s的值。
我们假设输入向量x的维度是m，输出向量s的维度是n，则矩阵U的维度是，矩阵W的维度是。下面是上式展开成矩阵的样子，看起来更直观一些：
在这里我们用手写体字母表示向量的一个元素，它的下标表示它是这个向量的第几个元素，它的上标表示第几个时刻。例如，表示向量s的第j个元素在t时刻的值。表示输入层第i个神经元到循环层第j个神经元的权重。表示循环层第t-1时刻的第i个神经元到循环层第t个时刻的第j个神经元的权重。
误差项的计算
同理，上式第二项也是一个Jacobian矩阵：
其中，diag[a]表示根据向量a创建一个对角矩阵，即
最后，将两项合在一起，可得：
式3就是将误差项沿时间反向传播的算法。
循环层将误差项反向传递到上一层网络，与普通的全连接层是完全一样的，这在前面的文章中已经详细讲过了，在此仅简要描述一下。
式4就是将误差项传递到上一层算法。
权重梯度的计算
现在，我们终于来到了BPTT算法的最后一步：计算每个权重的梯度。
按照上面的规律就可以生成式5里面的矩阵。
式6就是计算循环层权重矩阵W的梯度的公式。
------数学公式超高能预警-----
前面已经介绍了的计算方法，看上去还是比较直观的。然而，读者也许会困惑，为什么最终的梯度是各个时刻的梯度之和呢？我们前面只是直接用了这个结论，实际上这里面是有道理的，只是这个数学推导比较绕脑子。感兴趣的同学可以仔细阅读接下来这一段，它用到了矩阵对矩阵求导、张量与向量相乘运算的一些法则。
我们还是从这个式子开始：
接下来，我们计算式7加号右边的部分：
于是，我们得到了如下递推公式：
------数学公式超高能预警解除-----
权重矩阵U的计算方法和全连接神经网络的计算方法完全一样，这里就不再赘述了。感兴趣的读者可以看后面的代码实现。
RNN的梯度爆炸和消失问题
不幸的是，实践中前面介绍的几种RNNs并不能很好的处理较长的序列。一个主要的原因是，RNN在训练中很容易发生梯度爆炸和梯度消失，这导致训练时梯度不能在较长序列中一直传递下去，从而使RNN无法捕捉到长距离的影响。
为什么RNN会产生梯度爆炸和消失问题呢？我们接下来将详细分析一下原因。我们根据式3可得：
上式的定义为矩阵的模的上界。因为上式是一个幂函数，如果t-k很大的话（也就是向前看很远的时候），会导致对应的误差项的值增长或缩小的非常快，这样就会导致相应的梯度爆炸和梯度消失问题（取决于大于1还是小于1）。
通常来说，梯度爆炸更容易处理一些。因为梯度爆炸的时候，我们的程序会收到NaN错误。我们也可以设置一个梯度阈值，当梯度超过这个阈值的时候可以直接截取。
梯度消失更难检测，而且也更难处理一些。总的来说，我们有三种方法应对梯度消失问题：
合理的初始化权重值。初始化权重，使每个神经元尽可能不要取极大或极小值，以躲开梯度消失的区域。
使用relu代替sigmoid和tanh作为激活函数。原理请参考上一篇文章的激活函数一节。
使用其他结构的RNNs，比如长短时记忆网络（LTSM）和Gated Recurrent Unit（GRU），这是最流行的做法。我们将在以后的文章中介绍这两种网络。
RNN的应用举例——基于RNN的语言模型
现在，我们介绍一下基于RNN语言模型。我们首先把词依次输入到循环神经网络中，每输入一个词，循环神经网络就输出截止到目前为止，下一个最可能的词。例如，当我们依次输入：
我昨天上学迟到了
神经网络的输出如下图所示：
其中，s和e是两个特殊的词，分别表示一个序列的开始和结束。
我们知道，神经网络的输入和输出都是向量，为了让语言模型能够被神经网络处理，我们必须把词表达为向量的形式，这样神经网络才能处理它。
神经网络的输入是词，我们可以用下面的步骤对输入进行向量化：
建立一个包含所有词的词典，每个词在词典里面有一个唯一的编号。
任意一个词都可以用一个N维的one-hot向量来表示。其中，N是词典中包含的词的个数。假设一个词在词典中的编号是i，v是表示这个词的向量，是向量的第j个元素，则：
上面这个公式的含义，可以用下面的图来直观的表示：
使用这种向量化方法，我们就得到了一个高维、稀疏的向量（稀疏是指绝大部分元素的值都是0）。处理这样的向量会导致我们的神经网络有很多的参数，带来庞大的计算量。因此，往往会需要使用一些降维方法，将高维的稀疏向量转变为低维的稠密向量。不过这个话题我们就不再这篇文章中讨论了。
语言模型要求的输出是下一个最可能的词，我们可以让循环神经网络计算计算词典中每个词是下一个词的概率，这样，概率最大的词就是下一个最可能的词。因此，神经网络的输出向量也是一个N维向量，向量中的每个元素对应着词典中相应的词是下一个词的概率。如下图所示：
前面提到，语言模型是对下一个词出现的概率进行建模。那么，怎样让神经网络输出概率呢？方法就是用softmax层作为神经网络的输出层。
我们先来看一下softmax函数的定义：
这个公式看起来可能很晕，我们举一个例子。Softmax层如下图所示：
从上图我们可以看到，softmax layer的输入是一个向量，输出也是一个向量，两个向量的维度是一样的（在这个例子里面是4）。输入向量x=[1 2 3 4]经过softmax层之后，经过上面的softmax函数计算，转变为输出向量y=[0.03 0.09 0.24 0.64]。计算过程为：
我们来看看输出向量y的特征：
每一项为取值为0-1之间的正数；
所有项的总和是1。
我们不难发现，这些特征和概率的特征是一样的，因此我们可以把它们看做是概率。对于语言模型来说，我们可以认为模型预测下一个词是词典中第一个词的概率是0.03，是词典中第二个词的概率是0.09，以此类推。
语言模型的训练
可以使用监督学习的方法对语言模型进行训练，首先，需要准备训练数据集。接下来，我们介绍怎样把语料：
我昨天上学迟到了
转换成语言模型的训练数据集。
首先，我们获取输入-标签对：
然后，使用前面介绍过的向量化方法，对输入x和标签y进行向量化。这里面有意思的是，对标签y进行向量化，其结果也是一个one-hot向量。例如，我们对标签『我』进行向量化，得到的向量中，只有第2019个元素的值是1，其他位置的元素的值都是0。它的含义就是下一个词是『我』的概率是1，是其它词的概率都是0。
最后，我们使用交叉熵误差函数作为优化目标，对模型进行优化。
在实际工程中，我们可以使用大量的语料来对模型进行训练，获取训练数据和训练的方法都是相同的。
交叉熵误差
一般来说，当神经网络的输出层是softmax层时，对应的误差函数E通常选择交叉熵误差函数，其定义如下：
在上式中，N是训练样本的个数，向量是样本的标记，向量是网络的输出。标记是一个one-hot向量，例如，如果网络的输出，那么，交叉熵误差是（假设只有一个训练样本，即N=1）：
我们当然可以选择其他函数作为我们的误差函数，比如最小平方误差函数(MSE)。不过对概率进行建模时，选择交叉熵误差函数更make sense。具体原因，感兴趣的读者请阅读参考文献7。
为了加深我们对前面介绍的知识的理解，我们来动手实现一个RNN层。我们复用了上一篇文章中的一些代码，所以先把它们导入进来。
我们用RecurrentLayer类来实现一个循环层。下面的代码是初始化一个循环层，可以在构造函数中设置卷积层的超参数。我们注意到，循环层有两个权重数组，U和W。
在forward方法中，实现循环层的前向计算，这部分比较简单。
在backword方法中，实现BPTT算法。
有意思的是，BPTT算法虽然数学推导的过程很麻烦，但是写成代码却并不复杂。
在update方法中，实现梯度下降算法。
上面的代码不包含权重U的更新。这部分实际上和全连接神经网络是一样的，留给感兴趣的读者自己来完成吧。
循环层是一个带状态的层，每次forword都会改变循环层的内部状态，这给梯度检查带来了麻烦。因此，我们需要一个reset_state方法，来重置循环层的内部状态。
最后，是梯度检查的代码。
需要注意，每次计算error之前，都要调用reset_state方法重置循环层的内部状态。下面是梯度检查的结果，没问题！
至此，我们讲完了基本的循环神经网络、它的训练算法：BPTT，以及在语言模型上的应用。RNN比较烧脑，相信拿下前几篇文章的读者们搞定这篇文章也不在话下吧！然而，循环神经网络这个话题并没有完结。我们在前面说到过，基本的循环神经网络存在梯度爆炸和梯度消失问题，并不能真正的处理好长距离的依赖（虽然有一些技巧可以减轻这些问题）。
事实上，真正得到广泛的应用的是循环神经网络的一个变体：长短时记忆网络。它内部有一些特殊的结构，可以很好的处理长距离的依赖，我们将在下一篇文章中详细的介绍它。现在，让我们稍事休息，准备挑战更为烧脑的长短时记忆网络吧。
相关专题：
精选专题（官网：dbaplus.cn）
◆ 近期热文 ◆
◆ 专家专栏 ◆
丨丨丨丨丨
丨丨丨丨丨丨
◆ 近期活动 ◆
Gdevops全球敏捷运维峰会上海站
峰会官网：www.gdevops.com
责任编辑：
声明：该文观点仅代表作者本人，搜狐号系信息发布平台，搜狐仅提供信息存储空间服务。
今日搜狐热点}

我爱游戏网