一个很6的模型。输入一堆文本,它会输出一堆长得很像的新的文本。可以用来写小说、写数学书、甚至编曲(输入ascii的谱子)……

但是输出太糙了。还是需要大量学习。

看过的三篇文章,转一下。图片就不转了。


还能输出markdown、甚至论文abstract、URL、XML。生成的代码、Markdown、XML、Latex以及外文在形式上是很好的,完全可以用来骗外行(不懂某种语言的人)。

发现一些代码开头的#include被当成markdown标记了,去加了代码框标记。随便又给输出的markdown和latex加了标记。

非常好玩,机器学到了输出代码的时候先输出GNU许可。一字不落。

另外原文中有两个个很有意思的评论,关于输入论文abstract和输入吉他谱的。都很棒,也一并复制过来。


人工智能谈论写作时,他们在谈些什么

(用深度学习理论去学习武侠小说、网络小说、唐诗宋词,乃至色情小说、政府报告,人工智能将写出什么?本文一步步揭示了人工智能学习写作的过程。)

三月是人工智能的季节。月初,AlphaGo在韩国大胜李世石。月末,人工智能写出的小说在日本入围大奖赛。2016年,最热的科技概念无疑就是“人工智能”。人们对于人工智能的态度从最初的怀疑,到后来的震惊,再到现在的叹服。我们不仅在议论“人工智能能干些什么”,更在心里嘀咕:我们的职业还能存在多久?

不过,在热议之外,人工智能到底是什么,大多数人如同盲人摸象,仅仅只有一知半解。至于其内部的工作原理,一般公众更是一无所知。由于巨大的技术障碍,舆论对于人工智能仅仅停留于“高深、前沿、无所不能”的印象,而无法对之进行深入解析。公众的怀疑、震惊、叹服等等心态,说到底都是无知的体现。

知己知彼,方能百战不殆。在打响自己的“职业保卫战”之前,让我们解剖一只麻雀,看看人工智能到底是如何思考的,它到底有什么能耐,又有什么缺陷。

我们选取的麻雀是写作。

语言是人类的关键特征,它不仅是人类区别于动物,也是人类区别于机器的独特能力。著名的“图灵测试”,鉴别机器是否具备人类智能,就是以语言为考察指标的。

NLP(Natural Language Processing,自然语言处理)是一门研究如何让机器理解人类语言、写出人类文字的学科。这门学科有两种主要思路。一种是“符号法”,由人工建立语言的模型,提炼出语言的语法、词法,标注每个单词的含义、词性,然后教会机器这套原理,并让它根据语法去写作文章。——这种方法你一定很熟悉:不就是背单词、学语法、然后写作文吗?对,这就是我们在学校里学英语的方法。

另一种思路是“统计法”。与传统的老师教学生不同,这种方法更注重“自学”:扔一大批文字给机器,让它自己去寻找语言的规律,自己去尝试写作。

这种方法其实我们更熟悉。每个孩子不都是这样学习母语的吗?没有哪个父母会告诉孩子什么是名词、什么是动词、什么是疑问句强调句,孩子只是天天听、天天说,自然而然就掌握了一门语言。

当然,“符号法”与“统计法”并不是完全对立的,大多数NLP算法都同时包含两种思路。只不过在计算机发展的初级阶段,无力提供“统计法”所需要的大量计算资源,一直都是“符号法”更占上风。直到近十年以来,随着计算能力的大幅提高,尤其是深度学习理论的崛起,“统计法”才得到越来越广泛的应用。

一般认为,深度学习是通往真正人工智能的重要一步,如今声名鹊起的AlphaGo,即是基于深度学习理论而设计的。它并不理解围棋中诸如“势”、“厚薄”等种类繁多的术语,它只是学习了大量棋谱,从中总结出了围棋的规律,最终击败了李世石。AlphaGo取胜后,聂卫平一度苦思而不得其解,AlphaGo团队中没有一个围棋高手,为什么能打败世界冠军呢?其实,这就是聂卫平还停留在老师带学生的思路上。老师带学生,学生很难超过老师的高度,而AlphaGo是自学成才,不仅超越了顶尖的人类高手,还自创了许多精彩的招式。在人机大战后的春兰杯、围甲联赛中,人类棋手也开始应用AlphaGo所创造的定势,“AlphaGo流”迅速被围棋界所承认、吸收、并发扬光大。

在自然语言处理方面,也有一个著名的深度学习模型:卡帕西模型。

斯坦福大学的安德烈•卡帕西(Andrej Karpathy)是深度学习领域的知名人物。2015年5月,他设计了一个基于递归神经网络的NLP模型,并将代码公布在GitHub上。

这个模型非常精炼,只有几千行的代码量,任何一个具备基本编程功底的工程师,都可以看懂代码并且下载运行。但模型的功能却非常强大,它没有设定什么语法,也没有什么单词库,更没有指定只能处理英文、法文、或某一种特定语言。它唯一的要求,只是文字输入。你喂给它大量的文字段落,它就会用递归神经网络分析段落,寻找出字与字之间的关系。

所有文章说到底,不过是文字的排列组合,从数学角度而言,也就是一种文字的序列。如果计算机能够破解这个序列,掌握其规律,那么从理论上来说,它也能够生成这样的序列,写出一篇类似的作文。

理论非常简单,实战效果如何?现在,让我们假设有一个机器人,装备了卡帕西模型,他立志要成为一名中文小说家,他的表现会怎么样呢?

首先,要找一个学习范本。机器人找来的是中文世界最成功的流行小说:金庸的《射雕英雄传》。

机器人学习了一遍教材。《射雕英雄传》只有八十万字,学起来不费多少时间,一台普通的台式机,只要五六秒钟就可以学完。然后,机器人写出了如下文字:

呃,完全不知所云,是吧?

机器人所生成的文字,只是一些单字的随机组合。显然,他还没有找出任何文字的规律。此刻的机器人,如同一个初生婴儿,只会发出一些毫无意义的音调,完全不知语言为何物。

幸好,机器人有的是时间,他最擅长的就是重复性劳动。不过一分钟时间,他又学习了十遍,再次交出一篇作文:

还是看不懂。不过,比起上一篇也有进步,机器人写出的字不再是随机的,而是教材中出现频率较高的文字,如靖、黄、锋、全真等,都是书中主要人物的名号。可见那位婴儿正在注意聆听周围的语言,开始牙牙学语了。

机器人继续成长。十多分钟后,他已学习到一百遍:

这篇作文里的标点符号大幅增多。虽然其用法一塌糊涂,比如引号放在冒号之前,还常常会连用两个标点符号。但看来机器人已经意识到句子的存在,试图用句子来组织文字了。

当机器人学习到一千遍:

这篇作文简直是令人惊喜,就好像不知什么时候小孩子突然学会了说话,机器人对语言的领悟也突然有了一个飞跃。

最明显的变化是,他识别出了常用词汇,人物名字全部写对,连《九阴真经》也认识了。

更了不起的进步,则是机器人学会了句法,他会用标点符号了!每句话用逗号分隔,用句号结尾,双引号一一对应,而且全都位于冒号之后。

同样,他领会了“主谓宾”的句法结构,每句话都是以主语开头,后接人物动作。他也识别出了许多词汇的词性,并把它们放在正确的位置,形容词位于名词之前,副词位于动词之前。

总体而言,这篇文字还是看不懂,但是已经顺眼很多,开始像一片真正的人类文字了。可以说,现在机器人的语言水平相当于一岁多的孩子,能够磕磕碰碰说几句话了。

这充分显示了机器学习的威力。在没有预设任何语法库、词汇库的情况下,机器人愣是通过反复学习,掌握了人类语言的规律!

再接再厉,机器人学习到一万遍:

很遗憾,一万遍的学习并没有带来明显进步,这次写出来的作文跟上一篇区别不大,仍然不太通顺,也仍然是词法、语法基本正确,而句子的含义还是让人无法理解。

机器人不知道什么叫挫折和放弃,他继续埋头苦干,又学习到十万遍。但这一次,他的文字依旧没有起色。新的一份作业,通顺程度还是跟一万遍的一样。

坚持一天一夜,当机器人学习到上百万遍,又交出若干份类似的作业,我们不得不失望地发现,机器人的金庸之路到此为止,他的《射雕英雄传》没法写得更好了。

不过,世界很大,中文作家除了金庸,还有很多。机器人又踏上古龙之路,学习了数万遍《古龙全集》之后,他拿出一份习作:

新的学习结果很有趣。熟悉武侠小说的朋友都知道,同为武侠大师,金庸、古龙的风格却是迥然不同。实际上,他们二人只不过恰好都写了武侠题材而已,从文学理论来看,金庸是一个古典作家,而古龙是一个现代作家。

而机器人的模仿之作,也明显能看出两位老师的风格差异:

古龙的段落更短,一般一段就一句话,而金庸的一个自然段通常包含许多动作和对话。

古龙有更多心理描写,他的语言更现代,而金庸就显得古汁古味。

古龙喜欢装逼,人物语言总是酷毙,恨不得不用动手,一开口就把人噎死。

作为人类,我们可以评论,机器人的作文还不是很通顺。然而神奇的是,他一不懂词汇,二不懂语法,更不了解句子含义,不认识金庸古龙是谁,但他却照样表现出了人类作家的细微差异。他所写出的作文,同样具备段落短、语言酷、深入人物心理这些典型的古龙式文风。

这就是机器学习的威力所在,他不再需要人类专家去煞费苦心地提炼模型,他自己就能从材料中总结出模型。这是人类认识世界的一个重大进步。以往,科学发展的典型流程是,开普勒这样的科学家观测、实验、收集数据,牛顿这样的科学大师研究、发明科学体系。而今后,我们或许不需要、至少是不那么依赖牛顿级别的天才,只要有一批开普勒提供数据,计算机就能帮我们完成最后的临门一脚,挖掘出人类想象不到的科学规律。

再审视一遍作文,我们还可以看出,同样是模仿之作,“吉龙”也比“全庸”写得更加通顺一些。结合机器学习的特点,我们不难找到原因。首先,机器人学习的两份材料样本,一个是《射雕英雄传》,八十万字,另一个是《古龙全集》,高达一千七百万字,是前者的二十倍。这方面机器和人是一样的,都是学得越多,写得越好。

另一方面,古龙的语言也比金庸简单,他的句子短,段落短,语法结构也较为简单。《射雕英雄传》篇幅不长,却用到三千多个单字,而厚厚的《古龙全集》,才用到两千多个。可以说,金庸小说的复杂性较高,更有利于人类审美,而古龙小说的规律性较强,更有利于机器人学习。

为了验证这个设想,机器人第三次踏上学习之路。这一次,他学习的是规律性要求最高的中文作品:《全唐诗》。

果然,机器人写出了一手漂亮的唐诗。以第一首诗为例,遣字造句像模像样,“花开”对“风吹”,人类诗人也不过如此。格律按古声为:平仄平平仄,平平仄仄平,平平仄仄仄,仄仄仄平平,算是相当工整。唯一的缺憾,就是还没有押韵。

从意境而言,这首诗生动形象,浅白明晰,自然流露出的那份性情,也颇有几分唐诗的风流味道。

其他几首诗,或是五言,或是七言,格式也都整整齐齐。虽然不全通顺,但奇妙的是,却大都能够读出一种独特的诗味。

这大概要归功于诗歌的艺术特色:重视语言的独创性、多义性。越是不符合惯例的诗歌,越是具有独创性,越是具备想象的空间。

前面我们已经总结出,机器学习的长处,在于从大量数据中寻找特征,总结模型。短处则是他并没有真正理解数据,他只是按照自己找出来的规律,似是而非地写作。这里,写作唐诗宋词,正好可以扬长避短。古诗词有明确的规律,只要读得够多,机器人可以准确地提炼出平仄、韵脚等规则。另一方面,古诗词强调意境,强调联想,看似不合道理的汉字组合,被机器人放到一起,反而别具一番风味。如今网上有许多写诗软件,或许就是这么来的。

经过三次考试,机器人的写作水平大致暴露在我们眼前。看起来,除了唐诗宋词,他似乎写不出什么像样的东西。那么,只要你是写现代文的,是不是就可以松一口气呢?

不一定。

也许,写作正经文章,机器人还差一口气。但是,不那么正经的文章,机器人还是有用武之地的。

首先是网络小说。以最著名的网络小说之一《斗破苍穹》为例:

这篇作文读起来仍然不太流畅,但相对金庸古龙的模仿之作而言,却算得上大体通顺,至少,大部分句子都读得懂了。似乎机器人一旦学习网络小说,就突然悟性大增了。

为什么呢?难道网络小说就特别适合机器写作吗?

答案是YES。从语言文字而言,古龙比金庸单薄,而网络小说又比古龙大大地单薄。前面提到,《古龙全集》用到两千多个单字,这已经比《射雕英雄传》少了许多,而《斗破苍穹》用到的单字,只有可怜的一千多个。

简单的对比就知道,古龙写作二十多年,也才写到一千七百万字,而《斗破苍穹》只花了两年,就写到六百万字,其中灌水的程度也就可想而知了。

简单、稀薄的文字,或许会被真正的作家所鄙视,但这却是机器人最喜欢的。越是单调重复,他越是能找出规律,模仿出类似的作品。也许,用不了多久,网络写手就会发现他们多出一个高产、全能、永不疲倦的竞争对手。

第二种体裁,是色情小说。机器人通读了《少妇白洁》之后,兴奋地写出以下文字:

这篇小黄文当然也是看不懂,没几个通顺的句子。但没关系,文中充斥的那些敏感词汇,到处都在挑逗读者的下半身。不用看懂情节,读者也知道他们在搞那男女之事。

色情文学的读者,我们不否认有一些是来看故事,看人物的,但大多数读者,还只是看一个性交场面,寻求一点动物本能的刺激而已。堆砌一批人体器官的敏感词,就足以激活他们的勃起神经。对于这些读者,机器人也就足够了。

至于文字通顺的缺点,上一篇之所以写得渣,很大程度上是因为《少妇白洁》非常短,只有区区六万字,机器人不足以掌握色情文学的精髓。只要机器人再多学几本描写抽插运动的专著,把句子写得更加流畅一些,我们就可以期盼色情文学的春天来临了。

还有一种体裁,也具备唐诗宋词一样严格的格式要求,其内容又包含大量单调、重复的段落,可预测性极强。这种体裁,自然也是机器学习的绝佳对象。

对,这篇流畅的作文,就是机器人学习了《历年政府工作报告(1954-2011)》后的结果。

不过,这个我们就不评论了。机器人学习这些无所谓,作者君不想被送去学习党章。

好了,现在我们大致知道,当机器人谈论写作时,他们在谈些什么。就卡帕西模型而言,他有一个天花板。他能够描绘出一种外表类似的文字风格,但却无法写出真正让人理解、让人心动的文章。

只要你不写唐诗宋词、不写网络小说、不写色情小说、不写八股文,暂时可以放心了。目前机器人的语言能力,也就跟一两岁孩子相当。

但这并不意味着,人工智能永远追不上正常人类的智力。如前所述,卡帕西模型非常简单,只有几千行代码,也没有语义库、词汇库的定义,是一个纯粹的“统计法”模型。它主要是用来展示递归神经网络的工作原理,教学意义更甚于实用意义。实际应用中的商业化语言机器人,远远比它复杂和强大得多。

另一方面,也许你已经注意到,卡帕西模型的语言能力之所以上不去,是因为跟真正的孩子学说话相比,他还缺乏最重要的一环——外部反馈。

孩子学说话不仅仅是听,更重要的是说。语言学家指出,婴儿需要在母亲怀里亲密互动,需要与父母一起玩耍游戏。孩子每一句有意无意的依依唔唔,都能得到父母的回应与反馈,这种情境对于学习语言是最有利的。让孩子每天看电视,虽然他也听到了大量的对话,但由于缺乏双向沟通,孩子仍然无法习得正常的人类语言。

2015年3月,微软推出对话机器人Tay。她能够学习、模仿用户的言论,采取用户喜欢的语言风格进行对话。Tay立即引起了用户的兴趣,大量用户蜂拥而上,用各种出位言论训练她,试图诱导机器人突破下限。结果,在无数粗俗对话的刺激下,只花了一天功夫,Tay就变成了一个满口脏话、辱骂用户、鼓吹纳粹的不良少女。

微软不得不将Tay紧急关闭,至今她还没有归来的迹象。但从另一方面来看,这也说明了对话在语言学习中的威力。如果机器人能够与人类频繁互动,从人类的反馈中得知什么样的句子是好的,什么样的表达方式是对的,什么样的文字是美的,那么,他离真正掌握人类语言就不远了。

这样的工作,谷歌、脸书等科技巨头已在开展。他们不仅有最先进的人工智能技术,还拥有世界上规模最大的用户,能够为机器人找到足够多的人类老师,进行足够多的训练。也许,谷歌和脸书的许多服务,已经是机器人在提供,只是用户并不知晓而已。我们在不知不觉之中,已经在充当训练机器人的人肉材料。

变革正在发生,未来无从知晓。对于文字工作者而言,人工智能仍然是一把悬在头顶的利剑。

[原创翻译]循环神经网络惊人的有效性(上)

作者:杜客
链接:https://zhuanlan.zhihu.com/p/22107715
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

版权声明:本文智能单元首发,本人原创翻译,禁止未授权转载。
译者注:经知友推荐,将The Unreasonable Effectiveness of Recurrent Neural Networks一文翻译作为CS231n课程无RNN和LSTM笔记的补充,感谢@堃堃的校对。
目录

循环神经网络
字母级别的语言模型
RNN的乐趣
Paul Graham生成器
莎士比亚
维基百科
几何代数
Linux源码
生成婴儿姓名 译者注:上篇截止处
理解训练过程
训练时输出文本的进化
RNN中的预测与神经元激活可视化
源代码
拓展阅读
结论
译者反馈
原文如下

循环神经网络(RNN)简直像是魔法一样不可思议。我为图像标注项目训练第一个循环网络时的情景到现在都还历历在目。当时才对第一个练手模型训练了十几分钟(超参数还都是随手设置的),它就开始生成一些对于图像的描述,描述内容看起来很不错,几乎让人感到语句是通顺的了。有时候你会遇到模型简单,结果的质量却远高预期的情况,这就是其中一次。当时这个结果让我非常惊讶是因为我本以为RNN是非常难以训练的(随着实践的增多,我的结论基本与之相反了)。让我们快进一年:即使现在我成天都在训练RNN,也常常看到它们的能力和鲁棒性,有时候它们那充满魔性的输出还是能够把我给逗乐。这篇博文就是来和你分享RNN中的一些魔法。

我们将训练RNN,让它们生成一个又一个字母。同时好好思考这个问题:这怎么可能呢?

顺便说一句,和这篇博文一起,我在Github上发布了一个项目。项目基于多层的LSTM,使得你可以训练字母级别的语言模型。你可以输入一大段文本,然后它能学习并按照一次一个字母的方式生成文本。你也可以用它来复现我下面的实验。但是现在我们要超前一点:RNN到底是什么?
循环神经网络

序列。基于知识背景,你可能会思考:是什么让RNN如此独特呢?普通神经网络和卷积神经网络的一个显而易见的局限就是他们的API都过于限制:他们接收一个固定尺寸的向量作为输入(比如一张图像),并且产生一个固定尺寸的向量作为输出(比如针对不同分类的概率)。不仅如此,这些模型甚至对于上述映射的演算操作的步骤也是固定的(比如模型中的层数)。RNN之所以如此让人兴奋,其核心原因在于其允许我们对向量的序列进行操作:输入可以是序列,输出也可以是序列,在最一般化的情况下输入输出都可以是序列。下面是一些直观的例子:

————————————————————————————————————————

上图中每个正方形代表一个向量,箭头代表函数(比如矩阵乘法)。输入向量是红色,输出向量是蓝色,绿色向量装的是RNN的状态(马上具体介绍)。从左至右为:

非RNN的普通过程,从固定尺寸的输入到固定尺寸的输出(比如图像分类)。
输出是序列(例如图像标注:输入是一张图像,输出是单词的序列)。
输入是序列(例如情绪分析:输入是一个句子,输出是对句子属于正面还是负面情绪的分类)。
输入输出都是序列(比如机器翻译:RNN输入一个英文句子输出一个法文句子)。
同步的输入输出序列(比如视频分类中,我们将对视频的每一帧都打标签)。
注意在每个案例中都没有对序列的长度做出预先规定,这是因为循环变换(绿色部分)是固定的,我们想用几次就用几次。

————————————————————————————————————————

如你期望的那样,相较于那些从一开始连计算步骤的都定下的固定网络,序列体制的操作要强大得多。并且对于那些和我们一样希望构建一个更加智能的系统的人来说,这样的网络也更有吸引力。我们后面还会看到,RNN将其输入向量、状态向量和一个固定(可学习的)函数结合起来生成一个新的状态向量。在程序的语境中,这可以理解为运行一个具有某些输入和内部变量的固定程序。从这个角度看,RNN本质上就是在描述程序。实际上RNN是具备图灵完备性的,只要有合适的权重,它们可以模拟任意的程序。然而就像神经网络的通用近似理论一样,你不用过于关注其中细节。实际上,我建议你忘了我刚才说过的话。

如果训练普通神经网络是对函数做最优化,那么训练循环网络就是针对程序做最优化。

无序列也能进行序列化处理。你可能会想,将序列作为输入或输出的情况是相对少见的,但是需要认识到的重要一点是:即使输入或输出是固定尺寸的向量,依然可以使用这个强大的形式体系以序列化的方式对它们进行处理。例如,下图来自于DeepMind的两篇非常不错的论文。左侧动图显示的是一个算法学习到了一个循环网络的策略,该策略能够引导它对图像进行观察;更具体一些,就是它学会了如何从左往右地阅读建筑的门牌号(Ba et al)。右边动图显示的是一个循环网络通过学习序列化地向画布上添加颜色,生成了写有数字的图片(Gregor et al)。

—————————————————————————————————————————

左边:RNN学会如何阅读建筑物门牌号。右边:RNN学会绘出建筑门牌号。 译者注:知乎专栏不支持动图,建议感兴趣读者前往原文查看。

————————————————————————————————————————

必须理解到的一点就是:即使数据不是序列的形式,仍然可以构建并训练出能够进行序列化处理数据的强大模型。换句话说,你是要让模型学习到一个处理固定尺寸数据的分阶段程序。

RNN的计算。那么RNN到底是如何工作的呢?在其核心,RNN有一个貌似简单的API:它接收输入向量x,返回输出向量y。然而这个输出向量的内容不仅被输入数据影响,而且会收到整个历史输入的影响。写成一个类的话,RNN的API只包含了一个step方法:

rnn = RNN()
y = rnn.step(x) # x is an input vector, y is the RNN's output vector

每当step方法被调用的时候,RNN的内部状态就被更新。在最简单情况下,该内部装着仅包含一个内部隐向量h。下面是一个普通RNN的step方法的实现:

class RNN:
  # ...
  def step(self, x):
    # update the hidden state
    self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
    # compute the output vector
    y = np.dot(self.W_hy, self.h)
    return y

上面的代码详细说明了普通RNN的前向传播。该RNN的参数是三个矩阵:W_hh, W_xh, W_hy。隐藏状态self.h被初始化为零向量。np.tanh函数是一个非线性函数,将激活数据挤压到[-1,1]之内。注意代码是如何工作的:在tanh内有两个部分。一个是基于前一个隐藏状态,另一个是基于当前的输入。在numpy中,np.dot是进行矩阵乘法。两个中间变量相加,其结果被tanh处理为一个新的状态向量。如果你更喜欢用数学公式理解,那么公式是这样的:。其中tanh是逐元素进行操作的。

我们使用随机数字来初始化RNN的矩阵,进行大量的训练工作来寻找那些能够产生描述行为的矩阵,使用一些损失函数来衡量描述的行为,这些损失函数代表了根据输入x,你对于某些输出y的偏好。

更深层网络。RNN属于神经网络算法,如果你像叠薄饼一样开始对模型进行重叠来进行深度学习,那么算法的性能会单调上升(如果没出岔子的话)。例如,我们可以像下面代码一样构建一个2层的循环网络:

y1 = rnn1.step(x)
y = rnn2.step(y1)

换句话说,我们分别有两个RNN:一个RNN接受输入向量,第二个RNN以第一个RNN的输出作为其输入。其实就RNN本身来说,它们并不在乎谁是谁的输入:都是向量的进进出出,都是在反向传播时梯度通过每个模型。

更好的网络。需要简要指明的是在实践中通常使用的是一个稍有不同的算法,这就是我在前面提到过的长短基记忆网络,简称LSTM。LSTM是循环网络的一种特别类型。由于其更加强大的更新方程和更好的动态反向传播机制,它在实践中效果要更好一些。本文不会进行细节介绍,但是在该算法中,所有本文介绍的关于RNN的内容都不会改变,唯一改变的是状态更新(就是self.h=...那行代码)变得更加复杂。从这里开始,我会将术语RNN和LSTM混合使用,但是在本文中的所有实验都是用LSTM完成的。

字母级别的语言模型

现在我们已经理解了RNN是什么,它们何以令人兴奋,以及它们是如何工作的。现在通过一个有趣的应用来更深入地加以体会:我们将利用RNN训练一个字母级别的语言模型。也就是说,给RNN输入巨量的文本,然后让其建模并根据一个序列中的前一个字母,给出下一个字母的概率分布。这样就使得我们能够一个字母一个字母地生成新文本了。

在下面的例子中,假设我们的字母表只由4个字母组成“helo”,然后利用训练序列“hello”训练RNN。该训练序列实际上是由4个训练样本组成:1.当h为上文时,下文字母选择的概率应该是e最高。2.l应该是he的下文。3.l应该是hel文本的下文。4.o应该是hell文本的下文。

具体来说,我们将会把每个字母编码进一个1到k的向量(除对应字母为1外其余为0),然后利用step方法一次一个地将其输入给RNN。随后将观察到4维向量的序列(一个字母一个维度)。我们将这些输出向量理解为RNN关于序列下一个字母预测的信心程度。下面是流程图:

————————————————————————————————————————

一个RNN的例子:输入输出是4维的层,隐层神经元数量是3个。该流程图展示了使用hell作为输入时,RNN中激活数据前向传播的过程。输出层包含的是RNN关于下一个字母选择的置信度(字母表是helo)。我们希望绿色数字大,红色数字小。

——————————————————————————————————————————

举例如下:在第一步,RNN看到了字母h后,给出下一个字母的置信度分别是h为1,e为2.2,l为-3.0,o为4.1。因为在训练数据(字符串hello)中下一个正确的字母是e,所以我们希望提高它的置信度(绿色)并降低其他字母的置信度(红色)。类似的,在每一步都有一个目标字母,我们希望算法分配给该字母的置信度应该更大。因为RNN包含的整个操作都是可微分的,所以我们可以通过对算法进行反向传播(微积分中链式法则的递归使用)来求得权重调整的正确方向,在正确方向上可以提升正确目标字母的得分(绿色粗体数字)。然后进行参数更新,即在该方向上轻微移动权重。如果我们将同样的数据输入给RNN,在参数更新后将会发现正确字母的得分(比如第一步中的e)将会变高(例如从2.2变成2.3),不正确字母的得分将会降低。重复进行一个过程很多次直到网络收敛,其预测与训练数据连贯一致,总是能正确预测下一个字母。

更技术派的解释是我们对输出向量同步使用标准的Softmax分类器(也叫作交叉熵损失)。使用小批量的随机梯度下降来训练RNN,使用RMSProp或Adam来让参数稳定更新。

注意当字母l第一次输入时,目标字母是l,但第二次的目标是o。因此RNN不能只靠输入数据,必须使用它的循环连接来保持对上下文的跟踪,以此来完成任务。

在测试时,我们向RNN输入一个字母,得到其预测下一个字母的得分分布。我们根据这个分布取出得分最大的字母,然后将其输入给RNN以得到下一个字母。重复这个过程,我们就得到了文本!现在使用不同的数据集训练RNN,看看将会发生什么。

为了更好的进行介绍,我基于教学目的写了代码:minimal character-level RNN language model in Python/numpy,它只有100多行。如果你更喜欢读代码,那么希望它能给你一个更简洁直观的印象。我们下面介绍实验结果,这些实验是用更高效的Lua/Torch代码实现的。

RNN的乐趣

下面介绍的5个字母模型我都放在Github上的项目里了。每个实验中的输入都是一个带有文本的文件,我们训练RNN让它能够预测序列中下一个字母。

Paul Graham生成器

译者注:中文名一般译为保罗•格雷厄姆,著有《黑客与画家》一书,中文版已面世。在康奈尔大学读完本科,在哈佛大学获得计算机科学博士学位。1995年,创办了Viaweb。1998年,Yahoo!收购了Viaweb,收购价约5000万美元。此后架起了个人网站http://paulgraham.com,在上面撰写关于软件和创业的文章,以深刻的见解和清晰的表达而著称。2005年,创建了风险投资公司Y Combinator,目前已经资助了80多家创业公司。现在,他是公认的互联网创业权威。

让我们先来试一个小的英文数据集来进行正确性检查。我最喜欢的数据集是Paul Graham的文集。其基本思路是在这些文章中充满智慧,但Paul Graham的写作速度比较慢,要是能根据需求生成富于创业智慧的文章岂不美哉?那么就轮到RNN上场了。

将Paul Graham最近5年的文章收集起来,得到大小约1MB的文本文件,约有1百万个字符(这只算个很小的数据集)。技术要点:训练一个2层的LSTM,各含512个隐节点(约350万个参数),每层之后使用0.5的dropout。每个数据批量中含100个样本,时间长度上截断了100个字符进行梯度的反向传播。按照上述设置,每个数据批量在TITAN GPU上的运算耗时为0.46秒(如果仅对50个字符进行BPTT,那么耗时会减半,性能的耗费几乎忽略不计)。译者注:BPTT即Backpropagation Through Time。不在啰嗦,让我们看看RNN生成的文本:

“The surprised in investors weren’t going to raise money. I’m not the company with the time there are all interesting quickly, don’t have to get off the same programmers. There’s a super-angel round fundraising, why do you can do. If you have a different physical investment are become in people who reduced in a startup with the way to argument the acquirer could see them just that you’re also the founders will part of users’ affords that and an alternation to the idea. [2] Don’t work at first member to see the way kids will seem in advance of a bad successful startup. And if you have to act the big company too.”

好吧,显然生成器暂时还无法替代Paul Graham,但是RNN可是完全从头开始学英语的(包括逗号,撇号和空格),而且数据集又如此的小。我还很喜欢它自己学会了如何进行引用(例如上文中的[2])。有时候它甚至会说出一些充满智慧的洞见,比如“a company is a meeting to think to investors(公司就是一个琢磨如何让投资者打钱的会议)”。译者注:RNN你瞎说什么大实话:)如果你想要查看更多细节,点击这里。

温度。在生成文本的时候,我们可以调节Softmax的温度。将温度从1降到更低的数值(比如0.5)可以让RNN更加自信的同时变得更加保守。相反,如果将温度设置的更高,结果会更加多样化,但是代价就是可能会出现错误(比如更多的拼写错误)。如果将温度设置得非常接近于0,我们就会得到最像Paul Graham说的话:

“is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same”

看来我们陷入到连续创业的无限循环中去了。

莎士比亚

看起来我们可以让RNN学会如何拼写单词,但是数据中有更多的结构和样式呢?为了回答这个问题,我下载了莎士比亚的所有作品,并将它们装进一个单独的文件中,大小约4.4MB。我们现在可以训练更大的网络,这次试用的是一个3层的RNN,每层的隐节点512个。当我们训练了几个小时候,我们得到了下面的文本输出:

PANDARUS:
Alas, I think he shall be come approached and the day
When little srain would be attain'd into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.

Second Senator:
They are away this miseries, produced upon my soul,
Breaking and strongly should be buried, when I perish
The earth and thoughts of many states.

DUKE VINCENTIO:
Well, your wit is in the care of side and that.

Second Lord:
They would be ruled after this chamber, and
my fair nues begun out of the fact, to be conveyed,
Whose noble souls I'll have the heart of the wars.

Clown:
Come, sir, I will make did behold your worship.

VIOLA:
I'll drink it.

记住,RNN只知道字符,所以它学会了说话者的名字和说话内容的格式,有时候我们还会得到类似独白的文字,比如:

VIOLA:
Why, Salisbury must find his flesh and thought
That which I am not aps, not a man and in fire,
To show the reining of the raven and the wars
To grace my hand reproach within, and not a fair are hand,
That Caesar and my goodly father's world;
When I was heaven of presence and our fleets,
We spare with hours, but cut thy council I am great,
Murdered and by thy master's ready there
My power to give thee but so much as hell:
Some service in the noble bondman here,
Would show him to her wine.

KING LEAR:
O, if you were a feeble sight, the courtesy of your law,
Your sight and several breath, will wear the gods
With his heads, and my hands are wonder'd at the deeds,
So drop upon your lordship's head, and your opinion
Shall be against your honour.

我个人是很难把这段话从莎士比亚的原作中分辨出来啦:)如果你喜欢莎士比亚,那可以来看看这篇长达1000000字符的生成文本。当然,你可以使用我提供的代码,在不同的温度设置下来生成不同的文本。

维基百科

我们看见LSTM能够拼写单词,复现语法结构。那么现在就提高难度,使用markdown文本对它进行训练。我使用了Hutter Prize的100MB的数据集,数据集内容是原始的维基百科内容,然后在LSTM上训练。根据Graves等的论文,我使用了其中96MB用于训练,剩下的用做验证集。模型跑了有一晚上,然后可以生成维基百科文章了。下面是一些有趣的文本片段。首先,一些基本的markdown输出:

Naturalism and decision for the majority of Arab countries' capitalide was grounded
by the Irish language by [[John Clair]], [[An Imperial Japanese Revolt]], associated 
with Guangzham's sovereignty. His generals were the powerful ruler of the Portugal 
in the [[Protestant Immineners]], which could be said to be directly in Cantonese 
Communication, which followed a ceremony and set inspired prison, training. The 
emperor travelled back to [[Antioch, Perth, October 25|21]] to note, the Kingdom 
of Costa Rica, unsuccessful fashioned the [[Thrales]], [[Cynth's Dajoard]], known 
in western [[Scotland]], near Italy to the conquest of India with the conflict. 
Copyright was the succession of independence in the slop of Syrian influence that 
was a famous German movement based on a more popular servicious, non-doctrinal 
and sexual power post. Many governments recognize the military housing of the 
[[Civil Liberalization and Infantry Resolution 265 National Party in Hungary]], 
that is sympathetic to be to the [[Punjab Resolution]]
(PJS)[http://www.humah.yahoo.com/guardian.
cfm/7754800786d17551963s89.htm Official economics Adjoint for the Nazism, Montgomery 
was swear to advance to the resources for those Socialism's rule, 
was starting to signing a major tripad of aid exile.]]

如果你注意到的话,yahoo的那个url是不存在的,是模型生造了它。还有,可以看见模型学会了对于圆括号要成对出现。模型还学会了很多markdown结构,比如标题,列表等:

{ { cite journal | id=Cerling Nonforest Department|format=Newlymeslated|none } }
''www.e-complete''.

'''See also''': [[List of ethical consent processing]]

== See also ==
*[[Iender dome of the ED]]
*[[Anti-autism]]

===[[Religion|Religion]]===
*[[French Writings]]
*[[Maria]]
*[[Revelation]]
*[[Mount Agamul]]

== External links==
* [http://www.biblegateway.nih.gov/entrepre/ Website of the World Festival. The labour of India-county defeats at the Ripper of California Road.]

==External links==
* [http://www.romanology.com/ Constitution of the Netherlands and Hispanic Competition for Bilabial and Commonwealth Industry (Republican Constitution of the Extent of the Netherlands)]

有时候模型也会生成一些随机但是合法的XML:

<page>
  <title>Antichrist</title>
  <id>865</id>
  <revision>
    <id>15900676</id>
    <timestamp>2002-08-03T18:14:12Z</timestamp>
    <contributor>
      <username>Paris</username>
      <id>23</id>
    </contributor>
    <minor />
    <comment>Automated conversion</comment>
    <text xml:space="preserve">#REDIRECT [[Christianity]]</text>
  </revision>
</page>

模型生成了时间戳,id和其他一些东西。同时模型也能正确地让标示符成对出现,嵌套规则也合乎逻辑。如果你对文本感兴趣,点击这里。

代数几何

上面的结果表明模型确实比较擅长学习复杂的语法结构。收到这些结果的鼓舞,我和同伴Justin Johnson决定在结构化这一块将研究更加推进一步。我们在网站Stacks上找到了这本关于代数几何的书,下载了latex源文件(16MB大小),然后用于训练一个多层的LSTM。令人惊喜的是,模型输出的结果几乎是可以编译的。我们手动解决了一些问题后,就得到了一个看起来像模像样的数学文档,看起来非常惊人:

————————————————————————————————————————

生成的代数几何。这里是源文件。

——————————————————————————————————————————

这是另一个例子:

——————————————————————————————————————————

更像代数几何了,右边还出现了图表。

——————————————————————————————————————————

由上可见,模型有时候尝试生成latex图表,但是没有成功。我个人还很喜欢它跳过证明的部分(“Proof omitted”,在顶部左边)。当然,需要注意的是latex是相对困难的结构化语法格式,我自己都还没有完全掌握呢。下面是模型生成的一个源文件:

\begin{proof}
We may assume that $\mathcal{I}$ is an abelian sheaf on $\mathcal{C}$.
\item Given a morphism $\Delta : \mathcal{F} \to \mathcal{I}$
is an injective and let $\mathfrak q$ be an abelian sheaf on $X$.
Let $\mathcal{F}$ be a fibered complex. Let $\mathcal{F}$ be a category.
\begin{enumerate}
\item \hyperref[setain-construction-phantom]{Lemma}
\label{lemma-characterize-quasi-finite}
Let $\mathcal{F}$ be an abelian quasi-coherent sheaf on $\mathcal{C}$.
Let $\mathcal{F}$ be a coherent $\mathcal{O}_X$-module. Then
$\mathcal{F}$ is an abelian catenary over $\mathcal{C}$.
\item The following are equivalent
\begin{enumerate}
\item $\mathcal{F}$ is an $\mathcal{O}_X$-module.
\end{lemma}

这份模型输出的文本展示了一些常见错误。比如模型起了\begin{proof}的头,结尾却是\end{lemma}。这种错误我们就必须手动改正,错误产生的原因可能在于依赖关系过于长了:当模型完成证明后,它已经忘了自己的开头是proof还是lemma了。类似的,模型起了\begin{enumerate}的头,却忘了结尾。我们观察到,在使用更大规模或更好的模型后,这些错误变少了。然而,以上这些错误总是存在出现的可能性的。

Linux源码

我想要把结构化数据的实验推到极限,所以最后一个实验我们决定使用代码。具体说来,就是从Linux在Github上的项目提取了所有代码和能找到的头文件,把他们装进一个巨大的文件中(474MB的C代码)。我原本计划是只训练Linux的核心文件的,但是其大小只有约16MB。然后我在GPU上使用3层LSTM训练了好几天。这些模型有大约1000万的参数,这对于RNN模型来说还不算多的。实验结果超级有趣:

/*
 * Increment the size file of the new incorrect UI_FILTER group information
 * of the size generatively.
 */
static int indicate_policy(void)
{
  int error;
  if (fd == MARN_EPT) {
    /*
     * The kernel blank will coeld it to userspace.
     */
    if (ss->segment < mem_total)
      unblock_graph_and_set_blocked();
    else
      ret = 1;
    goto bail;
  }
  segaddr = in_SB(in.addr);
  selector = seg / 16;
  setup_works = true;
  for (i = 0; i < blocks; i++) {
    seq = buf[i++];
    bpf = bd->bd.next + i * search;
    if (fd) {
      current = blocked;
    }
  }
  rw->name = "Getjbbregs";
  bprm_self_clearl(&iv->version);
  regs->new = blocks[(BPF_STATS << info->historidac)] | PFMR_CLOBATHINC_SECONDS << 12;
  return segtable;
}

这些代码看起来相当酷了。虽然我不认为这些代码能够编译,但是看着这些代码你会感觉是一个巨大的C代码库。注意RNN还不时的给自己的代码加上了注释。代码中也很少有语法错误。比如它合理地使用了字符串,指针标记等。它学会了让花括号和中括号成对出现,让代码有正确的缩进。一个常见的错误是它不能追中变量的名字:它常常使用未声明的变量(比如上面的rw),或声明一些永不使用的变量(比如上面的int error),或返回不存在的变量。让我们看看更多的例子,下面是一个代码片段,展示了RNN学习到的更多的操作:

/*
 * If this error is set, we will need anything right after that BSD.
 */
static void action_new_function(struct s_stat_info *wb)
{
  unsigned long flags;
  int lel_idx_bit = e->edd, *sys & ~((unsigned long) *FIRST_COMPAT);
  buf[0] = 0xFFFFFFFF & (bit << 4);
  min(inc, slist->bytes);
  printk(KERN_WARNING "Memory allocated %02x/%02x, "
    "original MLL instead\n"),
    min(min(multi_run - s->len, max) * num_data_in),
    frame_pos, sz + first_seg);
  div_u64_w(val, inb_p);
  spin_unlock(&disk->queue_lock);
  mutex_unlock(&s->sock->mutex);
  mutex_unlock(&func->mutex);
  return disassemble(info->pending_bh);
}

static void num_serial_settings(struct tty_struct *tty)
{
  if (tty == tty)
    disable_single_st_p(dev);
  pci_disable_spool(port);
  return 0;
}

static void do_command(struct seq_file *m, void *v)
{
  int column = 32 << (cmd[2] & 0x80);
  if (state)
    cmd = (int)(int_state ^ (in_8(&ch->ch_flags) & Cmd) ? 2 : 1);
  else
    seq = 1;
  for (i = 0; i < 16; i++) {
    if (k & (1 << 1))
      pipe = (in_use & UMXTHREAD_UNCCA) +
        ((count & 0x00000000fffffff8) & 0x000000f) << 8;
    if (count == 0)
      sub(pid, ppc_md.kexec_handle, 0x20000000);
    pipe_set_bytes(i, 0);
  }
  /* Free our user pages pointer to place camera if all dash */
  subsystem_info = &of_changes[PAGE_SIZE];
  rek_controls(offset, idx, &soffset);
  /* Now we want to deliberately put it to device */
  control_check_polarity(&context, val, 0);
  for (i = 0; i < COUNTER; i++)
    seq_puts(s, "policy ");
}

注意在第二个函数中,比较了tty == tty,这永远为真。但这次至少参数名出现在了函数体中。在最后一个函数中,没有返回值,因为函数的声明有void,所以倒也正确。 然而,前面两个函数声明为void确有返回值,这又是一种因为依赖过长产生的常见错误。

有时候模型决定生成新的文件。这通常是非常让人惊喜的部分:模型首先引用了GUN许可,进行了一些导入操作,生成一些宏,然后开始进入代码:


/*
 *  Copyright (c) 2006-2010, Intel Mobile Communications.  All rights reserved.
 *
 *   This program is free software; you can redistribute it and/or modify it
 * under the terms of the GNU General Public License version 2 as published by
 * the Free Software Foundation.
 *
 *        This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *
 *  GNU General Public License for more details.
 *
 *   You should have received a copy of the GNU General Public License
 *    along with this program; if not, write to the Free Software Foundation,
 *  Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 */

#include <linux/kexec.h>
#include <linux/errno.h>
#include <linux/io.h>
#include <linux/platform_device.h>
#include <linux/multi.h>
#include <linux/ckevent.h>

#include <asm/io.h>
#include <asm/prom.h>
#include <asm/e820.h>
#include <asm/system_info.h>
#include <asm/setew.h>
#include <asm/pgproto.h>

#define REG_PG    vesa_slot_addr_pack
#define PFM_NOCOMP  AFSR(0, load)
#define STACK_DDR(type)     (func)

#define SWAP_ALLOCATE(nr)     (e)
#define emulate_sigs()  arch_get_unaligned_child()
#define access_rw(TST)  asm volatile("movd %%esp, %0, %3" : : "r" (0));   \
  if (__type & DO_READ)

static void stat_PC_SEC __read_mostly offsetof(struct seq_argsqueue, \
          pC>[1]);

static void
os_prefix(unsigned long sys)
{
#ifdef CONFIG_PREEMPT
  PUT_PARAM_RAID(2, sel) = get_state_state();
  set_pid_sum((unsigned long)state, current_state_str(),
           (unsigned long)-1->lr_full; low;
}

这里面有太多有趣的地方可以讨论,我几乎可以写一整个博客,所以我现在还是暂停,感兴趣的可以查看这里。

生成婴儿姓名

让我们再试一个。给RNN输入一个包含8000个小孩儿姓名的文本文件,一行只有一个名字。(名字是从这里获得的)我们可以把这些输入RNN然后生成新的名字。下面是一些名字例子,只展示了那些没有在训练集中出现过的名字:

Rudi Levette Berice Lussa Hany Mareanne Chrestina Carissy Marylen Hammine Janye Marlise Jacacrie Hendred Romand Charienna Nenotto Ette Dorane Wallen Marly Darine Salina Elvyn Ersia Maralena Minoria Ellia Charmin Antley Nerille Chelon Walmor Evena Jeryly Stachon Charisa Allisa Anatha Cathanie Geetra Alexie Jerin Cassen Herbett Cossie Velen Daurenge Robester Shermond Terisa Licia Roselen Ferine Jayn Lusine Charyanne Sales Sanny Resa Wallon Martine Merus Jelen Candica Wallin Tel Rachene Tarine Ozila Ketia Shanne Arnande Karella Roselina Alessia Chasty Deland Berther Geamar Jackein Mellisand Sagdy Nenc Lessie Rasemy Guen Gavi Milea Anneda Margoris Janin Rodelin Zeanna Elyne Janah Ferzina Susta Pey Castina

点击这里可以查看更多。我个人最喜欢的名字包括“Baby” (哈), “Killie”,“Char”,“R”,“More”,“Mars”,“Hi”,“Saddie”,“With”和“Ahbort”。这真的蛮有意思,你还可以畅想在写小说或者给创业公司起名字的时候,这个能给你灵感。

上篇截止。

译者反馈

翻译不到位的地方,欢迎知友们评论批评指正;
在计算机视觉方面,个人对于图像标注(image caption)比较感兴趣,正在入坑。欢迎有同样兴趣的知友投稿讨论;
感谢@七月和@陈子服 的细节指正。

[原创翻译]循环神经网络惊人的有效性(下)

作者:杜客
链接:https://zhuanlan.zhihu.com/p/22230074
来源:知乎
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

版权声明:本文智能单元首发,本人原创翻译,禁止未授权转载。
译者注:在CS231n课程笔记止步于CNN,没有循环神经网络(RNN和LSTM),实为憾事。经知友推荐,将The Unreasonable Effectiveness of Recurrent Neural Networks一文翻译完毕,作为补充。感谢@猴子,@堃堃和@李艺颖的校对。
目录

循环神经网络
字母级别的语言模型
RNN的乐趣
Paul Graham生成器
莎士比亚
维基百科
几何代数
Linux源码
生成婴儿姓名
理解训练过程 译者注:下篇起始处
训练时输出文本的进化
RNN中的预测与神经元激活可视化
源代码
拓展阅读
结论
译者反馈
理解训练过程

我们已经看见训练结束后的结果令人印象深刻,但是它到底是如何运作的呢?现在跑两个小实验来一探究竟。

训练时输出文本的进化

首先,观察模型在训练时输出文本的不断进化是很有意思的。例如,我使用托尔斯泰的《战争与和平》来训练LSTM,并在训练过程中每迭代100次就输出一段文本。在第100次迭代时,模型输出的文本是随机排列的:

tyntd-iafhatawiaoihrdemot  lytdws  e ,tfti, astai f ogoh eoase rrranbyne 'nhthnee e 
plia tklrgd t o idoe ns,smtt   h ne etie h,hregtrs nigtike,aoaenns lng

但是至少可以看到它学会了单词是被空格所分割的,只是有时候它使用了两个连续空格。它还没学到逗号后面总是有个空格。在迭代到第300次的时候,可以看到模型学会使用引号和句号。

"Tmont thithey" fomesscerliund
Keushey. Thom here
sheulke, anmerenith ol sivh I lalterthend Bleipile shuwy fil on aseterlome
coaniogennc Phe lism thond hon at. MeiDimorotion in ther thize."

单词被空格所分割,模型开始知道在句子末尾使用句号。在第500次迭代时:

we counter. He stutn co des. His stanted out one ofler that concossions and was 
to gearang reay Jotrets and with fre colt otf paitt thin wall. Which das stimn 

模型开始学会使用最短和最常用的单词,比如“we”、“He”、“His”、“Which”、“and”等。从第700次迭代开始,可以看见更多和英语单词形似的文本:

Aftair fall unsuch that the hall for Prince Velzonski's that me of
her hearly, and behs to so arwage fiving were to it beloge, pavu say falling misfort 
how, and Gogition is so overelical and ofter.

在第1200次迭代,我们可以看见使用引号、问好和感叹号,更长的单词也出现了。

"Kite vouch!" he repeated by her
door. "But I would be done and quarts, feeling, then, son is people...."

在迭代到2000次的时候,模型开始正确的拼写单词,引用句子和人名。

"Why do what that day," replied Natasha, and wishing to himself the fact the
princess, Princess Mary was easier, fed in had oftened him.
Pierre aking his soul came to the packs and drove up his father-in-law women.

从上述结果中可见,模型首先发现的是一般的单词加空格结构,然后开始学习单词;从短单词开始,然后学习更长的单词。由多个单词组成的话题和主题词要到训练后期才会出现。

RNN中的预测与神经元激活可视化

另一个有趣的实验内容就是将模型对于字符的预测可视化。下面的图示是我们对用维基百科内容训练的RNN模型输入验证集数据(蓝色和绿色的行)。在每个字母下面我们列举了模型预测的概率最高的5个字母,并用深浅不同的红色着色。深红代表模型认为概率很高,白色代表模型认为概率较低。注意有时候模型对于预测的字母是非常有信心的。比如在http://www. 序列中就是。

输入字母序列也被着以蓝色或者绿色,这代表的是RNN隐层表达中的某个随机挑选的神经元是否被激活。绿色代表非常兴奋,蓝色代表不怎么兴奋。LSTM中细节也与此类似,隐藏状态向量中的值是[-1, 1],这就是经过各种操作并使用tanh计算后的LSTM细胞状态。直观地说,这就是当RNN阅读输入序列时,它的“大脑”中的某些神经元的激活率。不同的神经元关注的是不同的模式。在下面我们会看到4种不同的神经元,我认为比较有趣和能够直观理解(当然也有很多不能直观理解)。

————————————————————————————————————————

本图中高亮的神经元看起来对于URL的开始与结束非常敏感。LSTM看起来是用这个神经元来记忆自己是不是在一个URL中。

——————————————————————————————————————————

高亮的神经元看起来对于markdown符号[[]]的开始与结束非常敏感。有趣的是,一个[符号不足以激活神经元,必须等到两个[[同时出现。而判断有几个[的任务看起来是由另一个神经元完成的。

——————————————————————————————————————————

这是一个在[[]]中线性变化的神经元。换句话说,在[[]]中,它的激活是为RNN提供了一个以时间为准的坐标系。RNN可以使用该信息来根据字符在[[]]中出现的早晚来决定其出现的频率(也许?)。

——————————————————————————————————————————

这是一个进行局部动作的神经元:它大部分时候都很安静,直到出现www序列中的第一个w后,就突然关闭了。RNN可能是使用这个神经元来计算www序列有多长,这样它就知道是该输出有一个w呢,还是开始输出URL了。

——————————————————————————————————————————

当然,由于RNN的隐藏状态是一个巨大且分散的高维度表达,所以上面这些结论多少有一点手动调整。上面的这些可视化图片是用定制的HTML/CSS/Javascript实现的,如果你想实现类似的,可以查看这里。

我们可以进一步简化可视化效果:不显示预测字符仅仅显示文本,文本的着色代表神经元的激活情况。可以看到大部分的细胞做的事情不是那么直观能理解,但是其中5%看起来是学到了一些有趣并且能理解的算法:

—————————————————————————————————————————

—————————————————————————————————————————

在预测下个字符的过程中优雅的一点是:我们不用进行任何的硬编码。比如,不用去实现判断我们到底是不是在一个引号之中。我们只是使用原始数据训练LSTM,然后它自己决定这是个有用的东西于是开始跟踪。换句话说,其中一个单元自己在训练中变成了引号探测单元,只因为这样有助于完成最终任务。这也是深度学习模型(更一般化地说是端到端训练)强大能力的一个简洁有力的证据。

源代码

我想这篇博文能够让你认为训练一个字符级别的语言模型是一件有趣的事儿。你可以使用我在Github上的char rnn代码训练一个自己的模型。它使用一个大文本文件训练一个字符级别的模型,可以输出文本。如果你有GPU,那么会在比CPU上训练快10倍。如果你训练结束得到了有意思的结果,请联系我。如果你看Torch/Lua代码看的头疼,别忘了它们只不过是这个100行项目的高端版。

题外话。代码是用Torch7写的,它最近变成我最爱的深度学习框架了。我开始学习Torch/LUA有几个月了,这并不简单(花了很多时间学习Github上的原始Torch代码,向项目创建者提问来解决问题),但是一旦你搞懂了,它就会给你带来很大的弹性和加速。之前我使用的是Caffe和Theano,虽然Torch虽然还不完美,但是我相信它的抽象和哲学层次比前两个高。在我看来,一个高效的框架应有以下特性:

有丰富函数(例如切片,数组/矩阵操作等)的,对底层CPU/GPU透明的张量库。
一整个基于脚本语言(比如Python)的分离的代码库,能够对张量进行操作,实现所有深度学习内容(前向、反向传播,计算图等)。
分享预训练模型非常容易(Caffe做得很好,其他的不行)。
最关键的:没有编译过程!或者至少不要像Theano现在这样!深度学习的趋势是更大更复杂的网络,这些网络都有随着时间展开的复杂计算流程。编译时间不能太长,不然开发过程将充满痛苦。其次,编译导致开发者放弃解释能力,不能高效地进行调试。如果在流程开发完成后有个选项能进行编译,那也可以。
拓展阅读

在结束本篇博文前,我想把RNN放到更广的背景中,提供一些当前的研究方向。RNN现在在深度学习领域引起了不小的兴奋。和卷积神经网络一样,它出现已经有十多年了,但是直到最近它的潜力才被逐渐发掘出来,这是因为我们的计算能力日益强大。下面是当前的一些进展(肯定不完整,而且很多工作可以追溯的1990年):

在NLP/语音领域,RNN将语音转化为文字,进行机器翻译,生成手写文本,当然也是强大的语言模型 (Sutskever等) (Graves) (Mikolov等)。字符级别和单词级别的模型都有,目前看来是单词级别的模型更领先,但是这只是暂时的。

计算机视觉。RNN迅速地在计算机视觉领域中被广泛运用。比如,使用RNN用于视频分类,图像标注(其中有我自己的工作和其他一些),视频标注和最近的视觉问答。在计算机视觉领域,我个人最喜欢的RNN论文是《Recurrent Models of Visual Attention》,之所以推荐它,是因为它高层上的指导方向和底层的建模方法(对图像短时间观察后的序列化处理),和建模难度低(REINFORCE算法规则是增强学习里面策略梯度方法中的一个特例,使得能够用非微分的计算来训练模型(在该文中是对图像四周进行快速查看))。我相信这种用CNN做原始数据感知,RNN在顶层做快速观察策略的混合模型将会在感知领域变得越来越流行,尤其是在那些不单单是对物体简单分类的复杂任务中将更加广泛运用。

归纳推理,记忆和注意力(Inductive Reasoning, Memories and Attention)。另一个令人激动的研究方向是要解决普通循环网络自身的局限。RNN的一个问题是它不具有归纳性:它能够很好地记忆序列,但是从其表现上来看,它不能很好地在正确的方向上对其进行归纳(一会儿会举例让这个更加具体一些)。另一个问题是RNN在运算的每一步都将表达数据的尺寸和计算量联系起来,而这并非必要。比如,假设将隐藏状态向量尺寸扩大为2倍,那么由于矩阵乘法操作,在每一步的浮点运算量就要变成4倍。理想状态下,我们希望保持大量的表达和记忆(比如存储全部维基百科或者很多中间变量),但同时每一步的运算量不变。

在该方向上第一个具有说服力的例子来自于DeepMind的神经图灵机(Neural Turing Machines)论文。该论文展示了一条路径:模型可以在巨大的外部存储数组和较小的存储寄存器集(将其看做工作的存储器)之间进行读写操作,而运算是在存储寄存器集中进行。更关键的一点是,神经图灵机论文提出了一个非常有意思的存储解决机制,该机制是通过一个(soft和全部可微分的)注意力模型来实现的。译者注:这里的soft取自softmax。基于概率的“软”注意力机制(soft attention)是一个强有力的建模特性,已经在面向机器翻译的《 Neural Machine Translation by Jointly Learning to Align and Translate》一文和面向问答的《Memory Networks》中得以应用。实际上,我想说的是:

注意力概念是近期神经网络领域中最有意思的创新。

现在我不想更多地介绍细节,但是软注意力机制存储器寻址是非常方便的,因为它让模型是完全可微的。不好的一点就是牺牲了效率,因为每一个可以关注的地方都被关注了(虽然是“软”式的)。想象一个C指针并不指向一个特定的地址,而是对内存中所有的地址定义一个分布,然后间接引用指针,返回一个与指向内容的权重和(这将非常耗费计算资源)。这让很多研究者都从软注意力模式转向硬注意力模式,而硬注意力模式是指对某一个区域内的内容固定关注(比如,对某些单元进行读写操作而不是所有单元进行读写操作)。这个模型从设计哲学上来说肯定更有吸引力,可扩展且高效,但不幸的是模型就不是可微分的了。这就导致了对于增强学习领域技术的引入(比如REINFORCE算法),因为增强学习领域中的研究者们非常熟悉不可微交互的概念。这项工作现在还在进展中,但是硬注意力模型已经被发展出来了,在《 Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets》,《 Reinforcement Learning Neural Turing Machines》,《Show Attend and Tell》三篇文章中均有介绍。

研究者。如果你想在RNN方面继续研究,我推荐Alex Graves,Ilya Sutskever和Tomas Mikolov三位研究者。想要知道更多增强学习和策略梯度方法(REINFORCE算法是其中一个特例),可以学习David Silver的课程,或Pieter Abbeel的课程。

代码。如果你想要继续训练RNN,我听说Theano上的keras或passage还不错。我使用Torch写了一个项目,也用numpy实现了一个可以前向和后向传播的LSTM。你还可以在Github上看看我的NeuralTalk项目,是用RNN/LSTM来进行图像标注。或者看看Jeff Donahue用Caffe实现的项目。

结论

我们已经学习了RNN,知道了它如何工作,以及为什么它如此重要。我们还利用不同的数据集将RNN训练成字母级别的语言模型,观察了它是如何进行这个过程的。可以预见,在未来将会出现对RNN的巨大创新,我个人认为它们将成为智能系统的关键组成部分。

最后,为了给文章增添一点格调,我使用本篇博文对RNN进行了训练。然而由于博文的长度很短,不足以很好地训练RNN。但是返回的一段文本如下(使用低的温度设置来返回更典型的样本):

I've the RNN with and works, but the computed with program of the 
RNN with and the computed of the RNN with with and the code

是的,这篇博文就是讲RNN和它如何工作的,所以显然模型是有用的:)下次见!

译者反馈

翻译不到位的地方,欢迎知友们评论批评指正;
关于Torch和TensorFlow,AK本人现在在OpenAI工作主要是在用TF了,但是他对于Torch还是有很强的倾向性。这在他最新的博文中可以看到;
在计算机视觉方面,个人对于图像标注比较感兴趣,正在入坑。欢迎有同样兴趣的知友投稿讨论;
想要加入翻译小组的同学,请连续3次在评论中对我们最新的翻译做出认真的批评和指正,而后我们会小组内投票决定是否吸纳新成员:)这个小小的门槛是为了方便我们找到真正喜爱机器学习和翻译的同学。

The Unreasonable Effectiveness of Recurrent Neural Networks

There’s something magical about Recurrent Neural Networks (RNNs). I still remember when I trained my first recurrent network for Image Captioning. Within a few dozen minutes of training my first baby model (with rather arbitrarily-chosen hyperparameters) started to generate very nice looking descriptions of images that were on the edge of making sense. Sometimes the ratio of how simple your model is to the quality of the results you get out of it blows past your expectations, and this was one of those times. What made this result so shocking at the time was that the common wisdom was that RNNs were supposed to be difficult to train (with more experience I’ve in fact reached the opposite conclusion). Fast forward about a year: I’m training RNNs all the time and I’ve witnessed their power and robustness many times, and yet their magical outputs still find ways of amusing me. This post is about sharing some of that magic with you.

We’ll train RNNs to generate text character by character and ponder the question “how is that even possible?”
By the way, together with this post I am also releasing code on Github that allows you to train character-level language models based on multi-layer LSTMs. You give it a large chunk of text and it will learn to generate text like it one character at a time. You can also use it to reproduce my experiments below. But we’re getting ahead of ourselves; What are RNNs anyway?

Recurrent Neural Networks

Sequences. Depending on your background you might be wondering: What makes Recurrent Networks so special? A glaring limitation of Vanilla Neural Networks (and also Convolutional Networks) is that their API is too constrained: they accept a fixed-sized vector as input (e.g. an image) and produce a fixed-sized vector as output (e.g. probabilities of different classes). Not only that: These models perform this mapping using a fixed amount of computational steps (e.g. the number of layers in the model). The core reason that recurrent nets are more exciting is that they allow us to operate over sequences of vectors: Sequences in the input, the output, or in the most general case both. A few examples may make this more concrete:

Each rectangle is a vector and arrows represent functions (e.g. matrix multiply). Input vectors are in red, output vectors are in blue and green vectors hold the RNN's state (more on this soon). From left to right: (1) Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output (e.g. image classification). (2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words). (3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). (4) Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French). (5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). Notice that in every case are no pre-specified constraints on the lengths sequences because the recurrent transformation (green) is fixed and can be applied as many times as we like.
As you might expect, the sequence regime of operation is much more powerful compared to fixed networks that are doomed from the get-go by a fixed number of computational steps, and hence also much more appealing for those of us who aspire to build more intelligent systems. Moreover, as we’ll see in a bit, RNNs combine the input vector with their state vector with a fixed (but learned) function to produce a new state vector. This can in programming terms be interpreted as running a fixed program with certain inputs and some internal variables. Viewed this way, RNNs essentially describe programs. In fact, it is known that RNNs are Turing-Complete in the sense that they can to simulate arbitrary programs (with proper weights). But similar to universal approximation theorems for neural nets you shouldn’t read too much into this. In fact, forget I said anything.

If training vanilla neural nets is optimization over functions, training recurrent nets is optimization over programs.
Sequential processing in absence of sequences. You might be thinking that having sequences as inputs or outputs could be relatively rare, but an important point to realize is that even if your inputs/outputs are fixed vectors, it is still possible to use this powerful formalism to process them in a sequential manner. For instance, the figure below shows results from two very nice papers from DeepMind. On the left, an algorithm learns a recurrent network policy that steers its attention around an image; In particular, it learns to read out house numbers from left to right (Ba et al.). On the right, a recurrent network generates images of digits by learning to sequentially add color to a canvas (Gregor et al.):

Left: RNN learns to read house numbers. Right: RNN learns to paint house numbers.
The takeaway is that even if your data is not in form of sequences, you can still formulate and train powerful models that learn to process it sequentially. You’re learning stateful programs that process your fixed-sized data.

RNN computation. So how do these things work? At the core, RNNs have a deceptively simple API: They accept an input vector x and give you an output vector y. However, crucially this output vector’s contents are influenced not only by the input you just fed in, but also on the entire history of inputs you’ve fed in in the past. Written as a class, the RNN’s API consists of a single step function:

rnn = RNN()
y = rnn.step(x) # x is an input vector, y is the RNN's output vector

The RNN class has some internal state that it gets to update every time step is called. In the simplest case this state consists of a single hidden vector h. Here is an implementation of the step function in a Vanilla RNN:

class RNN:
  # ...
  def step(self, x):
    # update the hidden state
    self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
    # compute the output vector
    y = np.dot(self.W_hy, self.h)
    return y

The above specifies the forward pass of a vanilla RNN. This RNN’s parameters are the three matrices W_hh, W_xh, W_hy. The hidden state self.h is initialized with the zero vector. The np.tanh function implements a non-linearity that squashes the activations to the range [-1, 1]. Notice briefly how this works: There are two terms inside of the tanh: one is based on the previous hidden state and one is based on the current input. In numpy np.dot is matrix multiplication. The two intermediates interact with addition, and then get squashed by the tanh into the new state vector. If you’re more comfortable with math notation, we can also write the hidden state update as ht=tanh(Whhht−1+Wxhxt)ht=tanh⁡(Whhht−1+Wxhxt), where tanh is applied elementwise.

We initialize the matrices of the RNN with random numbers and the bulk of work during training goes into finding the matrices that give rise to desirable behavior, as measured with some loss function that expresses your preference to what kinds of outputs y you’d like to see in response to your input sequences x.

Going deep. RNNs are neural networks and everything works monotonically better (if done right) if you put on your deep learning hat and start stacking models up like pancakes. For instance, we can form a 2-layer recurrent network as follows:

y1 = rnn1.step(x)
y = rnn2.step(y1)
In other words we have two separate RNNs: One RNN is receiving the input vectors and the second RNN is receiving the output of the first RNN as its input. Except neither of these RNNs know or care - it’s all just vectors coming in and going out, and some gradients flowing through each module during backpropagation.

Getting fancy. I’d like to briefly mention that in practice most of us use a slightly different formulation than what I presented above called a Long Short-Term Memory (LSTM) network. The LSTM is a particular type of recurrent network that works slightly better in practice, owing to its more powerful update equation and some appealing backpropagation dynamics. I won’t go into details, but everything I’ve said about RNNs stays exactly the same, except the mathematical form for computing the update (the line self.h = ... ) gets a little more complicated. From here on I will use the terms “RNN/LSTM” interchangeably but all experiments in this post use an LSTM.

Character-Level Language Models

Okay, so we have an idea about what RNNs are, why they are super exciting, and how they work. We’ll now ground this in a fun application: We’ll train RNN character-level language models. That is, we’ll give the RNN a huge chunk of text and ask it to model the probability distribution of the next character in the sequence given a sequence of previous characters. This will then allow us to generate new text one character at a time.

As a working example, suppose we only had a vocabulary of four possible letters “helo”, and wanted to train an RNN on the training sequence “hello”. This training sequence is in fact a source of 4 separate training examples: 1. The probability of “e” should be likely given the context of “h”, 2. “l” should be likely in the context of “he”, 3. “l” should also be likely given the context of “hel”, and finally 4. “o” should be likely given the context of “hell”.

Concretely, we will encode each character into a vector using 1-of-k encoding (i.e. all zero except for a single one at the index of the character in the vocabulary), and feed them into the RNN one at a time with the step function. We will then observe a sequence of 4-dimensional output vectors (one dimension per character), which we interpret as the confidence the RNN currently assigns to each character coming next in the sequence. Here’s a diagram:

An example RNN with 4-dimensional input and output layers, and a hidden layer of 3 units (neurons). This diagram shows the activations in the forward pass when the RNN is fed the characters "hell" as input. The output layer contains confidences the RNN assigns for the next character (vocabulary is "h,e,l,o"); We want the green numbers to be high and red numbers to be low.
For example, we see that in the first time step when the RNN saw the character “h” it assigned confidence of 1.0 to the next letter being “h”, 2.2 to letter “e”, -3.0 to “l”, and 4.1 to “o”. Since in our training data (the string “hello”) the next correct character is “e”, we would like to increase its confidence (green) and decrease the confidence of all other letters (red). Similarly, we have a desired target character at every one of the 4 time steps that we’d like the network to assign a greater confidence to. Since the RNN consists entirely of differentiable operations we can run the backpropagation algorithm (this is just a recursive application of the chain rule from calculus) to figure out in what direction we should adjust every one of its weights to increase the scores of the correct targets (green bold numbers). We can then perform a parameter update, which nudges every weight a tiny amount in this gradient direction. If we were to feed the same inputs to the RNN after the parameter update we would find that the scores of the correct characters (e.g. “e” in the first time step) would be slightly higher (e.g. 2.3 instead of 2.2), and the scores of incorrect characters would be slightly lower. We then repeat this process over and over many times until the network converges and its predictions are eventually consistent with the training data in that correct characters are always predicted next.

A more technical explanation is that we use the standard Softmax classifier (also commonly referred to as the cross-entropy loss) on every output vector simultaneously. The RNN is trained with mini-batch Stochastic Gradient Descent and I like to use RMSProp or Adam (per-parameter adaptive learning rate methods) to stablilize the updates.

Notice also that the first time the character “l” is input, the target is “l”, but the second time the target is “o”. The RNN therefore cannot rely on the input alone and must use its recurrent connection to keep track of the context to achieve this task.

At test time, we feed a character into the RNN and get a distribution over what characters are likely to come next. We sample from this distribution, and feed it right back in to get the next letter. Repeat this process and you’re sampling text! Lets now train an RNN on different datasets and see what happens.

To further clarify, for educational purposes I also wrote a minimal character-level RNN language model in Python/numpy. It is only about 100 lines long and hopefully it gives a concise, concrete and useful summary of the above if you’re better at reading code than text. We’ll now dive into example results, produced with the much more efficient Lua/Torch codebase.

Fun with RNNs

All 5 example character models below were trained with the code I’m releasing on Github. The input in each case is a single file with some text, and we’re training an RNN to predict the next character in the sequence.

Paul Graham generator

Lets first try a small dataset of English as a sanity check. My favorite fun dataset is the concatenation of Paul Graham’s essays. The basic idea is that there’s a lot of wisdom in these essays, but unfortunately Paul Graham is a relatively slow generator. Wouldn’t it be great if we could sample startup wisdom on demand? That’s where an RNN comes in.

Concatenating all pg essays over the last ~5 years we get approximately 1MB text file, or about 1 million characters (this is considered a very small dataset by the way). Technical: Lets train a 2-layer LSTM with 512 hidden nodes (approx. 3.5 million parameters), and with dropout of 0.5 after each layer. We’ll train with batches of 100 examples and truncated backpropagation through time of length 100 characters. With these settings one batch on a TITAN Z GPU takes about 0.46 seconds (this can be cut in half with 50 character BPTT at negligible cost in performance). Without further ado, lets see a sample from the RNN:

“The surprised in investors weren’t going to raise money. I’m not the company with the time there are all interesting quickly, don’t have to get off the same programmers. There’s a super-angel round fundraising, why do you can do. If you have a different physical investment are become in people who reduced in a startup with the way to argument the acquirer could see them just that you’re also the founders will part of users’ affords that and an alternation to the idea. [2] Don’t work at first member to see the way kids will seem in advance of a bad successful startup. And if you have to act the big company too.”

Okay, clearly the above is unfortunately not going to replace Paul Graham anytime soon, but remember that the RNN had to learn English completely from scratch and with a small dataset (including where you put commas, apostrophes and spaces). I also like that it learns to support its own arguments (e.g. [2], above). Sometimes it says something that offers a glimmer of insight, such as “a company is a meeting to think to investors”. Here’s a link to 50K character sample if you’d like to see more.

Temperature. We can also play with the temperature of the Softmax during sampling. Decreasing the temperature from 1 to some lower number (e.g. 0.5) makes the RNN more confident, but also more conservative in its samples. Conversely, higher temperatures will give more diversity but at cost of more mistakes (e.g. spelling mistakes, etc). In particular, setting temperature very near zero will give the most likely thing that Paul Graham might say:

“is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same thing that was a startup is that they were all the same”

looks like we’ve reached an infinite loop about startups.

Shakespeare

It looks like we can learn to spell English words. But how about if there is more structure and style in the data? To examine this I downloaded all the works of Shakespeare and concatenated them into a single (4.4MB) file. We can now afford to train a larger network, in this case lets try a 3-layer RNN with 512 hidden nodes on each layer. After we train the network for a few hours we obtain samples such as:

PANDARUS:
Alas, I think he shall be come approached and the day
When little srain would be attain'd into being never fed,
And who is but a chain and subjects of his death,
I should not sleep.

Second Senator:
They are away this miseries, produced upon my soul,
Breaking and strongly should be buried, when I perish
The earth and thoughts of many states.

DUKE VINCENTIO:
Well, your wit is in the care of side and that.

Second Lord:
They would be ruled after this chamber, and
my fair nues begun out of the fact, to be conveyed,
Whose noble souls I'll have the heart of the wars.

Clown:
Come, sir, I will make did behold your worship.

VIOLA:
I'll drink it.
Remember, all the RNN knows are characters, so in particular it samples both speaker’s names and the contents. Sometimes we also get relatively extented monologue passages, such as:

VIOLA:
Why, Salisbury must find his flesh and thought
That which I am not aps, not a man and in fire,
To show the reining of the raven and the wars
To grace my hand reproach within, and not a fair are hand,
That Caesar and my goodly father's world;
When I was heaven of presence and our fleets,
We spare with hours, but cut thy council I am great,
Murdered and by thy master's ready there
My power to give thee but so much as hell:
Some service in the noble bondman here,
Would show him to her wine.

KING LEAR:
O, if you were a feeble sight, the courtesy of your law,
Your sight and several breath, will wear the gods
With his heads, and my hands are wonder'd at the deeds,
So drop upon your lordship's head, and your opinion
Shall be against your honour.

I can barely recognize these samples from actual Shakespeare ???? If you like Shakespeare, you might appreciate this 100,000 character sample. Of course, you can also generate an infinite amount of your own samples at different temperatures with the provided code.

Wikipedia

We saw that the LSTM can learn to spell words and copy general syntactic structures. Lets further increase the difficulty and train on structured markdown. In particular, lets take the Hutter Prize 100MB dataset of raw Wikipedia and train an LSTM. Following Graves et al., I used the first 96MB for training, the rest for validation and ran a few models overnight. We can now sample Wikipedia articles! Below are a few fun excerpts. First, some basic markdown output:

Naturalism and decision for the majority of Arab countries' capitalide was grounded
by the Irish language by [[John Clair]], [[An Imperial Japanese Revolt]], associated 
with Guangzham's sovereignty. His generals were the powerful ruler of the Portugal 
in the [[Protestant Immineners]], which could be said to be directly in Cantonese 
Communication, which followed a ceremony and set inspired prison, training. The 
emperor travelled back to [[Antioch, Perth, October 25|21]] to note, the Kingdom 
of Costa Rica, unsuccessful fashioned the [[Thrales]], [[Cynth's Dajoard]], known 
in western [[Scotland]], near Italy to the conquest of India with the conflict. 
Copyright was the succession of independence in the slop of Syrian influence that 
was a famous German movement based on a more popular servicious, non-doctrinal 
and sexual power post. Many governments recognize the military housing of the 
[[Civil Liberalization and Infantry Resolution 265 National Party in Hungary]], 
that is sympathetic to be to the [[Punjab Resolution]]
(PJS)[http://www.humah.yahoo.com/guardian.
cfm/7754800786d17551963s89.htm Official economics Adjoint for the Nazism, Montgomery 
was swear to advance to the resources for those Socialism's rule, 
was starting to signing a major tripad of aid exile.]]

In case you were wondering, the yahoo url above doesn’t actually exist, the model just hallucinated it. Also, note that the model learns to open and close the parenthesis correctly. There’s also quite a lot of structured markdown that the model learns, for example sometimes it creates headings, lists, etc.:

{ { cite journal | id=Cerling Nonforest Department|format=Newlymeslated|none } }
''www.e-complete''.

'''See also''': [[List of ethical consent processing]]

== See also ==
*[[Iender dome of the ED]]
*[[Anti-autism]]

===[[Religion|Religion]]===
*[[French Writings]]
*[[Maria]]
*[[Revelation]]
*[[Mount Agamul]]

== External links==
* [http://www.biblegateway.nih.gov/entrepre/ Website of the World Festival. The labour of India-county defeats at the Ripper of California Road.]

==External links==
* [http://www.romanology.com/ Constitution of the Netherlands and Hispanic Competition for Bilabial and Commonwealth Industry (Republican Constitution of the Extent of the Netherlands)]

Sometimes the model snaps into a mode of generating random but valid XML:

<page>
  <title>Antichrist</title>
  <id>865</id>
  <revision>
    <id>15900676</id>
    <timestamp>2002-08-03T18:14:12Z</timestamp>
    <contributor>
      <username>Paris</username>
      <id>23</id>
    </contributor>
    <minor />
    <comment>Automated conversion</comment>
    <text xml:space="preserve">#REDIRECT [[Christianity]]</text>
  </revision>
</page>

The model completely makes up the timestamp, id, and so on. Also, note that it closes the correct tags appropriately and in the correct nested order. Here are 100,000 characters of sampled wikipedia if you’re interested to see more.

Algebraic Geometry (Latex)

The results above suggest that the model is actually quite good at learning complex syntactic structures. Impressed by these results, my labmate (Justin Johnson) and I decided to push even further into structured territories and got a hold of this book on algebraic stacks/geometry. We downloaded the raw Latex source file (a 16MB file) and trained a multilayer LSTM. Amazingly, the resulting sampled Latex almost compiles. We had to step in and fix a few issues manually but then you get plausible looking math, it’s quite astonishing:

Sampled (fake) algebraic geometry. Here's the actual pdf.
Here’s another sample:

More hallucinated algebraic geometry. Nice try on the diagram (right).
As you can see above, sometimes the model tries to generate latex diagrams, but clearly it hasn’t really figured them out. I also like the part where it chooses to skip a proof (“Proof omitted.”, top left). Of course, keep in mind that latex has a relatively difficult structured syntactic format that I haven’t even fully mastered myself. For instance, here is a raw sample from the model (unedited):

\begin{proof}
We may assume that $\mathcal{I}$ is an abelian sheaf on $\mathcal{C}$.
\item Given a morphism $\Delta : \mathcal{F} \to \mathcal{I}$
is an injective and let $\mathfrak q$ be an abelian sheaf on $X$.
Let $\mathcal{F}$ be a fibered complex. Let $\mathcal{F}$ be a category.
\begin{enumerate}
\item \hyperref[setain-construction-phantom]{Lemma}
\label{lemma-characterize-quasi-finite}
Let $\mathcal{F}$ be an abelian quasi-coherent sheaf on $\mathcal{C}$.
Let $\mathcal{F}$ be a coherent $\mathcal{O}_X$-module. Then
$\mathcal{F}$ is an abelian catenary over $\mathcal{C}$.
\item The following are equivalent
\begin{enumerate}
\item $\mathcal{F}$ is an $\mathcal{O}_X$-module.
\end{lemma}

This sample from a relatively decent model illustrates a few common mistakes. For example, the model opens a \begin{proof} environment but then ends it with a \end{lemma}. This is an example of a problem we’d have to fix manually, and is likely due to the fact that the dependency is too long-term: By the time the model is done with the proof it has forgotten whether it was doing a proof or a lemma. Similarly, it opens an \begin{enumerate} but then forgets to close it. We observed that these became less common with larger/better models, but nonetheless, these are the kinds of mistakes that come up.

Linux Source Code

I wanted to push structured data to its limit, so for the final challenge I decided to use code. In particular, I took all the source and header files found in the Linux repo on Github, concatenated all of them in a single giant file (474MB of C code) (I was originally going to train only on the kernel but that by itself is only ~16MB). Then I trained several as-large-as-fits-on-my-GPU 3-layer LSTMs over a period of a few days. These models have about 10 million parameters, which is still on the lower end for RNN models. The results are superfun:

/*
 * Increment the size file of the new incorrect UI_FILTER group information
 * of the size generatively.
 */
static int indicate_policy(void)
{
  int error;
  if (fd == MARN_EPT) {
    /*
     * The kernel blank will coeld it to userspace.
     */
    if (ss->segment < mem_total)
      unblock_graph_and_set_blocked();
    else
      ret = 1;
    goto bail;
  }
  segaddr = in_SB(in.addr);
  selector = seg / 16;
  setup_works = true;
  for (i = 0; i < blocks; i++) {
    seq = buf[i++];
    bpf = bd->bd.next + i * search;
    if (fd) {
      current = blocked;
    }
  }
  rw->name = "Getjbbregs";
  bprm_self_clearl(&iv->version);
  regs->new = blocks[(BPF_STATS << info->historidac)] | PFMR_CLOBATHINC_SECONDS << 12;
  return segtable;
}
The code looks really quite great overall. Of course, I don’t think it compiles but when you scroll through the generate code it feels very much like a giant C code base. Notice that the RNN peppers its code with comments here and there at random. It is also very good at making very few syntactic errors. For example, it uses strings properly, pointer notation, etc. It also opens and closes brackets {[ correctly and learns to indent its code very well. A common error is that it can’t keep track of variable names: It often uses undefined variables (e.g. rw above), declares variables it never uses (e.g. int error), or returns non-existing variables. Lets see a few more examples. Here’s another snippet that shows a wider array of operations that the RNN learns:

/*
 * If this error is set, we will need anything right after that BSD.
 */
static void action_new_function(struct s_stat_info *wb)
{
  unsigned long flags;
  int lel_idx_bit = e->edd, *sys & ~((unsigned long) *FIRST_COMPAT);
  buf[0] = 0xFFFFFFFF & (bit << 4);
  min(inc, slist->bytes);
  printk(KERN_WARNING "Memory allocated %02x/%02x, "
    "original MLL instead\n"),
    min(min(multi_run - s->len, max) * num_data_in),
    frame_pos, sz + first_seg);
  div_u64_w(val, inb_p);
  spin_unlock(&disk->queue_lock);
  mutex_unlock(&s->sock->mutex);
  mutex_unlock(&func->mutex);
  return disassemble(info->pending_bh);
}

static void num_serial_settings(struct tty_struct *tty)
{
  if (tty == tty)
    disable_single_st_p(dev);
  pci_disable_spool(port);
  return 0;
}

static void do_command(struct seq_file *m, void *v)
{
  int column = 32 << (cmd[2] & 0x80);
  if (state)
    cmd = (int)(int_state ^ (in_8(&ch->ch_flags) & Cmd) ? 2 : 1);
  else
    seq = 1;
  for (i = 0; i < 16; i++) {
    if (k & (1 << 1))
      pipe = (in_use & UMXTHREAD_UNCCA) +
        ((count & 0x00000000fffffff8) & 0x000000f) << 8;
    if (count == 0)
      sub(pid, ppc_md.kexec_handle, 0x20000000);
    pipe_set_bytes(i, 0);
  }
  /* Free our user pages pointer to place camera if all dash */
  subsystem_info = &of_changes[PAGE_SIZE];
  rek_controls(offset, idx, &soffset);
  /* Now we want to deliberately put it to device */
  control_check_polarity(&context, val, 0);
  for (i = 0; i < COUNTER; i++)
    seq_puts(s, "policy ");
}

Notice that in the second function the model compares tty == tty, which is vacuously true. On the other hand, at least the variable tty exists in the scope this time! In the last function, notice that the code does not return anything, which happens to be correct since the function signature is void. However, the first two functions were also declared void and did return values. This is again a form of a common mistake due to long-term interactions.

Sometimes the model decides that it’s time to sample a new file. This is usually a very amusing part: The model first recites the GNU license character by character, samples a few includes, generates some macros and then dives into the code:

/*
 *  Copyright (c) 2006-2010, Intel Mobile Communications.  All rights reserved.
 *
 *   This program is free software; you can redistribute it and/or modify it
 * under the terms of the GNU General Public License version 2 as published by
 * the Free Software Foundation.
 *
 *        This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *
 *  GNU General Public License for more details.
 *
 *   You should have received a copy of the GNU General Public License
 *    along with this program; if not, write to the Free Software Foundation,
 *  Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
 */

#include <linux/kexec.h>
#include <linux/errno.h>
#include <linux/io.h>
#include <linux/platform_device.h>
#include <linux/multi.h>
#include <linux/ckevent.h>

#include <asm/io.h>
#include <asm/prom.h>
#include <asm/e820.h>
#include <asm/system_info.h>
#include <asm/setew.h>
#include <asm/pgproto.h>

#define REG_PG    vesa_slot_addr_pack
#define PFM_NOCOMP  AFSR(0, load)
#define STACK_DDR(type)     (func)

#define SWAP_ALLOCATE(nr)     (e)
#define emulate_sigs()  arch_get_unaligned_child()
#define access_rw(TST)  asm volatile("movd %%esp, %0, %3" : : "r" (0));   \
  if (__type & DO_READ)

static void stat_PC_SEC __read_mostly offsetof(struct seq_argsqueue, \
          pC>[1]);

static void
os_prefix(unsigned long sys)
{
#ifdef CONFIG_PREEMPT
  PUT_PARAM_RAID(2, sel) = get_state_state();
  set_pid_sum((unsigned long)state, current_state_str(),
           (unsigned long)-1->lr_full; low;
}

There are too many fun parts to cover- I could probably write an entire blog post on just this part. I’ll cut it short for now, but here is 1MB of sampled Linux code for your viewing pleasure.

Generating Baby Names

Lets try one more for fun. Lets feed the RNN a large text file that contains 8000 baby names listed out, one per line (names obtained from here). We can feed this to the RNN and then generate new names! Here are some example names, only showing the ones that do not occur in the training data (90% don’t):

Rudi Levette Berice Lussa Hany Mareanne Chrestina Carissy Marylen Hammine Janye Marlise Jacacrie Hendred Romand Charienna Nenotto Ette Dorane Wallen Marly Darine Salina Elvyn Ersia Maralena Minoria Ellia Charmin Antley Nerille Chelon Walmor Evena Jeryly Stachon Charisa Allisa Anatha Cathanie Geetra Alexie Jerin Cassen Herbett Cossie Velen Daurenge Robester Shermond Terisa Licia Roselen Ferine Jayn Lusine Charyanne Sales Sanny Resa Wallon Martine Merus Jelen Candica Wallin Tel Rachene Tarine Ozila Ketia Shanne Arnande Karella Roselina Alessia Chasty Deland Berther Geamar Jackein Mellisand Sagdy Nenc Lessie Rasemy Guen Gavi Milea Anneda Margoris Janin Rodelin Zeanna Elyne Janah Ferzina Susta Pey Castina

You can see many more here. Some of my favorites include “Baby” (haha), “Killie”, “Char”, “R”, “More”, “Mars”, “Hi”, “Saddie”, “With” and “Ahbort”. Well that was fun. Of course, you can imagine this being quite useful inspiration when writing a novel, or naming a new startup ????

Understanding what’s going on

We saw that the results at the end of training can be impressive, but how does any of this work? Lets run two quick experiments to briefly peek under the hood.

The evolution of samples while training

First, it’s fun to look at how the sampled text evolves while the model trains. For example, I trained an LSTM of Leo Tolstoy’s War and Peace and then generated samples every 100 iterations of training. At iteration 100 the model samples random jumbles:

tyntd-iafhatawiaoihrdemot lytdws e ,tfti, astai f ogoh eoase rrranbyne 'nhthnee e
plia tklrgd t o idoe ns,smtt h ne etie h,hregtrs nigtike,aoaenns lng
However, notice that at least it is starting to get an idea about words separated by spaces. Except sometimes it inserts two spaces. It also doesn’t know that comma is amost always followed by a space. At 300 iterations we see that the model starts to get an idea about quotes and periods:

"Tmont thithey" fomesscerliund
Keushey. Thom here
sheulke, anmerenith ol sivh I lalterthend Bleipile shuwy fil on aseterlome
coaniogennc Phe lism thond hon at. MeiDimorotion in ther thize."
The words are now also separated with spaces and the model starts to get the idea about periods at the end of a sentence. At iteration 500:

we counter. He stutn co des. His stanted out one ofler that concossions and was
to gearang reay Jotrets and with fre colt otf paitt thin wall. Which das stimn
the model has now learned to spell the shortest and most common words such as “we”, “He”, “His”, “Which”, “and”, etc. At iteration 700 we’re starting to see more and more English-like text emerge:

Aftair fall unsuch that the hall for Prince Velzonski's that me of
her hearly, and behs to so arwage fiving were to it beloge, pavu say falling misfort
how, and Gogition is so overelical and ofter.
At iteration 1200 we’re now seeing use of quotations and question/exclamation marks. Longer words have now been learned as well:

"Kite vouch!" he repeated by her
door. "But I would be done and quarts, feeling, then, son is people...."
Until at last we start to get properly spelled words, quotations, names, and so on by about iteration 2000:

"Why do what that day," replied Natasha, and wishing to himself the fact the
princess, Princess Mary was easier, fed in had oftened him.
Pierre aking his soul came to the packs and drove up his father-in-law women.
The picture that emerges is that the model first discovers the general word-space structure and then rapidly starts to learn the words; First starting with the short words and then eventually the longer ones. Topics and themes that span multiple words (and in general longer-term dependencies) start to emerge only much later.

Visualizing the predictions and the “neuron” firings in the RNN

Another fun visualization is to look at the predicted distributions over characters. In the visualizations below we feed a Wikipedia RNN model character data from the validation set (shown along the blue/green rows) and under every character we visualize (in red) the top 5 guesses that the model assigns for the next character. The guesses are colored by their probability (so dark red = judged as very likely, white = not very likely). For example, notice that there are stretches of characters where the model is extremely confident about the next letter (e.g., the model is very confident about characters during the http://www. sequence).

The input character sequence (blue/green) is colored based on the firing of a randomly chosen neuron in the hidden representation of the RNN. Think about it as green = very excited and blue = not very excited (for those familiar with details of LSTMs, these are values between [-1,1] in the hidden state vector, which is just the gated and tanh’d LSTM cell state). Intuitively, this is visualizing the firing rate of some neuron in the “brain” of the RNN while it reads the input sequence. Different neurons might be looking for different patterns; Below we’ll look at 4 different ones that I found and thought were interesting or interpretable (many also aren’t):

The neuron highlighted in this image seems to get very excited about URLs and turns off outside of the URLs. The LSTM is likely using this neuron to remember if it is inside a URL or not.

The highlighted neuron here gets very excited when the RNN is inside the [[ ]] markdown environment and turns off outside of it. Interestingly, the neuron can't turn on right after it sees the character "[", it must wait for the second "[" and then activate. This task of counting whether the model has seen one or two "[" is likely done with a different neuron.

Here we see a neuron that varies seemingly linearly across the [[ ]] environment. In other words its activation is giving the RNN a time-aligned coordinate system across the [[ ]] scope. The RNN can use this information to make different characters more or less likely depending on how early/late it is in the [[ ]] scope (perhaps?).

Here is another neuron that has very local behavior: it is relatively silent but sharply turns off right after the first "w" in the "www" sequence. The RNN might be using this neuron to count up how far in the "www" sequence it is, so that it can know whether it should emit another "w", or if it should start the URL.
Of course, a lot of these conclusions are slightly hand-wavy as the hidden state of the RNN is a huge, high-dimensional and largely distributed representation. These visualizations were produced with custom HTML/CSS/Javascript, you can see a sketch of what’s involved here if you’d like to create something similar.

We can also condense this visualization by excluding the most likely predictions and only visualize the text, colored by activations of a cell. We can see that in addition to a large portion of cells that do not do anything interpretible, about 5% of them turn out to have learned quite interesting and interpretible algorithms:

Again, what is beautiful about this is that we didn’t have to hardcode at any point that if you’re trying to predict the next character it might, for example, be useful to keep track of whether or not you are currently inside or outside of quote. We just trained the LSTM on raw data and it decided that this is a useful quantitity to keep track of. In other words one of its cells gradually tuned itself during training to become a quote detection cell, since this helps it better perform the final task. This is one of the cleanest and most compelling examples of where the power in Deep Learning models (and more generally end-to-end training) is coming from.

Source Code

I hope I’ve convinced you that training character-level language models is a very fun exercise. You can train your own models using the char-rnn code I released on Github (under MIT license). It takes one large text file and trains a character-level model that you can then sample from. Also, it helps if you have a GPU or otherwise training on CPU will be about a factor of 10x slower. In any case, if you end up training on some data and getting fun results let me know! And if you get lost in the Torch/Lua codebase remember that all it is is just a more fancy version of this 100-line gist.

Brief digression. The code is written in Torch 7, which has recently become my favorite deep learning framework. I’ve only started working with Torch/LUA over the last few months and it hasn’t been easy (I spent a good amount of time digging through the raw Torch code on Github and asking questions on their gitter to get things done), but once you get a hang of things it offers a lot of flexibility and speed. I’ve also worked with Caffe and Theano in the past and I believe Torch, while not perfect, gets its levels of abstraction and philosophy right better than others. In my view the desirable features of an effective framework are:

CPU/GPU transparent Tensor library with a lot of functionality (slicing, array/matrix operations, etc. )
An entirely separate code base in a scripting language (ideally Python) that operates over Tensors and implements all Deep Learning stuff (forward/backward, computation graphs, etc)
It should be possible to easily share pretrained models (Caffe does this well, others don’t), and crucially
NO compilation step (or at least not as currently done in Theano). The trend in Deep Learning is towards larger, more complex networks that are are time-unrolled in complex graphs. It is critical that these do not compile for a long time or development time greatly suffers. Second, by compiling one gives up interpretability and the ability to log/debug effectively. If there is an option to compile the graph once it has been developed for efficiency in prod that’s fine.
Further Reading

Before the end of the post I also wanted to position RNNs in a wider context and provide a sketch of the current research directions. RNNs have recently generated a significant amount of buzz and excitement in the field of Deep Learning. Similar to Convolutional Networks they have been around for decades but their full potential has only recently started to get widely recognized, in large part due to our growing computational resources. Here’s a brief sketch of a few recent developments (definitely not complete list, and a lot of this work draws from research back to 1990s, see related work sections):

In the domain of NLP/Speech, RNNs transcribe speech to text, perform machine translation, generate handwritten text, and of course, they have been used as powerful language models (Sutskever et al.) (Graves) (Mikolov et al.) (both on the level of characters and words). Currently it seems that word-level models work better than character-level models, but this is surely a temporary thing.

Computer Vision. RNNs are also quickly becoming pervasive in Computer Vision. For example, we’re seeing RNNs in frame-level video classification, image captioning (also including my own work and many others), video captioning and very recently visual question answering. My personal favorite RNNs in Computer Vision paper is Recurrent Models of Visual Attention, both due to its high-level direction (sequential processing of images with glances) and the low-level modeling (REINFORCE learning rule that is a special case of policy gradient methods in Reinforcement Learning, which allows one to train models that perform non-differentiable computation (taking glances around the image in this case)). I’m confident that this type of hybrid model that consists of a blend of CNN for raw perception coupled with an RNN glance policy on top will become pervasive in perception, especially for more complex tasks that go beyond classifying some objects in plain view.

Inductive Reasoning, Memories and Attention. Another extremely exciting direction of research is oriented towards addressing the limitations of vanilla recurrent networks. One problem is that RNNs are not inductive: They memorize sequences extremely well, but they don’t necessarily always show convincing signs of generalizing in the correct way (I’ll provide pointers in a bit that make this more concrete). A second issue is they unnecessarily couple their representation size to the amount of computation per step. For instance, if you double the size of the hidden state vector you’d quadruple the amount of FLOPS at each step due to the matrix multiplication. Ideally, we’d like to maintain a huge representation/memory (e.g. containing all of Wikipedia or many intermediate state variables), while maintaining the ability to keep computation per time step fixed.

The first convincing example of moving towards these directions was developed in DeepMind’s Neural Turing Machines paper. This paper sketched a path towards models that can perform read/write operations between large, external memory arrays and a smaller set of memory registers (think of these as our working memory) where the computation happens. Crucially, the NTM paper also featured very interesting memory addressing mechanisms that were implemented with a (soft, and fully-differentiable) attention model. The concept of soft attention has turned out to be a powerful modeling feature and was also featured in Neural Machine Translation by Jointly Learning to Align and Translate for Machine Translation and Memory Networks for (toy) Question Answering. In fact, I’d go as far as to say that

The concept of attention is the most interesting recent architectural innovation in neural networks.
Now, I don’t want to dive into too many details but a soft attention scheme for memory addressing is convenient because it keeps the model fully-differentiable, but unfortunately one sacrifices efficiency because everything that can be attended to is attended to (but softly). Think of this as declaring a pointer in C that doesn’t point to a specific address but instead defines an entire distribution over all addresses in the entire memory, and dereferencing the pointer returns a weighted sum of the pointed content (that would be an expensive operation!). This has motivated multiple authors to swap soft attention models for hard attention where one samples a particular chunk of memory to attend to (e.g. a read/write action for some memory cell instead of reading/writing from all cells to some degree). This model is significantly more philosophically appealing, scalable and efficient, but unfortunately it is also non-differentiable. This then calls for use of techniques from the Reinforcement Learning literature (e.g. REINFORCE) where people are perfectly used to the concept of non-differentiable interactions. This is very much ongoing work but these hard attention models have been explored, for example, in Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets, Reinforcement Learning Neural Turing Machines, and Show Attend and Tell.

People. If you’d like to read up on RNNs I recommend theses from Alex Graves, Ilya Sutskever and Tomas Mikolov. For more about REINFORCE and more generally Reinforcement Learning and policy gradient methods (which REINFORCE is a special case of) David Silver’s class, or one of Pieter Abbeel’s classes.

Code. If you’d like to play with training RNNs I hear good things about keras or passage for Theano, the code released with this post for Torch, or this gist for raw numpy code I wrote a while ago that implements an efficient, batched LSTM forward and backward pass. You can also have a look at my numpy-based NeuralTalk which uses an RNN/LSTM to caption images, or maybe this Caffe implementation by Jeff Donahue.

Conclusion

We’ve learned about RNNs, how they work, why they have become a big deal, we’ve trained an RNN character-level language model on several fun datasets, and we’ve seen where RNNs are going. You can confidently expect a large amount of innovation in the space of RNNs, and I believe they will become a pervasive and critical component to intelligent systems.

Lastly, to add some meta to this post, I trained an RNN on the source file of this blog post. Unfortunately, at about 46K characters I haven’t written enough data to properly feed the RNN, but the returned sample (generated with low temperature to get a more typical sample) is:

I've the RNN with and works, but the computed with program of the
RNN with and the computed of the RNN with with and the code
Yes, the post was about RNN and how well it works, so clearly this works ????. See you next time!

EDIT (extra links):

Videos:

I gave a talk on this work at the London Deep Learning meetup (video).
Discussions:

HN discussion
Reddit discussion on r/machinelearning
Reddit discussion on r/programming
Replies:

Yoav Goldberg compared these RNN results to n-gram maximum likelihood (counting) baseline
@nylk trained char-rnn on cooking recipes. They look great!
@MrChrisJohnson trained char-rnn on Eminem lyrics and then synthesized a rap song with robotic voice reading it out. Hilarious ????
@samim trained char-rnn on Obama Speeches. They look fun!
João Felipe trained char-rnn irish folk music and sampled music
Bob Sturm also trained char-rnn on music in ABC notation
RNN Bible bot by Maximilien
Learning Holiness learning the Bible
Terminal.com snapshot that has char-rnn set up and ready to go in a browser-based virtual machine (thanks @samim)

Featured Comment

4ω⁴/3c³ • a year ago

I used 400 Mb of NSF Research Awards abstracts 1990-2003 for learning this char-RNN with 3 layers and size 1024. The generated abstracts seem almost reasonable and leave you with a feeling that you didn't quite understood the meaning because you're not familiar with nuances of special terms. Here they are, and here's one example:

Title       : Electoral Research on Presynaptic Problems in Subsequent Structures
Type        : Award
NSF Org     : DMS
Latest
Amendment
Date        : July 10,  1993
File        : a9213310

Award Number: 9261720
Award Instr.: Standard Grant
Prgm Manager: Phillip R. Taylor
	      OCE  DIVISION OF OCEAN SCIENCES
	      GEO  DIRECTORATE FOR GEOSCIENCES
Start Date  : September 1,  1992
Expires     : February 28,  1992   (Estimated)
Expected
Total Amt.  : $96200              (Estimated)
Investigator: Mark F. Schwartz   (Principal Investigator current)
Sponsor     : U of Cal Davis
	      OVCR/Sponsorptinch Ave AMbEr, Med Ot CTs, IN  428823462    812/471-6424

NSF Program : 1670      CHEMICAL OCEANOGRAPHY
Fld Applictn: 0204000   Oceanography
Program Ref : 9178,9267,SMET,
Abstract    :

              This project will investigate the surface microscopy of North Atlantic
              differential
              properties of the core conditions of the production
              of the decomposer system.

              This project seeks to develop a new approach to a
              control of hormone and the control of
              selection and fluxes in the early interactions of
              material determinations.  This project will be investigated
              to develop and exploit a combination of computational and
              controlled networking and engineering programs and the
              computational component of such event enhanced and operating
              concepts, and an electrode for the transition and interviews of
              molecular biology and in such systems.  The
              conference is to realize the relationships between these phases and
              physical sciences, and the effect of physical properties
              with processes in the possible constraints of relationships.
              The results will be used in a second part of this project
              in several courses with the experience of scientific and molecular
              backgrounds and the proposed research in the international
              sciences.

              The experimental research will test the robustness of
              more the structural conditions and the correlation of the neurons
              and to establish the more solution of the flux of
              the relevant complexity in structure.

              The research will be done by the consequences of
              extraction to be analyzed by means of advanced
              engineering type of starlings.

              This research is a collaborative research project between
              a contribution to the work on the development of a
              fundamental role in the construction of a state-of-the-art
              and related components of the estimation of the interaction of
              the control of proteins in the polymer system will be
              conducted at the American Element and the Forward and
              Conservation of Change and Atlantic and Atmospheric Synthesis of
              fluids and the functional conditional properties.  The
              research will provide a basic study of
              mechanisms

I wonder if NSF will be able to pass the Turing test if someone send one of these generated proposals their way. ????

RalfD • 2 years ago

Maybe it would be fun to feed in musical notation and then hear the output? The Fake-Shakespeare and fake-programming produced here is impressive (even though nonsensical), but I wonder what would Beethoven, Bach or Mozart sound like? Chaotic or actually melodic?
36 • Reply•Share ›
Avatar
James Blaha RalfD • 2 years ago
I fed in famous guitar tabs (about 500mb) worth, in ASCII. It now generates guitar tabs in ASCII well enough for me to import them into GuitarPro, where I recorded it playing back. I took that and imported to FL Studio, added some filters and a drum loop, but the notes and rhythms are otherwise totally unedited. This is what it sounds like:

https://soundcloud.com/optomet...

It is only about 20% done training on the file, and already getting good results!
14 • Reply•Share ›
Avatar
bmilde James Blaha • a year ago
Hi James!

Really cool idea and something I want to give as an exercise to my student in a seminar. Could you send me your dataset? Thanks!
1 • Reply•Share ›
Avatar
James Blaha bmilde • a year ago
Hi! I'd be happy to! Here is a bunch of stuff from it, including the dataset, trained models, and the files I used to convert the tabs back and forth.

https://drive.google.com/folde...
• Reply•Share ›
Avatar
karpathy Mod James Blaha • 2 years ago
Neat, this is fun! What format is the input in?
• Reply•Share ›
Avatar
James Blaha karpathy • 2 years ago
I formatted the input like:

%
E|------------------------------|--------------------------------------------------|
B|------------------------------|--------------------------------------------------|
G|------14--12--12--9---9---7---|-------5--------5---------------------------------|
D|------14--12--12--9---9---7---|----3--------3--------5--7-----5--7--5--7--5------|
A|------------------------------|-3--------3--------5--------5-----------------7---|
E|-0----------------------------|--------------------------------------------------|
%
E|--------------------------------|-------------------------------------|---------------------------|
B|--------------------------------|-------------------------------------|---------------------------|
G|--------------------------------|-5-----------7---7--7--0-------------|---------------------------|
D|------5---7---7---5---7--5------|-5----0------7---7--7----------------|---------------------------|
A|----------------------------7---|-3-----------5---5---------7--5------|------5---7---7---5---7----|
E|-0------------------------------|---------------------------------7---|-0-------------------------|
%

(Everything lines up with a monospace font) Standard ASCII tab format. I exported like this but removed all lyrics/comments and added the % between tab lines to help it differentiate new lines of tabs.

It gives me output that is much more consistent than the input:

%
E|-0-----------0-----------0-------|-0-------------------0-----------|-0-------------------------------|
B|---------1-----------------------|---------1-----------1-----------|---------0-----------0-----------|
G|-----2-----------2-----------2---|-----0---------------0-----------|-----0---------------0-----------|
D|-2-----------0-------0-----------|---------------------------------|---------------------------------|
A|---------------------------------|---------------------------------|---------------------------------|
E|---------------------------------|---------------------------------|---------------------------------|
%
E|---------------------------------|---------------------------------|---------------------------------|
B|-------------0-------------------|-------------0-------------------|-------------0-------------------|
G|-----0-----------0-----------0---|-----0-----------0-----------0---|-----0-----------0-------0-------|
D|---------------------------------|---------------------------------|-----------------0---------------|
A|---------------------------------|---------------------------------|---------------------------------|
E|---------------------------------|---------------------------------|---------------------------------|
%

This is what i put back into GuitarPro and have it play, usually only needs minor fixes. Sometimes it likes to add another empty bar on just one line, or remove one, for instance.
see more
• Reply•Share ›
Avatar
karpathy Mod James Blaha • 2 years ago
hmm I'm not sure that this is very RNN friendly input form. Wouldn't it be better if you gave it contiguous chunks in time? E.g. instead of
E 1 2 3
B 4 5 6
G 7 8 9
you'd pass in something like 123.456.789. In other words, you're passing in groups of 6 things that all happened at the same time, and they are always delimited with a special character such as the dot.

I'd expect that to work significantly better
4 • Reply•Share ›
Avatar
Umut Ozertem karpathy • 10 months ago
I agree. Also one other problem (maybe not a problem, not sure) with this is that the tablature takes fingering into account and for feeding data to an rnn to produce some music (more specifically chords, phrasing and some basic harmony) that shouldn't matter. There are multiple spots on the guitar to get the same note. 12th fret on E string = 7th fret on A string = 2nd fret on D string etc... (i.e. piano is a 1-d instrument and guitar is a 2-d instrument). Perhaps the network can learn that too but why not remove ambiguity first and then feed the data?

BTW how do you handle the timing of each note? Tablature is rather weak in that sense, no?
• Reply•Share ›
Avatar
James Blaha karpathy • a year ago
This input format is significantly better on the formatting, it pretty much never messes up now. The music itself doesn't seem quite as good but maybe I have to let it train longer. I am getting a lower validation loss (from 0.24 on the old one to 0.19 on the new one).

When you say the training loss is MUCH lower, what do you mean by that? I'm currently getting a training loss of ~0.16 with a validation loss of ~0.19, and it is still improving as training goes on a small amount, so I think it isn't overfitting but I'm not sure.
• Reply•Share ›
Avatar
theguy126 James Blaha • a year ago
how do you do validation loss on generated stuff?
• Reply•Share ›
Avatar
karpathy Mod James Blaha • a year ago
0.16 < 0.19 so you're overfitting a small bit. I'd suggest you mix in a bit of dropout, maybe 0.1 or 0.25, or so. Make sure to sync your code with the one that's on Github now, I issued several improvements recently that should make things train quite a bit better.
• Reply•Share ›
Avatar
James Blaha karpathy • 2 years ago
I was going to try that as well, but it seems like it can still figure it out this way since none of the lines aren't too long. I'll train one with the same settings on a dataset like that and let you know if it works better.

This way the structure of a single string, for instance for melody, should be better, whereas the other way chords, bars, and tempo should work better.