diff --git a/.DS_Store b/.DS_Store new file mode 100644 index 0000000..3e4a100 Binary files /dev/null and b/.DS_Store differ diff --git "a/1_Getting_Started_\345\205\245\351\227\250.md" "b/1_Getting_Started_\345\205\245\351\227\250.md" index 53b867c..fbe27ad 100644 --- "a/1_Getting_Started_\345\205\245\351\227\250.md" +++ "b/1_Getting_Started_\345\205\245\351\227\250.md" @@ -3,14 +3,13 @@ 这个教程并不是为了巩固研究生或者本科生的机器学习课程,但我们确实对一些重要的概念(和公式)做了的快速的概述,来确保我们在谈论同个概念。同时,你也需要去下载数据集,以便可以跑未来课程的样例代码。 -###下载 +### 下载 在每一个学习算法的页面,你都需要去下载相关的文件。加入你想要一次下载所有的文件,你可以克隆本教程的git仓库。 git clone git://github.com/lisa-lab/DeepLearningTutorials.git -###数据集 -####MNIST数据集 -(mnist.pkl.gz) +### 数据集 +#### MNIST数据集(mnist.pkl.gz) [MNIST](http://yann.lecun.com/exdb/mnist)是一个包含60000个训练样例和10000个测试样例的手写数字图像的数据集。在许多论文,包括本教程,都将60000个训练样例分为50000个样例的训练集和10000个样例的验证集(为了超参数,例如学习率、模型尺寸等等)。所有的数字图像都被归一化和中心化为28*28的像素,256位图的灰度图。 为了方便在Python中的使用,我们对数据集进行了处理。你可以在这里[下载](http://deeplearning.net/data/mnist/mnist.pkl.gz)。这个文件被表示为包含3个lists的tuple:训练集、验证集和测试集。每个lists都是都是两个list的组合,一个list是有numpy的1维array表示的784(28*28)维的0~1(0是黑,1是白)的float值,另一个list是0~9的图像标签。下面的代码显示了如何去加载这个数据集。 @@ -53,33 +52,33 @@ def shared_dataset(data_xy): 如果你再GPU上跑代码,并且数据集太大,可能导致内存崩溃。在这个时候,你就应当把数据存储为共享变量。你可以将数据储存为一个充分小的大块(几个minibatch)在一个共享变量里面,然后在训练的时候使用它。一旦你使用了这个大块,更新它储存的值。这将最小化CPU和GPU的内存交换。 -###标记 -####数据集标记 +### 标记 +#### 数据集标记 我们定义数据集为D,包括3个部分,D_train,D_valid,D_test三个集合。D内每个索引都是一个(x,y)对。 -####数学约定 +#### 数学约定 * W:大写字母表示矩阵(除非特殊说明) * W(i,j):矩阵内(i,j)点的数据 * W(i.):矩阵的一行 * W(.j):矩阵的一列 * b:小些字母表示向量(除非特殊说明) * b(i):向量内的(i)点的数据 -####符号和缩略语表 +#### 符号和缩略语表 * D:输入维度的数目 * D_h(i):第i层个隐层的输入单元数目 * L:标签的数目 * NLL:负对数似然函数 * theta:给定模型的参数集合 -####Python命名空间 +#### Python命名空间 ```Python import theano import theano.tensor as T import numpy ``` -###深度学习的监督优化入门 -####学习一个分类器 -#####0-1损失函数 +### 深度学习的监督优化入门 +#### 学习一个分类器 +##### 0-1损失函数 ![0-1_loss_1](/images/1_0-1_loss_1.png) ![0-1_loss_2](/images/1_0-1_loss_2.png) @@ -94,7 +93,7 @@ import numpy zero_one_loss = T.sum(T.neq(T.argmax(p_y_given_x), y)) ``` -#####负对数似然损失函数 +##### 负对数似然损失函数 由于0-1损失函数不可微分,在大型模型中对它优化会造成巨大开销。因此我们通过最大化给定数据标签的似然函数来训练模型。 ![nll_1](/images/1_negative_log_likelihod_1.png) @@ -114,7 +113,7 @@ NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y]) # syntax to retrieve the log-probability of the correct labels, y. ``` -####随机梯度下降 +#### 随机梯度下降 什么是普通的梯度下降?梯度下降是一个简单的算法,利用负梯度方向来决定每次迭代的新的搜索方向,使得每次迭代能使待优化的目标函数逐步减小。 伪代码如下所示。 @@ -180,9 +179,9 @@ for (x_batch, y_batch) in train_batches: return params ``` -####正则化 +#### 正则化 正则化是为了防止在MSGD训练过程中出现过拟合。为了应对过拟合,我们提出了几个方法:L1/L2正则化和early-stopping。 -#####L1/L2正则化 +##### L1/L2正则化 L1/L2正则化就是在损失函数中添加额外的项,用以惩罚一定的参数结构。对于L2正则化,又被称为“权制递减(weight decay)”。 ![l1_l2_regularization_1](/images/1_l1_l2_regularization_1.png) @@ -204,7 +203,7 @@ L2_sqr = T.sum(param ** 2) loss = NLL + lambda_1 * L1 + lambda_2 * L2 ``` -#####Early-stopping +##### Early-stopping Early-stopping通过监控模型在验证集上的表现来应对过拟合。验证集是一个我们从未在梯度下降中使用,也不在测试集的数据集合,它被认为是为了测试数据的一个表达。当在验证集上,模型的表现不再提高,或者表现更差,那么启发式算法应该放弃继续优化。 在选择何时终止优化方面,主要基于主观判断和一些启发式的方法,但在这个教程里,我们使用一个几何级数增加的patience量的策略。 @@ -266,17 +265,17 @@ while (epoch < n_epochs) and (not done_looping): 这个`validation_frequency`应该要比`patience`更小。这个代码应该至少检查了两次,在使用`patience`之前。这就是我们使用这个等式`validation_frequency = min( value, patience/2.`的原因。 这个算法可能会有更好的表现,当我们通过统计显著性的测试来代替简单的比较来决定是否增加patient。 -####测试 +#### 测试 我们依据在验证集上表现最好的参数作为模型的参数,去在测试集上进行测试。 -####总结 +#### 总结 这是对优化章节的总结。Early-stopping技术需要我们将数据分割为训练集、验证集、测试集。测试集使用minibatch的随机梯度下降来对目标函数进行逼近。同时引入L1/L2正则项来应对过拟合。 -###Theano/Python技巧 -####载入和保存模型 +### Theano/Python技巧 +#### 载入和保存模型 当你做实验的时候,用梯度下降算法可能要好几个小时去发现一个最优解。你可能在发现解的时候,想要保存这些权值。你也可能想要保存搜索进程中当前最优化的解。 -#####使用Pickle在共享变量中储存numpy的ndarrays +##### 使用Pickle在共享变量中储存numpy的ndarrays ```Python >>> import cPickle >>> save_file = open('path', 'wb') # this will overwrite current contents diff --git "a/2_Classifying_MNIST_using_LR_\351\200\273\350\276\221\345\233\236\345\275\222\350\277\233\350\241\214MNIST\345\210\206\347\261\273.md" "b/2_Classifying_MNIST_using_LR_\351\200\273\350\276\221\345\233\236\345\275\222\350\277\233\350\241\214MNIST\345\210\206\347\261\273.md" index ae83bfe..57971ca 100644 --- "a/2_Classifying_MNIST_using_LR_\351\200\273\350\276\221\345\233\236\345\275\222\350\277\233\350\241\214MNIST\345\210\206\347\261\273.md" +++ "b/2_Classifying_MNIST_using_LR_\351\200\273\350\276\221\345\233\236\345\275\222\350\277\233\350\241\214MNIST\345\210\206\347\261\273.md" @@ -6,7 +6,7 @@ 在这一节,我们将展示Theano如何实现最基本的分类器:逻辑回归分类器。我们以模型的快速入门开始,复习(refresher)和巩固(anchor)数学负号,也展示了数学表达式如何映射到Theano图中。 -###模型 +## 模型 逻辑回归模型是一个线性概率模型。它由一个权值矩阵W和偏置向量b参数化。分类通过将输入向量提交到一组超平面,每个超平面对应一个类。输入向量和超平面的距离是这个输入属于该类的一个概率量化。 在给定模型下,输入x,输出为y的概率,可以用如下公式表示 @@ -55,7 +55,7 @@ Theano代码如下。 为了获得实际的模型预测,我们使用`T_argmax`操作,来返回`p_y_given_x`的最大值对应的y。 如果想要获得完整的Theano算子,看[算子列表](http://deeplearning.net/software/theano/library/tensor/basic.html#basic-tensor-functionality) -###定义一个损失函数 +## 定义一个损失函数 学习优化模型参数需要最小化一个损失参数。在多分类的逻辑回归中,很显然是使用负对数似然函数作为损失函数。似然函数和损失函数定义如下:
![loss_function](/images/2_defining_a_loss_function_1.png)
@@ -78,7 +78,7 @@ Theano代码如下。 ``` 在这里我们使用错误的平均来表示损失函数,以减少minibatch尺寸对我们的影响。 -###创建一个逻辑回归类 +## 创建一个逻辑回归类 现在,我们要定义一个`逻辑回归`的类,来概括逻辑回归的基本行为。代码已经是我们之前涵盖的了,不再进行过多解释。 ```Python @@ -223,7 +223,7 @@ class LogisticRegression(object): cost = classifier.negative_log_likelihood(y) ``` -###学习模型 +## 学习模型 在实现MSGD的许多语言中,需要通过手动求解损失函数对每个参数的梯度(微分)来实现。 在Theano中呢,这是非常简单的。它自动微分,并且使用了一定的数学转换来提高数学稳定性。 @@ -257,7 +257,7 @@ class LogisticRegression(object): * 每一次函数调用,它都先用index对应的训练集的切片来更新x,y。然后计算该minibatch下的cost,以及申请`update`操作。 每次`train_model(inedx)`被调用,它都计算并返回该minibatch的cost,当然这也是MSGD的一步。整个学习算法因循环了数据集所有样例。 -###训练模型 +## 训练模型 在之前论述中所说,我们对分类错误的样本感兴趣(不仅仅是可能性)。因此模型中增加了一个额外的实例方法,来纪录每个minibatch中的错误分类样例数。 ```Python @@ -308,7 +308,7 @@ class LogisticRegression(object): } ) ``` -###把它们组合起来 +## 把它们组合起来 最后的代码如下。 ```Python """ diff --git "a/3_Multilayer_Perceptron_\345\244\232\345\261\202\346\204\237\347\237\245\346\234\272.md" "b/3_Multilayer_Perceptron_\345\244\232\345\261\202\346\204\237\347\237\245\346\234\272.md" index 6f893ea..56667ee 100644 --- "a/3_Multilayer_Perceptron_\345\244\232\345\261\202\346\204\237\347\237\245\346\234\272.md" +++ "b/3_Multilayer_Perceptron_\345\244\232\345\261\202\346\204\237\347\237\245\346\234\272.md" @@ -5,7 +5,7 @@ 下一个我们将在Theano中使用的结构是单隐层的多层感知机(MLP)。MLP可以被看作一个逻辑回归分类器。这个中间层被称为隐藏层。一个单隐层对于MLP成为通用近似器是有效的。然而在后面,我们将讲述使用多个隐藏层的好处,例如深度学习的前提。这个课程介绍了[MLP,反向误差传导,如何训练MLPs](http://www.iro.umontreal.ca/~pift6266/H10/notes/mlp.html)。 -###模型 +## 模型 一个多层感知机(或者说人工神经网络——ANN),在只有一个隐藏层时可以被表示为如下的图: ![mlp_model_1](/images/3_the_model_1.png) @@ -17,7 +17,7 @@ 其中b_1,W_1是输出层到隐藏层的偏置向量和权值矩阵,s是该层的激活函数。而b_2,W_2是隐藏层到输出层的偏置向量和权值矩阵,G是该层的激活函数。通常选择s为sigmoid函数,G为softmax函数。 在训练MLP模型的参数时,我们使用minibatch的随机梯度下降,在获得梯度后使用反向误差传导算法来实现参数的训练。由于Theano提供自动的微分,我们不需要在这个教程里面谈及这个方面。 -###从逻辑回归到多层感知机 +## 从逻辑回归到多层感知机 本教程将专注于单隐藏层的MLP。我们以隐藏层的类的实现开始,如果要构建一个MLP,只需要在此基础上添加一个逻辑回归就好。 ```Python @@ -229,7 +229,7 @@ class MLP(object): ) ``` -###把它组合起来 +## 把它组合起来 已经解释了所有的基本该概念,下面的代码就是一个完整的MLP类。 ```Python @@ -647,22 +647,22 @@ The code for file mlp.py ran for 97.34m 读者也可以在[这个页面](http://yann.lecun.com/exdb/mnist)查看MNIST的识别结果。 -###训练MLPs的技巧 +## 训练MLPs的技巧 在上面的代码中国,有一些是不能进行梯度下降来优化的。严格意义上将,发现最优的超参集合是不可能的任务。第一,我们不能独立的优化每一个参数。第二,我们不能很容易的求解所有参数的梯度(有些是离散的值,有些是实数)。第三,这个优化问题是非凸的,容易陷入局部最优。 好消息是,过去25年,研究者发明了一些在神经网络中选择超参数的方法和规则。你可以在LeCun等人的[Efficient BackPro](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)中阅读,这是一个好的综述。这里,我们将总结下我们的代码中用到的几个重要的方法和技术。 -####非线性 +### 非线性 最常见的就是`sigmoid`和`tanh`函数。在[第4.4节](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)中解释的,非线性是关于原点对称的,它倾向去输出0均值的输出(这是被期望的属性)。根据我们的经验,tanh(双曲函数)拥有更好的收敛性。 -####权值初始化 +### 权值初始化 在初始化权值的时候,我们一般需要它们在0附近,要足够小(在激活函数的近似线性区域可以获得最大的梯度)。另一个特性,尤其对深度网络而言,是可以减小层与层之间的激活函数的方差和反向传导梯度的方差。这就可以让信息更好的向下和向上的传导,减少层间差异。数学推倒,请看[Xavier10](http://deeplearning.net/tutorial/references.html#xavier10)。 -####学习率 +### 学习率 有许多文献专注在好的学习速率的选择上。最简单的方案就是选择一个固定速率。经验法则:尝试对数间隔的值(0.1,001,。。),然后缩小(对数)网络搜索的范围(你获得最低验证错误的区域)。 随着时间的推移减小学习速率有时候也是一个好主意。一个简单的方法是使用这个公式:u/(1+d*t),u是初始速率(可以使用上面讲的网格搜索选择),d是减小常量,用以控制学习速率,可以设为0.001或者更小,t是迭代次数或者时间。 [4.7节](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf)讲述了网络中每个参数学习速率选择的方法,然后基于分类错误率自适应的选择它们。 -####隐藏节点数 +### 隐藏节点数 这个超参数是非常基于数据集的。模糊的来说就是,输入分布越复杂,去模拟它的网络就需要更大的容量,那么隐藏单元的数目就要更大。事实上,一个层的权值矩阵就是可以直接度量的(输入维度*输出维度)。 除非我们去使用正则选项(early-stopping或L1/L2惩罚),隐藏节点数和泛化表现的分布图,将呈现U型(即隐藏节点越多,在后期并不能提高泛化性)。 -####正则化参数 +### 正则化参数 典型的方法是使用L1/L2正则化,同时lambda设为0.01,0.001等。尽管在我们之前提及的框架里面,它并没有显著提高性能,但它仍然是一个值得探讨的方法。 diff --git "a/4_Convoltional_Neural_Networks_LeNet_\345\215\267\347\247\257\347\245\236\347\273\217\347\275\221\347\273\234.md" "b/4_Convoltional_Neural_Networks_LeNet_\345\215\267\347\247\257\347\245\236\347\273\217\347\275\221\347\273\234.md" index 9851d27..a9796b1 100644 --- "a/4_Convoltional_Neural_Networks_LeNet_\345\215\267\347\247\257\347\245\236\347\273\217\347\275\221\347\273\234.md" +++ "b/4_Convoltional_Neural_Networks_LeNet_\345\215\267\347\247\257\347\245\236\347\273\217\347\275\221\347\273\234.md" @@ -8,12 +8,12 @@ 本节的所有代码,可以在[这里](http://deeplearning.net/tutorial/code/convolutional_mlp.py)下载,还有[3狼月亮图](https://raw.githubusercontent.com/lisa-lab/DeepLearningTutorials/master/doc/images/3wolfmoon.jpg)。 -###动机 +## 动机 卷积神经网络是多层感知机的生物灵感变种。从Hubel和Wiesel先前对猫的视觉皮层的研究,我们知道视皮层中含有细胞的复杂分布。这些细胞只对小的视觉子区域敏感,称为`感受野`。这些子区域平铺来覆盖整个视场。这些细胞表现为输入图像空间的局部滤波器,非常适合检测自然图像中的强空间局部相关性。 此外,两类基础细胞类型被定义:`简单细胞`使用它们的感受野,最大限度的响应特定的棱状图案。`复杂细胞`有更大的感受野,可以局部不变的确定图案精确位置。动物视觉皮层是现存的最强大的视觉处理系统,很显然,我们需要去模仿它的行为。因此,许多类神经模型在文献中出现,包括[NeoCognitron](http://deeplearning.net/tutorial/references.html#fukushima),[HMAX](http://deeplearning.net/tutorial/references.html#serre07)和[LeNet-5](http://deeplearning.net/tutorial/references.html#lecun98),这是本教程需要着重讲解的。 -###稀疏连接 +## 稀疏连接 卷积神经网络通过在相邻层的神经元之间实施局部连接模式来检测局部空间相关性。换句话说就是,第m层的隐藏单元的输入来自第m-1层单元的子集,单元拥有空间上的感受野连接。我们可以通过如下的图来表示: ![sparse_connectivity](/images/4_sparse_con_1.png) @@ -22,12 +22,12 @@ 然而,就像上面展示的,将这些层叠加起来去形成(非线性)滤波器,就可以变得越来越全局化。举例而言,第m+1层的单元可以编码一个宽度为5的非线性特征。 -###权值共享 +## 权值共享 此外,在CNNs中,每一只滤波器共享同一组权值,这样该滤波器就可以形成一个特征映射(feaature map)。梯度下降算法在小改动后可以学习这种共享参数。这个被共享权值的梯度就是被共享的参数的梯度的简单求和。 复制单元使得特征可以无视其在视觉野中的位置而被检测到。此外,权值共享增加了学习效率,减少了需要被学习的自由参数的数目。这样的设定,使得CNNs在视觉问题上有更好的泛化性。 -###细节和注解 +## 细节和注解 一个特征映射是由一个函数在整个图像的某一子区域重复使用来获得的,换句话说,就是通过线性滤波器来卷积输入图像,加上偏置后,再输入到非线性函数。如果我们定义第k个特征映射是为h_k,滤波器有W_k,b_k定义,则特征映射可以被表现为如下形式: ![h_k(i,j)](/images/4_detail_notation_1.png) @@ -45,7 +45,7 @@ 把它们都放一起就是,W_k_l(i,j),表示第m层的第k个特征映射,在第m-1层的l个特征映射的(i,j)参考坐标的连接权值。 -###卷积操作 +## 卷积操作 卷积操作是Theano实现卷积层的主要消耗。卷积操作通过`theano.tensor.signal.conv2d`,它包括两个输入符号: * 与输入的minibatch有关的4维张量,尺寸包括如下:[mini-batch的大小,输入特征映射的数目,图像高度,图像宽度]。 @@ -141,7 +141,7 @@ pylab.show() 注意我们使用了与MLP相同得权值初始化方案。权值在一个范围为[-1/fan-in, 1/fan-in]的均匀分布中随机取样,fan-in是一个隐单元的输入数。对MLP,它是下一层单元的数目。对CNNs,我不得不需要去考虑到输入特征映射的数目和感受野的大小。 -###最大池化 +## 最大池化 卷积神经网络另一个重大的概念是最大池化,一个非线性的降采样形式。最大池化就是将输入图像分割为一系列不重叠的矩阵,然后对每个子区域,输出最大值。 最大池化在视觉中是有用的,由如下2个原因: @@ -197,7 +197,7 @@ With ignore_border set to False: 注意,与其他Theano代码相比,`max_pool_2d`操作有点特殊。它需要缩减因子`ds`(长度维2的tuple,班汉图像长度和宽度的缩减因子)在图构建的时候被告知。这在未来可能会发生改变。 -###整个模型 +## 整个模型 稀疏性、卷积层和最大池化时LeNet系列模型的核心。而准确的模型细节有很大的差异,下图显示了一个LeNet模型。 ![full_model](/images/4_full_model_1.png) @@ -395,7 +395,7 @@ class LeNetConvPoolLayer(object): ``` 我们把进行实际训练和early-stopping代码取出了。因为它和MLP中是一样的。有兴趣的读者,可以阅读教程开头的源代码。 -###运行代码 +## 运行代码 在一台Core i7-2600K CPU clocked at 3.40GHz上,我们使用floatX=float32,获得如下的输出: ``` @@ -422,11 +422,11 @@ The code for file convolutional_mlp.py ran for 32.52m ``` 可以观察到不同实验下验证误差和测试误差的不同,这是由不同硬件的取整结构不同造成的。可以忽略。 -###技巧 -####超参的选择 +## 技巧 +### 超参的选择 卷积神经网络的训练相比与标准的MLP是相当困难的,因为它添加了更多的超参数。当我们在应用学习率和正则化的规则下,下面的方法也需要在优化CNNs被考虑: -#####滤波器的数量 +#### 滤波器的数量 当选择每层滤波器数量的时候,需要记住计算单卷积层的活性比传统的MLP会更加昂贵。 假设第l-1层包含K_(l-1)个特征映射和M*N个像素点(例如,位置数乘以特征映射数),然后第l层有K_(l)个滤波器,尺寸为m*n。那么计算一个特征映射(在(M-m)*(N-n)个像素位置应用每个m*n大小的滤波器)将消耗(M-m)*(N-n)*m*n*K_(l-1)的计算量。然后总共要计算K_l次。如果不是所有的特征只与前一层的所有特征相连,那么事情就变得更加复杂啦。 @@ -436,15 +436,15 @@ The code for file convolutional_mlp.py ran for 32.52m 因为特征映射的尺寸会随着深度的增加而减小,靠近输入层的层将趋向于有更少的滤波器,而更高的层有更多的滤波器。事实上,为了平衡每一层的计算量,特征数和图像位置数的乘积在层的传递过程中都是基本一致的。为了保护输入信息,我们需要保证总的激活数量(特征映射数*像素位置数)在层间传递的时候是至于减少(当然我们在做监督学习的时候当然是希望它减小的)。特征映射的数量直接控制整个容量,同时它依赖于可用样例的数目和任务的复杂度。 -#####滤波器的尺寸 +#### 滤波器的尺寸 通常在每个文献中滤波器的尺寸都有很大的不同,它常常是基于数据库的。MNIST在第一层的最好结果是5*5层滤波器。当自然图像(每维有几百个像素)趋向于使用更大的滤波器,例如12*12,15*15。 因此这个技巧事实上是去寻找正确等级的“粒度”,以便对给定的数据集去形成合适范围内的抽象。 -#####最大池化的尺寸 +#### 最大池化的尺寸 经典的是2*2,或者没有最大池化。非常大的图可以在较低的层使用4*4的池化。但是需要记住的是,池化在通过16个因子减少信号维度的同时,也可能导致信号细节的大量丢失。 -#####技巧 +#### 技巧 假如你想要在新的数据集上采用这个模型,下面的一些小技巧可能能让你获得更好的结果: * 白化(whitening)数据(例如,使用主成分分析) * 衰减每次迭代的学习速率。 diff --git "a/5_Denoising_Autoencoders_\351\231\215\345\231\252\350\207\252\345\212\250\347\274\226\347\240\201.md" "b/5_Denoising_Autoencoders_\351\231\215\345\231\252\350\207\252\345\212\250\347\274\226\347\240\201.md" index 13649cc..afe634d 100644 --- "a/5_Denoising_Autoencoders_\351\231\215\345\231\252\350\207\252\345\212\250\347\274\226\347\240\201.md" +++ "b/5_Denoising_Autoencoders_\351\231\215\345\231\252\350\207\252\345\212\250\347\274\226\347\240\201.md" @@ -7,7 +7,7 @@ 降噪自动编码机(denoising Autoencoders)是经典自动编码机的扩展。它在[Vincent08](http://deeplearning.net/tutorial/references.html#vincent08)中作为深度网络的一个构建块被介绍。我们通过简短的[自动编码机](http://deeplearning.net/tutorial/dA.html#autoencoders)来开始本教程。 -###自动编码机 +## 自动编码机 在[Bengio09](http://deeplearning.net/tutorial/references.html#bengio09)的第4.6节中,有自动编码机的简介。一个自动编码机,由d维的[0,1]之间的输入向量x,通过第一层映射(使用一个编码器)来获得隐藏的d‘维度的[0,1]的输出表达y。通过如下的决定性映射: ![y_mapping](/images/5_autoencoders_1.png) @@ -252,7 +252,7 @@ class dA(object): 这里有其他方法,使得一个有比输入有更多隐藏单元的自动编码机,去避免只学习它本身,而是在输入的隐藏表达中捕捉到有用的东西。一个是添加稀疏性(迫使许多隐单元是0或者接近0)。稀疏性已经被很成功的发挥了[Ranzato07](http://deeplearning.net/tutorial/references.html#ranzato07)[Lee08](http://deeplearning.net/tutorial/references.html#lee08)。另一个是,在输入到重建过程中,增加从输入到重建的转换中的随机性。这个技术在受限玻尔兹曼机中被使用(Restricted Boltzmann Machines,在后面的章节中讨论),还有降噪自动编码机,在后面讨论。 -###降噪自动编码机 +## 降噪自动编码机 降噪自动编码机的思想是很简单饿。为了迫使隐藏层去发现更加鲁棒性的特征,避免它只是去简单的学习定义,我们训练自动编码机去重建被破坏的输入版本的数据。 @@ -450,7 +450,7 @@ class dA(object): ``` -###将它组合起来 +## 将它组合起来 现在去构建一个`dA`类和训练它变得很简单了。 @@ -509,7 +509,7 @@ image = Image.fromarray(tile_raster_images(X=da.W.get_value(borrow=True).T, image.save('filters_corruption_30.png') ``` -###运行这个代码 +## 运行这个代码 当我们不使用任何噪声的时候,获得的滤波器如下: diff --git "a/7_Restricted_Boltzmann_Machine_\345\217\227\351\231\220\346\263\242\345\260\224\345\205\271\346\233\274\346\234\272.md" "b/7_Restricted_Boltzmann_Machine_\345\217\227\351\231\220\346\263\242\345\260\224\345\205\271\346\233\274\346\234\272.md" index 5510aee..7948286 100644 --- "a/7_Restricted_Boltzmann_Machine_\345\217\227\351\231\220\346\263\242\345\260\224\345\205\271\346\233\274\346\234\272.md" +++ "b/7_Restricted_Boltzmann_Machine_\345\217\227\351\231\220\346\263\242\345\260\224\345\205\271\346\233\274\346\234\272.md" @@ -5,7 +5,7 @@ 本节的所有代码都可以在[这里](http://deeplearning.net/tutorial/code/rbm.py)下载。 -###基于能量模型(Energy-Based Models) +## 基于能量模型(Energy-Based Models) 基于能量的模型(EBM)把我们所关心变量的各种组合和一个标量能量联系在一起。训练模型的过程就是不断改变标量能量的过程,使其能量函数的形状满足期望的形状。比如,如果一个变量组合被认为是合理的,它同时也具有较小的能量。基于能量的概率模型通过能量函数来定义概率分布: ![energy_fun](/images/7_ebm_1.png) @@ -20,7 +20,7 @@ 其中随机梯度为![gradient](/images/7_ebm_4.png),其中theta为模型的参数。 -####包含隐藏单元的EBMs +### 包含隐藏单元的EBMs 在很多情况下,我们无法观察到x样本的全部分布,或者我们需要引进一些没有观察到的变量,以增加模型的表达能力。因而我们考虑将模型分为2部分,一个可见部分(x的观察分布)和一个隐藏部分h,这样得到的就是包含隐含变量的EBM: @@ -47,7 +47,7 @@ 通常我们很难精确计算这个梯度,因为式中第一项涉及到可见单元与隐含单元的联合分布,由于归一化因子Z(θ)的存在,该分布很难获取。 我们只能通过一些采样方法(如Gibbs采样)获取其近似值,其具体方法将在后文中详述。 -###受限波尔兹曼机(RBM) +## 受限波尔兹曼机(RBM) 波尔兹曼机是对数线性马尔可夫随机场(MRF)的一种特殊形式,例如这个能量函数在它的自由参数下是线性的。为了使得它们能更强力的表达复杂分布(从受限的参数设定到一个非参数设定),我们认为一些变量是不可见的(被称为隐藏)。通过拥有更多隐藏变量(也称之为隐藏单元),我们可以增加波尔兹曼机的模型容量。受限波尔兹曼机限制波尔兹曼机可视层和隐藏层的层内连接。RBM模型可以由下图描述: @@ -68,7 +68,7 @@ RBM的能量函数可以被定义如下: ![prob_rbm](/images/7_rbm_4.png) -####二进制单元的RBMs +### 二进制单元的RBMs 在使用二进制单元(v和h都属于{0,1})的普通研究情况时,概率版的普通神经激活函数表示如下: ![activation_fun](/images/7_rbm_binary_units_1.png) @@ -80,7 +80,7 @@ RBM的能量函数可以被定义如下: ![free_energy_binary](images/7_rbm_binary_units_1.png) -####二进制单元的更新公式 +### 二进制单元的更新公式 我们可以获得如下的一个二进制单元RBM的对数似然梯度: @@ -88,7 +88,7 @@ RBM的能量函数可以被定义如下: 这个公式的更多细节推倒,读者可以阅读[这一页](http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DBNEquations),或者[Learning Deep Architectures for AI](http://www.iro.umontreal.ca/%7Elisa/publications2/index.php/publications/show/239)的第五节。在这里,我们将不使用这些等式,而是通过Theano的`T.grad`来获取梯度。 -###在RBM中进行采样 +## 在RBM中进行采样 p(x)的样本可以通过运行马尔可夫链的汇聚、Gibbs采样的过渡来得到。 @@ -109,20 +109,20 @@ p(x)的样本可以通过运行马尔可夫链的汇聚、Gibbs采样的过渡 在这个理论里面,每个参数在学习进程中的更新都需要运行这样几个链来趋近。毫无疑问这将耗费很大的计算量。一些新的算法已经被提出来,以有效的学习p(v,h)中的样本情况。 -###对比散度算法(CD-k) +## 对比散度算法(CD-k) 对比散度算法,是一种成功的用于求解对数似然函数关于未知参数梯度的近似的方法。它使用两个技巧来技术采样过程: * 因为我们希望p(v)=p_train(v)(数据的真实、底层分布),所以我们使用一个训练样本来初始化马尔可夫链(例如,从一个被预计接近于p的分布,所以这个链已经开始去收敛这个最终的分布p)。 * 对比梯度不需要等待链的收敛。样本在k步Gibbs采样后就可以获得。在实际中,k=1时就可以获得惊人的好的效果。 -####持续的对比散度 +### 持续的对比散度 持续的对比散度[Tieleman08](http://deeplearning.net/tutorial/references.html#tieleman08)使用了另外一种近似方法来从p(v,h)中采样。它建立在一个拥有持续状态的单马尔可夫链上(例如,不是对每个可视样例都重启链)。对每一次参数更新,我们通过简单的运行这个链k步来获得新的样本。然后保存链的状态以便后续的更新。 一般直觉的是,如果参数的更新是足够小相比链的混合率,那么马尔科夫链应该能够“赶上”模型的变化。 -###实现 +## 实现 ![RBM_impl](/images/7_implementation_1.png) @@ -426,21 +426,21 @@ class RBM(object): return monitoring_cost, updates ``` -###进展跟踪 +## 进展跟踪 RBMs的训练是特别困难的。由于归一化函数Z,我们无法在训练的时候估计对数似然函数log(P(x))。因而我们没有直接可以度量超参数优化与否的方法。 而下面的几个选项对用户是有用的。 -####负样本的检查 +### 负样本的检查 在训练中获得的负样本是可以可视化的。在训练进程中,我们知道由RBM定义的模型不断逼近真实分布,p_train(x)。负样例就可以视为训练集中的样本。显而易见的,坏的超参数将在这种方式下被丢弃。 -####滤波器的可视化跟踪 +### 滤波器的可视化跟踪 由模型训练的滤波器是可以可视化的。我们可以将每个单元的权值以灰度图的方式展示。滤波器应该选出数据中强的特征。对于任意的数据集,这个滤波器都是不确定的。例如,训练MNIST,滤波器就表现的像“stroke”检测器,而训练自然图像的稀疏编码的时候,则像Gabor滤波器。 -####似然估计的替代 +### 似然估计的替代 此外,更加容易处理的函数可以被用于做似然估计的替代。当我们使用PCD来训练RBM的时候,可以使用伪似然估计来替代。伪似然估计(Pseudo-likeihood,PL)更加简于计算,因为它假设所有的比特都是相互独立的,因此有: @@ -449,11 +449,11 @@ RBMs的训练是特别困难的。由于归一化函数Z,我们无法在训练 -###主循环 +## 主循环 -###结果 +## 结果 diff --git a/LeNet-5/dA.py b/LeNet-5/dA.py deleted file mode 100644 index e1debf7..0000000 --- a/LeNet-5/dA.py +++ /dev/null @@ -1,413 +0,0 @@ -""" - This tutorial introduces denoising auto-encoders (dA) using Theano. - - Denoising autoencoders are the building blocks for SdA. - They are based on auto-encoders as the ones used in Bengio et al. 2007. - An autoencoder takes an input x and first maps it to a hidden representation - y = f_{\theta}(x) = s(Wx+b), parameterized by \theta={W,b}. The resulting - latent representation y is then mapped back to a "reconstructed" vector - z \in [0,1]^d in input space z = g_{\theta'}(y) = s(W'y + b'). The weight - matrix W' can optionally be constrained such that W' = W^T, in which case - the autoencoder is said to have tied weights. The network is trained such - that to minimize the reconstruction error (the error between x and z). - - For the denosing autoencoder, during training, first x is corrupted into - \tilde{x}, where \tilde{x} is a partially destroyed version of x by means - of a stochastic mapping. Afterwards y is computed as before (using - \tilde{x}), y = s(W\tilde{x} + b) and z as s(W'y + b'). The reconstruction - error is now measured between z and the uncorrupted input x, which is - computed as the cross-entropy : - - \sum_{k=1}^d[ x_k \log z_k + (1-x_k) \log( 1-z_k)] - - - References : - - P. Vincent, H. Larochelle, Y. Bengio, P.A. Manzagol: Extracting and - Composing Robust Features with Denoising Autoencoders, ICML'08, 1096-1103, - 2008 - - Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle: Greedy Layer-Wise - Training of Deep Networks, Advances in Neural Information Processing - Systems 19, 2007 - -""" - -import os -import sys -import time - -import numpy - -import theano -import theano.tensor as T -from theano.tensor.shared_randomstreams import RandomStreams - -from logistic_sgd import load_data -from utils import tile_raster_images - -try: - import PIL.Image as Image -except ImportError: - import Image - - -# start-snippet-1 -class dA(object): - """Denoising Auto-Encoder class (dA) - - A denoising autoencoders tries to reconstruct the input from a corrupted - version of it by projecting it first in a latent space and reprojecting - it afterwards back in the input space. Please refer to Vincent et al.,2008 - for more details. If x is the input then equation (1) computes a partially - destroyed version of x by means of a stochastic mapping q_D. Equation (2) - computes the projection of the input into the latent space. Equation (3) - computes the reconstruction of the input, while equation (4) computes the - reconstruction error. - - .. math:: - - \tilde{x} ~ q_D(\tilde{x}|x) (1) - - y = s(W \tilde{x} + b) (2) - - x = s(W' y + b') (3) - - L(x,z) = -sum_{k=1}^d [x_k \log z_k + (1-x_k) \log( 1-z_k)] (4) - - """ - - def __init__( - self, - numpy_rng, - theano_rng=None, - input=None, - n_visible=784, - n_hidden=500, - W=None, - bhid=None, - bvis=None - ): - """ - Initialize the dA class by specifying the number of visible units (the - dimension d of the input ), the number of hidden units ( the dimension - d' of the latent or hidden space ) and the corruption level. The - constructor also receives symbolic variables for the input, weights and - bias. Such a symbolic variables are useful when, for example the input - is the result of some computations, or when weights are shared between - the dA and an MLP layer. When dealing with SdAs this always happens, - the dA on layer 2 gets as input the output of the dA on layer 1, - and the weights of the dA are used in the second stage of training - to construct an MLP. - - :type numpy_rng: numpy.random.RandomState - :param numpy_rng: number random generator used to generate weights - - :type theano_rng: theano.tensor.shared_randomstreams.RandomStreams - :param theano_rng: Theano random generator; if None is given one is - generated based on a seed drawn from `rng` - - :type input: theano.tensor.TensorType - :param input: a symbolic description of the input or None for - standalone dA - - :type n_visible: int - :param n_visible: number of visible units - - :type n_hidden: int - :param n_hidden: number of hidden units - - :type W: theano.tensor.TensorType - :param W: Theano variable pointing to a set of weights that should be - shared belong the dA and another architecture; if dA should - be standalone set this to None - - :type bhid: theano.tensor.TensorType - :param bhid: Theano variable pointing to a set of biases values (for - hidden units) that should be shared belong dA and another - architecture; if dA should be standalone set this to None - - :type bvis: theano.tensor.TensorType - :param bvis: Theano variable pointing to a set of biases values (for - visible units) that should be shared belong dA and another - architecture; if dA should be standalone set this to None - - - """ - self.n_visible = n_visible - self.n_hidden = n_hidden - - # create a Theano random generator that gives symbolic random values - if not theano_rng: - theano_rng = RandomStreams(numpy_rng.randint(2 ** 30)) - - # note : W' was written as `W_prime` and b' as `b_prime` - if not W: - # W is initialized with `initial_W` which is uniformely sampled - # from -4*sqrt(6./(n_visible+n_hidden)) and - # 4*sqrt(6./(n_hidden+n_visible))the output of uniform if - # converted using asarray to dtype - # theano.config.floatX so that the code is runable on GPU - initial_W = numpy.asarray( - numpy_rng.uniform( - low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)), - high=4 * numpy.sqrt(6. / (n_hidden + n_visible)), - size=(n_visible, n_hidden) - ), - dtype=theano.config.floatX - ) - W = theano.shared(value=initial_W, name='W', borrow=True) - - if not bvis: - bvis = theano.shared( - value=numpy.zeros( - n_visible, - dtype=theano.config.floatX - ), - borrow=True - ) - - if not bhid: - bhid = theano.shared( - value=numpy.zeros( - n_hidden, - dtype=theano.config.floatX - ), - name='b', - borrow=True - ) - - self.W = W - # b corresponds to the bias of the hidden - self.b = bhid - # b_prime corresponds to the bias of the visible - self.b_prime = bvis - # tied weights, therefore W_prime is W transpose - self.W_prime = self.W.T - self.theano_rng = theano_rng - # if no input is given, generate a variable representing the input - if input is None: - # we use a matrix because we expect a minibatch of several - # examples, each example being a row - self.x = T.dmatrix(name='input') - else: - self.x = input - - self.params = [self.W, self.b, self.b_prime] - # end-snippet-1 - - def get_corrupted_input(self, input, corruption_level): - """This function keeps ``1-corruption_level`` entries of the inputs the - same and zero-out randomly selected subset of size ``coruption_level`` - Note : first argument of theano.rng.binomial is the shape(size) of - random numbers that it should produce - second argument is the number of trials - third argument is the probability of success of any trial - - this will produce an array of 0s and 1s where 1 has a - probability of 1 - ``corruption_level`` and 0 with - ``corruption_level`` - - The binomial function return int64 data type by - default. int64 multiplicated by the input - type(floatX) always return float64. To keep all data - in floatX when floatX is float32, we set the dtype of - the binomial to floatX. As in our case the value of - the binomial is always 0 or 1, this don't change the - result. This is needed to allow the gpu to work - correctly as it only support float32 for now. - - """ - return self.theano_rng.binomial(size=input.shape, n=1, - p=1 - corruption_level, - dtype=theano.config.floatX) * input - - def get_hidden_values(self, input): - """ Computes the values of the hidden layer """ - return T.nnet.sigmoid(T.dot(input, self.W) + self.b) - - def get_reconstructed_input(self, hidden): - """Computes the reconstructed input given the values of the - hidden layer - - """ - return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime) - - def get_cost_updates(self, corruption_level, learning_rate): - """ This function computes the cost and the updates for one trainng - step of the dA """ - - tilde_x = self.get_corrupted_input(self.x, corruption_level) - y = self.get_hidden_values(tilde_x) - z = self.get_reconstructed_input(y) - # note : we sum over the size of a datapoint; if we are using - # minibatches, L will be a vector, with one entry per - # example in minibatch - L = - T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1) - # note : L is now a vector, where each element is the - # cross-entropy cost of the reconstruction of the - # corresponding example of the minibatch. We need to - # compute the average of all these to get the cost of - # the minibatch - cost = T.mean(L) - - # compute the gradients of the cost of the `dA` with respect - # to its parameters - gparams = T.grad(cost, self.params) - # generate the list of updates - updates = [ - (param, param - learning_rate * gparam) - for param, gparam in zip(self.params, gparams) - ] - - return (cost, updates) - - -def test_dA(learning_rate=0.1, training_epochs=15, - dataset='mnist.pkl.gz', - batch_size=20, output_folder='dA_plots'): - - """ - This demo is tested on MNIST - - :type learning_rate: float - :param learning_rate: learning rate used for training the DeNosing - AutoEncoder - - :type training_epochs: int - :param training_epochs: number of epochs used for training - - :type dataset: string - :param dataset: path to the picked dataset - - """ - datasets = load_data(dataset) - train_set_x, train_set_y = datasets[0] - - # compute number of minibatches for training, validation and testing - n_train_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size - - # allocate symbolic variables for the data - index = T.lscalar() # index to a [mini]batch - x = T.matrix('x') # the data is presented as rasterized images - - if not os.path.isdir(output_folder): - os.makedirs(output_folder) - os.chdir(output_folder) - #################################### - # BUILDING THE MODEL NO CORRUPTION # - #################################### - - rng = numpy.random.RandomState(123) - theano_rng = RandomStreams(rng.randint(2 ** 30)) - - da = dA( - numpy_rng=rng, - theano_rng=theano_rng, - input=x, - n_visible=28 * 28, - n_hidden=500 - ) - - cost, updates = da.get_cost_updates( - corruption_level=0., - learning_rate=learning_rate - ) - - train_da = theano.function( - [index], - cost, - updates=updates, - givens={ - x: train_set_x[index * batch_size: (index + 1) * batch_size] - } - ) - - start_time = time.clock() - - ############ - # TRAINING # - ############ - - # go through training epochs - for epoch in xrange(training_epochs): - # go through trainng set - c = [] - for batch_index in xrange(n_train_batches): - c.append(train_da(batch_index)) - - print 'Training epoch %d, cost ' % epoch, numpy.mean(c) - - end_time = time.clock() - - training_time = (end_time - start_time) - - print >> sys.stderr, ('The no corruption code for file ' + - os.path.split(__file__)[1] + - ' ran for %.2fm' % ((training_time) / 60.)) - image = Image.fromarray( - tile_raster_images(X=da.W.get_value(borrow=True).T, - img_shape=(28, 28), tile_shape=(10, 10), - tile_spacing=(1, 1))) - image.save('filters_corruption_0.png') - - ##################################### - # BUILDING THE MODEL CORRUPTION 30% # - ##################################### - - rng = numpy.random.RandomState(123) - theano_rng = RandomStreams(rng.randint(2 ** 30)) - - da = dA( - numpy_rng=rng, - theano_rng=theano_rng, - input=x, - n_visible=28 * 28, - n_hidden=500 - ) - - cost, updates = da.get_cost_updates( - corruption_level=0.3, - learning_rate=learning_rate - ) - - train_da = theano.function( - [index], - cost, - updates=updates, - givens={ - x: train_set_x[index * batch_size: (index + 1) * batch_size] - } - ) - - start_time = time.clock() - - ############ - # TRAINING # - ############ - - # go through training epochs - for epoch in xrange(training_epochs): - # go through trainng set - c = [] - for batch_index in xrange(n_train_batches): - c.append(train_da(batch_index)) - - print 'Training epoch %d, cost ' % epoch, numpy.mean(c) - - end_time = time.clock() - - training_time = (end_time - start_time) - - print >> sys.stderr, ('The 30% corruption code for file ' + - os.path.split(__file__)[1] + - ' ran for %.2fm' % (training_time / 60.)) - - image = Image.fromarray(tile_raster_images( - X=da.W.get_value(borrow=True).T, - img_shape=(28, 28), tile_shape=(10, 10), - tile_spacing=(1, 1))) - image.save('filters_corruption_30.png') - - os.chdir('../') - - -if __name__ == '__main__': - test_dA() diff --git a/LeNet-5/deep_learning_test1.py b/LeNet-5/deep_learning_test1.py deleted file mode 100644 index c281687..0000000 --- a/LeNet-5/deep_learning_test1.py +++ /dev/null @@ -1,343 +0,0 @@ -"""This tutorial introduces the LeNet5 neural network architecture -using Theano. LeNet5 is a convolutional neural network, good for -classifying images. This tutorial shows how to build the architecture, -and comes with all the hyper-parameters you need to reproduce the -paper's MNIST results. - - -This implementation simplifies the model in the following ways: - - - LeNetConvPool doesn't implement location-specific gain and bias parameters - - LeNetConvPool doesn't implement pooling by average, it implements pooling - by max. - - Digit classification is implemented with a logistic regression rather than - an RBF network - - LeNet5 was not fully-connected convolutions at second layer - -References: - - Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: - Gradient-Based Learning Applied to Document - Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998. - http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf - -""" -import os -import sys -import time - -import numpy - -import theano -import theano.tensor as T -from theano.tensor.signal import downsample -from theano.tensor.nnet import conv - -from logistic_sgd import LogisticRegression, load_data -from mlp import HiddenLayer - - -class LeNetConvPoolLayer(object): - """Pool Layer of a convolutional network """ - - def __init__(self, rng, input, filter_shape, image_shape, poolsize=(2, 2)): - """ - Allocate a LeNetConvPoolLayer with shared variable internal parameters. - - :type rng: numpy.random.RandomState - :param rng: a random number generator used to initialize weights - - :type input: theano.tensor.dtensor4 - :param input: symbolic image tensor, of shape image_shape - - :type filter_shape: tuple or list of length 4 - :param filter_shape: (number of filters, num input feature maps, - filter height, filter width) - - :type image_shape: tuple or list of length 4 - :param image_shape: (batch size, num input feature maps, - image height, image width) - - :type poolsize: tuple or list of length 2 - :param poolsize: the downsampling (pooling) factor (#rows, #cols) - """ - - assert image_shape[1] == filter_shape[1] - self.input = input - - # there are "num input feature maps * filter height * filter width" - # inputs to each hidden unit - fan_in = numpy.prod(filter_shape[1:]) - # each unit in the lower layer receives a gradient from: - # "num output feature maps * filter height * filter width" / - # pooling size - fan_out = (filter_shape[0] * numpy.prod(filter_shape[2:]) / - numpy.prod(poolsize)) - # initialize weights with random weights - W_bound = numpy.sqrt(6. / (fan_in + fan_out)) - self.W = theano.shared( - numpy.asarray( - rng.uniform(low=-W_bound, high=W_bound, size=filter_shape), - dtype=theano.config.floatX - ), - borrow=True - ) - - # the bias is a 1D tensor -- one bias per output feature map - b_values = numpy.zeros((filter_shape[0],), dtype=theano.config.floatX) - self.b = theano.shared(value=b_values, borrow=True) - - # convolve input feature maps with filters - conv_out = conv.conv2d( - input=input, - filters=self.W, - filter_shape=filter_shape, - image_shape=image_shape - ) - - # downsample each feature map individually, using maxpooling - pooled_out = downsample.max_pool_2d( - input=conv_out, - ds=poolsize, - ignore_border=True - ) - - # add the bias term. Since the bias is a vector (1D array), we first - # reshape it to a tensor of shape (1, n_filters, 1, 1). Each bias will - # thus be broadcasted across mini-batches and feature map - # width & height - self.output = T.tanh(pooled_out + self.b.dimshuffle('x', 0, 'x', 'x')) - - # store parameters of this layer - self.params = [self.W, self.b] - - -def evaluate_lenet5(learning_rate=0.1, n_epochs=200, - dataset='mnist.pkl.gz', - nkerns=[20, 50], batch_size=500): - """ Demonstrates lenet on MNIST dataset - - :type learning_rate: float - :param learning_rate: learning rate used (factor for the stochastic - gradient) - - :type n_epochs: int - :param n_epochs: maximal number of epochs to run the optimizer - - :type dataset: string - :param dataset: path to the dataset used for training /testing (MNIST here) - - :type nkerns: list of ints - :param nkerns: number of kernels on each layer - """ - - rng = numpy.random.RandomState(23455) - - datasets = load_data(dataset) - - train_set_x, train_set_y = datasets[0] - valid_set_x, valid_set_y = datasets[1] - test_set_x, test_set_y = datasets[2] - - # compute number of minibatches for training, validation and testing - n_train_batches = train_set_x.get_value(borrow=True).shape[0] - n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] - n_test_batches = test_set_x.get_value(borrow=True).shape[0] - n_train_batches /= batch_size - n_valid_batches /= batch_size - n_test_batches /= batch_size - - # allocate symbolic variables for the data - index = T.lscalar() # index to a [mini]batch - - # start-snippet-1 - x = T.matrix('x') # the data is presented as rasterized images - y = T.ivector('y') # the labels are presented as 1D vector of - # [int] labels - - ###################### - # BUILD ACTUAL MODEL # - ###################### - print '... building the model' - - # Reshape matrix of rasterized images of shape (batch_size, 28 * 28) - # to a 4D tensor, compatible with our LeNetConvPoolLayer - # (28, 28) is the size of MNIST images. - layer0_input = x.reshape((batch_size, 1, 28, 28)) - - # Construct the first convolutional pooling layer: - # filtering reduces the image size to (28-5+1 , 28-5+1) = (24, 24) - # maxpooling reduces this further to (24/2, 24/2) = (12, 12) - # 4D output tensor is thus of shape (batch_size, nkerns[0], 12, 12) - layer0 = LeNetConvPoolLayer( - rng, - input=layer0_input, - image_shape=(batch_size, 1, 28, 28), - filter_shape=(nkerns[0], 1, 5, 5), - poolsize=(2, 2) - ) - - # Construct the second convolutional pooling layer - # filtering reduces the image size to (12-5+1, 12-5+1) = (8, 8) - # maxpooling reduces this further to (8/2, 8/2) = (4, 4) - # 4D output tensor is thus of shape (nkerns[0], nkerns[1], 4, 4) - layer1 = LeNetConvPoolLayer( - rng, - input=layer0.output, - image_shape=(batch_size, nkerns[0], 12, 12), - filter_shape=(nkerns[1], nkerns[0], 5, 5), - poolsize=(2, 2) - ) - - # the HiddenLayer being fully-connected, it operates on 2D matrices of - # shape (batch_size, num_pixels) (i.e matrix of rasterized images). - # This will generate a matrix of shape (batch_size, nkerns[1] * 4 * 4), - # or (500, 50 * 4 * 4) = (500, 800) with the default values. - layer2_input = layer1.output.flatten(2) - - # construct a fully-connected sigmoidal layer - layer2 = HiddenLayer( - rng, - input=layer2_input, - n_in=nkerns[1] * 4 * 4, - n_out=500, - activation=T.tanh - ) - - # classify the values of the fully-connected sigmoidal layer - layer3 = LogisticRegression(input=layer2.output, n_in=500, n_out=10) - - # the cost we minimize during training is the NLL of the model - cost = layer3.negative_log_likelihood(y) - - # create a function to compute the mistakes that are made by the model - test_model = theano.function( - [index], - layer3.errors(y), - givens={ - x: test_set_x[index * batch_size: (index + 1) * batch_size], - y: test_set_y[index * batch_size: (index + 1) * batch_size] - } - ) - - validate_model = theano.function( - [index], - layer3.errors(y), - givens={ - x: valid_set_x[index * batch_size: (index + 1) * batch_size], - y: valid_set_y[index * batch_size: (index + 1) * batch_size] - } - ) - - # create a list of all model parameters to be fit by gradient descent - params = layer3.params + layer2.params + layer1.params + layer0.params - - # create a list of gradients for all model parameters - grads = T.grad(cost, params) - - # train_model is a function that updates the model parameters by - # SGD Since this model has many parameters, it would be tedious to - # manually create an update rule for each model parameter. We thus - # create the updates list by automatically looping over all - # (params[i], grads[i]) pairs. - updates = [ - (param_i, param_i - learning_rate * grad_i) - for param_i, grad_i in zip(params, grads) - ] - - train_model = theano.function( - [index], - cost, - updates=updates, - givens={ - x: train_set_x[index * batch_size: (index + 1) * batch_size], - y: train_set_y[index * batch_size: (index + 1) * batch_size] - } - ) - # end-snippet-1 - - ############### - # TRAIN MODEL # - ############### - print '... training' - # early-stopping parameters - patience = 10000 # look as this many examples regardless - patience_increase = 2 # wait this much longer when a new best is - # found - improvement_threshold = 0.995 # a relative improvement of this much is - # considered significant - validation_frequency = min(n_train_batches, patience / 2) - # go through this many - # minibatche before checking the network - # on the validation set; in this case we - # check every epoch - - best_validation_loss = numpy.inf - best_iter = 0 - test_score = 0. - start_time = time.clock() - - epoch = 0 - done_looping = False - - while (epoch < n_epochs) and (not done_looping): - epoch = epoch + 1 - for minibatch_index in xrange(n_train_batches): - - iter = (epoch - 1) * n_train_batches + minibatch_index - - if iter % 100 == 0: - print 'training @ iter = ', iter - cost_ij = train_model(minibatch_index) - - if (iter + 1) % validation_frequency == 0: - - # compute zero-one loss on validation set - validation_losses = [validate_model(i) for i - in xrange(n_valid_batches)] - this_validation_loss = numpy.mean(validation_losses) - print('epoch %i, minibatch %i/%i, validation error %f %%' % - (epoch, minibatch_index + 1, n_train_batches, - this_validation_loss * 100.)) - - # if we got the best validation score until now - if this_validation_loss < best_validation_loss: - - #improve patience if loss improvement is good enough - if this_validation_loss < best_validation_loss * \ - improvement_threshold: - patience = max(patience, iter * patience_increase) - - # save best validation score and iteration number - best_validation_loss = this_validation_loss - best_iter = iter - - # test it on the test set - test_losses = [ - test_model(i) - for i in xrange(n_test_batches) - ] - test_score = numpy.mean(test_losses) - print((' epoch %i, minibatch %i/%i, test error of ' - 'best model %f %%') % - (epoch, minibatch_index + 1, n_train_batches, - test_score * 100.)) - - if patience <= iter: - done_looping = True - break - - end_time = time.clock() - print('Optimization complete.') - print('Best validation score of %f %% obtained at iteration %i, ' - 'with test performance %f %%' % - (best_validation_loss * 100., best_iter + 1, test_score * 100.)) - print >> sys.stderr, ('The code for file ' + - os.path.split(__file__)[1] + - ' ran for %.2fm' % ((end_time - start_time) / 60.)) - -if __name__ == '__main__': - evaluate_lenet5() - - -def experiment(state, channel): - evaluate_lenet5(state.learning_rate, dataset=state.dataset) \ No newline at end of file diff --git a/LeNet-5/logistic_sgd.py b/LeNet-5/logistic_sgd.py deleted file mode 100644 index 83f46d5..0000000 --- a/LeNet-5/logistic_sgd.py +++ /dev/null @@ -1,445 +0,0 @@ -#coding=UTF-8 - -# logistic regression -# http://deeplearning.net/tutorial/logreg.html -# http://www.cnblogs.com/xueliangliu/archive/2013/04/07/3006014.html - - -""" -This tutorial introduces logistic regression using Theano and stochastic -gradient descent. - -Logistic regression is a probabilistic, linear classifier. It is parametrized -by a weight matrix :math:`W` and a bias vector :math:`b`. Classification is -done by projecting data points onto a set of hyperplanes, the distance to -which is used to determine a class membership probability. - -Mathematically, this can be written as: - -.. math:: - P(Y=i|x, W,b) &= softmax_i(W x + b) \\ - &= \frac {e^{W_i x + b_i}} {\sum_j e^{W_j x + b_j}} - - -The output of the model or prediction is then done by taking the argmax of -the vector whose i'th element is P(Y=i|x). - -.. math:: - - y_{pred} = argmax_i P(Y=i|x,W,b) - - -This tutorial presents a stochastic gradient descent optimization method -suitable for large datasets. - - -References: - - - textbooks: "Pattern Recognition and Machine Learning" - - Christopher M. Bishop, section 4.3.2 - -""" -__docformat__ = 'restructedtext en' - -import cPickle -import gzip -import os -import sys -import time - -import numpy - -import theano -import theano.tensor as T - - -class LogisticRegression(object): - """Multi-class Logistic Regression Class - - The logistic regression is fully described by a weight matrix :math:`W` - and bias vector :math:`b`. Classification is done by projecting data - points onto a set of hyperplanes, the distance to which is used to - determine a class membership probability. - """ - - def __init__(self, input, n_in, n_out): - """ Initialize the parameters of the logistic regression - - :type input: theano.tensor.TensorType - :param input: symbolic variable that describes the input of the - architecture (one minibatch) - - :type n_in: int - :param n_in: number of input units, the dimension of the space in - which the datapoints lie - - :type n_out: int - :param n_out: number of output units, the dimension of the space in - which the labels lie - - """ - # start-snippet-1 - # initialize with 0 the weights W as a matrix of shape (n_in, n_out) - self.W = theano.shared( - value=numpy.zeros( - (n_in, n_out), - dtype=theano.config.floatX - ), - name='W', - borrow=True - ) - # initialize the baises b as a vector of n_out 0s - self.b = theano.shared( - value=numpy.zeros( - (n_out,), - dtype=theano.config.floatX - ), - name='b', - borrow=True - ) - - # symbolic expression for computing the matrix of class-membership - # probabilities - # Where: - # W is a matrix where column-k represent the separation hyper plain for - # class-k - # x is a matrix where row-j represents input training sample-j - # b is a vector where element-k represent the free parameter of hyper - # plain-k - self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b) - - # symbolic description of how to compute prediction as class whose - # probability is maximal - self.y_pred = T.argmax(self.p_y_given_x, axis=1) - # end-snippet-1 - - # parameters of the model - self.params = [self.W, self.b] - - def negative_log_likelihood(self, y): - """Return the mean of the negative log-likelihood of the prediction - of this model under a given target distribution. - - .. math:: - - \frac{1}{|\mathcal{D}|} \mathcal{L} (\theta=\{W,b\}, \mathcal{D}) = - \frac{1}{|\mathcal{D}|} \sum_{i=0}^{|\mathcal{D}|} - \log(P(Y=y^{(i)}|x^{(i)}, W,b)) \\ - \ell (\theta=\{W,b\}, \mathcal{D}) - - :type y: theano.tensor.TensorType - :param y: corresponds to a vector that gives for each example the - correct label - - Note: we use the mean instead of the sum so that - the learning rate is less dependent on the batch size - """ - # start-snippet-2 - # y.shape[0] is (symbolically) the number of rows in y, i.e., - # number of examples (call it n) in the minibatch - # T.arange(y.shape[0]) is a symbolic vector which will contain - # [0,1,2,... n-1] T.log(self.p_y_given_x) is a matrix of - # Log-Probabilities (call it LP) with one row per example and - # one column per class LP[T.arange(y.shape[0]),y] is a vector - # v containing [LP[0,y[0]], LP[1,y[1]], LP[2,y[2]], ..., - # LP[n-1,y[n-1]]] and T.mean(LP[T.arange(y.shape[0]),y]) is - # the mean (across minibatch examples) of the elements in v, - # i.e., the mean log-likelihood across the minibatch. - return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y]) - # end-snippet-2 - - def errors(self, y): - """Return a float representing the number of errors in the minibatch - over the total number of examples of the minibatch ; zero one - loss over the size of the minibatch - - :type y: theano.tensor.TensorType - :param y: corresponds to a vector that gives for each example the - correct label - """ - - # check if y has same dimension of y_pred - if y.ndim != self.y_pred.ndim: - raise TypeError( - 'y should have the same shape as self.y_pred', - ('y', y.type, 'y_pred', self.y_pred.type) - ) - # check if y is of the correct datatype - if y.dtype.startswith('int'): - # the T.neq operator returns a vector of 0s and 1s, where 1 - # represents a mistake in prediction - return T.mean(T.neq(self.y_pred, y)) - else: - raise NotImplementedError() - - -def load_data(dataset): - ''' Loads the dataset - - :type dataset: string - :param dataset: the path to the dataset (here MNIST) - ''' - - ############# - # LOAD DATA # - ############# - - # Download the MNIST dataset if it is not present - data_dir, data_file = os.path.split(dataset) - if data_dir == "" and not os.path.isfile(dataset): - # Check if dataset is in the data directory. - new_path = os.path.join( - os.path.split(__file__)[0], - "..", - "data", - dataset - ) - if os.path.isfile(new_path) or data_file == 'mnist.pkl.gz': - dataset = new_path - - if (not os.path.isfile(dataset)) and data_file == 'mnist.pkl.gz': - import urllib - origin = ( - 'http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz' - ) - print 'Downloading data from %s' % origin - urllib.urlretrieve(origin, dataset) - - print '... loading data' - - # Load the dataset - f = gzip.open(dataset, 'rb') - train_set, valid_set, test_set = cPickle.load(f) - f.close() - #train_set, valid_set, test_set format: tuple(input, target) - #input is an numpy.ndarray of 2 dimensions (a matrix) - #witch row's correspond to an example. target is a - #numpy.ndarray of 1 dimensions (vector)) that have the same length as - #the number of rows in the input. It should give the target - #target to the example with the same index in the input. - - def shared_dataset(data_xy, borrow=True): - """ Function that loads the dataset into shared variables - - The reason we store our dataset in shared variables is to allow - Theano to copy it into the GPU memory (when code is run on GPU). - Since copying data into the GPU is slow, copying a minibatch everytime - is needed (the default behaviour if the data is not in a shared - variable) would lead to a large decrease in performance. - """ - data_x, data_y = data_xy - shared_x = theano.shared(numpy.asarray(data_x, - dtype=theano.config.floatX), - borrow=borrow) - shared_y = theano.shared(numpy.asarray(data_y, - dtype=theano.config.floatX), - borrow=borrow) - # When storing data on the GPU it has to be stored as floats - # therefore we will store the labels as ``floatX`` as well - # (``shared_y`` does exactly that). But during our computations - # we need them as ints (we use labels as index, and if they are - # floats it doesn't make sense) therefore instead of returning - # ``shared_y`` we will have to cast it to int. This little hack - # lets ous get around this issue - return shared_x, T.cast(shared_y, 'int32') - - test_set_x, test_set_y = shared_dataset(test_set) - valid_set_x, valid_set_y = shared_dataset(valid_set) - train_set_x, train_set_y = shared_dataset(train_set) - - rval = [(train_set_x, train_set_y), (valid_set_x, valid_set_y), - (test_set_x, test_set_y)] - return rval - - -def sgd_optimization_mnist(learning_rate=0.13, n_epochs=1000, - dataset='mnist.pkl.gz', - batch_size=600): - """ - Demonstrate stochastic gradient descent optimization of a log-linear - model - - This is demonstrated on MNIST. - - :type learning_rate: float - :param learning_rate: learning rate used (factor for the stochastic - gradient) - - :type n_epochs: int - :param n_epochs: maximal number of epochs to run the optimizer - - :type dataset: string - :param dataset: the path of the MNIST dataset file from - http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz - - """ - datasets = load_data(dataset) - - train_set_x, train_set_y = datasets[0] - valid_set_x, valid_set_y = datasets[1] - test_set_x, test_set_y = datasets[2] - - # compute number of minibatches for training, validation and testing - n_train_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size - n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] / batch_size - n_test_batches = test_set_x.get_value(borrow=True).shape[0] / batch_size - - ###################### - # BUILD ACTUAL MODEL # - ###################### - print '... building the model' - - # allocate symbolic variables for the data - index = T.lscalar() # index to a [mini]batch - - # generate symbolic variables for input (x and y represent a - # minibatch) - x = T.matrix('x') # data, presented as rasterized images - y = T.ivector('y') # labels, presented as 1D vector of [int] labels - - # construct the logistic regression class - # Each MNIST image has size 28*28 - classifier = LogisticRegression(input=x, n_in=28 * 28, n_out=10) - - # the cost we minimize during training is the negative log likelihood of - # the model in symbolic format - cost = classifier.negative_log_likelihood(y) - - # compiling a Theano function that computes the mistakes that are made by - # the model on a minibatch - test_model = theano.function( - inputs=[index], - outputs=classifier.errors(y), - givens={ - x: test_set_x[index * batch_size: (index + 1) * batch_size], - y: test_set_y[index * batch_size: (index + 1) * batch_size] - } - ) - - validate_model = theano.function( - inputs=[index], - outputs=classifier.errors(y), - givens={ - x: valid_set_x[index * batch_size: (index + 1) * batch_size], - y: valid_set_y[index * batch_size: (index + 1) * batch_size] - } - ) - - # compute the gradient of cost with respect to theta = (W,b) - g_W = T.grad(cost=cost, wrt=classifier.W) - g_b = T.grad(cost=cost, wrt=classifier.b) - - # start-snippet-3 - # specify how to update the parameters of the model as a list of - # (variable, update expression) pairs. - updates = [(classifier.W, classifier.W - learning_rate * g_W), - (classifier.b, classifier.b - learning_rate * g_b)] - - # compiling a Theano function `train_model` that returns the cost, but in - # the same time updates the parameter of the model based on the rules - # defined in `updates` - train_model = theano.function( - inputs=[index], - outputs=cost, - updates=updates, - givens={ - x: train_set_x[index * batch_size: (index + 1) * batch_size], - y: train_set_y[index * batch_size: (index + 1) * batch_size] - } - ) - # end-snippet-3 - - ############### - # TRAIN MODEL # - ############### - print '... training the model' - # early-stopping parameters - patience = 5000 # look as this many examples regardless - patience_increase = 2 # wait this much longer when a new best is - # found - improvement_threshold = 0.995 # a relative improvement of this much is - # considered significant - validation_frequency = min(n_train_batches, patience / 2) - # go through this many - # minibatche before checking the network - # on the validation set; in this case we - # check every epoch - - best_validation_loss = numpy.inf - test_score = 0. - start_time = time.clock() - - done_looping = False - epoch = 0 - while (epoch < n_epochs) and (not done_looping): - epoch = epoch + 1 - for minibatch_index in xrange(n_train_batches): - - minibatch_avg_cost = train_model(minibatch_index) - # iteration number - iter = (epoch - 1) * n_train_batches + minibatch_index - - if (iter + 1) % validation_frequency == 0: - # compute zero-one loss on validation set - validation_losses = [validate_model(i) - for i in xrange(n_valid_batches)] - this_validation_loss = numpy.mean(validation_losses) - - print( - 'epoch %i, minibatch %i/%i, validation error %f %%' % - ( - epoch, - minibatch_index + 1, - n_train_batches, - this_validation_loss * 100. - ) - ) - - # if we got the best validation score until now - if this_validation_loss < best_validation_loss: - #improve patience if loss improvement is good enough - if this_validation_loss < best_validation_loss * \ - improvement_threshold: - patience = max(patience, iter * patience_increase) - - best_validation_loss = this_validation_loss - # test it on the test set - - test_losses = [test_model(i) - for i in xrange(n_test_batches)] - test_score = numpy.mean(test_losses) - - print( - ( - ' epoch %i, minibatch %i/%i, test error of' - ' best model %f %%' - ) % - ( - epoch, - minibatch_index + 1, - n_train_batches, - test_score * 100. - ) - ) - - if patience <= iter: - done_looping = True - break - - end_time = time.clock() - print( - ( - 'Optimization complete with best validation score of %f %%,' - 'with test performance %f %%' - ) - % (best_validation_loss * 100., test_score * 100.) - ) - print 'The code run for %d epochs, with %f epochs/sec' % ( - epoch, 1. * epoch / (end_time - start_time)) - print >> sys.stderr, ('The code for file ' + - os.path.split(__file__)[1] + - ' ran for %.1fs' % ((end_time - start_time))) - -if __name__ == '__main__': - sgd_optimization_mnist() - diff --git a/LeNet-5/mlp.py b/LeNet-5/mlp.py deleted file mode 100644 index 3efd0e4..0000000 --- a/LeNet-5/mlp.py +++ /dev/null @@ -1,404 +0,0 @@ -""" -This tutorial introduces the multilayer perceptron using Theano. - - A multilayer perceptron is a logistic regressor where -instead of feeding the input to the logistic regression you insert a -intermediate layer, called the hidden layer, that has a nonlinear -activation function (usually tanh or sigmoid) . One can use many such -hidden layers making the architecture deep. The tutorial will also tackle -the problem of MNIST digit classification. - -.. math:: - - f(x) = G( b^{(2)} + W^{(2)}( s( b^{(1)} + W^{(1)} x))), - -References: - - - textbooks: "Pattern Recognition and Machine Learning" - - Christopher M. Bishop, section 5 - -""" -__docformat__ = 'restructedtext en' - - -import os -import sys -import time - -import numpy - -import theano -import theano.tensor as T - - -from logistic_sgd import LogisticRegression, load_data - - -# start-snippet-1 -class HiddenLayer(object): - def __init__(self, rng, input, n_in, n_out, W=None, b=None, - activation=T.tanh): - """ - Typical hidden layer of a MLP: units are fully-connected and have - sigmoidal activation function. Weight matrix W is of shape (n_in,n_out) - and the bias vector b is of shape (n_out,). - - NOTE : The nonlinearity used here is tanh - - Hidden unit activation is given by: tanh(dot(input,W) + b) - - :type rng: numpy.random.RandomState - :param rng: a random number generator used to initialize weights - - :type input: theano.tensor.dmatrix - :param input: a symbolic tensor of shape (n_examples, n_in) - - :type n_in: int - :param n_in: dimensionality of input - - :type n_out: int - :param n_out: number of hidden units - - :type activation: theano.Op or function - :param activation: Non linearity to be applied in the hidden - layer - """ - self.input = input - # end-snippet-1 - - # `W` is initialized with `W_values` which is uniformely sampled - # from sqrt(-6./(n_in+n_hidden)) and sqrt(6./(n_in+n_hidden)) - # for tanh activation function - # the output of uniform if converted using asarray to dtype - # theano.config.floatX so that the code is runable on GPU - # Note : optimal initialization of weights is dependent on the - # activation function used (among other things). - # For example, results presented in [Xavier10] suggest that you - # should use 4 times larger initial weights for sigmoid - # compared to tanh - # We have no info for other function, so we use the same as - # tanh. - if W is None: - W_values = numpy.asarray( - rng.uniform( - low=-numpy.sqrt(6. / (n_in + n_out)), - high=numpy.sqrt(6. / (n_in + n_out)), - size=(n_in, n_out) - ), - dtype=theano.config.floatX - ) - if activation == theano.tensor.nnet.sigmoid: - W_values *= 4 - - W = theano.shared(value=W_values, name='W', borrow=True) - - if b is None: - b_values = numpy.zeros((n_out,), dtype=theano.config.floatX) - b = theano.shared(value=b_values, name='b', borrow=True) - - self.W = W - self.b = b - - lin_output = T.dot(input, self.W) + self.b - self.output = ( - lin_output if activation is None - else activation(lin_output) - ) - # parameters of the model - self.params = [self.W, self.b] - - -# start-snippet-2 -class MLP(object): - """Multi-Layer Perceptron Class - - A multilayer perceptron is a feedforward artificial neural network model - that has one layer or more of hidden units and nonlinear activations. - Intermediate layers usually have as activation function tanh or the - sigmoid function (defined here by a ``HiddenLayer`` class) while the - top layer is a softamx layer (defined here by a ``LogisticRegression`` - class). - """ - - def __init__(self, rng, input, n_in, n_hidden, n_out): - """Initialize the parameters for the multilayer perceptron - - :type rng: numpy.random.RandomState - :param rng: a random number generator used to initialize weights - - :type input: theano.tensor.TensorType - :param input: symbolic variable that describes the input of the - architecture (one minibatch) - - :type n_in: int - :param n_in: number of input units, the dimension of the space in - which the datapoints lie - - :type n_hidden: int - :param n_hidden: number of hidden units - - :type n_out: int - :param n_out: number of output units, the dimension of the space in - which the labels lie - - """ - - # Since we are dealing with a one hidden layer MLP, this will translate - # into a HiddenLayer with a tanh activation function connected to the - # LogisticRegression layer; the activation function can be replaced by - # sigmoid or any other nonlinear function - self.hiddenLayer = HiddenLayer( - rng=rng, - input=input, - n_in=n_in, - n_out=n_hidden, - activation=T.tanh - ) - - # The logistic regression layer gets as input the hidden units - # of the hidden layer - self.logRegressionLayer = LogisticRegression( - input=self.hiddenLayer.output, - n_in=n_hidden, - n_out=n_out - ) - # end-snippet-2 start-snippet-3 - # L1 norm ; one regularization option is to enforce L1 norm to - # be small - self.L1 = ( - abs(self.hiddenLayer.W).sum() - + abs(self.logRegressionLayer.W).sum() - ) - - # square of L2 norm ; one regularization option is to enforce - # square of L2 norm to be small - self.L2_sqr = ( - (self.hiddenLayer.W ** 2).sum() - + (self.logRegressionLayer.W ** 2).sum() - ) - - # negative log likelihood of the MLP is given by the negative - # log likelihood of the output of the model, computed in the - # logistic regression layer - self.negative_log_likelihood = ( - self.logRegressionLayer.negative_log_likelihood - ) - # same holds for the function computing the number of errors - self.errors = self.logRegressionLayer.errors - - # the parameters of the model are the parameters of the two layer it is - # made out of - self.params = self.hiddenLayer.params + self.logRegressionLayer.params - # end-snippet-3 - - -def test_mlp(learning_rate=0.01, L1_reg=0.00, L2_reg=0.0001, n_epochs=1000, - dataset='mnist.pkl.gz', batch_size=20, n_hidden=500): - """ - Demonstrate stochastic gradient descent optimization for a multilayer - perceptron - - This is demonstrated on MNIST. - - :type learning_rate: float - :param learning_rate: learning rate used (factor for the stochastic - gradient - - :type L1_reg: float - :param L1_reg: L1-norm's weight when added to the cost (see - regularization) - - :type L2_reg: float - :param L2_reg: L2-norm's weight when added to the cost (see - regularization) - - :type n_epochs: int - :param n_epochs: maximal number of epochs to run the optimizer - - :type dataset: string - :param dataset: the path of the MNIST dataset file from - http://www.iro.umontreal.ca/~lisa/deep/data/mnist/mnist.pkl.gz - - - """ - datasets = load_data(dataset) - - train_set_x, train_set_y = datasets[0] - valid_set_x, valid_set_y = datasets[1] - test_set_x, test_set_y = datasets[2] - - # compute number of minibatches for training, validation and testing - n_train_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size - n_valid_batches = valid_set_x.get_value(borrow=True).shape[0] / batch_size - n_test_batches = test_set_x.get_value(borrow=True).shape[0] / batch_size - - ###################### - # BUILD ACTUAL MODEL # - ###################### - print '... building the model' - - # allocate symbolic variables for the data - index = T.lscalar() # index to a [mini]batch - x = T.matrix('x') # the data is presented as rasterized images - y = T.ivector('y') # the labels are presented as 1D vector of - # [int] labels - - rng = numpy.random.RandomState(1234) - - # construct the MLP class - classifier = MLP( - rng=rng, - input=x, - n_in=28 * 28, - n_hidden=n_hidden, - n_out=10 - ) - - # start-snippet-4 - # the cost we minimize during training is the negative log likelihood of - # the model plus the regularization terms (L1 and L2); cost is expressed - # here symbolically - cost = ( - classifier.negative_log_likelihood(y) - + L1_reg * classifier.L1 - + L2_reg * classifier.L2_sqr - ) - # end-snippet-4 - - # compiling a Theano function that computes the mistakes that are made - # by the model on a minibatch - test_model = theano.function( - inputs=[index], - outputs=classifier.errors(y), - givens={ - x: test_set_x[index * batch_size:(index + 1) * batch_size], - y: test_set_y[index * batch_size:(index + 1) * batch_size] - } - ) - - validate_model = theano.function( - inputs=[index], - outputs=classifier.errors(y), - givens={ - x: valid_set_x[index * batch_size:(index + 1) * batch_size], - y: valid_set_y[index * batch_size:(index + 1) * batch_size] - } - ) - - # start-snippet-5 - # compute the gradient of cost with respect to theta (sotred in params) - # the resulting gradients will be stored in a list gparams - gparams = [T.grad(cost, param) for param in classifier.params] - - # specify how to update the parameters of the model as a list of - # (variable, update expression) pairs - - # given two list the zip A = [a1, a2, a3, a4] and B = [b1, b2, b3, b4] of - # same length, zip generates a list C of same size, where each element - # is a pair formed from the two lists : - # C = [(a1, b1), (a2, b2), (a3, b3), (a4, b4)] - updates = [ - (param, param - learning_rate * gparam) - for param, gparam in zip(classifier.params, gparams) - ] - - # compiling a Theano function `train_model` that returns the cost, but - # in the same time updates the parameter of the model based on the rules - # defined in `updates` - train_model = theano.function( - inputs=[index], - outputs=cost, - updates=updates, - givens={ - x: train_set_x[index * batch_size: (index + 1) * batch_size], - y: train_set_y[index * batch_size: (index + 1) * batch_size] - } - ) - # end-snippet-5 - - ############### - # TRAIN MODEL # - ############### - print '... training' - - # early-stopping parameters - patience = 10000 # look as this many examples regardless - patience_increase = 2 # wait this much longer when a new best is - # found - improvement_threshold = 0.995 # a relative improvement of this much is - # considered significant - validation_frequency = min(n_train_batches, patience / 2) - # go through this many - # minibatche before checking the network - # on the validation set; in this case we - # check every epoch - - best_validation_loss = numpy.inf - best_iter = 0 - test_score = 0. - start_time = time.clock() - - epoch = 0 - done_looping = False - - while (epoch < n_epochs) and (not done_looping): - epoch = epoch + 1 - for minibatch_index in xrange(n_train_batches): - - minibatch_avg_cost = train_model(minibatch_index) - # iteration number - iter = (epoch - 1) * n_train_batches + minibatch_index - - if (iter + 1) % validation_frequency == 0: - # compute zero-one loss on validation set - validation_losses = [validate_model(i) for i - in xrange(n_valid_batches)] - this_validation_loss = numpy.mean(validation_losses) - - print( - 'epoch %i, minibatch %i/%i, validation error %f %%' % - ( - epoch, - minibatch_index + 1, - n_train_batches, - this_validation_loss * 100. - ) - ) - - # if we got the best validation score until now - if this_validation_loss < best_validation_loss: - #improve patience if loss improvement is good enough - if ( - this_validation_loss < best_validation_loss * - improvement_threshold - ): - patience = max(patience, iter * patience_increase) - - best_validation_loss = this_validation_loss - best_iter = iter - - # test it on the test set - test_losses = [test_model(i) for i - in xrange(n_test_batches)] - test_score = numpy.mean(test_losses) - - print((' epoch %i, minibatch %i/%i, test error of ' - 'best model %f %%') % - (epoch, minibatch_index + 1, n_train_batches, - test_score * 100.)) - - if patience <= iter: - done_looping = True - break - - end_time = time.clock() - print(('Optimization complete. Best validation score of %f %% ' - 'obtained at iteration %i, with test performance %f %%') % - (best_validation_loss * 100., best_iter + 1, test_score * 100.)) - print >> sys.stderr, ('The code for file ' + - os.path.split(__file__)[1] + - ' ran for %.2fm' % ((end_time - start_time) / 60.)) - - -if __name__ == '__main__': - test_mlp() \ No newline at end of file diff --git a/LeNet-5/runGPU.py b/LeNet-5/runGPU.py deleted file mode 100644 index fbdcdae..0000000 --- a/LeNet-5/runGPU.py +++ /dev/null @@ -1,22 +0,0 @@ -from theano import function, config, shared, sandbox -import theano.tensor as T -import numpy -import time - -vlen = 10 * 30 * 768 # 10 x #cores x # threads per core -iters = 1000 - -rng = numpy.random.RandomState(22) -x = shared(numpy.asarray(rng.rand(vlen), config.floatX)) -f = function([], T.exp(x)) -print f.maker.fgraph.toposort() -t0 = time.time() -for i in xrange(iters): - r = f() -t1 = time.time() -print 'Looping %d times took' % iters, t1 - t0, 'seconds' -print 'Result is', r -if numpy.any([isinstance(x.op, T.Elemwise) for x in f.maker.fgraph.toposort()]): - print 'Used the cpu' -else: - print 'Used the gpu' \ No newline at end of file diff --git a/LeNet-5/utils.py b/LeNet-5/utils.py deleted file mode 100644 index 3b50019..0000000 --- a/LeNet-5/utils.py +++ /dev/null @@ -1,139 +0,0 @@ -""" This file contains different utility functions that are not connected -in anyway to the networks presented in the tutorials, but rather help in -processing the outputs into a more understandable way. - -For example ``tile_raster_images`` helps in generating a easy to grasp -image from a set of samples or weights. -""" - - -import numpy - - -def scale_to_unit_interval(ndar, eps=1e-8): - """ Scales all values in the ndarray ndar to be between 0 and 1 """ - ndar = ndar.copy() - ndar -= ndar.min() - ndar *= 1.0 / (ndar.max() + eps) - return ndar - - -def tile_raster_images(X, img_shape, tile_shape, tile_spacing=(0, 0), - scale_rows_to_unit_interval=True, - output_pixel_vals=True): - """ - Transform an array with one flattened image per row, into an array in - which images are reshaped and layed out like tiles on a floor. - - This function is useful for visualizing datasets whose rows are images, - and also columns of matrices for transforming those rows - (such as the first layer of a neural net). - - :type X: a 2-D ndarray or a tuple of 4 channels, elements of which can - be 2-D ndarrays or None; - :param X: a 2-D array in which every row is a flattened image. - - :type img_shape: tuple; (height, width) - :param img_shape: the original shape of each image - - :type tile_shape: tuple; (rows, cols) - :param tile_shape: the number of images to tile (rows, cols) - - :param output_pixel_vals: if output should be pixel values (i.e. int8 - values) or floats - - :param scale_rows_to_unit_interval: if the values need to be scaled before - being plotted to [0,1] or not - - - :returns: array suitable for viewing as an image. - (See:`Image.fromarray`.) - :rtype: a 2-d array with same dtype as X. - - """ - - assert len(img_shape) == 2 - assert len(tile_shape) == 2 - assert len(tile_spacing) == 2 - - # The expression below can be re-written in a more C style as - # follows : - # - # out_shape = [0,0] - # out_shape[0] = (img_shape[0]+tile_spacing[0])*tile_shape[0] - - # tile_spacing[0] - # out_shape[1] = (img_shape[1]+tile_spacing[1])*tile_shape[1] - - # tile_spacing[1] - out_shape = [ - (ishp + tsp) * tshp - tsp - for ishp, tshp, tsp in zip(img_shape, tile_shape, tile_spacing) - ] - - if isinstance(X, tuple): - assert len(X) == 4 - # Create an output numpy ndarray to store the image - if output_pixel_vals: - out_array = numpy.zeros((out_shape[0], out_shape[1], 4), - dtype='uint8') - else: - out_array = numpy.zeros((out_shape[0], out_shape[1], 4), - dtype=X.dtype) - - #colors default to 0, alpha defaults to 1 (opaque) - if output_pixel_vals: - channel_defaults = [0, 0, 0, 255] - else: - channel_defaults = [0., 0., 0., 1.] - - for i in xrange(4): - if X[i] is None: - # if channel is None, fill it with zeros of the correct - # dtype - dt = out_array.dtype - if output_pixel_vals: - dt = 'uint8' - out_array[:, :, i] = numpy.zeros( - out_shape, - dtype=dt - ) + channel_defaults[i] - else: - # use a recurrent call to compute the channel and store it - # in the output - out_array[:, :, i] = tile_raster_images( - X[i], img_shape, tile_shape, tile_spacing, - scale_rows_to_unit_interval, output_pixel_vals) - return out_array - - else: - # if we are dealing with only one channel - H, W = img_shape - Hs, Ws = tile_spacing - - # generate a matrix to store the output - dt = X.dtype - if output_pixel_vals: - dt = 'uint8' - out_array = numpy.zeros(out_shape, dtype=dt) - - for tile_row in xrange(tile_shape[0]): - for tile_col in xrange(tile_shape[1]): - if tile_row * tile_shape[1] + tile_col < X.shape[0]: - this_x = X[tile_row * tile_shape[1] + tile_col] - if scale_rows_to_unit_interval: - # if we should scale values to be between 0 and 1 - # do this by calling the `scale_to_unit_interval` - # function - this_img = scale_to_unit_interval( - this_x.reshape(img_shape)) - else: - this_img = this_x.reshape(img_shape) - # add the slice to the corresponding position in the - # output array - c = 1 - if output_pixel_vals: - c = 255 - out_array[ - tile_row * (H + Hs): tile_row * (H + Hs) + H, - tile_col * (W + Ws): tile_col * (W + Ws) + W - ] = this_img * c - return out_array diff --git a/Mathematical-Modeling-2014/Project/baidu_spider.py b/Mathematical-Modeling-2014/Project/baidu_spider.py deleted file mode 100644 index 9f6124d..0000000 --- a/Mathematical-Modeling-2014/Project/baidu_spider.py +++ /dev/null @@ -1,140 +0,0 @@ -# -*- coding: utf-8 -*- -#--------------------------------------- -# 程序:百度贴吧爬虫 -# 版本:0.5 -# 作者:why -# 日期:2013-05-16 -# 语言:Python 2.7 -# 操作:输入网址后自动只看楼主并保存到本地文件 -# 功能:将楼主发布的内容打包txt存储到本地。 -#--------------------------------------- - -import string -import urllib2 -import re - -#----------- 处理页面上的各种标签 ----------- -class HTML_Tool: - # 用非 贪婪模式 匹配 \t 或者 \n 或者 空格 或者 超链接 或者 图片 - BgnCharToNoneRex = re.compile("(\t|\n| ||)") - - # 用非 贪婪模式 匹配 任意<>标签 - EndCharToNoneRex = re.compile("<.*?>") - - # 用非 贪婪模式 匹配 任意

标签 - BgnPartRex = re.compile("") - CharToNewLineRex = re.compile("(
|

||
|
)") - CharToNextTabRex = re.compile("") - - # 将一些html的符号实体转变为原始符号 - replaceTab = [("<","<"),(">",">"),("&","&"),("&","\""),(" "," ")] - - def Replace_Char(self,x): - x = self.BgnCharToNoneRex.sub("",x) - x = self.BgnPartRex.sub("\n ",x) - x = self.CharToNewLineRex.sub("\n",x) - x = self.CharToNextTabRex.sub("\t",x) - x = self.EndCharToNoneRex.sub("",x) - - for t in self.replaceTab: - x = x.replace(t[0],t[1]) - return x - -class Baidu_Spider: - # 申明相关的属性 - def __init__(self,url): - self.myUrl = url + '?see_lz=1' - self.datas = [] - self.myTool = HTML_Tool() - print u'已经启动百度贴吧爬虫,咔嚓咔嚓' - - # 初始化加载页面并将其转码储存 - def baidu_tieba(self): - # 读取页面的原始信息并将其从gbk转码 - myPage = urllib2.urlopen(self.myUrl).read().decode("gbk") - # 计算楼主发布内容一共有多少页 - endPage = self.page_counter(myPage) - # 获取该帖的标题 - title = self.find_title(myPage) - print u'文章名称:' + title - # 获取最终的数据 - self.save_data(self.myUrl,title,endPage) - - #用来计算一共有多少页 - def page_counter(self,myPage): - # 匹配 "共有12页" 来获取一共有多少页 - myMatch = re.search(r'class="red">(\d+?)', myPage, re.S) - if myMatch: - endPage = int(myMatch.group(1)) - print u'爬虫报告:发现楼主共有%d页的原创内容' % endPage - else: - endPage = 0 - print u'爬虫报告:无法计算楼主发布内容有多少页!' - return endPage - - # 用来寻找该帖的标题 - def find_title(self,myPage): - # 匹配

xxxxxxxxxx

找出标题 - myMatch = re.search(r'(.*?)', myPage, re.S) - title = u'暂无标题' - if myMatch: - title = myMatch.group(1) - else: - print u'爬虫报告:无法加载文章标题!' - # 文件名不能包含以下字符: \ / : * ? " < > | - title = title.replace('\\','').replace('/','').replace(':','').replace('*','').replace('?','').replace('"','').replace('>','').replace('<','').replace('|','') - return title - - - # 用来存储楼主发布的内容 - def save_data(self,url,title,endPage): - # 加载页面数据到数组中 - self.get_data(url,endPage) - # 打开本地文件 - f = open(title+'.txt','w+') - f.writelines(self.datas) - f.close() - print u'爬虫报告:文件已下载到本地并打包成txt文件' - print u'请按任意键退出...' - raw_input(); - - # 获取页面源码并将其存储到数组中 - def get_data(self,url,endPage): - url = url + '&pn=' - for i in range(1,endPage+1): - print u'爬虫报告:爬虫%d号正在加载中...' % i - myPage = urllib2.urlopen(url + str(i)).read() - # 将myPage中的html代码处理并存储到datas里面 - self.deal_data(myPage.decode('gbk')) - - - # 将内容从页面代码中抠出来 - def deal_data(self,myPage): - myItems = re.findall('id="post_content.*?>(.*?)',myPage,re.S) - for item in myItems: - data = self.myTool.Replace_Char(item.replace("\n","").encode('gbk')) - self.datas.append(data+'\n') - - - -#-------- 程序入口处 ------------------ -print u"""#--------------------------------------- -# 程序:百度贴吧爬虫 -# 版本:0.5 -# 作者:why -# 日期:2013-05-16 -# 语言:Python 2.7 -# 操作:输入网址后自动只看楼主并保存到本地文件 -# 功能:将楼主发布的内容打包txt存储到本地。 -#--------------------------------------- -""" - -# 以某小说贴吧为例子 -# bdurl = 'http://tieba.baidu.com/p/2296712428?see_lz=1&pn=1' - -print u'请输入贴吧的地址最后的数字串:' -bdurl = 'http://tieba.baidu.com/p/' + str(raw_input(u'http://tieba.baidu.com/p/')) - -#调用 -mySpider = Baidu_Spider(bdurl) -mySpider.baidu_tieba() \ No newline at end of file diff --git a/Mathematical-Modeling-2014/Project/cloud_large.png b/Mathematical-Modeling-2014/Project/cloud_large.png deleted file mode 100644 index f8b17b9..0000000 Binary files a/Mathematical-Modeling-2014/Project/cloud_large.png and /dev/null differ diff --git a/Mathematical-Modeling-2014/Project/myTest/TTT.txt b/Mathematical-Modeling-2014/Project/myTest/TTT.txt deleted file mode 100644 index 5fe7392..0000000 --- a/Mathematical-Modeling-2014/Project/myTest/TTT.txt +++ /dev/null @@ -1,105 +0,0 @@ - SA( 1, 1) 0.000000 - SA( 1, 2) 4.000000 - SA( 1, 3) 0.000000 - SA( 1, 4) 0.000000 - SA( 1, 5) 0.000000 - SA( 2, 1) 0.000000 - SA( 2, 2) 4.000000 - SA( 2, 3) 0.000000 - SA( 2, 4) 0.000000 - SA( 2, 5) 0.000000 - SA( 3, 1) 4.000000 - SA( 3, 2) 0.000000 - SA( 3, 3) 0.000000 - SA( 3, 4) 0.000000 - SA( 3, 5) 0.000000 - SA( 4, 1) 0.000000 - SA( 4, 2) 0.000000 - SA( 4, 3) 0.000000 - SA( 4, 4) 4.000000 - SA( 4, 5) 0.000000 - SA( 5, 1) 0.000000 - SA( 5, 2) 4.000000 - SA( 5, 3) 0.000000 - SA( 5, 4) 0.000000 - SA( 5, 5) 0.000000 - SA( 6, 1) 0.000000 - SA( 6, 2) 0.000000 - SA( 6, 3) 0.000000 - SA( 6, 4) 3.000000 - SA( 6, 5) 0.000000 - SA( 7, 1) 4.000000 - SA( 7, 2) 0.000000 - SA( 7, 3) 0.000000 - SA( 7, 4) 0.000000 - SA( 7, 5) 0.000000 - SA( 8, 1) 0.000000 - SA( 8, 2) 0.000000 - SA( 8, 3) 0.000000 - SA( 8, 4) 4.000000 - SA( 8, 5) 0.000000 - SA( 9, 1) 0.000000 - SA( 9, 2) 4.000000 - SA( 9, 3) 0.000000 - SA( 9, 4) 0.000000 - SA( 9, 5) 0.000000 - SA( 10, 1) 0.000000 - SA( 10, 2) 4.000000 - SA( 10, 3) 0.000000 - SA( 10, 4) 0.000000 - SA( 10, 5) 0.000000 - SA( 11, 1) 0.000000 - SA( 11, 2) 0.000000 - SA( 11, 3) 0.000000 - SA( 11, 4) 4.000000 - SA( 11, 5) 0.000000 - SA( 12, 1) 0.000000 - SA( 12, 2) 4.000000 - SA( 12, 3) 0.000000 - SA( 12, 4) 0.000000 - SA( 12, 5) 0.000000 - SA( 13, 1) 0.000000 - SA( 13, 2) 0.000000 - SA( 13, 3) 0.000000 - SA( 13, 4) 4.000000 - SA( 13, 5) 0.000000 - SA( 14, 1) 4.000000 - SA( 14, 2) 0.000000 - SA( 14, 3) 0.000000 - SA( 14, 4) 0.000000 - SA( 14, 5) 0.000000 - SA( 15, 1) 0.000000 - SA( 15, 2) 0.000000 - SA( 15, 3) 0.000000 - SA( 15, 4) 4.000000 - SA( 15, 5) 0.000000 - SA( 16, 1) 0.000000 - SA( 16, 2) 0.000000 - SA( 16, 3) 4.000000 - SA( 16, 4) 0.000000 - SA( 16, 5) 0.000000 - SA( 17, 1) 0.000000 - SA( 17, 2) 0.000000 - SA( 17, 3) 4.000000 - SA( 17, 4) 0.000000 - SA( 17, 5) 0.000000 - SA( 18, 1) 0.000000 - SA( 18, 2) 0.000000 - SA( 18, 3) 4.000000 - SA( 18, 4) 0.000000 - SA( 18, 5) 0.000000 - SA( 19, 1) 0.000000 - SA( 19, 2) 0.000000 - SA( 19, 3) 4.000000 - SA( 19, 4) 0.000000 - SA( 19, 5) 0.000000 - SA( 20, 1) 0.000000 - SA( 20, 2) 0.000000 - SA( 20, 3) 4.000000 - SA( 20, 4) 0.000000 - SA( 20, 5) 0.000000 - SA( 21, 1) 0.000000 - SA( 21, 2) 0.000000 - SA( 21, 3) 4.000000 - SA( 21, 4) 0.000000 - SA( 21, 5) 0.000000 \ No newline at end of file diff --git a/Mathematical-Modeling-2014/Project/myTest/ansj_dict.py b/Mathematical-Modeling-2014/Project/myTest/ansj_dict.py deleted file mode 100644 index db1e280..0000000 --- a/Mathematical-Modeling-2014/Project/myTest/ansj_dict.py +++ /dev/null @@ -1,149 +0,0 @@ -#coding:utf-8 - -path = "C:\\Users\\Syndrome\\Desktop\\语料数据\\ansj词典\\".decode('utf8').encode('cp936') -new_path = path + "81W_dict.txt" - -#################################################词典读取 -myFile = open(new_path,"r") - -word_81 = [] -word_length = [] - -line = myFile.readline() - -i = 1 -while line: - line = line.rstrip('\n') - # print line - word_81.append(line) - word_length.append(len(line)/3) - line = myFile.readline() - i += 1 - -max_len = max(word_length) -print "the num of word is " + str(i) -print "the max of length is " + str(max_len) -print "part1" - -myFile.close() - -#################################################词典按长度储存 - -newPath = path + "ansj_simple.txt" - -myFile = open(newPath , 'w') - -for i in range(50,-1,-1): #for循环的书写 - for j in range(0,len(word_length)): - if word_length[j] == i: - newLine = word_81[j] + "\n" - myFile.writelines(newLine) - -myFile.close() - - -print "part2" - -##############################################################词典词语长度坐标文件 -new_word_length = sorted(word_length) -new_len = [811639] - -j = 0 -for i in range(0,12): - while j < len(new_word_length): - if new_word_length[j] == i: - pass - else: - new_len.append(811639-j) - break - j += 1 -new_len.append(811639-j) - -newPath = path + "ansj_word_num.txt" - -myFile = open(newPath , 'w') - -print len(new_len) -print new_len -for i in range(0,len(new_len)): - myFile.writelines(str(new_len[i]) + '\n') - -myFile.close() - -print "part3" - -#################################################分词 - -word = [] - -myFile = open(path + "ansj_simple.txt" , 'r') -line = myFile.readline().rstrip('\n') -i = 0 -while line: - word.append(line) - line = myFile.readline().rstrip('\n') -myFile.close() -print "dictionary is ready!" - -word_num = new_len -print "the position of word is ready!" - - -TEST = "一位朴实美丽的渔家姑娘从红树林边的渔村闯入都市,经历了情感的波折和撞击演绎出复杂而\ -又多变的人生。故事发生在有着大面积红树林的小渔村和南海海滨一座新兴的小城里。渔家姑娘珍珠进\ -城打工,珍珠公司总经理大虎对她一见钟情,珍珠却不为所动。大虎企图强占珍珠,珍珠毅然回到红树\ -林。大虎在另两个干部子弟二虎和三虎的挑唆下,轮奸了珍珠。珍珠的意中人大同进行报复,欲杀大虎\ -的母亲、副市长林岚,却刺伤了检查官马叔。大虎又与二虎、三虎轮奸了女工小云,被当场抓获。林岚\ -救子心切,落入了刑侦科长金大川手里。马叔与牛晋顶住压力,使案件终于重审,三个虎被绳之以法。" - -new_sent = [] -T_len = len(TEST)/3 - -if T_len < 10: - s = T_len -else: - s = 9 - -while s > 0: - flag = 0 - # print word_num[s]-1 - # print word_num[s+1] - for i in range(word_num[s]-2,word_num[s+1]-1,-1): - # print i - if TEST[0:s*3] == word[i]: - new_sent.append(word[i]) - print word[i] + "ZZZZZZZZZ" - flag = 1 - break - if flag == 1: - TEST = TEST[s*3:] - if len(TEST)/3 < 10: - s = len(TEST)/3 - else: - s = 9 - else: - s -= 1 - if s == 1: - new_sent.append(TEST[:s*3]) - print "TTTTT" + TEST[:s*3] + " " + str(s) - TEST = TEST[s*3:] - if len(TEST)/3 < 10: - s = len(TEST)/3 - else: - s = 9 - -for item in new_sent: - print item + "\\", - -print "\npart4" - - - - - - - - - - - diff --git a/Mathematical-Modeling-2014/Project/myTest/get_word_length.py b/Mathematical-Modeling-2014/Project/myTest/get_word_length.py deleted file mode 100644 index 1ff701c..0000000 --- a/Mathematical-Modeling-2014/Project/myTest/get_word_length.py +++ /dev/null @@ -1,87 +0,0 @@ -#coding:utf-8 - - -##############################################################词典文件读取 -#中文路径的处理 -path = "C:\\Users\\Syndrome\\Desktop\\语料数据\\360W_字典\\".decode('utf8').encode('cp936') - -myFile = open(path + "dict_360.txt","r") - -word_length = [] -word_line = [] - -line = myFile.readline() -i = 0 -while line: - word_line.append(line) - line = line.rstrip('\n') #去掉换行符 - m = line.split('\t') #以\t为分隔符 - #word_length[i] = len(m[0])/3 - word_length.append(len(m[0])/3) - i += 1 - line = myFile.readline() - # if i >= 1000: - # break -myFile.close() - -print "finish" -print "max of the length of word is " + str(max(word_length)) -print len(word_length) -print len(word_line) - -#写文件 -##############################################################词典文件增加词语长度后,基于长度排序再保存 -newPath = path + "dictionary.txt" -myFile = open(newPath , 'w') - -for i in range(50,-1,-1): #for循环的书写 - for j in range(0,len(word_length)): - if word_length[j] == i: - newLine = str(i) + '\t' + word_line[j] - myFile.writelines(newLine) - -myFile.close() - - -##############################################################简化词典文件,基于长度排序的保存 -newPath = path + "dictionary_simple.txt" - -myFile = open(newPath , 'w') - -for i in range(50,-1,-1): #for循环的书写 - for j in range(0,len(word_length)): - if word_length[j] == i: - m = word_line[j].split('\t') - newLine = m[0] + "\n" - myFile.writelines(newLine) - -myFile.close() - -##############################################################词典词语长度坐标文件 -new_word_length = sorted(word_length) -new_len = [0] - -j = 0 -for i in range(0,50): - while j < len(new_word_length): - if new_word_length[j] == i: - pass - else: - new_len.append(3669216-j) - break - j += 1 -new_len.append(3669216-j) - -newPath = path + "word_num.txt" - -myFile = open(newPath , 'w') - -print len(new_len) -print new_len -for i in range(0,len(new_len)): - myFile.writelines(str(new_len[i]) + '\n') - -myFile.close() - - - diff --git a/Mathematical-Modeling-2014/Project/myTest/math1.py b/Mathematical-Modeling-2014/Project/myTest/math1.py deleted file mode 100644 index 6d008ee..0000000 --- a/Mathematical-Modeling-2014/Project/myTest/math1.py +++ /dev/null @@ -1,16 +0,0 @@ -#coding:utf-8 - - -myFile = open("TTT.txt") - -line = myFile.readline() - -print line - - - - - - - - diff --git a/Mathematical-Modeling-2014/Project/myTest/nltk_test.py b/Mathematical-Modeling-2014/Project/myTest/nltk_test.py deleted file mode 100644 index a09d31a..0000000 --- a/Mathematical-Modeling-2014/Project/myTest/nltk_test.py +++ /dev/null @@ -1,67 +0,0 @@ -#coding:utf-8 - -word_num = [0,0,1,2,2,3,3,3,3,3,3,3,3,3,4] -word = ["你是","我今生","唯一的挚爱","你是我今生唯一的挚爱啊啊啊啊"] - -TEST = "他说你是我今生唯一的挚爱" - -T_len = len(TEST)/3 -print T_len -s = T_len - -while s > 0: - flag = 0 - print TEST[0:s*3] - for i in range(word_num[s]-1,word_num[s+1]): - print word[i]+"sss" - if TEST[0:s*3] == word[i]: - print word[i] + "XXXXXX" - flag = 1 - if flag == 1: - TEST = TEST[s*3:] - s = len(TEST)/3 - else: - s -= 1 - if s == 1: - print TEST[:s*3] + "ZZZZZZZ" - TEST = TEST[s*3:] - s = len(TEST)/3 - - -import random -def guess(player): - declare = 'You enter number not between 1 and 99!' - number = int(raw_input('Player %s - Enter a number between 1 and 99:' % player)) - if number < 1: - print declare - elif number > 99: - print declare - else: - pass - return number - -def game(): - i = 1 - count = [0,0,0] - falg = True - rambom_num = random.randrange(1,99) - while falg: - for player in range(0,3): - number = guess(player + 1) - count[player] = i - if number > rambom_num: - print 'Your guess is too high!' - elif number < rambom_num: - print 'Your guess is too low!' - else: - print '--------------------------------------' - print 'Your made the right guess!' - print 'The secret number is %s' % number - for p in range(0,len(count)): - print 'Player %s - Total number of guesses: %s' % (p + 1,count[p]) - falg = False - break - i = i + 1 - -game() - \ No newline at end of file diff --git a/Mathematical-Modeling-2014/Project/myTest/pachong_test.py b/Mathematical-Modeling-2014/Project/myTest/pachong_test.py deleted file mode 100644 index 568428d..0000000 --- a/Mathematical-Modeling-2014/Project/myTest/pachong_test.py +++ /dev/null @@ -1,34 +0,0 @@ - -import urllib2 -url='http://www.baidu.com/s?wd=cloga' -content=urllib2.urlopen(url).read() - - -import re -urls_pat=re.compile(r'(.*?)') -siteUrls=re.findall(urls_pat,content) - -print siteUrls - -strip_tag_pat = re.compile(r'<.*?>') - -rank = 0 -file=open('result.txt','w') -for i in siteUrls: - i0=re.sub(strip_tag_pat,'',i) - i0=i0.strip() - i1=i0.split(' ') - date=i1[-1] - siteUrl=''.join(i1[:-1]) - rank+=1 - file.write(date+','+siteUrl+','+str(rank)+'\n') -file.close() - - - - - - - - - diff --git a/Mathematical-Modeling-2014/Project/myTest/result.txt b/Mathematical-Modeling-2014/Project/myTest/result.txt deleted file mode 100644 index c9869ec..0000000 --- a/Mathematical-Modeling-2014/Project/myTest/result.txt +++ /dev/null @@ -1,9 +0,0 @@ -cloga.info/ 2014-07-26 ,,2 -github.com/cloga 2012-01-10 ,,3 -www.douban.com/people/... 2013-05-12 ,,4 -cn.linkedin.com/in/clo... 2013-01-28 ,,5 -www.weibo.com/cloga 2014-07-31 ,,6 -www.tianya.cn/12911163 2012-01-20 ,,7 -cn.linkedin.com/in/cloga 2011-09-01 ,,8 -space.chinaz.com/Cloga 2014-05-29 ,,9 -i.youku.com/u/UODM5OTU... 2013-01-27 ,,10 diff --git a/Mathematical-Modeling-2014/Project/myTest/split_sentence.py b/Mathematical-Modeling-2014/Project/myTest/split_sentence.py deleted file mode 100644 index 1750d7d..0000000 --- a/Mathematical-Modeling-2014/Project/myTest/split_sentence.py +++ /dev/null @@ -1,113 +0,0 @@ -#coding:utf-8 - -path = "C:\\Users\\Syndrome\\Desktop\\语料数据\\360W_字典\\".decode('utf8').encode('cp936') - -newPath = path + "dictionary_simple.txt" - -word = [] -word_num = [] - -####################################################################最简字典文件打开并进入内存 -myFile = open(newPath , 'r') - -line = myFile.readline().rstrip('\n') -i = 0 -while line: - word.append(line) - line = myFile.readline().rstrip('\n') - # if i == 2000: - # print word[i] - # i=i+1 - -myFile.close() -print len(word) -print "part1" - -####################################################################词典词语长度坐标文件进去内存 -newPath2 = path + "word_num.txt" -myFile = open(newPath2 , 'r') - -line = myFile.readline().rstrip('\n') - -while line: - word_num.append(int(line)) - line = myFile.readline().rstrip('\n') - -myFile.close() - -print len(word_num) - -print "part2" -####################################################################利用词典进行分词 - -TEST = "你是我一生的挚爱啊我的女神" - -TEST = "一位朴实美丽的渔家姑娘从红树林边的渔村闯入都市,经历了情感的波折和撞击演绎出复杂而\ -又多变的人生。故事发生在有着大面积红树林的小渔村和南海海滨一座新兴的小城里。渔家姑娘珍珠进\ -城打工,珍珠公司总经理大虎对她一见钟情,珍珠却不为所动。大虎企图强占珍珠,珍珠毅然回到红树\ -林。大虎在另两个干部子弟二虎和三虎的挑唆下,轮奸了珍珠。珍珠的意中人大同进行报复,欲杀大虎\ -的母亲、副市长林岚,却刺伤了检查官马叔。大虎又与二虎、三虎轮奸了女工小云,被当场抓获。林岚\ -救子心切,落入了刑侦科长金大川手里。马叔与牛晋顶住压力,使案件终于重审,三个虎被绳之以法。" - - -new_sent = [] - -T_len = len(TEST)/3 - -if T_len < 41: - s = T_len -else: - s = 40 - -while s > 0: - flag = 0 - # print word_num[s]-1 - # print word_num[s+1] - # print s - # print TEST[0:s*3] - for i in range(word_num[s]-1,word_num[s+1],-1): - #print word[i] - if TEST[0:s*3] == word[i]: - new_sent.append(word[i]) - print word[i] + "ZZZZZZZZZ" - flag = 1 - break - if flag == 1: - TEST = TEST[s*3:] - if len(TEST)/3 < 41: - s = len(TEST)/3 - else: - s = 40 - else: - s -= 1 - if s == 1: - new_sent.append(TEST[:s*3]) - print "TTTTT" + TEST[:s*3] + " " + str(s) - TEST = TEST[s*3:] - if len(TEST)/3 < 41: - s = len(TEST)/3 - else: - s = 40 - - -for item in new_sent: - print item + "\\", - - -print "\npart3" - - - - - - - - - - - - - - - - diff --git a/Mathematical-Modeling-2014/Project/myTest/test.py b/Mathematical-Modeling-2014/Project/myTest/test.py deleted file mode 100644 index 338a497..0000000 --- a/Mathematical-Modeling-2014/Project/myTest/test.py +++ /dev/null @@ -1,35 +0,0 @@ -#coding:utf-8 - -import os - -path = "C:\\Users\\Syndrome\\Desktop\\语料数据\\文本分类\\20_newsgroups\\".decode("utf-8").encode("cp936") - -filenamelist=os.listdir(path) -for item in filenamelist : - print item - filenamelist2 = os.listdir(path + "\\" + item) - for item2 in filenamelist2 : - print item2 - newPath = path + "\\" + item +"\\" + item2 - myFile = open (newPath) - - myFile.close() - -print "finish!" - - - - -# myFile = open(path) - -# line = myFile.readline() - -# while line : -# print line -# line = myFile.readline() - -# myFile.close() - - - - diff --git a/Mathematical-Modeling-2014/Project/myTest/test2.py b/Mathematical-Modeling-2014/Project/myTest/test2.py deleted file mode 100644 index 228eb48..0000000 --- a/Mathematical-Modeling-2014/Project/myTest/test2.py +++ /dev/null @@ -1,29 +0,0 @@ -#coding: utf-8 - - -###多线程 Multithreading - -import threading -TOTAL = 0 -MY_LOCK = threading.Lock() -class CountThread(threading.Thread): - def run(self): - global TOTAL - for i in range(100): - MY_LOCK.acquire() - TOTAL = TOTAL + 1 - MY_LOCK.release() - print('%s\n' % (TOTAL)) -a = CountThread() -b = CountThread() -a.start() -b.start() - - - -text1 = ["你是","我今生","唯一的挚爱","你是我今生唯一的挚爱啊啊啊啊"] -print text1.count("你是")+1 - - - - diff --git a/Mathematical-Modeling-2014/Project/myTest/test_dict_360.py b/Mathematical-Modeling-2014/Project/myTest/test_dict_360.py deleted file mode 100644 index 13d6c74..0000000 --- a/Mathematical-Modeling-2014/Project/myTest/test_dict_360.py +++ /dev/null @@ -1,38 +0,0 @@ -#coding:utf-8 - - -path = "C:\\Users\\Syndrome\\Desktop\\语料数据\\360W_字典\\dict_360.txt".decode('utf8').encode('cp936') - -f = open(path,"r") - -line = f.readline() -i = 0 -while line: - line = line.rstrip('\n') #去除字符\n - m = line.split('\t') #字符串分割,以\t - - print len(m[0])/3 - - for item in m: - print item # 后面跟 ',' 将忽略换行符 - # print(line, end = '')   # 在 Python 3中使用 - - line = f.readline() - i += 1 - if i == 1000: - break - -f.close() - - - -# 注释代码快捷键,ctrl+/ -# def str_len(str): -# try: -# row_l=len(str) -# utf8_l=len(str.encode('utf-8')) -# return (utf8_l-row_l)/2+row_l -# except: -# return None -# return None - diff --git a/Mathematical-Modeling-2014/Project/qiubai_spider.py b/Mathematical-Modeling-2014/Project/qiubai_spider.py deleted file mode 100644 index b88fb52..0000000 --- a/Mathematical-Modeling-2014/Project/qiubai_spider.py +++ /dev/null @@ -1,141 +0,0 @@ -# -*- coding: utf-8 -*- -#--------------------------------------- -# 程序:糗百爬虫 -# 版本:0.2 -# 作者:why -# 日期:2013-05-15 -# 语言:Python 2.7 -# 操作:输入quit退出阅读糗事百科 -# 功能:按下回车依次浏览今日的糗百热点 -# 更新:解决了命令提示行下乱码的问题 -#--------------------------------------- - -import urllib2 -import urllib -import re -import thread -import time - -# import sys -# reload(sys) -# sys.setdefaultencoding('utf-8') - -#----------- 处理页面上的各种标签 ----------- -class HTML_Tool: - # 用非 贪婪模式 匹配 \t 或者 \n 或者 空格 或者 超链接 或者 图片 - BgnCharToNoneRex = re.compile("(\t|\n| ||)") - - # 用非 贪婪模式 匹配 任意<>标签 - EndCharToNoneRex = re.compile("<.*?>") - - # 用非 贪婪模式 匹配 任意

标签 - BgnPartRex = re.compile("") - CharToNewLineRex = re.compile("(
|

||
|
)") - CharToNextTabRex = re.compile("") - - # 将一些html的符号实体转变为原始符号 - replaceTab = [("<","<"),(">",">"),("&","&"),("&","\""),(" "," ")] - - def Replace_Char(self,x): - x = self.BgnCharToNoneRex.sub("",x) - x = self.BgnPartRex.sub("\n ",x) - x = self.CharToNewLineRex.sub("\n",x) - x = self.CharToNextTabRex.sub("\t",x) - x = self.EndCharToNoneRex.sub("",x) - - for t in self.replaceTab: - x = x.replace(t[0],t[1]) - return x -#----------- 处理页面上的各种标签 ----------- - - -#----------- 加载处理糗事百科 ----------- -class HTML_Model: - - def __init__(self): - self.page = 1 - self.pages = [] - self.myTool = HTML_Tool() - self.enable = False - - # 将所有的段子都扣出来,添加到列表中并且返回列表 - def GetPage(self,page): - myUrl = "http://m.qiushibaike.com/hot/page/" + page - myResponse = urllib2.urlopen(myUrl) - myPage = myResponse.read() - #encode的作用是将unicode编码转换成其他编码的字符串 - #decode的作用是将其他编码的字符串转换成unicode编码 - unicodePage = myPage.decode("utf-8") - - # 找出所有class="content"的div标记 - #re.S是任意匹配模式,也就是.可以匹配换行符 - myItems = re.findall('(.*?)',unicodePage,re.S) - items = [] - for item in myItems: - # item 中第一个是div的标题,也就是时间 - # item 中第二个是div的内容,也就是内容 - items.append([item[0].replace("\n",""),item[1].replace("\n","")]) - return items - - # 用于加载新的段子 - def LoadPage(self): - # 如果用户未输入quit则一直运行 - while self.enable: - # 如果pages数组中的内容小于2个 - if len(self.pages) < 2: - try: - # 获取新的页面中的段子们 - myPage = self.GetPage(str(self.page)) - self.page += 1 - self.pages.append(myPage) - except: - print '无法链接糗事百科!' - else: - time.sleep(1) - - def ShowPage(self,q,page): - for items in q: - print u'第%d页' % page , items[0] - print self.myTool.Replace_Char(items[1]) - myInput = raw_input() - if myInput == "quit": - self.enable = False - break - - def Start(self): - self.enable = True - page = self.page - - print u'正在加载中请稍候......' - - # 新建一个线程在后台加载段子并存储 - thread.start_new_thread(self.LoadPage,()) - - #----------- 加载处理糗事百科 ----------- - while self.enable: - # 如果self的page数组中存有元素 - if self.pages: - nowPage = self.pages[0] - del self.pages[0] - self.ShowPage(nowPage,page) - page += 1 - - -#----------- 程序的入口处 ----------- -print u""" ---------------------------------------- - 程序:糗百爬虫 - 版本:0.1 - 作者:why - 日期:2013-05-15 - 语言:Python 2.7 - 操作:输入quit退出阅读糗事百科 - 功能:按下回车依次浏览今日的糗百热点 ---------------------------------------- -""" - - -print u'请按下回车浏览今日的糗百内容:' -raw_input(' ') -myModel = HTML_Model() -myModel.Start() \ No newline at end of file diff --git a/Mathematical-Modeling-2014/Project/snownlp_test.py b/Mathematical-Modeling-2014/Project/snownlp_test.py deleted file mode 100644 index d4eb9f6..0000000 --- a/Mathematical-Modeling-2014/Project/snownlp_test.py +++ /dev/null @@ -1,65 +0,0 @@ -#coding:utf-8 -# import sys -# reload(sys) -# sys.setdefaultencoding( "utf-8" ) - - -from snownlp import SnowNLP - -str1 = u'这个东西真心很赞' -s = SnowNLP(str1) - -#print str1 -print str1.encode('utf-8') - -sw=s.words -print sw -#print sw.encode('utf-8') # [u'这个', u'东西', u'真心', - # u'很', u'赞'] - -print s.tags # [(u'这个', u'r'), (u'东西', u'n'), - # (u'真心', u'd'), (u'很', u'd'), - # (u'赞', u'Vg')] - -print s.sentiments # 0.9830157237610916 positive的概率 - -print s.pinyin # [u'zhe', u'ge', u'dong', u'xi', - # u'zhen', u'xin', u'hen', u'zan'] - - -s = SnowNLP(u'「繁體字」「繁體中文」的叫法在臺灣亦很常見。') - -s.han # u'「繁体字」「繁体中文」的叫法 - # 在台湾亦很常见。' - -text = u''' -自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。 -它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。 -自然语言处理是一门融语言学、计算机科学、数学于一体的科学。 -因此,这一领域的研究将涉及自然语言,即人们日常使用的语言, -所以它与语言学的研究有着密切的联系,但又有重要的区别。 -自然语言处理并不是一般地研究自然语言, -而在于研制能有效地实现自然语言通信的计算机系统, -特别是其中的软件系统。因而它是计算机科学的一部分。 -''' - -s = SnowNLP(text) - -s.keywords(3) # [u'语言', u'自然', u'计算机'] - -s.summary(3) # [u'自然语言处理是一门融语言学、计算机科学、 - # 数学于一体的科学', - # u'即人们日常使用的语言', - # u'自然语言处理是计算机科学领域与人工智能 - # 领域中的一个重要方向'] -s.sentences - -s = SnowNLP([[u'这篇', u'文章'], - [u'那篇', u'论文'], - [u'这个']]) -s.tf -s.idf -s.sim([u'文章'])# [0.3756070762985226, 0, 0] - - - diff --git a/Mathematical-Modeling-2014/Project/spider.py b/Mathematical-Modeling-2014/Project/spider.py deleted file mode 100644 index 2854f04..0000000 --- a/Mathematical-Modeling-2014/Project/spider.py +++ /dev/null @@ -1,50 +0,0 @@ -# coding=utf-8 - -#--------------------------------------- -# 程序:百度贴吧爬虫 -# 版本:0.1 -# 作者:why -# 日期:2013-05-14 -# 语言:Python 2.7 -# 操作:输入带分页的地址,去掉最后面的数字,设置一下起始页数和终点页数。 -# 功能:下载对应页码内的所有页面并存储为html文件。 -#--------------------------------------- - -import string, urllib - -#定义百度函数 -def baidu_tieba(url,begin_page,end_page): - for i in range(begin_page, end_page+1): - sName = string.zfill(i,5) + '.txt'#自动填充成六位的文件名 - print '正在下载第' + str(i) + '个网页,并将其存储为' + sName + '......' - f = open(sName,'w+') - m = urllib.urlopen(url + str(i)).read() - - #print m - - f.write(m) - f.close() - - -#-------- 在这里输入参数 ------------------ - -# 这个是山东大学的百度贴吧中某一个帖子的地址 -#bdurl = 'http://tieba.baidu.com/p/2296017831?pn=' -#iPostBegin = 1 -#iPostEnd = 10 - -#bdurl = str(raw_input(u'请输入贴吧的地址,去掉pn=后面的数字:\n')) -bdurl = 'http://tieba.baidu.com/p/2296017831?pn=' -#begin_page = int(raw_input(u'请输入开始的页数:\n')) -#end_page = int(raw_input(u'请输入终点的页数:\n')) -begin_page = 1 -end_page = 5 -#-------- 在这里输入参数 ------------------ - - -#调用 -baidu_tieba(bdurl,begin_page,end_page) - -response = urllib.urlopen('http://www.baidu.com/') -html = response.read() -print html diff --git a/Mathematical-Modeling-2014/Project/test1.py b/Mathematical-Modeling-2014/Project/test1.py deleted file mode 100644 index 2d2f883..0000000 --- a/Mathematical-Modeling-2014/Project/test1.py +++ /dev/null @@ -1,28 +0,0 @@ - -import re -import urllib - - -def getHtml(url): - page = urllib.urlopen(url) - html = page.read() - return html - -def getImg(html): - reg = r"src='+(.*?\.jpg)+' width" - imgre = re.compile(reg) - imgList = re.findall(imgre,html) - x = 0 - for imgurl in imgList: - print imgurl - #urllib.urlretrieve(imgurl,'%s.jpg' % x) - x+=1 - - -#a = raw_input() - -html = getHtml("http://tieba.baidu.com/p/2844418574?pn=2") -getImg(html) - - - diff --git a/Mathematical-Modeling-2014/Project/test_test.py b/Mathematical-Modeling-2014/Project/test_test.py deleted file mode 100644 index 13d6f2c..0000000 --- a/Mathematical-Modeling-2014/Project/test_test.py +++ /dev/null @@ -1,6 +0,0 @@ -#coding:utf-8 -s=u"中文" -b=u"我" -print b.encode("gb2312") -print s.encode("gb2312") - diff --git a/Mathematical-Modeling-2014/Project/wordcloud.py b/Mathematical-Modeling-2014/Project/wordcloud.py deleted file mode 100644 index f06fd07..0000000 --- a/Mathematical-Modeling-2014/Project/wordcloud.py +++ /dev/null @@ -1,12 +0,0 @@ -#test of pytagcloud - -from pytagcloud import create_tag_image, make_tags -from pytagcloud.lang.counter import get_tag_counts - -YOUR_TEXT = "A tag cloud is a visual representation for text data, typically\ -used to depict keyword metadata on websites, or to visualize free form text." - -tags = make_tags(get_tag_counts(YOUR_TEXT), maxsize=120) - -create_tag_image(tags, 'cloud_large.png', size=(900, 600), fontname='Lobster') - diff --git a/Mathematical-Modeling-2014/car.txt b/Mathematical-Modeling-2014/car.txt deleted file mode 100644 index bb9fd8a..0000000 --- a/Mathematical-Modeling-2014/car.txt +++ /dev/null @@ -1,37 +0,0 @@ -4490 1780 -4466 1705 -4531 1817 -4670 1780 -4747 1820 -4500 1755 -4880 1800 -4865 1805 -4687 1700 -4544 1760 -4608 1743 -4350 1735 -4400 1695 -4789 1765 -5015 1880 -4600 1800 -4930 1795 -4945 1845 -4603 1780 -4855 1780 -5035 1855 -4480 1840 -4580 1725 -4420 1690 -6831 1980 -3745 1615 -4194 1680 -3763 1615 -3460 1618 -4310 1695 -4270 1695 -4245 1680 -4212 1762 -3588 1563 -3998 1640 -4230 1690 -4135 1755 diff --git a/Mathematical-Modeling-2014/car45.txt b/Mathematical-Modeling-2014/car45.txt deleted file mode 100644 index 8b77d39..0000000 --- a/Mathematical-Modeling-2014/car45.txt +++ /dev/null @@ -1,45 +0,0 @@ -4610 1826 1763 4 2 0 3 1 -5015 1880 1475 2 3 0 4 2 -4310 1695 1480 12 6 5 10 7 -4747 1820 1440 15 8 4 9 6 -3460 1618 1465 12 8 7 21 6 -4490 1780 1405 10 12 14 9 13 -4230 1690 1550 7 0 2 5 7 -4270 1695 1480 5 3 12 5 4 -4480 1840 1500 4 0 6 8 5 -4135 1755 1605 6 0 0 3 2 -4600 1800 1475 12 3 5 0 0 -4574 1704 1845 6 4 2 0 0 -4500 1755 1450 15 9 5 7 6 -4420 1690 1590 7 4 3 4 5 -4930 1795 1475 4 2 3 1 2 -4350 1735 1470 8 9 4 2 5 -4945 1695 1970 3 0 0 0 2 -4400 1695 1470 13 7 4 8 5 -4945 1845 1480 4 3 4 1 2 -3588 1563 1533 3 5 15 5 8 -4466 1705 1410 4 5 7 2 0 -4531 1817 1421 4 2 0 4 3 -4880 1800 1450 5 3 2 6 5 -5160 1895 1930 7 2 4 3 2 -4800 1770 1880 4 3 8 2 6 -4590 1766 1767 0 1 5 7 8 -4194 1680 1440 3 4 2 8 7 -4865 1805 1450 12 8 4 2 6 -3763 1615 1440 3 5 14 4 7 -3998 1640 1535 0 3 8 6 9 -4285 1765 1715 0 6 4 12 8 -4608 1743 1465 15 12 4 6 5 -4789 1765 1470 10 8 6 7 0 -4687 1700 1450 0 2 12 6 5 -4580 1725 1500 9 4 3 7 5 -4603 1780 1480 5 6 8 0 9 -3820 1495 1860 0 4 20 8 5 -4212 1762 1531 8 7 10 3 5 -4245 1680 1500 5 7 8 4 9 -3745 1615 1385 0 0 15 8 4 -4855 1780 1480 9 5 0 5 6 -4544 1760 1464 8 7 4 5 5 -5035 1855 1485 12 6 0 4 3 -6831 1980 1478 2 0 0 1 1 -4670 1780 1435 15 13 9 10 6 diff --git a/Mathematical-Modeling-2014/test.py b/Mathematical-Modeling-2014/test.py deleted file mode 100644 index aea4327..0000000 --- a/Mathematical-Modeling-2014/test.py +++ /dev/null @@ -1,37 +0,0 @@ -#!/usr/bin/python -# -*- coding: utf-8 -*- -""" -Function: -【教程】把Sublime Text 2用作Python的IDE去实现Python的开发 - -http://www.crifan.com/use_sublime_text_2_as_python_ide - -Author: Crifan Li -Version: 2013-02-01 -Contact: admin at crifan dot com -""" - -def sublimeText2IdeDemo(): - """ - Demo how to use sublime text 2 as Python IDE - also try to support: - input parameter - autocomplete - """ - print "Demo print in Sublime Text 2" - inputVal = 100 - #raw_input("Now in sublime text 2, please input parameter:") - print "Your inputed parameter is ",inputVal - -if __name__ == "__main__": - sublimeText2IdeDemo() - - - - - - - - - - diff --git a/Mathematical-Modeling-2014/test2.py b/Mathematical-Modeling-2014/test2.py deleted file mode 100644 index ccd1f65..0000000 --- a/Mathematical-Modeling-2014/test2.py +++ /dev/null @@ -1,40 +0,0 @@ -#!/usr/bin/python -#coding=utf-8 -# 数学建模:单辆矫运车装车方案,前四问 -# 输入为:矫运车长度,宽度 -# 输出为:装车方案 - -# 高度超过1.7米的乘用车只能装在1-1、1-2型下层 -# 纵向及横向的安全车距均至少为0.1米 - -Put = [0,1,0,1,0,0] -#长 上宽 下宽 -Truck = [19.1,24.4,19.1] -#长 宽 高 -Car = [4.71,3.715,4.73] - - -for i in range(0,6): - if Put[i] == 0: - for j in range(0,int(Truck[i/2]/Car[0])+2): - for k in range(0,int(Truck[i/2]/Car[1])+2): - if j*Car[0]+k*Car[1] > Truck[i/2]: - if k > 0 : - print(i,j,k-1) - break - else: - for j in range(0,int(Truck[i/2]/Car[0])+2): - for k in range(0,int(Truck[i/2]/Car[1])+2): - for l in range(0,int(Truck[i/2]/Car[2])+2): - if j*Car[0]+k*Car[1]+l*Car[2] > Truck[i/2]: - if l > 0 : - print(i,j,k,l-1) - break - - - - - - - - diff --git a/Mathematical-Modeling-2014/test3.py b/Mathematical-Modeling-2014/test3.py deleted file mode 100644 index 71bd68c..0000000 --- a/Mathematical-Modeling-2014/test3.py +++ /dev/null @@ -1,35 +0,0 @@ -#!/usr/bin/python -#coding=utf-8 - -import time -import numpy as np -import pylab as pl -from sklearn.cluster import KMeans -from sklearn.metrics.pairwise import euclidean_distances -from sklearn.datasets.samples_generator import make_blobs - -np.random.seed(0) -centers = [[1,1], [-1,-1], [1, -1]] -k = len(centers) -x , labels = make_blobs(n_samples=3000, centers=centers, cluster_std=.7) - -kmeans = KMeans(init='k-means++', n_clusters=3, n_init = 10) -t0 = time.time() -kmeans.fit(x) -t_end = time.time() - t0 - -colors = ['r', 'b', 'g'] -for k , col in zip( range(k) , colors): - members = (kmeans.labels_ == k ) - pl.plot( x[members, 0] , x[members,1] , 'w', markerfacecolor=col, marker='.') - pl.plot(kmeans.cluster_centers_[k,0], kmeans.cluster_centers_[k,1], 'o', markerfacecolor=col,\ - markeredgecolor='k', markersize=10) -pl.show() - - - - - - - - diff --git a/Mathematical-Modeling-2014/test4.py b/Mathematical-Modeling-2014/test4.py deleted file mode 100644 index 47a58c3..0000000 --- a/Mathematical-Modeling-2014/test4.py +++ /dev/null @@ -1,57 +0,0 @@ -#!/usr/bin/python -#coding=utf-8 - -import string - -datafile = open("car.txt") - - -n = 37 -m = 2 -mat = [[0]*m for i in range(n)] - -i = 0 -car = datafile.readline() -while car: - car_data = car.strip('\n').split(" ") - j = 0 - for items in car_data: - #字符串转换成数字 - data1 = string.atoi(items) - mat[i][j] = data1 - j = j + 1 - #print data1 - i = i + 1 - car = datafile.readline() - - - - -for i in range(n): - for j in range(m): - print mat[i][j], - print - - - -from sklearn.cluster import KMeans - -kmeans = KMeans(init='k-means++', n_clusters = 4, n_init = 10) - -kmeans.fit(mat) - -result = kmeans.predict(mat) - -print result - - - - - - - - - - - - diff --git a/README.md b/README.md index ef79281..7df7f62 100644 --- a/README.md +++ b/README.md @@ -10,8 +10,7 @@ This is a `Chinese tutorial` which is translated from [DeepLearning 0.1 document 这是一个翻译自[深度学习0.1文档](http://deeplearning.net/tutorial/contents.html)的`中文教程`。在这个教程里面所有的算法和模型都是通过Pyhton和[Theano](http://deeplearning.net/software/theano/index.html)实现的。Theano是一个著名的第三方库,允许程序员使用GPU或者CPU去运行他的Python代码。 - -##内容/Contents +## 内容/Contents * [入门(Getting Started)](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/1_Getting_Started_入门.md) * [使用逻辑回归进行MNIST分类(Classifying MNIST digits using Logistic Regression)](https://github.com/Syndrome777/DeepLearningTutorial/blob/master/2_Classifying_MNIST_using_LR_逻辑回归进行MNIST分类.md) @@ -27,10 +26,10 @@ This is a `Chinese tutorial` which is translated from [DeepLearning 0.1 document * Miscellaneous -##版权/Copyright -####作者/Author +## 版权/Copyright +#### 作者/Author [Theano Development Team](http://deeplearning.net/tutorial/LICENSE.html), LISA lab, University of Montreal -####翻译者/Translator +#### 翻译者/Translator [Lifeng Hua](https://github.com/Syndrome777), Zhejiang University diff --git a/images/.DS_Store b/images/.DS_Store new file mode 100644 index 0000000..bf22167 Binary files /dev/null and b/images/.DS_Store differ