diff --git a/.gitignore b/.gitignore index 4e4b44ae..c3f21a98 100644 --- a/.gitignore +++ b/.gitignore @@ -1,3 +1,4 @@ _site .DS_Store *.swp +.ipynb_checkpoints diff --git a/CNAME b/CNAME new file mode 100644 index 00000000..44d8135d --- /dev/null +++ b/CNAME @@ -0,0 +1 @@ +aikorea.org diff --git a/Readme.md b/Readme.md index dfc3cdf7..01875a65 100644 --- a/Readme.md +++ b/Readme.md @@ -1,3 +1,18 @@ -Notes and assignments for Stanford CS class [CS231n: Convolutional Neural Networks for Visual Recognition](http://vision.stanford.edu/teaching/cs231n/) +English to Korean translation project for the notes and assignments for Stanford CS class [CS231n: Convolutional Neural Networks for Visual Recognition](http://vision.stanford.edu/teaching/cs231n/). +## How to Participate + +1. Fork this repository +2. Translate the assigned file (markdown, ipython-notebook, etc.) into Korean - Please refer to the [glossary](http://aikorea.org/cs231n/glossary) +3. Send PR + +## Local Development Instructions + +To view the rendered site in your browser, + +1. Install Jekyll - follow the instructions [[here](https://jekyllrb.com/docs/installation/)] +2. Assuming that you have already forked this repo, `git clone https://github.com/yourUserName/cs231n.git` +3. `cd cs231n` +4. `jekyll serve` +5. View the website at http://127.0.0.1:4000/cs231n/ diff --git a/_config.yml b/_config.yml index 79d50b82..133b1587 100644 --- a/_config.yml +++ b/_config.yml @@ -1,12 +1,12 @@ # Site settings title: CS231n Convolutional Neural Networks for Visual Recognition -email: karpathy@cs.stanford.edu -description: "Course materials and notes for Stanford class CS231n: Convolutional Neural Networks for Visual Recognition." -baseurl: "" -url: "http://cs231n.github.io" -twitter_username: cs231n -github_username: cs231n +email: team.aikorea@gmail.com +description: "스탠포드 CS231n: Convolutional Neural Networks for Visual Recognition 수업자료 번역사이트" +baseurl: "/cs231n" +url: "http://aikorea.org" +twitter_username: kjw6612 +github_username: aikorea # Build settings -markdown: redcarpet +markdown: kramdown permalink: pretty diff --git a/_includes/head.html b/_includes/head.html index 7222af0e..18a47c31 100644 --- a/_includes/head.html +++ b/_includes/head.html @@ -23,5 +23,5 @@ ga('send', 'pageview'); - + diff --git a/acknowledgement.md b/acknowledgement.md new file mode 100644 index 00000000..56053111 --- /dev/null +++ b/acknowledgement.md @@ -0,0 +1,9 @@ +--- +layout: page +mathjax: true +permalink: /acknowledgement/ +--- + +*(프로젝트 완료 시까지 임시 파일입니다)* + +다들 바쁘신 와중에 틈틈이 시간내어 번역 프로젝트에 참여해 주신 myungsub, sandrokim, ygchoi, alexseong, ckyun777, dolai, donghun, gnujoow, j-min, jaywhang, jazzsaxmafia, jihoonl, jslee, junghojin, juyong, kjw0612, maybe, okmin, rollis0825, salopge, sanghun, sora, stats2ml, sungjunhong 님께 이 자리를 빌려 감사 말씀을 드립니다. diff --git a/assets/aws-signin.png b/assets/aws-signin.png index 30413cbf..023d223d 100644 Binary files a/assets/aws-signin.png and b/assets/aws-signin.png differ diff --git a/assets/aws-signup.png b/assets/aws-signup.png index 0fd58901..430d9e64 100644 Binary files a/assets/aws-signup.png and b/assets/aws-signup.png differ diff --git a/assignment1.md b/assignment1.md index 599e5a54..c3d52595 100644 --- a/assignment1.md +++ b/assignment1.md @@ -30,7 +30,7 @@ for the project. If you choose not to use a virtual environment, it is up to you to make sure that all dependencies for the code are installed on your machine. To set up a virtual environment, run the following: -```bash +~~~bash cd assignment1 sudo pip install virtualenv # This may already be installed virtualenv .env # Create a virtual environment @@ -38,16 +38,16 @@ source .env/bin/activate # Activate the virtual environment pip install -r requirements.txt # Install dependencies # Work on the assignment for a while ... deactivate # Exit the virtual environment -``` +~~~ **Download data:** Once you have the starter code, you will need to download the CIFAR-10 dataset. Run the following from the `assignment1` directory: -```bash +~~~bash cd cs231n/datasets ./get_datasets.sh -``` +~~~ **Start IPython:** After you have the CIFAR-10 data, you should start the IPython notebook server from the diff --git a/assignment2.md b/assignment2.md index f35b2375..9a8750b8 100644 --- a/assignment2.md +++ b/assignment2.md @@ -26,7 +26,7 @@ for the project. If you choose not to use a virtual environment, it is up to you to make sure that all dependencies for the code are installed on your machine. To set up a virtual environment, run the following: -```bash +~~~bash cd assignment2 sudo pip install virtualenv # This may already be installed virtualenv .env # Create a virtual environment @@ -34,7 +34,7 @@ source .env/bin/activate # Activate the virtual environment pip install -r requirements.txt # Install dependencies # Work on the assignment for a while ... deactivate # Exit the virtual environment -``` +~~~ You can reuse the virtual environment that you created for the first assignment, but you will need to run `pip install -r requirements.txt` after activating it @@ -44,16 +44,16 @@ to install additional dependencies required by this assignment. Once you have the starter code, you will need to download the CIFAR-10 dataset. Run the following from the `assignment2` directory: -```bash +~~~bash cd cs231n/datasets ./get_datasets.sh -``` +~~~ **Compile the Cython extension:** Convolutional Neural Networks require a very efficient implementation. We have implemented of the functionality using [Cython](http://cython.org/); you will need to compile the Cython extension before you can run the code. From the `cs231n` directory, run the following command: -```bash +~~~bash python setup.py build_ext --inplace -``` +~~~ **Start IPython:** After you have the CIFAR-10 data, you should start the IPython notebook server from the diff --git a/assignment3.md b/assignment3.md index 52dddd32..caa2f08a 100644 --- a/assignment3.md +++ b/assignment3.md @@ -28,7 +28,7 @@ for the project. If you choose not to use a virtual environment, it is up to you to make sure that all dependencies for the code are installed on your machine. To set up a virtual environment, run the following: -```bash +~~~bash cd assignment3 sudo pip install virtualenv # This may already be installed virtualenv .env # Create a virtual environment @@ -36,7 +36,7 @@ source .env/bin/activate # Activate the virtual environment pip install -r requirements.txt # Install dependencies # Work on the assignment for a while ... deactivate # Exit the virtual environment -``` +~~~ You can reuse the virtual environment that you created for the first or second assignment, but you will need to run `pip install -r requirements.txt` after @@ -52,18 +52,18 @@ Run the following from the `assignment3` directory: NOTE: After downloading and unpacking, the data and pretrained models will take about 900MB of disk space. -```bash +~~~bash cd cs231n/datasets ./get_datasets.sh ./get_tiny_imagenet_splits.sh ./get_pretrained_models.sh -``` +~~~ **Compile the Cython extension:** Convolutional Neural Networks require a very efficient implementation. We have implemented of the functionality using [Cython](http://cython.org/); you will need to compile the Cython extension before you can run the code. From the `cs231n` directory, run the following command: -```bash +~~~bash python setup.py build_ext --inplace -``` +~~~ **Start IPython:** After you have downloaded the data and compiled the Cython extensions, diff --git a/assignments2016/assignment1.md b/assignments2016/assignment1.md index 0bc7efc4..32e11f88 100644 --- a/assignments2016/assignment1.md +++ b/assignments2016/assignment1.md @@ -3,88 +3,84 @@ layout: page mathjax: true permalink: /assignments2016/assignment1/ --- +이번 숙제에서 여러분은 간단한 이미지 분류 파이프라인을 k-Nearest neighbor 또는 SVM/Softmax 분류기에 기반하여 넣는 방법을 연습할 수 있습니다. 이번 숙제의 목표는 다음과 같습니다. -In this assignment you will practice putting together a simple image classification pipeline, based on the k-Nearest Neighbor or the SVM/Softmax classifier. The goals of this assignment are as follows: +- **이미지 분류 파이프라인**의 기초와 데이터기반 접근법에 대해 이해합니다. +- 학습/확인/테스트의 분할과 **hyperparameter 튜닝**를 위해 검증 데이터를 사용하는 것에 관해 이해합니다. +- 효율적으로 작성된 **벡터화**된 numpy 코드로 proficiency을 나타나게 합니다. +- k-Nearest Neighbor (**kNN**) 분류기를 구현하고 적용해봅니다. +- Multiclass Support Vector Machine (**SVM**) 분류기를 구현하고 적용해봅니다. +- **Softmax** 분류기를 구현하고 적용해봅니다. +- **Two layer neural network** 분류기를 구현하고 적용해봅니다. +- 위 분류기들의 장단점과 차이에 대해 이해합니다. +- 성능향상을 위해 단순히 이미지 픽셀(화소)보다 더 고차원의 표현(**higher-level representations**)을 사용하는 이유에 관하여 이해합니다. (색상 히스토그램, 그라데이션의 히스토그램(HOG) 특징) -- understand the basic **Image Classification pipeline** and the data-driven approach (train/predict stages) -- understand the train/val/test **splits** and the use of validation data for **hyperparameter tuning**. -- develop proficiency in writing efficient **vectorized** code with numpy -- implement and apply a k-Nearest Neighbor (**kNN**) classifier -- implement and apply a Multiclass Support Vector Machine (**SVM**) classifier -- implement and apply a **Softmax** classifier -- implement and apply a **Two layer neural network** classifier -- understand the differences and tradeoffs between these classifiers -- get a basic understanding of performance improvements from using **higher-level representations** than raw pixels (e.g. color histograms, Histogram of Gradient (HOG) features) +## 설치 +여러분은 다음 두가지 방법으로 숙제를 시작할 수 있습니다: Terminal.com을 이용한 가상 환경 또는 로컬 환경. -## Setup -You can work on the assignment in one of two ways: locally on your own machine, or on a virtual machine through Terminal.com. +### Terminal에서의 가상 환경. +Terminal에는 우리의 수업을 위한 서브도메인이 만들어져 있습니다. [www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com) 계정을 등록하세요. 이번 숙제에 대한 스냅샷은 [여기](https://www.stanfordterminalcloud.com/snapshot/49f5a1ea15dc424aec19155b3398784d57c55045435315ce4f8b96b62819ef65)에서 찾아볼 수 있습니다. 만약 수업에 등록되었다면, TA(see Piazza for more information)에게 이 수업을 위한 Terminal 예산을 요구할 수 있습니다. 처음 스냅샷을 실행시키면, 수업을 위한 모든 것이 설치되어 있어서 바로 숙제를 시작할 수 있습니다. [여기](/terminal-tutorial)에 Terminal을 위한 간단한 튜토리얼을 작성해 뒀습니다. -### Working in the cloud on Terminal +### 로컬 환경 +[여기](http://vision.stanford.edu/teaching/cs231n/winter1516_assignment1.zip)에서 압축파일을 다운받고 다음을 따르세요. -Terminal has created a separate subdomain to serve our class, [www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com). Register your account there. The Assignment 1 snapshot can then be found [here](https://www.stanfordterminalcloud.com/snapshot/49f5a1ea15dc424aec19155b3398784d57c55045435315ce4f8b96b62819ef65). If you're registered in the class you can contact the TA (see Piazza for more information) to request Terminal credits for use on the assignment. Once you boot up the snapshot everything will be installed for you, and you'll be ready to start on your assignment right away. We've written a small tutorial on Terminal [here](/terminal-tutorial). +**[선택 1] Use Anaconda:** +과학, 수학, 공학, 데이터 분석을 위한 대부분의 주요 패키지들을 담고있는 [Anaconda](https://www.continuum.io/downloads)를 사용하여 설치하는 것이 흔히 사용하는 방법입니다. 설치가 다 되면 모든 요구사항(dependency)을 넘기고 바로 숙제를 시작해도 좋습니다. -### Working locally -Get the code as a zip file [here](http://vision.stanford.edu/teaching/cs231n/winter1516_assignment1.zip). As for the dependencies: +**[선택 2] 수동 설치, virtual environment:** +만약 Anaconda 대신 좀 더 일반적이면서 까다로운 방법을 택하고 싶다면 이번 과제를 위한 [virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/)를 만들 수 있습니다. 만약 virtual environment를 사용하지 않는다면 모든 코드가 컴퓨터에 전역적으로 종속되게 설치됩니다. Virtual environment의 설정은 아래를 참조하세요. -**[Option 1] Use Anaconda:** -The preferred approach for installing all the assignment dependencies is to use [Anaconda](https://www.continuum.io/downloads), which is a Python distribution that includes many of the most popular Python packages for science, math, engineering and data analysis. Once you install it you can skip all mentions of requirements and you're ready to go directly to working on the assignment. - -**[Option 2] Manual install, virtual environment:** -If you'd like to (instead of Anaconda) go with a more manual and risky installation route you will likely want to create a [virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/) for the project. If you choose not to use a virtual environment, it is up to you to make sure that all dependencies for the code are installed globally on your machine. To set up a virtual environment, run the following: - -```bash +~~~bash cd assignment1 -sudo pip install virtualenv # This may already be installed -virtualenv .env # Create a virtual environment -source .env/bin/activate # Activate the virtual environment -pip install -r requirements.txt # Install dependencies +sudo pip install virtualenv # 아마 먼저 설치되어 있을 겁니다. +virtualenv .env # virtual environment를 만듭니다. +source .env/bin/activate # virtual environment를 활성화 합니다. +pip install -r requirements.txt # dependencies 설치합니다. # Work on the assignment for a while ... -deactivate # Exit the virtual environment -``` +deactivate # virtual environment를 종료합니다. +~~~ -**Download data:** -Once you have the starter code, you will need to download the CIFAR-10 dataset. -Run the following from the `assignment1` directory: +**데이터셋 다운로드:** +먼저 숙제를 시작하기전에 CIFAR-10 dataset를 다운로드해야 합니다. 아래 코드를 `assignment1` 폴더에서 실행하세요: -```bash +~~~bash cd cs231n/datasets ./get_datasets.sh -``` +~~~ + +**IPython 시작:** +CIFAR-10 data를 받았다면, `assignment1` 폴더의 IPython notebook server를 시작할 수 있습니다. IPython에 친숙하지 않다면 작성해둔 [IPython tutorial](/ipython-tutorial)를 읽어보는 것을 권장합니다. -**Start IPython:** -After you have the CIFAR-10 data, you should start the IPython notebook server from the -`assignment1` directory. If you are unfamiliar with IPython, you should read our -[IPython tutorial](/ipython-tutorial). +**NOTE:** OSX에서 virtual environment를 실행하면, matplotlib 에러가 날 수 있습니다([이 문제에 관한 이슈](http://matplotlib.org/faq/virtualenv_faq.html)). IPython 서버를 `assignment1`폴더의 `start_ipython_osx.sh`로 실행하면 이 문제를 피해갈 수 있습니다; 이 스크립트는 virtual environment가 `.env`라고 되어있다고 가정하고 작성되었습니다. -**NOTE:** If you are working in a virtual environment on OSX, you may encounter -errors with matplotlib due to the [issues described here](http://matplotlib.org/faq/virtualenv_faq.html). You can work around this issue by starting the IPython server using the `start_ipython_osx.sh` script from the `assignment1` directory; the script assumes that your virtual environment is named `.env`. +### 과제 제출: +로컬 환경이나 Terminal에 상관없이, 이번 숙제를 마쳤다면 `collectSubmission.sh`스크립트를 실행하세요. 이 스크립트는 `assignment1.zip`파일을 만듭니다. 이 파일을 [the coursework](https://coursework.stanford.edu/portal/site/W16-CS-231N-01/)에 업로드하세요. -### Submitting your work: -Whether you work on the assignment locally or using Terminal, once you are done -working run the `collectSubmission.sh` script; this will produce a file called -`assignment1.zip`. Upload this file to your dropbox on -[the coursework](https://coursework.stanford.edu/portal/site/W16-CS-231N-01/) -page for the course. -### Q1: k-Nearest Neighbor classifier (20 points) +### Q1: k-Nearest Neighbor 분류기 (20 points) -The IPython Notebook **knn.ipynb** will walk you through implementing the kNN classifier. +IPython Notebook **knn.ipynb**이 kNN 분류기를 구현하는 방법을 안내합니다. -### Q2: Training a Support Vector Machine (25 points) +### Q2: Support Vector Machine 훈련 (25 points) -The IPython Notebook **svm.ipynb** will walk you through implementing the SVM classifier. +IPython Notebook **svm.ipynb**이 SVM 분류기를 구현하는 방법을 안내합니다. -### Q3: Implement a Softmax classifier (20 points) +### Q3: Softmax 분류기 실행하기 (20 points) -The IPython Notebook **softmax.ipynb** will walk you through implementing the Softmax classifier. +IPython Notebook **softmax.ipynb**이 Softmax 분류기를 구현하는 방법을 안내합니다. ### Q4: Two-Layer Neural Network (25 points) -The IPython Notebook **two\_layer\_net.ipynb** will walk you through the implementation of a two-layer neural network classifier. -### Q5: Higher Level Representations: Image Features (10 points) +IPython Notebook **two_layer_net.ipynb**이 two-layer neural network 분류기를 구현하는 방법을 안내합니다 -The IPython Notebook **features.ipynb** will walk you through this exercise, in which you will examine the improvements gained by using higher-level representations as opposed to using raw pixel values. +### Q5: 이미지 특징을 고차원으로 표현하기 (10 points) -### Q6: Cool Bonus: Do something extra! (+10 points) +IPython Notebook **features.ipynb**을 사용하여 단순한 이미지 픽셀(화소)보다 고차원의 표현이 효과적인지 검사해 볼 것입니다. -Implement, investigate or analyze something extra surrounding the topics in this assignment, and using the code you developed. For example, is there some other interesting question we could have asked? Is there any insightful visualization you can plot? Or anything fun to look at? Or maybe you can experiment with a spin on the loss function? If you try out something cool we'll give you up to 10 extra points and may feature your results in the lecture. +### Q6: 추가 과제: 뭔가 더 해보세요! (+10 points) +이번 과제와 관련된 다른 것들을 작성한 코드로 분석하고 연구해보세요. 예를 들어, 질문하고 싶은 흥미로운 질문이 있나요? 통찰력 있는 시각화를 작성할 수 있나요? 아니면 다른 재미있는 살펴볼 거리가 있나요? 또는 손실 함수(loss function)을 조금씩 변형해가며 실험해볼 수도 있을 것입니다. 만약 다른 멋있는 것을 시도해본다면 추가로 10 points를 얻을 수 있고 강의에 수행한 결과가 실릴 수 있습니다. + +--- +

+번역: 배지운 (MaybeS) +

diff --git a/assignments2016/assignment1/.gitignore b/assignments2016/assignment1/.gitignore new file mode 100644 index 00000000..b0611d38 --- /dev/null +++ b/assignments2016/assignment1/.gitignore @@ -0,0 +1,3 @@ +*.swp +*.pyc +.env/* diff --git a/assignments2016/assignment1/README.md b/assignments2016/assignment1/README.md new file mode 100644 index 00000000..6aaea415 --- /dev/null +++ b/assignments2016/assignment1/README.md @@ -0,0 +1 @@ +Details about this assignment can be found [on the course webpage](http://cs231n.github.io/), under Assignment #1 of Winter 2016. diff --git a/assignments2016/assignment1/collectSubmission.sh b/assignments2016/assignment1/collectSubmission.sh new file mode 100644 index 00000000..13219057 --- /dev/null +++ b/assignments2016/assignment1/collectSubmission.sh @@ -0,0 +1,2 @@ +rm -f assignment1.zip +zip -r assignment1.zip . -x "*.git*" "*cs231n/datasets*" "*.ipynb_checkpoints*" "*README.md" "*collectSubmission.sh" "*requirements.txt" diff --git a/assignments2016/assignment1/cs231n/__init__.py b/assignments2016/assignment1/cs231n/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/assignments2016/assignment1/cs231n/classifiers/__init__.py b/assignments2016/assignment1/cs231n/classifiers/__init__.py new file mode 100644 index 00000000..cef2b580 --- /dev/null +++ b/assignments2016/assignment1/cs231n/classifiers/__init__.py @@ -0,0 +1,2 @@ +from cs231n.classifiers.k_nearest_neighbor import * +from cs231n.classifiers.linear_classifier import * diff --git a/assignments2016/assignment1/cs231n/classifiers/k_nearest_neighbor.py b/assignments2016/assignment1/cs231n/classifiers/k_nearest_neighbor.py new file mode 100644 index 00000000..7b592485 --- /dev/null +++ b/assignments2016/assignment1/cs231n/classifiers/k_nearest_neighbor.py @@ -0,0 +1,170 @@ +import numpy as np + +class KNearestNeighbor(object): + """ a kNN classifier with L2 distance """ + + def __init__(self): + pass + + def train(self, X, y): + """ + Train the classifier. For k-nearest neighbors this is just + memorizing the training data. + + Inputs: + - X: A numpy array of shape (num_train, D) containing the training data + consisting of num_train samples each of dimension D. + - y: A numpy array of shape (N,) containing the training labels, where + y[i] is the label for X[i]. + """ + self.X_train = X + self.y_train = y + + def predict(self, X, k=1, num_loops=0): + """ + Predict labels for test data using this classifier. + + Inputs: + - X: A numpy array of shape (num_test, D) containing test data consisting + of num_test samples each of dimension D. + - k: The number of nearest neighbors that vote for the predicted labels. + - num_loops: Determines which implementation to use to compute distances + between training points and testing points. + + Returns: + - y: A numpy array of shape (num_test,) containing predicted labels for the + test data, where y[i] is the predicted label for the test point X[i]. + """ + if num_loops == 0: + dists = self.compute_distances_no_loops(X) + elif num_loops == 1: + dists = self.compute_distances_one_loop(X) + elif num_loops == 2: + dists = self.compute_distances_two_loops(X) + else: + raise ValueError('Invalid value %d for num_loops' % num_loops) + + return self.predict_labels(dists, k=k) + + def compute_distances_two_loops(self, X): + """ + Compute the distance between each test point in X and each training point + in self.X_train using a nested loop over both the training data and the + test data. + + Inputs: + - X: A numpy array of shape (num_test, D) containing test data. + + Returns: + - dists: A numpy array of shape (num_test, num_train) where dists[i, j] + is the Euclidean distance between the ith test point and the jth training + point. + """ + num_test = X.shape[0] + num_train = self.X_train.shape[0] + dists = np.zeros((num_test, num_train)) + for i in xrange(num_test): + for j in xrange(num_train): + ##################################################################### + # TODO: # + # Compute the l2 distance between the ith test point and the jth # + # training point, and store the result in dists[i, j]. You should # + # not use a loop over dimension. # + ##################################################################### + pass + ##################################################################### + # END OF YOUR CODE # + ##################################################################### + return dists + + def compute_distances_one_loop(self, X): + """ + Compute the distance between each test point in X and each training point + in self.X_train using a single loop over the test data. + + Input / Output: Same as compute_distances_two_loops + """ + num_test = X.shape[0] + num_train = self.X_train.shape[0] + dists = np.zeros((num_test, num_train)) + for i in xrange(num_test): + ####################################################################### + # TODO: # + # Compute the l2 distance between the ith test point and all training # + # points, and store the result in dists[i, :]. # + ####################################################################### + pass + ####################################################################### + # END OF YOUR CODE # + ####################################################################### + return dists + + def compute_distances_no_loops(self, X): + """ + Compute the distance between each test point in X and each training point + in self.X_train using no explicit loops. + + Input / Output: Same as compute_distances_two_loops + """ + num_test = X.shape[0] + num_train = self.X_train.shape[0] + dists = np.zeros((num_test, num_train)) + ######################################################################### + # TODO: # + # Compute the l2 distance between all test points and all training # + # points without using any explicit loops, and store the result in # + # dists. # + # # + # You should implement this function using only basic array operations; # + # in particular you should not use functions from scipy. # + # # + # HINT: Try to formulate the l2 distance using matrix multiplication # + # and two broadcast sums. # + ######################################################################### + pass + ######################################################################### + # END OF YOUR CODE # + ######################################################################### + return dists + + def predict_labels(self, dists, k=1): + """ + Given a matrix of distances between test points and training points, + predict a label for each test point. + + Inputs: + - dists: A numpy array of shape (num_test, num_train) where dists[i, j] + gives the distance betwen the ith test point and the jth training point. + + Returns: + - y: A numpy array of shape (num_test,) containing predicted labels for the + test data, where y[i] is the predicted label for the test point X[i]. + """ + num_test = dists.shape[0] + y_pred = np.zeros(num_test) + for i in xrange(num_test): + # A list of length k storing the labels of the k nearest neighbors to + # the ith test point. + closest_y = [] + ######################################################################### + # TODO: # + # Use the distance matrix to find the k nearest neighbors of the ith # + # testing point, and use self.y_train to find the labels of these # + # neighbors. Store these labels in closest_y. # + # Hint: Look up the function numpy.argsort. # + ######################################################################### + pass + ######################################################################### + # TODO: # + # Now that you have found the labels of the k nearest neighbors, you # + # need to find the most common label in the list closest_y of labels. # + # Store this label in y_pred[i]. Break ties by choosing the smaller # + # label. # + ######################################################################### + pass + ######################################################################### + # END OF YOUR CODE # + ######################################################################### + + return y_pred + diff --git a/assignments2016/assignment1/cs231n/classifiers/linear_classifier.py b/assignments2016/assignment1/cs231n/classifiers/linear_classifier.py new file mode 100644 index 00000000..8e820903 --- /dev/null +++ b/assignments2016/assignment1/cs231n/classifiers/linear_classifier.py @@ -0,0 +1,130 @@ +import numpy as np +from cs231n.classifiers.linear_svm import * +from cs231n.classifiers.softmax import * + +class LinearClassifier(object): + + def __init__(self): + self.W = None + + def train(self, X, y, learning_rate=1e-3, reg=1e-5, num_iters=100, + batch_size=200, verbose=False): + """ + Train this linear classifier using stochastic gradient descent. + + Inputs: + - X: A numpy array of shape (N, D) containing training data; there are N + training samples each of dimension D. + - y: A numpy array of shape (N,) containing training labels; y[i] = c + means that X[i] has label 0 <= c < C for C classes. + - learning_rate: (float) learning rate for optimization. + - reg: (float) regularization strength. + - num_iters: (integer) number of steps to take when optimizing + - batch_size: (integer) number of training examples to use at each step. + - verbose: (boolean) If true, print progress during optimization. + + Outputs: + A list containing the value of the loss function at each training iteration. + """ + num_train, dim = X.shape + num_classes = np.max(y) + 1 # assume y takes values 0...K-1 where K is number of classes + if self.W is None: + # lazily initialize W + self.W = 0.001 * np.random.randn(dim, num_classes) + + # Run stochastic gradient descent to optimize W + loss_history = [] + for it in xrange(num_iters): + X_batch = None + y_batch = None + + ######################################################################### + # TODO: # + # Sample batch_size elements from the training data and their # + # corresponding labels to use in this round of gradient descent. # + # Store the data in X_batch and their corresponding labels in # + # y_batch; after sampling X_batch should have shape (dim, batch_size) # + # and y_batch should have shape (batch_size,) # + # # + # Hint: Use np.random.choice to generate indices. Sampling with # + # replacement is faster than sampling without replacement. # + ######################################################################### + pass + ######################################################################### + # END OF YOUR CODE # + ######################################################################### + + # evaluate loss and gradient + loss, grad = self.loss(X_batch, y_batch, reg) + loss_history.append(loss) + + # perform parameter update + ######################################################################### + # TODO: # + # Update the weights using the gradient and the learning rate. # + ######################################################################### + pass + ######################################################################### + # END OF YOUR CODE # + ######################################################################### + + if verbose and it % 100 == 0: + print 'iteration %d / %d: loss %f' % (it, num_iters, loss) + + return loss_history + + def predict(self, X): + """ + Use the trained weights of this linear classifier to predict labels for + data points. + + Inputs: + - X: D x N array of training data. Each column is a D-dimensional point. + + Returns: + - y_pred: Predicted labels for the data in X. y_pred is a 1-dimensional + array of length N, and each element is an integer giving the predicted + class. + """ + y_pred = np.zeros(X.shape[1]) + ########################################################################### + # TODO: # + # Implement this method. Store the predicted labels in y_pred. # + ########################################################################### + pass + ########################################################################### + # END OF YOUR CODE # + ########################################################################### + return y_pred + + def loss(self, X_batch, y_batch, reg): + """ + Compute the loss function and its derivative. + Subclasses will override this. + + Inputs: + - X_batch: A numpy array of shape (N, D) containing a minibatch of N + data points; each point has dimension D. + - y_batch: A numpy array of shape (N,) containing labels for the minibatch. + - reg: (float) regularization strength. + + Returns: A tuple containing: + - loss as a single float + - gradient with respect to self.W; an array of the same shape as W + """ + pass + + +class LinearSVM(LinearClassifier): + """ A subclass that uses the Multiclass SVM loss function """ + + def loss(self, X_batch, y_batch, reg): + return svm_loss_vectorized(self.W, X_batch, y_batch, reg) + + +class Softmax(LinearClassifier): + """ A subclass that uses the Softmax + Cross-entropy loss function """ + + def loss(self, X_batch, y_batch, reg): + return softmax_loss_vectorized(self.W, X_batch, y_batch, reg) + diff --git a/assignments2016/assignment1/cs231n/classifiers/linear_svm.py b/assignments2016/assignment1/cs231n/classifiers/linear_svm.py new file mode 100644 index 00000000..19ab753f --- /dev/null +++ b/assignments2016/assignment1/cs231n/classifiers/linear_svm.py @@ -0,0 +1,92 @@ +import numpy as np +from random import shuffle + +def svm_loss_naive(W, X, y, reg): + """ + Structured SVM loss function, naive implementation (with loops). + + Inputs have dimension D, there are C classes, and we operate on minibatches + of N examples. + + Inputs: + - W: A numpy array of shape (D, C) containing weights. + - X: A numpy array of shape (N, D) containing a minibatch of data. + - y: A numpy array of shape (N,) containing training labels; y[i] = c means + that X[i] has label c, where 0 <= c < C. + - reg: (float) regularization strength + + Returns a tuple of: + - loss as single float + - gradient with respect to weights W; an array of same shape as W + """ + dW = np.zeros(W.shape) # initialize the gradient as zero + + # compute the loss and the gradient + num_classes = W.shape[1] + num_train = X.shape[0] + loss = 0.0 + for i in xrange(num_train): + scores = X[i].dot(W) + correct_class_score = scores[y[i]] + for j in xrange(num_classes): + if j == y[i]: + continue + margin = scores[j] - correct_class_score + 1 # note delta = 1 + if margin > 0: + loss += margin + + # Right now the loss is a sum over all training examples, but we want it + # to be an average instead so we divide by num_train. + loss /= num_train + + # Add regularization to the loss. + loss += 0.5 * reg * np.sum(W * W) + + ############################################################################# + # TODO: # + # Compute the gradient of the loss function and store it dW. # + # Rather that first computing the loss and then computing the derivative, # + # it may be simpler to compute the derivative at the same time that the # + # loss is being computed. As a result you may need to modify some of the # + # code above to compute the gradient. # + ############################################################################# + + + return loss, dW + + +def svm_loss_vectorized(W, X, y, reg): + """ + Structured SVM loss function, vectorized implementation. + + Inputs and outputs are the same as svm_loss_naive. + """ + loss = 0.0 + dW = np.zeros(W.shape) # initialize the gradient as zero + + ############################################################################# + # TODO: # + # Implement a vectorized version of the structured SVM loss, storing the # + # result in loss. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + + ############################################################################# + # TODO: # + # Implement a vectorized version of the gradient for the structured SVM # + # loss, storing the result in dW. # + # # + # Hint: Instead of computing the gradient from scratch, it may be easier # + # to reuse some of the intermediate values that you used to compute the # + # loss. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return loss, dW diff --git a/assignments2016/assignment1/cs231n/classifiers/neural_net.py b/assignments2016/assignment1/cs231n/classifiers/neural_net.py new file mode 100644 index 00000000..94bbcd05 --- /dev/null +++ b/assignments2016/assignment1/cs231n/classifiers/neural_net.py @@ -0,0 +1,218 @@ +import numpy as np +import matplotlib.pyplot as plt + + +class TwoLayerNet(object): + """ + A two-layer fully-connected neural network. The net has an input dimension of + N, a hidden layer dimension of H, and performs classification over C classes. + We train the network with a softmax loss function and L2 regularization on the + weight matrices. The network uses a ReLU nonlinearity after the first fully + connected layer. + + In other words, the network has the following architecture: + + input - fully connected layer - ReLU - fully connected layer - softmax + + The outputs of the second fully-connected layer are the scores for each class. + """ + + def __init__(self, input_size, hidden_size, output_size, std=1e-4): + """ + Initialize the model. Weights are initialized to small random values and + biases are initialized to zero. Weights and biases are stored in the + variable self.params, which is a dictionary with the following keys: + + W1: First layer weights; has shape (D, H) + b1: First layer biases; has shape (H,) + W2: Second layer weights; has shape (H, C) + b2: Second layer biases; has shape (C,) + + Inputs: + - input_size: The dimension D of the input data. + - hidden_size: The number of neurons H in the hidden layer. + - output_size: The number of classes C. + """ + self.params = {} + self.params['W1'] = std * np.random.randn(input_size, hidden_size) + self.params['b1'] = np.zeros(hidden_size) + self.params['W2'] = std * np.random.randn(hidden_size, output_size) + self.params['b2'] = np.zeros(output_size) + + def loss(self, X, y=None, reg=0.0): + """ + Compute the loss and gradients for a two layer fully connected neural + network. + + Inputs: + - X: Input data of shape (N, D). Each X[i] is a training sample. + - y: Vector of training labels. y[i] is the label for X[i], and each y[i] is + an integer in the range 0 <= y[i] < C. This parameter is optional; if it + is not passed then we only return scores, and if it is passed then we + instead return the loss and gradients. + - reg: Regularization strength. + + Returns: + If y is None, return a matrix scores of shape (N, C) where scores[i, c] is + the score for class c on input X[i]. + + If y is not None, instead return a tuple of: + - loss: Loss (data loss and regularization loss) for this batch of training + samples. + - grads: Dictionary mapping parameter names to gradients of those parameters + with respect to the loss function; has the same keys as self.params. + """ + # Unpack variables from the params dictionary + W1, b1 = self.params['W1'], self.params['b1'] + W2, b2 = self.params['W2'], self.params['b2'] + N, D = X.shape + + # Compute the forward pass + scores = None + ############################################################################# + # TODO: Perform the forward pass, computing the class scores for the input. # + # Store the result in the scores variable, which should be an array of # + # shape (N, C). # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + # If the targets are not given then jump out, we're done + if y is None: + return scores + + # Compute the loss + loss = None + ############################################################################# + # TODO: Finish the forward pass, and compute the loss. This should include # + # both the data loss and L2 regularization for W1 and W2. Store the result # + # in the variable loss, which should be a scalar. Use the Softmax # + # classifier loss. So that your results match ours, multiply the # + # regularization loss by 0.5 # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + # Backward pass: compute gradients + grads = {} + ############################################################################# + # TODO: Compute the backward pass, computing the derivatives of the weights # + # and biases. Store the results in the grads dictionary. For example, # + # grads['W1'] should store the gradient on W1, and be a matrix of same size # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return loss, grads + + def train(self, X, y, X_val, y_val, + learning_rate=1e-3, learning_rate_decay=0.95, + reg=1e-5, num_iters=100, + batch_size=200, verbose=False): + """ + Train this neural network using stochastic gradient descent. + + Inputs: + - X: A numpy array of shape (N, D) giving training data. + - y: A numpy array f shape (N,) giving training labels; y[i] = c means that + X[i] has label c, where 0 <= c < C. + - X_val: A numpy array of shape (N_val, D) giving validation data. + - y_val: A numpy array of shape (N_val,) giving validation labels. + - learning_rate: Scalar giving learning rate for optimization. + - learning_rate_decay: Scalar giving factor used to decay the learning rate + after each epoch. + - reg: Scalar giving regularization strength. + - num_iters: Number of steps to take when optimizing. + - batch_size: Number of training examples to use per step. + - verbose: boolean; if true print progress during optimization. + """ + num_train = X.shape[0] + iterations_per_epoch = max(num_train / batch_size, 1) + + # Use SGD to optimize the parameters in self.model + loss_history = [] + train_acc_history = [] + val_acc_history = [] + + for it in xrange(num_iters): + X_batch = None + y_batch = None + + ######################################################################### + # TODO: Create a random minibatch of training data and labels, storing # + # them in X_batch and y_batch respectively. # + ######################################################################### + pass + ######################################################################### + # END OF YOUR CODE # + ######################################################################### + + # Compute loss and gradients using the current minibatch + loss, grads = self.loss(X_batch, y=y_batch, reg=reg) + loss_history.append(loss) + + ######################################################################### + # TODO: Use the gradients in the grads dictionary to update the # + # parameters of the network (stored in the dictionary self.params) # + # using stochastic gradient descent. You'll need to use the gradients # + # stored in the grads dictionary defined above. # + ######################################################################### + pass + ######################################################################### + # END OF YOUR CODE # + ######################################################################### + + if verbose and it % 100 == 0: + print 'iteration %d / %d: loss %f' % (it, num_iters, loss) + + # Every epoch, check train and val accuracy and decay learning rate. + if it % iterations_per_epoch == 0: + # Check accuracy + train_acc = (self.predict(X_batch) == y_batch).mean() + val_acc = (self.predict(X_val) == y_val).mean() + train_acc_history.append(train_acc) + val_acc_history.append(val_acc) + + # Decay learning rate + learning_rate *= learning_rate_decay + + return { + 'loss_history': loss_history, + 'train_acc_history': train_acc_history, + 'val_acc_history': val_acc_history, + } + + def predict(self, X): + """ + Use the trained weights of this two-layer network to predict labels for + data points. For each data point we predict scores for each of the C + classes, and assign each data point to the class with the highest score. + + Inputs: + - X: A numpy array of shape (N, D) giving N D-dimensional data points to + classify. + + Returns: + - y_pred: A numpy array of shape (N,) giving predicted labels for each of + the elements of X. For all i, y_pred[i] = c means that X[i] is predicted + to have class c, where 0 <= c < C. + """ + y_pred = None + + ########################################################################### + # TODO: Implement this function; it should be VERY simple! # + ########################################################################### + pass + ########################################################################### + # END OF YOUR CODE # + ########################################################################### + + return y_pred + + diff --git a/assignments2016/assignment1/cs231n/classifiers/softmax.py b/assignments2016/assignment1/cs231n/classifiers/softmax.py new file mode 100644 index 00000000..edddcfac --- /dev/null +++ b/assignments2016/assignment1/cs231n/classifiers/softmax.py @@ -0,0 +1,62 @@ +import numpy as np +from random import shuffle + +def softmax_loss_naive(W, X, y, reg): + """ + Softmax loss function, naive implementation (with loops) + + Inputs have dimension D, there are C classes, and we operate on minibatches + of N examples. + + Inputs: + - W: A numpy array of shape (D, C) containing weights. + - X: A numpy array of shape (N, D) containing a minibatch of data. + - y: A numpy array of shape (N,) containing training labels; y[i] = c means + that X[i] has label c, where 0 <= c < C. + - reg: (float) regularization strength + + Returns a tuple of: + - loss as single float + - gradient with respect to weights W; an array of same shape as W + """ + # Initialize the loss and gradient to zero. + loss = 0.0 + dW = np.zeros_like(W) + + ############################################################################# + # TODO: Compute the softmax loss and its gradient using explicit loops. # + # Store the loss in loss and the gradient in dW. If you are not careful # + # here, it is easy to run into numeric instability. Don't forget the # + # regularization! # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return loss, dW + + +def softmax_loss_vectorized(W, X, y, reg): + """ + Softmax loss function, vectorized version. + + Inputs and outputs are the same as softmax_loss_naive. + """ + # Initialize the loss and gradient to zero. + loss = 0.0 + dW = np.zeros_like(W) + + ############################################################################# + # TODO: Compute the softmax loss and its gradient using no explicit loops. # + # Store the loss in loss and the gradient in dW. If you are not careful # + # here, it is easy to run into numeric instability. Don't forget the # + # regularization! # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return loss, dW + diff --git a/assignments2016/assignment1/cs231n/data_utils.py b/assignments2016/assignment1/cs231n/data_utils.py new file mode 100644 index 00000000..9158da4d --- /dev/null +++ b/assignments2016/assignment1/cs231n/data_utils.py @@ -0,0 +1,158 @@ +import cPickle as pickle +import numpy as np +import os +from scipy.misc import imread + +def load_CIFAR_batch(filename): + """ load single batch of cifar """ + with open(filename, 'rb') as f: + datadict = pickle.load(f) + X = datadict['data'] + Y = datadict['labels'] + X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype("float") + Y = np.array(Y) + return X, Y + +def load_CIFAR10(ROOT): + """ load all of cifar """ + xs = [] + ys = [] + for b in range(1,6): + f = os.path.join(ROOT, 'data_batch_%d' % (b, )) + X, Y = load_CIFAR_batch(f) + xs.append(X) + ys.append(Y) + Xtr = np.concatenate(xs) + Ytr = np.concatenate(ys) + del X, Y + Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch')) + return Xtr, Ytr, Xte, Yte + +def load_tiny_imagenet(path, dtype=np.float32): + """ + Load TinyImageNet. Each of TinyImageNet-100-A, TinyImageNet-100-B, and + TinyImageNet-200 have the same directory structure, so this can be used + to load any of them. + + Inputs: + - path: String giving path to the directory to load. + - dtype: numpy datatype used to load the data. + + Returns: A tuple of + - class_names: A list where class_names[i] is a list of strings giving the + WordNet names for class i in the loaded dataset. + - X_train: (N_tr, 3, 64, 64) array of training images + - y_train: (N_tr,) array of training labels + - X_val: (N_val, 3, 64, 64) array of validation images + - y_val: (N_val,) array of validation labels + - X_test: (N_test, 3, 64, 64) array of testing images. + - y_test: (N_test,) array of test labels; if test labels are not available + (such as in student code) then y_test will be None. + """ + # First load wnids + with open(os.path.join(path, 'wnids.txt'), 'r') as f: + wnids = [x.strip() for x in f] + + # Map wnids to integer labels + wnid_to_label = {wnid: i for i, wnid in enumerate(wnids)} + + # Use words.txt to get names for each class + with open(os.path.join(path, 'words.txt'), 'r') as f: + wnid_to_words = dict(line.split('\t') for line in f) + for wnid, words in wnid_to_words.iteritems(): + wnid_to_words[wnid] = [w.strip() for w in words.split(',')] + class_names = [wnid_to_words[wnid] for wnid in wnids] + + # Next load training data. + X_train = [] + y_train = [] + for i, wnid in enumerate(wnids): + if (i + 1) % 20 == 0: + print 'loading training data for synset %d / %d' % (i + 1, len(wnids)) + # To figure out the filenames we need to open the boxes file + boxes_file = os.path.join(path, 'train', wnid, '%s_boxes.txt' % wnid) + with open(boxes_file, 'r') as f: + filenames = [x.split('\t')[0] for x in f] + num_images = len(filenames) + + X_train_block = np.zeros((num_images, 3, 64, 64), dtype=dtype) + y_train_block = wnid_to_label[wnid] * np.ones(num_images, dtype=np.int64) + for j, img_file in enumerate(filenames): + img_file = os.path.join(path, 'train', wnid, 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + ## grayscale file + img.shape = (64, 64, 1) + X_train_block[j] = img.transpose(2, 0, 1) + X_train.append(X_train_block) + y_train.append(y_train_block) + + # We need to concatenate all training data + X_train = np.concatenate(X_train, axis=0) + y_train = np.concatenate(y_train, axis=0) + + # Next load validation data + with open(os.path.join(path, 'val', 'val_annotations.txt'), 'r') as f: + img_files = [] + val_wnids = [] + for line in f: + img_file, wnid = line.split('\t')[:2] + img_files.append(img_file) + val_wnids.append(wnid) + num_val = len(img_files) + y_val = np.array([wnid_to_label[wnid] for wnid in val_wnids]) + X_val = np.zeros((num_val, 3, 64, 64), dtype=dtype) + for i, img_file in enumerate(img_files): + img_file = os.path.join(path, 'val', 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + img.shape = (64, 64, 1) + X_val[i] = img.transpose(2, 0, 1) + + # Next load test images + # Students won't have test labels, so we need to iterate over files in the + # images directory. + img_files = os.listdir(os.path.join(path, 'test', 'images')) + X_test = np.zeros((len(img_files), 3, 64, 64), dtype=dtype) + for i, img_file in enumerate(img_files): + img_file = os.path.join(path, 'test', 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + img.shape = (64, 64, 1) + X_test[i] = img.transpose(2, 0, 1) + + y_test = None + y_test_file = os.path.join(path, 'test', 'test_annotations.txt') + if os.path.isfile(y_test_file): + with open(y_test_file, 'r') as f: + img_file_to_wnid = {} + for line in f: + line = line.split('\t') + img_file_to_wnid[line[0]] = line[1] + y_test = [wnid_to_label[img_file_to_wnid[img_file]] for img_file in img_files] + y_test = np.array(y_test) + + return class_names, X_train, y_train, X_val, y_val, X_test, y_test + + +def load_models(models_dir): + """ + Load saved models from disk. This will attempt to unpickle all files in a + directory; any files that give errors on unpickling (such as README.txt) will + be skipped. + + Inputs: + - models_dir: String giving the path to a directory containing model files. + Each model file is a pickled dictionary with a 'model' field. + + Returns: + A dictionary mapping model file names to models. + """ + models = {} + for model_file in os.listdir(models_dir): + with open(os.path.join(models_dir, model_file), 'rb') as f: + try: + models[model_file] = pickle.load(f)['model'] + except pickle.UnpicklingError: + continue + return models diff --git a/assignments2016/assignment1/cs231n/datasets/.gitignore b/assignments2016/assignment1/cs231n/datasets/.gitignore new file mode 100644 index 00000000..0232c3ab --- /dev/null +++ b/assignments2016/assignment1/cs231n/datasets/.gitignore @@ -0,0 +1,4 @@ +cifar-10-batches-py/* +tiny-imagenet-100-A* +tiny-imagenet-100-B* +tiny-100-A-pretrained/* diff --git a/assignments2016/assignment1/cs231n/datasets/get_datasets.sh b/assignments2016/assignment1/cs231n/datasets/get_datasets.sh new file mode 100755 index 00000000..0dd93621 --- /dev/null +++ b/assignments2016/assignment1/cs231n/datasets/get_datasets.sh @@ -0,0 +1,4 @@ +# Get CIFAR10 +wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz +tar -xzvf cifar-10-python.tar.gz +rm cifar-10-python.tar.gz diff --git a/assignments2016/assignment1/cs231n/features.py b/assignments2016/assignment1/cs231n/features.py new file mode 100644 index 00000000..fdf40372 --- /dev/null +++ b/assignments2016/assignment1/cs231n/features.py @@ -0,0 +1,148 @@ +import matplotlib +import numpy as np +from scipy.ndimage import uniform_filter + + +def extract_features(imgs, feature_fns, verbose=False): + """ + Given pixel data for images and several feature functions that can operate on + single images, apply all feature functions to all images, concatenating the + feature vectors for each image and storing the features for all images in + a single matrix. + + Inputs: + - imgs: N x H X W X C array of pixel data for N images. + - feature_fns: List of k feature functions. The ith feature function should + take as input an H x W x D array and return a (one-dimensional) array of + length F_i. + - verbose: Boolean; if true, print progress. + + Returns: + An array of shape (N, F_1 + ... + F_k) where each column is the concatenation + of all features for a single image. + """ + num_images = imgs.shape[0] + if num_images == 0: + return np.array([]) + + # Use the first image to determine feature dimensions + feature_dims = [] + first_image_features = [] + for feature_fn in feature_fns: + feats = feature_fn(imgs[0].squeeze()) + assert len(feats.shape) == 1, 'Feature functions must be one-dimensional' + feature_dims.append(feats.size) + first_image_features.append(feats) + + # Now that we know the dimensions of the features, we can allocate a single + # big array to store all features as columns. + total_feature_dim = sum(feature_dims) + imgs_features = np.zeros((num_images, total_feature_dim)) + imgs_features[0] = np.hstack(first_image_features).T + + # Extract features for the rest of the images. + for i in xrange(1, num_images): + idx = 0 + for feature_fn, feature_dim in zip(feature_fns, feature_dims): + next_idx = idx + feature_dim + imgs_features[i, idx:next_idx] = feature_fn(imgs[i].squeeze()) + idx = next_idx + if verbose and i % 1000 == 0: + print 'Done extracting features for %d / %d images' % (i, num_images) + + return imgs_features + + +def rgb2gray(rgb): + """Convert RGB image to grayscale + + Parameters: + rgb : RGB image + + Returns: + gray : grayscale image + + """ + return np.dot(rgb[...,:3], [0.299, 0.587, 0.144]) + + +def hog_feature(im): + """Compute Histogram of Gradient (HOG) feature for an image + + Modified from skimage.feature.hog + http://pydoc.net/Python/scikits-image/0.4.2/skimage.feature.hog + + Reference: + Histograms of Oriented Gradients for Human Detection + Navneet Dalal and Bill Triggs, CVPR 2005 + + Parameters: + im : an input grayscale or rgb image + + Returns: + feat: Histogram of Gradient (HOG) feature + + """ + + # convert rgb to grayscale if needed + if im.ndim == 3: + image = rgb2gray(im) + else: + image = np.at_least_2d(im) + + sx, sy = image.shape # image size + orientations = 9 # number of gradient bins + cx, cy = (8, 8) # pixels per cell + + gx = np.zeros(image.shape) + gy = np.zeros(image.shape) + gx[:, :-1] = np.diff(image, n=1, axis=1) # compute gradient on x-direction + gy[:-1, :] = np.diff(image, n=1, axis=0) # compute gradient on y-direction + grad_mag = np.sqrt(gx ** 2 + gy ** 2) # gradient magnitude + grad_ori = np.arctan2(gy, (gx + 1e-15)) * (180 / np.pi) + 90 # gradient orientation + + n_cellsx = int(np.floor(sx / cx)) # number of cells in x + n_cellsy = int(np.floor(sy / cy)) # number of cells in y + # compute orientations integral images + orientation_histogram = np.zeros((n_cellsx, n_cellsy, orientations)) + for i in range(orientations): + # create new integral image for this orientation + # isolate orientations in this range + temp_ori = np.where(grad_ori < 180 / orientations * (i + 1), + grad_ori, 0) + temp_ori = np.where(grad_ori >= 180 / orientations * i, + temp_ori, 0) + # select magnitudes for those orientations + cond2 = temp_ori > 0 + temp_mag = np.where(cond2, grad_mag, 0) + orientation_histogram[:,:,i] = uniform_filter(temp_mag, size=(cx, cy))[cx/2::cx, cy/2::cy].T + + return orientation_histogram.ravel() + + +def color_histogram_hsv(im, nbin=10, xmin=0, xmax=255, normalized=True): + """ + Compute color histogram for an image using hue. + + Inputs: + - im: H x W x C array of pixel data for an RGB image. + - nbin: Number of histogram bins. (default: 10) + - xmin: Minimum pixel value (default: 0) + - xmax: Maximum pixel value (default: 255) + - normalized: Whether to normalize the histogram (default: True) + + Returns: + 1D vector of length nbin giving the color histogram over the hue of the + input image. + """ + ndim = im.ndim + bins = np.linspace(xmin, xmax, nbin+1) + hsv = matplotlib.colors.rgb_to_hsv(im/xmax) * xmax + imhist, bin_edges = np.histogram(hsv[:,:,0], bins=bins, density=normalized) + imhist = imhist * np.diff(bin_edges) + + # return histogram + return imhist + + +pass diff --git a/assignments2016/assignment1/cs231n/gradient_check.py b/assignments2016/assignment1/cs231n/gradient_check.py new file mode 100644 index 00000000..2d6b1f62 --- /dev/null +++ b/assignments2016/assignment1/cs231n/gradient_check.py @@ -0,0 +1,124 @@ +import numpy as np +from random import randrange + +def eval_numerical_gradient(f, x, verbose=True, h=0.00001): + """ + a naive implementation of numerical gradient of f at x + - f should be a function that takes a single argument + - x is the point (numpy array) to evaluate the gradient at + """ + + fx = f(x) # evaluate function value at original point + grad = np.zeros_like(x) + # iterate over all indexes in x + it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite']) + while not it.finished: + + # evaluate function at x+h + ix = it.multi_index + oldval = x[ix] + x[ix] = oldval + h # increment by h + fxph = f(x) # evalute f(x + h) + x[ix] = oldval - h + fxmh = f(x) # evaluate f(x - h) + x[ix] = oldval # restore + + # compute the partial derivative with centered formula + grad[ix] = (fxph - fxmh) / (2 * h) # the slope + if verbose: + print ix, grad[ix] + it.iternext() # step to next dimension + + return grad + + +def eval_numerical_gradient_array(f, x, df, h=1e-5): + """ + Evaluate a numeric gradient for a function that accepts a numpy + array and returns a numpy array. + """ + grad = np.zeros_like(x) + it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite']) + while not it.finished: + ix = it.multi_index + + oldval = x[ix] + x[ix] = oldval + h + pos = f(x).copy() + x[ix] = oldval - h + neg = f(x).copy() + x[ix] = oldval + + grad[ix] = np.sum((pos - neg) * df) / (2 * h) + it.iternext() + return grad + + +def eval_numerical_gradient_blobs(f, inputs, output, h=1e-5): + """ + Compute numeric gradients for a function that operates on input + and output blobs. + + We assume that f accepts several input blobs as arguments, followed by a blob + into which outputs will be written. For example, f might be called like this: + + f(x, w, out) + + where x and w are input Blobs, and the result of f will be written to out. + + Inputs: + - f: function + - inputs: tuple of input blobs + - output: output blob + - h: step size + """ + numeric_diffs = [] + for input_blob in inputs: + diff = np.zeros_like(input_blob.diffs) + it = np.nditer(input_blob.vals, flags=['multi_index'], + op_flags=['readwrite']) + while not it.finished: + idx = it.multi_index + orig = input_blob.vals[idx] + + input_blob.vals[idx] = orig + h + f(*(inputs + (output,))) + pos = np.copy(output.vals) + input_blob.vals[idx] = orig - h + f(*(inputs + (output,))) + neg = np.copy(output.vals) + input_blob.vals[idx] = orig + + diff[idx] = np.sum((pos - neg) * output.diffs) / (2.0 * h) + + it.iternext() + numeric_diffs.append(diff) + return numeric_diffs + + +def eval_numerical_gradient_net(net, inputs, output, h=1e-5): + return eval_numerical_gradient_blobs(lambda *args: net.forward(), + inputs, output, h=h) + + +def grad_check_sparse(f, x, analytic_grad, num_checks=10, h=1e-5): + """ + sample a few random elements and only return numerical + in this dimensions. + """ + + for i in xrange(num_checks): + ix = tuple([randrange(m) for m in x.shape]) + + oldval = x[ix] + x[ix] = oldval + h # increment by h + fxph = f(x) # evaluate f(x + h) + x[ix] = oldval - h # increment by h + fxmh = f(x) # evaluate f(x - h) + x[ix] = oldval # reset + + grad_numerical = (fxph - fxmh) / (2 * h) + grad_analytic = analytic_grad[ix] + rel_error = abs(grad_numerical - grad_analytic) / (abs(grad_numerical) + abs(grad_analytic)) + print 'numerical: %f analytic: %f, relative error: %e' % (grad_numerical, grad_analytic, rel_error) + diff --git a/assignments2016/assignment1/cs231n/vis_utils.py b/assignments2016/assignment1/cs231n/vis_utils.py new file mode 100644 index 00000000..8d04473f --- /dev/null +++ b/assignments2016/assignment1/cs231n/vis_utils.py @@ -0,0 +1,73 @@ +from math import sqrt, ceil +import numpy as np + +def visualize_grid(Xs, ubound=255.0, padding=1): + """ + Reshape a 4D tensor of image data to a grid for easy visualization. + + Inputs: + - Xs: Data of shape (N, H, W, C) + - ubound: Output grid will have values scaled to the range [0, ubound] + - padding: The number of blank pixels between elements of the grid + """ + (N, H, W, C) = Xs.shape + grid_size = int(ceil(sqrt(N))) + grid_height = H * grid_size + padding * (grid_size - 1) + grid_width = W * grid_size + padding * (grid_size - 1) + grid = np.zeros((grid_height, grid_width, C)) + next_idx = 0 + y0, y1 = 0, H + for y in xrange(grid_size): + x0, x1 = 0, W + for x in xrange(grid_size): + if next_idx < N: + img = Xs[next_idx] + low, high = np.min(img), np.max(img) + grid[y0:y1, x0:x1] = ubound * (img - low) / (high - low) + # grid[y0:y1, x0:x1] = Xs[next_idx] + next_idx += 1 + x0 += W + padding + x1 += W + padding + y0 += H + padding + y1 += H + padding + # grid_max = np.max(grid) + # grid_min = np.min(grid) + # grid = ubound * (grid - grid_min) / (grid_max - grid_min) + return grid + +def vis_grid(Xs): + """ visualize a grid of images """ + (N, H, W, C) = Xs.shape + A = int(ceil(sqrt(N))) + G = np.ones((A*H+A, A*W+A, C), Xs.dtype) + G *= np.min(Xs) + n = 0 + for y in range(A): + for x in range(A): + if n < N: + G[y*H+y:(y+1)*H+y, x*W+x:(x+1)*W+x, :] = Xs[n,:,:,:] + n += 1 + # normalize to [0,1] + maxg = G.max() + ming = G.min() + G = (G - ming)/(maxg-ming) + return G + +def vis_nn(rows): + """ visualize array of arrays of images """ + N = len(rows) + D = len(rows[0]) + H,W,C = rows[0][0].shape + Xs = rows[0][0] + G = np.ones((N*H+N, D*W+D, C), Xs.dtype) + for y in range(N): + for x in range(D): + G[y*H+y:(y+1)*H+y, x*W+x:(x+1)*W+x, :] = rows[y][x] + # normalize to [0,1] + maxg = G.max() + ming = G.min() + G = (G - ming)/(maxg-ming) + return G + + + diff --git a/assignments2016/assignment1/features.ipynb b/assignments2016/assignment1/features.ipynb new file mode 100644 index 00000000..7e0177e0 --- /dev/null +++ b/assignments2016/assignment1/features.ipynb @@ -0,0 +1,332 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 이미지 특징 연습\n", + "*이 워크시트를 완성하고 제출하세요. (출력물과 워크시트에 포함되지 않은 코드들을 포함해서) 더 자세한 정보는 코스 웹사이트인 [숙제 페이지](http://vision.stanford.edu/teaching/cs231n/assignments.html)에서 볼 수 있습니다.*\n", + "\n", + "우리는 입력된 이미지의 픽셀에 선형 분류기를 학습시켜 이미지 분류 작업에 적절한 성능을 얻을 수 있음을 알고있습니다.\n", + "이번 연습에서 우리는 단순 픽셀을 계산하기 위해 단순 픽셀(화소)이 아닌 특징을 통해 선형 분류기를 훈련시켜 우리의 분류 성능을 향상시킬 수 있음을 보일 것입니다.\n", + "\n", + "이번 연습을 위한 모든 해야할 작업들은 이 notebook에서 수행됩니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import random\n", + "import numpy as np\n", + "from cs231n.data_utils import load_CIFAR10\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # 기본 그래프 크기 설정\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# auto-reloading을 위한 외부 모듈\n", + "# http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython를 보세요.\n", + "%load_ext autoreload\n", + "%autoreload 2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 데이터 불러오기\n", + "이전 연습에서 처럼, 우리는 CIFAR-10 데이터를 불러올 것입니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.features import color_histogram_hsv, hog_feature\n", + "\n", + "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):\n", + " # CIFAR-10 데이터를 불러옵니다.\n", + " cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", + " X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", + " \n", + " # 데이터 표본\n", + " mask = range(num_training, num_training + num_validation)\n", + " X_val = X_train[mask]\n", + " y_val = y_train[mask]\n", + " mask = range(num_training)\n", + " X_train = X_train[mask]\n", + " y_train = y_train[mask]\n", + " mask = range(num_test)\n", + " X_test = X_test[mask]\n", + " y_test = y_test[mask]\n", + "\n", + " return X_train, y_train, X_val, y_val, X_test, y_test\n", + "\n", + "X_train, y_train, X_val, y_val, X_test, y_test = get_CIFAR10_data()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 특징 추출하기\n", + "우리는 각 이미지 마다 그라데이션의 히스토그램(HOG)를 HSV색 공간에서의 색상 채널을 사용한 색상 히스토그램만큼 잘 계산할 것입니다. 우리는 우리의 마지막 특징 벡터를 각 이미지마다 HOG와 색상 히스토그램 특징 벡터를 이용하여 형성합니다.\n", + "\n", + "강조하면, HOG 색상 정보를 무시하면서 이미지의 질감을 포착하고 색상 히스토그램은 질감을 무시하면서 입력된 이미지의 색상 나타낼 수 있습니다. 결과적으로, 우리는 두 가지를 동시에 사용하므로 한가지만 사용하는 것보다 더 효과적으로 작동할 것을 기대합니다. 이 가정을 증명하는 것은 보너스 단계에서 수행할만한 좋은 과제가 될 수 있습니다.\n", + "\n", + "`hog_feature` 과 `color_histogram_hsv` 함수는 둘 다 하나의 이미지에서 그 이미지의 특징벡터를 반환하는 작업을 수행합니다. extract_features 함수는 이미지 집합과 특징 함수들의 목록을 가지고 각 이미지에 각각의 특징 함수를 평가하고 결과를 각 열이 하나의 이미지에 대한 모든 특징 벡터의 연결인 배열에 저장합니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.features import *\n", + "\n", + "num_color_bins = 10 # Number of bins in the color histogram\n", + "feature_fns = [hog_feature, lambda img: color_histogram_hsv(img, nbin=num_color_bins)]\n", + "X_train_feats = extract_features(X_train, feature_fns, verbose=True)\n", + "X_val_feats = extract_features(X_val, feature_fns)\n", + "X_test_feats = extract_features(X_test, feature_fns)\n", + "\n", + "# 전처리: 평균 특징 빼기\n", + "mean_feat = np.mean(X_train_feats, axis=0, keepdims=True)\n", + "X_train_feats -= mean_feat\n", + "X_val_feats -= mean_feat\n", + "X_test_feats -= mean_feat\n", + "\n", + "# 전처리: 표준편차로 분리하기. 이것은 각 특징이 거의 같은 규모임을 보장합니다.\n", + "std_feat = np.std(X_train_feats, axis=0, keepdims=True)\n", + "X_train_feats /= std_feat\n", + "X_val_feats /= std_feat\n", + "X_test_feats /= std_feat\n", + "\n", + "# 전처리: bias 차원 추가\n", + "X_train_feats = np.hstack([X_train_feats, np.ones((X_train_feats.shape[0], 1))])\n", + "X_val_feats = np.hstack([X_val_feats, np.ones((X_val_feats.shape[0], 1))])\n", + "X_test_feats = np.hstack([X_test_feats, np.ones((X_test_feats.shape[0], 1))])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## SVM을 특징에 대해서 훈련\n", + "이번 과제에서 작성한 멀티클래스 SVM 코드를 사용하여 위에서 추출된 특징을 이용해 SVM을 훈련합니다.\n", + "이 방법은 SVM을 단순픽셀을 이용하여 훈련시키는 것보다 더 좋은 결과를 얻을 수 있습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# Validation을 사용하여 학습 속도와 정규화 강도를 조정합니다.\n", + "\n", + "from cs231n.classifiers.linear_classifier import LinearSVM\n", + "\n", + "learning_rates = [1e-9, 1e-8, 1e-7]\n", + "regularization_strengths = [1e5, 1e6, 1e7]\n", + "\n", + "results = {}\n", + "best_val = -1\n", + "best_svm = None\n", + "\n", + "pass\n", + "######################################################################################\n", + "# TODO: #\n", + "# Validation을 사용하여 학습 속도와 정규화 강도를 조정합니다. #\n", + "# 이것은 SVM에서 했던 검증과 동일해야 합니다. #\n", + "# 가장 잘 훈련된 분류기를 best_svm에 저장하세요. #\n", + "# 아마 다른 개수의 색상 히스토그램안의 bin을 사용하여 해보고 싶을 수 있습니다. #\n", + "# 아마 다른 개수의 색상 히스토그램안의 bin을 사용하여 해보고 싶을 수 있습니다. #\n", + "# 만약 신중하다면, validation 세트에서 0.44에 근접한 정확도를 얻을 수 있을것 입니다. #\n", + "######################################################################################\n", + "\n", + "pass\n", + "######################################################################################\n", + "# 코드의 끝 #\n", + "######################################################################################\n", + "\n", + "# 결과를 출력합니다.\n", + "for lr, reg in sorted(results):\n", + " train_accuracy, val_accuracy = results[(lr, reg)]\n", + " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", + " lr, reg, train_accuracy, val_accuracy)\n", + " \n", + "print 'best validation accuracy achieved during cross-validation: %f' % best_val" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 테스트 세트로 당신이 훈련시킨 SVM을 평가합니다.\n", + "y_test_pred = best_svm.predict(X_test_feats)\n", + "test_accuracy = np.mean(y_test == y_test_pred)\n", + "print test_accuracy" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 알고리즘이 어떻게 작동하는지에 대한 직관을 얻기 위해 중요한 것은\n", + "# 알고리즘이 만드는 실수를 시각화 하는것 입니다.\n", + "# 이 시각화에서, 우리는 현재 시스템에서 잘못 분류된 이미지의 예제들을 보여줍니다.\n", + "# 첫 번째 열은 실제 \"plane\"은 아니지만 시스템이 \"plane\"으로 분류된 이미지를 보여줍니다.\n", + "\n", + "examples_per_class = 8\n", + "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "for cls, cls_name in enumerate(classes):\n", + " idxs = np.where((y_test != cls) & (y_test_pred == cls))[0]\n", + " idxs = np.random.choice(idxs, examples_per_class, replace=False)\n", + " for i, idx in enumerate(idxs):\n", + " plt.subplot(examples_per_class, len(classes), i * len(classes) + cls + 1)\n", + " plt.imshow(X_test[idx].astype('uint8'))\n", + " plt.axis('off')\n", + " if i == 0:\n", + " plt.title(cls_name)\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 연습문제 1:\n", + " 잘못 분류된 결과에 대해 설명해보세요. 의미를 알 수 있나요?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 이미지 특징의 신경망\n", + "이번 과제에서 우리는 단순 픽셀의 2-계층 신경망을 학습시키면 선형 분류기보다 성능이 더 향상됨을 배웠습니다. 이번 notebook에서 우리는 선형 분류기를 이미지 픽셀에 바로 적용하는 것보다 이미지에서 추출한 특징(feature)에 적용하는 것이 더 좋은 성능을 얻는 것을 알 수 있었습니다.\n", + "\n", + "완성도를 위해, 우리는 이미지 특징의 신경망 또한 학습시켜보아야 합니다. 이 접근법은 이전의 모든 방법보다 더 뛰어날 것입니다: 테스트 세트에 대해 55%이상의 분류 정확도를 쉽게 달성할 수 있어야합니다; 우리의 최고의 모델은 60%의 분류 정확도를 달성했습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "print X_train_feats.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.classifiers.neural_net import TwoLayerNet\n", + "\n", + "input_dim = X_train_feats.shape[1]\n", + "hidden_dim = 500\n", + "num_classes = 10\n", + "\n", + "net = TwoLayerNet(input_dim, hidden_dim, num_classes)\n", + "best_net = None\n", + "\n", + "################################################################################\n", + "# TODO: #\n", + "# 이미지 특징으로 2-계층 신경망 학습시키기. #\n", + "# 이전 섹션처럼 다양한 변수들을 교차검증하기. #\n", + "# 최고의 모델을 best_net 변수에 저장하기. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# 코드의 끝 #\n", + "################################################################################" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 당신의 신경망 분류기를 테스트 세트로 실행시켜 보세요.\n", + "# 55% 이상의 정확도를 얻을 수 있어야 합니다.\n", + "\n", + "test_acc = (net.predict(X_test_feats) == y_test).mean()\n", + "print test_acc" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 보너스: 당신만의 특징을 디자인해보세요!\n", + "\n", + "간단한 이미지 특징이 분류기의 성능을 향상시킬 수 있음을 배웠습니다. 지금까지 우리는 HOG와 색상 히스토그램을 통해 시도해봤지만 다른 종류의 특징들은 분류 성능을 더 향상시킬 수 있습니다.\n", + "\n", + "보너스 포인트를 위해, 새로운 종류의 특징을 디자인하고 적용하고 CIFAR-10의 이미지 분류에 사용해 보세요. 당신의 특징이 어떻게 작동하고 왜 그러한 특징이 이미지 분류에 효과적으로 작동할 것이라 생각했는데 설명해보세요. 이 notebook에서 적용해보고, 임의의 hyperparameters로 교차 검증 하고 HOG + 색상 히스토그램 기준과 성능을 비교해보세요." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 보너스: 뭔가 더 해보세요!\n", + "이번 과제에서 제공된 자료와 코드를 사용하여 흥미로운 도전을 해보세요. 과제를 하면서 다른 의문점이 생겼나요? 과제를 하면서 머리에서 참신한 생각이 떠올랐나요? 당신을 보여줄 수 있는 기회입니다!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.1" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/assignments2016/assignment1/frameworkpython b/assignments2016/assignment1/frameworkpython new file mode 100755 index 00000000..a0fa5517 --- /dev/null +++ b/assignments2016/assignment1/frameworkpython @@ -0,0 +1,13 @@ +#!/bin/bash + +# what real Python executable to use +PYVER=2.7 +PATHTOPYTHON=/usr/local/bin/ +PYTHON=${PATHTOPYTHON}python${PYVER} + +# find the root of the virtualenv, it should be the parent of the dir this script is in +ENV=`$PYTHON -c "import os; print os.path.abspath(os.path.join(os.path.dirname(\"$0\"), '..'))"` + +# now run Python with the virtualenv set as Python's HOME +export PYTHONHOME=$ENV +exec $PYTHON "$@" diff --git a/assignments2016/assignment1/knn.ipynb b/assignments2016/assignment1/knn.ipynb new file mode 100644 index 00000000..00b0ec4b --- /dev/null +++ b/assignments2016/assignment1/knn.ipynb @@ -0,0 +1,458 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# K-Nearest Neighbor (kNN) 연습\n", + "\n", + "*이 워크시트를 완성하고 제출하세요. (출력물과 워크시트에 포함되지 않은 코드들을 포함해서) 더 자세한 정보는 코스 웹사이트인 [숙제 페이지](http://vision.stanford.edu/teaching/cs231n/assignments.html)에서 볼 수 있습니다.*\n", + "\n", + "kNN 분류기는 다음 두 단계로 구성됩니다.\n", + "\n", + "- 학습중에, 분류기는 데이터를 학습하고 그것을 기억합니다.\n", + "- 테스트중에, KNN은 모든 이미지를 훈련된 이미지와 k 번째 레이블을 전송하는 가장 유사한 훈련 예와 비교합니다.\n", + "- k의 값은 교차 검증되었습니다.\n", + "\n", + "이번 연습에서 우리는 이러한 단계들을 수행하고 \n", + "간단한 이미지 분류기 pipeline, 교차검증을 이해하고, 효율적인 벡터화된 코드를 작성하는 방법을 알아봅니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 이 notebook을 위해 몇가지 설치 코드를 실행하세요.\n", + "\n", + "import random\n", + "import numpy as np\n", + "from cs231n.data_utils import load_CIFAR10\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# matplotlib figure들을 새 창에서 뛰우지 않고 이 notebook에서 하기 위한 약간의 마법입니다.\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# 이 notebook이 외부 파이썬 모듈을 재호출하기위한 코드입니다.\n", + "# 다음 링크를 확인해 보세요. http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# CIFAR-10 데이터를 불러옵니다.\n", + "cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", + "X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", + "\n", + "# sanity 체크로서 학습 데이터와 테스트 데이터의 크기를 출력합니다.\n", + "print 'Training data shape: ', X_train.shape\n", + "print 'Training labels shape: ', y_train.shape\n", + "print 'Test data shape: ', X_test.shape\n", + "print 'Test labels shape: ', y_test.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 데이터셋에서 몇 가지 예제를 시각화 합니다.\n", + "# 각 class마다 약간의 학습 이미지를 보여줍니다.\n", + "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "num_classes = len(classes)\n", + "samples_per_class = 7\n", + "for y, cls in enumerate(classes):\n", + " idxs = np.flatnonzero(y_train == y)\n", + " idxs = np.random.choice(idxs, samples_per_class, replace=False)\n", + " for i, idx in enumerate(idxs):\n", + " plt_idx = i * num_classes + y + 1\n", + " plt.subplot(samples_per_class, num_classes, plt_idx)\n", + " plt.imshow(X_train[idx].astype('uint8'))\n", + " plt.axis('off')\n", + " if i == 0:\n", + " plt.title(cls)\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 이 연습에 더 효율적인 코드 실행을 위한 데이터를 표본\n", + "num_training = 5000\n", + "mask = range(num_training)\n", + "X_train = X_train[mask]\n", + "y_train = y_train[mask]\n", + "\n", + "num_test = 500\n", + "mask = range(num_test)\n", + "X_test = X_test[mask]\n", + "y_test = y_test[mask]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 이미지 데이터를 행으로 변형시킵니다.\n", + "X_train = np.reshape(X_train, (X_train.shape[0], -1))\n", + "X_test = np.reshape(X_test, (X_test.shape[0], -1))\n", + "print X_train.shape, X_test.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.classifiers import KNearestNeighbor\n", + "\n", + "# kNN 분류기를 생성합니다.\n", + "# kNN분류기를 학습시킬때 분류기는 단순히 데이터를 기억하고\n", + "# 더 이상의 처리를 하지 않는다는것을 기억하세요.\n", + "classifier = KNearestNeighbor()\n", + "classifier.train(X_train, y_train)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "이제 테스트데이터를 kNN 분류기로 분류해볼껍니다.\n", + "이 과정을 두 단계로 분류할 수 있습니다:\n", + "\n", + "1. 먼저 모든 테스트 예제와 모든 훈련 예제 사이의 거리를 계산해야 합니다.\n", + "2. Given these distances, for each test example \n", + "we find the k nearest examples and have them vote \n", + "for the label\n", + "\n", + "모든 테스트 예제와 학습 예제 사이의 거리 행렬을 계산하는 것 부터 시작해 봅시다. **Ntr** 학습 예제와 **Nte** 테스트 예제가 있을 때, 각 (i, j) 요소가 i번째 테스트와 j번째 훈련 예제의 거리를 나타내는 **Nte x Ntr** 행렬을 결과로 얻을 수 있습니다.\n", + "\n", + "\n", + "먼저 `cs231n/classifiers/k_nearest_neighbor.py`를 열고 각 (테스트, 학습) 예제를 계산하는데 (매우 비효율적인) 이중 반복문을 사용한 `compute_distances_two_loops`를 구현해 보세요." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# cs231n/classifiers/k_nearest_neighbor.py를 열고\n", + "# compute_distances_two_loops를 구현해 보세요.\n", + "\n", + "# 구현을 테스트해보세요.\n", + "dists = classifier.compute_distances_two_loops(X_test)\n", + "print dists.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 거리 행렬을 시각화 할 수 있습니다: 각 행은 하나의 시험 예제와 훈련 예제의 거리\n", + "plt.imshow(dists, interpolation='none')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**연습문제 #1** 일부 행, 열이 더 밝게 가시화 된 거리 행렬의 구조화된 패턴에 주목하세요. (기본 색상에서 검은 색은 낮은 간격을 나타내는 반면, 흰색은 높은 간격을 나타내는 것에 주목하세요.)\n", + "\n", + "- 뚜렷하게 밝은 행의 데이터가 그렇게 표시된 원인은 무엇일까요?\n", + "- 열은 어떤 원인 때문에 저럴까요?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**당신의 답**: *여기에 쓰세요*\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 이제 predict_labels를 구현해보고 아래의 코드를 실행해 보세요.\n", + "# k = 1 을 사용합니다.(가장 가까운 이웃으로)\n", + "y_test_pred = classifier.predict_labels(dists, k=1)\n", + "\n", + "# 예측 예제의 정확도를 계산하고 출력하세요.\n", + "num_correct = np.sum(y_test_pred == y_test)\n", + "accuracy = float(num_correct) / num_test\n", + "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "우리는 대략 `27%`정도의 정확도를 예상합니다. 이제 `k = 5`같은 좀더 큰 `k`에 대해서도 실행해 보세요." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "y_test_pred = classifier.predict_labels(dists, k=5)\n", + "num_correct = np.sum(y_test_pred == y_test)\n", + "accuracy = float(num_correct) / num_test\n", + "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`k = 1`보다 약간 더 좋은 성능을 기대할 수 있습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 이제 부분 벡터화와 단일 반복문을 사용하여 거리 행렬의 계산 속도를 높일 수 있습니다.\n", + "# compute_distance_one_loop를 구현해보고 아래의 코드를 실행해 보세요.\n", + "dists_one = classifier.compute_distances_one_loop(X_test)\n", + "\n", + "# 우리의 벡터화 구현이 맞다는것을 보장하기 위해, \n", + "# 우리는 navie한 구현을 확인해야 합니다.\n", + "# 두 행렬의 유사 여부를 결정하는 방법은 여러가지가 있습니다.\n", + "# 단순한 방법은 Frobenius norm입니다.\n", + "# 이 Frobenius norm의 두 행렬은 모든 원소의 차이의 제곱합의 제곱근 입니다.\n", + "# 다른 말로 하면, 행렬을 벡터로 변형하고 유클리드 거리를 계산합니다.\n", + "difference = np.linalg.norm(dists - dists_one, ord='fro')\n", + "print 'Difference was: %f' % (difference, )\n", + "if difference < 0.001:\n", + " print 'Good! The distance matrices are the same'\n", + "else:\n", + " print 'Uh-oh! The distance matrices are different'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 이제 compute_distances_no_loops 안의 완전히 벡터화된 버전을 구현하고 실행합니다.\n", + "dists_two = classifier.compute_distances_no_loops(X_test)\n", + "\n", + "# 거리 행렬이 우리가 전에 계산한 것과 일치하는지 확인합니다.\n", + "difference = np.linalg.norm(dists - dists_two, ord='fro')\n", + "print 'Difference was: %f' % (difference, )\n", + "if difference < 0.001:\n", + " print 'Good! The distance matrices are the same'\n", + "else:\n", + " print 'Uh-oh! The distance matrices are different'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 구현한 것들이 얼마나 빠른지 비교합시다.\n", + "def time_function(f, *args):\n", + " \"\"\"\n", + " Call a function f with args and return the time (in seconds) that it took to execute.\n", + " \"\"\"\n", + " import time\n", + " tic = time.time()\n", + " f(*args)\n", + " toc = time.time()\n", + " return toc - tic\n", + "\n", + "two_loop_time = time_function(classifier.compute_distances_two_loops, X_test)\n", + "print 'Two loop version took %f seconds' % two_loop_time\n", + "\n", + "one_loop_time = time_function(classifier.compute_distances_one_loop, X_test)\n", + "print 'One loop version took %f seconds' % one_loop_time\n", + "\n", + "no_loop_time = time_function(classifier.compute_distances_no_loops, X_test)\n", + "print 'No loop version took %f seconds' % no_loop_time\n", + "\n", + "# 완전 벡터화 구현이 훨씬 더 빠른 성능을 낸다는것을 볼 수 있습니다." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 교차검증\n", + "\n", + "우리는 k-Nearest Neighbor 분류기를 구현했지만 임의로 k = 5라는 값을 정했습니다. 이제 hyperparameter의 교차검증으로 최선의 값을 결정할 것입니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "num_folds = 5\n", + "k_choices = [1, 3, 5, 8, 10, 12, 15, 20, 50, 100]\n", + "\n", + "X_train_folds = []\n", + "y_train_folds = []\n", + "####################################################################################\n", + "# TODO: #\n", + "# 폴더에 훈련 데이터를 분류합니다. #\n", + "# 분류 후에, X_train_folds 와 y_train_folds는 y_train_folds[i]가 #\n", + "# X_train_folds[i]의 점에 대한 레이블 벡터인 num_folds의 길이의 목록이어야 합니다. #\n", + "# 힌트: numpy의 array_split 함수를 살펴보세요. #\n", + "####################################################################################\n", + "pass\n", + "################################################################################\n", + "# 코드의 끝 #\n", + "################################################################################\n", + "\n", + "# 사전은 서로 다른 교차 검증을 실행할 때 찾은 k의 값에 대한 정확도를 가지고 있습니다.\n", + "# k_to_accuracies[k]는 'num_folds' 길이의 리스트로 \n", + "# 각기 다른 k 값을 사용할 때의 정확도를 담고있습니다.\n", + "k_to_accuracies = {}\n", + "\n", + "\n", + "####################################################################################\n", + "# TODO: #\n", + "# 최고의 k 값을 찾기 위해 k-fold 교차 검증을 수행합니다. #\n", + "# 가능한 각 k에 대해서, k-nearest-neighbor 알고리즘을 numpy의num_folds회 실행합니다.#\n", + "# 각각의 경우에 모두 사용하되 그 중 하나는 훈련 데이터로, #\n", + "# 마지막 하나는 검증 데이터로 사용합니다. #\n", + "####################################################################################\n", + "pass\n", + "################################################################################\n", + "# 코드의 끝 #\n", + "################################################################################\n", + "\n", + "# 계산된 정확도를 출력합니다.\n", + "for k in sorted(k_to_accuracies):\n", + " for accuracy in k_to_accuracies[k]:\n", + " print 'k = %d, accuracy = %f' % (k, accuracy)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 원시 관측 플롯\n", + "for k in k_choices:\n", + " accuracies = k_to_accuracies[k]\n", + " plt.scatter([k] * len(accuracies), accuracies)\n", + "\n", + "# 표준편차에 해당하는 오차 막대와 추세선을 그립니다.\n", + "accuracies_mean = np.array([np.mean(v) for k,v in sorted(k_to_accuracies.items())])\n", + "accuracies_std = np.array([np.std(v) for k,v in sorted(k_to_accuracies.items())])\n", + "plt.errorbar(k_choices, accuracies_mean, yerr=accuracies_std)\n", + "plt.title('Cross-validation on k')\n", + "plt.xlabel('k')\n", + "plt.ylabel('Cross-validation accuracy')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 위의 교차검증 결과에 기반해서 최적의 k를 선택하고 모든 학습 데이터를 \n", + "# 이용하여 분류기를 재학습 시키고 테스트 데이터를 이용해 테스트 해봅니다.\n", + "# 테스트데이터에 대해서 28%이상의 정확도를 얻을 수 있어야 합니다.\n", + "best_k = 1\n", + "\n", + "classifier = KNearestNeighbor()\n", + "classifier.train(X_train, y_train)\n", + "y_test_pred = classifier.predict(X_test, k=best_k)\n", + "\n", + "# 정확도를 계산하고 출력합니다.\n", + "num_correct = np.sum(y_test_pred == y_test)\n", + "accuracy = float(num_correct) / num_test\n", + "print 'Got %d / %d correct => accuracy: %f' % (num_correct, num_test, accuracy)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.1" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/assignments2016/assignment1/requirements.txt b/assignments2016/assignment1/requirements.txt new file mode 100644 index 00000000..13111380 --- /dev/null +++ b/assignments2016/assignment1/requirements.txt @@ -0,0 +1,46 @@ +Jinja2==2.8 +MarkupSafe==0.23 +Pillow==3.0.0 +Pygments==2.0.2 +appnope==0.1.0 +backports-abc==0.4 +backports.ssl-match-hostname==3.5.0.1 +certifi==2015.11.20.1 +cycler==0.9.0 +decorator==4.0.6 +functools32==3.2.3-2 +gnureadline==6.3.3 +ipykernel==4.2.2 +ipython==4.0.1 +ipython-genutils==0.1.0 +ipywidgets==4.1.1 +jsonschema==2.5.1 +jupyter==1.0.0 +jupyter-client==4.1.1 +jupyter-console==4.0.3 +jupyter-core==4.0.6 +matplotlib==1.5.0 +mistune==0.7.1 +nbconvert==4.1.0 +nbformat==4.0.1 +notebook==4.0.6 +numpy==1.10.4 +path.py==8.1.2 +pexpect==4.0.1 +pickleshare==0.5 +ptyprocess==0.5 +pyparsing==2.0.7 +python-dateutil==2.4.2 +pytz==2015.7 +pyzmq==15.1.0 +qtconsole==4.1.1 +scipy==0.16.1 +simplegeneric==0.8.1 +singledispatch==3.4.0.3 +six==1.10.0 +terminado==0.5 +tornado==4.3 +traitlets==4.0.0 +wsgiref==0.1.2 +jupyter==1.0.0 +pillow==3.1.0 diff --git a/assignments2016/assignment1/softmax.ipynb b/assignments2016/assignment1/softmax.ipynb new file mode 100644 index 00000000..1d364dc4 --- /dev/null +++ b/assignments2016/assignment1/softmax.ipynb @@ -0,0 +1,301 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Softmax 연습\n", + "\n", + "*이 워크시트를 완성하고 제출하세요. (출력물과 워크시트에 포함되지 않은 코드들을 포함해서) 더 자세한 정보는 코스 웹사이트인 [숙제 페이지](http://vision.stanford.edu/teaching/cs231n/assignments.html)에서 볼 수 있습니다.*\n", + "\n", + "이번 연습은 SVM과 유사합니다. 아래와 같은 것들을 하게됩니다.\n", + "\n", + "- Softmax 분류기를 위한 완전히 벡터화된 **손실 함수**를 구현합니다.\n", + "- **분석 요소**를 위한 완전히 벡터화된 표현식을 구현합니다.\n", + "- 구현한것을 수치 요소로 체크합니다.\n", + "- 검증 셋을 이용해 **학습율과 정규화 강도를 튜닝**합니다.\n", + "- **SGD**를 사용해 손실 함수를 **최적화**합니다.\n", + "- 최종 학습 가중치를 **시각화**합니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "import random\n", + "import numpy as np\n", + "from cs231n.data_utils import load_CIFAR10\n", + "import matplotlib.pyplot as plt\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# 외부 모듈의 auto-reloading을 위해 아래 링크를 확인하세요.\n", + "# http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000, num_dev=500):\n", + " \"\"\"\n", + " CIFAR-10 데이터 셋을 불러온 후 미리 준비된 선형 분류기에 전처리를 수행합니다.\n", + " 이 과정은 SVM에서 사용했던 방법과 같지만 하나의 함수로 압축되어 있습니다.\n", + " \"\"\"\n", + " # 원시 CIFAR-10 데이터를 불러옵니다.\n", + " cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", + " X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", + " \n", + " # 데이터에서 표본을 얻습니다.\n", + " mask = range(num_training, num_training + num_validation)\n", + " X_val = X_train[mask]\n", + " y_val = y_train[mask]\n", + " mask = range(num_training)\n", + " X_train = X_train[mask]\n", + " y_train = y_train[mask]\n", + " mask = range(num_test)\n", + " X_test = X_test[mask]\n", + " y_test = y_test[mask]\n", + " mask = np.random.choice(num_training, num_dev, replace=False)\n", + " X_dev = X_train[mask]\n", + " y_dev = y_train[mask]\n", + " \n", + " # 전처리: 이미지 데이터를 행으로 변형합니다.\n", + " X_train = np.reshape(X_train, (X_train.shape[0], -1))\n", + " X_val = np.reshape(X_val, (X_val.shape[0], -1))\n", + " X_test = np.reshape(X_test, (X_test.shape[0], -1))\n", + " X_dev = np.reshape(X_dev, (X_dev.shape[0], -1))\n", + " \n", + " # 데이터 정규화: 평균 이미지 빼기\n", + " mean_image = np.mean(X_train, axis = 0)\n", + " X_train -= mean_image\n", + " X_val -= mean_image\n", + " X_test -= mean_image\n", + " X_dev -= mean_image\n", + " \n", + " # 기저 차원을 더하고 열로 변형시킵니다.\n", + " X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])\n", + " X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])\n", + " X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])\n", + " X_dev = np.hstack([X_dev, np.ones((X_dev.shape[0], 1))])\n", + " \n", + " return X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev\n", + "\n", + "\n", + "# 위 함수를 우리 데이터로 실행해봅니다.\n", + "X_train, y_train, X_val, y_val, X_test, y_test, X_dev, y_dev = get_CIFAR10_data()\n", + "print 'Train data shape: ', X_train.shape\n", + "print 'Train labels shape: ', y_train.shape\n", + "print 'Validation data shape: ', X_val.shape\n", + "print 'Validation labels shape: ', y_val.shape\n", + "print 'Test data shape: ', X_test.shape\n", + "print 'Test labels shape: ', y_test.shape\n", + "print 'dev data shape: ', X_dev.shape\n", + "print 'dev labels shape: ', y_dev.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Softmax 분류기\n", + "\n", + "**cs231n/classifiers/softmax.py**에 이번 섹션에 필요한 코드가 적혀있습니다.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 먼저 중첩 루프를 사용해 softmax 손실 함수를 구현하세요.\n", + "# cs231n/calssifiers/softmax.py 를 열고 softmax_loss_naive 함수를 구현하세요.\n", + "\n", + "from cs231n.classifiers.softmax import softmax_loss_naive\n", + "import time\n", + "\n", + "# 랜덤 softmax 가중치 배열을 만들고 손실을 계산하는데 사용합니다.\n", + "W = np.random.randn(3073, 10) * 0.0001\n", + "loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)\n", + "\n", + "# As a rough sanity check, our loss should be something close to -log(0.1).\n", + "print 'loss: %f' % loss\n", + "print 'sanity check: %f' % (-np.log(0.1))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 연습문제 1:\n", + "왜 손실이 -log(0.1)로 근사되는지 이유를 간단히 서술하세요.\n", + "\n", + "**당신의 답:** *여기에 쓰세요*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# softmax_loss_naived의 구현을 완성하고 중첩 루프를 이용한 버전을 구현해 보세요.\n", + "loss, grad = softmax_loss_naive(W, X_dev, y_dev, 0.0)\n", + "\n", + "# SVM에서 했던 것 처럼, 수치 요소를 디버깅 툴처럼 체크해보세요.\n", + "# The numeric gradient should be close to the analytic gradient.\n", + "from cs231n.gradient_check import grad_check_sparse\n", + "f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 0.0)[0]\n", + "grad_numerical = grad_check_sparse(f, W, grad, 10)\n", + "\n", + "# SVM에서처럼, 정규화를 이용해 다른 요소를 체크해보세요.\n", + "loss, grad = softmax_loss_naive(W, X_dev, y_dev, 1e2)\n", + "f = lambda w: softmax_loss_naive(w, X_dev, y_dev, 1e2)[0]\n", + "grad_numerical = grad_check_sparse(f, W, grad, 10)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 이제 간단하게 구현된 softmax 손실함수와 요소와 soft_max_loss_vectorized에 구현된 벡터화된 버전이 있습니다.\n", + "# 이 두가지 버전은 같은 결과를 낼 것이지만 벡터화된 버전이 좀 더 빠를것 입니다.\n", + "tic = time.time()\n", + "loss_naive, grad_naive = softmax_loss_naive(W, X_dev, y_dev, 0.00001)\n", + "toc = time.time()\n", + "print 'naive loss: %e computed in %fs' % (loss_naive, toc - tic)\n", + "\n", + "from cs231n.classifiers.softmax import softmax_loss_vectorized\n", + "tic = time.time()\n", + "loss_vectorized, grad_vectorized = softmax_loss_vectorized(W, X_dev, y_dev, 0.00001)\n", + "toc = time.time()\n", + "print 'vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic)\n", + "\n", + "# ASVM에서 했던것 처럼, Frobenius 방법을 사용해 두 버전의 요소를 비교할 것입니다.\n", + "grad_difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')\n", + "print 'Loss difference: %f' % np.abs(loss_naive - loss_vectorized)\n", + "print 'Gradient difference: %f' % grad_difference" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 검증셋을 이용하여 hyperparameters(정규화 강도와 학습률)를 튜닝하세요.\n", + "# 다른 범위에 대해 학습률과 정규화 강도를 실험해 보세요.\n", + "# r검증셋에 대해 0.35 이상의 분류 정확도를 얻어야 합니다.\n", + "from cs231n.classifiers import Softmax\n", + "results = {}\n", + "best_val = -1\n", + "best_softmax = None\n", + "learning_rates = [1e-7, 5e-7]\n", + "regularization_strengths = [5e4, 1e8]\n", + "\n", + "################################################################################\n", + "# TODO: #\n", + "# 검증셋을 이용해 학습률과 정규화 강도를 설정합니다. #\n", + "# 이것은 SVM에서의 검증과 같아야합니다; #\n", + "# 가장 잘 학습된 softmax 분류기를 best_softmax에 저장하세요. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# 코드의 끝 #\n", + "################################################################################\n", + " \n", + "# 결과를 출력합니다\n", + "for lr, reg in sorted(results):\n", + " train_accuracy, val_accuracy = results[(lr, reg)]\n", + " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", + " lr, reg, train_accuracy, val_accuracy)\n", + " \n", + "print 'best validation accuracy achieved during cross-validation: %f' % best_val" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 테스트 셋으로 평가해 봅니다.\n", + "# 테스트 셋에서 최고의 softmax를 평가해 봅니다.\n", + "y_test_pred = best_softmax.predict(X_test)\n", + "test_accuracy = np.mean(y_test == y_test_pred)\n", + "print 'softmax on raw pixels final test set accuracy: %f' % (test_accuracy, )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 각 클래스에 대한 학습 된 가중치를 시각화\n", + "w = best_softmax.W[:-1,:] # strip out the bias\n", + "w = w.reshape(32, 32, 3, 10)\n", + "\n", + "w_min, w_max = np.min(w), np.max(w)\n", + "\n", + "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "for i in xrange(10):\n", + " plt.subplot(2, 5, i + 1)\n", + " \n", + " # 가중치를 0과 255사이로 재조정\n", + " wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)\n", + " plt.imshow(wimg.astype('uint8'))\n", + " plt.axis('off')\n", + " plt.title(classes[i])" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.1" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/assignments2016/assignment1/start_ipython_osx.sh b/assignments2016/assignment1/start_ipython_osx.sh new file mode 100755 index 00000000..4815b001 --- /dev/null +++ b/assignments2016/assignment1/start_ipython_osx.sh @@ -0,0 +1,4 @@ +# Assume the virtualenv is called .env + +cp frameworkpython .env/bin +.env/bin/frameworkpython -m IPython notebook diff --git a/assignments2016/assignment1/svm.ipynb b/assignments2016/assignment1/svm.ipynb new file mode 100644 index 00000000..ef6331f7 --- /dev/null +++ b/assignments2016/assignment1/svm.ipynb @@ -0,0 +1,568 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Multiclass Support Vector Machine exercise\n", + "\n", + "*Complete and hand in this completed worksheet (including its outputs and any supporting code outside of the worksheet) with your assignment submission. For more details see the [assignments page](http://vision.stanford.edu/teaching/cs231n/assignments.html) on the course website.*\n", + "\n", + "In this exercise you will:\n", + " \n", + "- implement a fully-vectorized **loss function** for the SVM\n", + "- implement the fully-vectorized expression for its **analytic gradient**\n", + "- **check your implementation** using numerical gradient\n", + "- use a validation set to **tune the learning rate and regularization** strength\n", + "- **optimize** the loss function with **SGD**\n", + "- **visualize** the final learned weights\n" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Run some setup code for this notebook.\n", + "\n", + "import random\n", + "import numpy as np\n", + "from cs231n.data_utils import load_CIFAR10\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# This is a bit of magic to make matplotlib figures appear inline in the\n", + "# notebook rather than in a new window.\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# Some more magic so that the notebook will reload external python modules;\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## CIFAR-10 Data Loading and Preprocessing" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Load the raw CIFAR-10 data.\n", + "cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", + "X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", + "\n", + "# As a sanity check, we print out the size of the training and test data.\n", + "print 'Training data shape: ', X_train.shape\n", + "print 'Training labels shape: ', y_train.shape\n", + "print 'Test data shape: ', X_test.shape\n", + "print 'Test labels shape: ', y_test.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Visualize some examples from the dataset.\n", + "# We show a few examples of training images from each class.\n", + "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "num_classes = len(classes)\n", + "samples_per_class = 7\n", + "for y, cls in enumerate(classes):\n", + " idxs = np.flatnonzero(y_train == y)\n", + " idxs = np.random.choice(idxs, samples_per_class, replace=False)\n", + " for i, idx in enumerate(idxs):\n", + " plt_idx = i * num_classes + y + 1\n", + " plt.subplot(samples_per_class, num_classes, plt_idx)\n", + " plt.imshow(X_train[idx].astype('uint8'))\n", + " plt.axis('off')\n", + " if i == 0:\n", + " plt.title(cls)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Split the data into train, val, and test sets. In addition we will\n", + "# create a small development set as a subset of the training data;\n", + "# we can use this for development so our code runs faster.\n", + "num_training = 49000\n", + "num_validation = 1000\n", + "num_test = 1000\n", + "num_dev = 500\n", + "\n", + "# Our validation set will be num_validation points from the original\n", + "# training set.\n", + "mask = range(num_training, num_training + num_validation)\n", + "X_val = X_train[mask]\n", + "y_val = y_train[mask]\n", + "\n", + "# Our training set will be the first num_train points from the original\n", + "# training set.\n", + "mask = range(num_training)\n", + "X_train = X_train[mask]\n", + "y_train = y_train[mask]\n", + "\n", + "# We will also make a development set, which is a small subset of\n", + "# the training set.\n", + "mask = np.random.choice(num_training, num_dev, replace=False)\n", + "X_dev = X_train[mask]\n", + "y_dev = y_train[mask]\n", + "\n", + "# We use the first num_test points of the original test set as our\n", + "# test set.\n", + "mask = range(num_test)\n", + "X_test = X_test[mask]\n", + "y_test = y_test[mask]\n", + "\n", + "print 'Train data shape: ', X_train.shape\n", + "print 'Train labels shape: ', y_train.shape\n", + "print 'Validation data shape: ', X_val.shape\n", + "print 'Validation labels shape: ', y_val.shape\n", + "print 'Test data shape: ', X_test.shape\n", + "print 'Test labels shape: ', y_test.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Preprocessing: reshape the image data into rows\n", + "X_train = np.reshape(X_train, (X_train.shape[0], -1))\n", + "X_val = np.reshape(X_val, (X_val.shape[0], -1))\n", + "X_test = np.reshape(X_test, (X_test.shape[0], -1))\n", + "X_dev = np.reshape(X_dev, (X_dev.shape[0], -1))\n", + "\n", + "# As a sanity check, print out the shapes of the data\n", + "print 'Training data shape: ', X_train.shape\n", + "print 'Validation data shape: ', X_val.shape\n", + "print 'Test data shape: ', X_test.shape\n", + "print 'dev data shape: ', X_dev.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Preprocessing: subtract the mean image\n", + "# first: compute the image mean based on the training data\n", + "mean_image = np.mean(X_train, axis=0)\n", + "print mean_image[:10] # print a few of the elements\n", + "plt.figure(figsize=(4,4))\n", + "plt.imshow(mean_image.reshape((32,32,3)).astype('uint8')) # visualize the mean image\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# second: subtract the mean image from train and test data\n", + "X_train -= mean_image\n", + "X_val -= mean_image\n", + "X_test -= mean_image\n", + "X_dev -= mean_image" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# third: append the bias dimension of ones (i.e. bias trick) so that our SVM\n", + "# only has to worry about optimizing a single weight matrix W.\n", + "X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])\n", + "X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])\n", + "X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])\n", + "X_dev = np.hstack([X_dev, np.ones((X_dev.shape[0], 1))])\n", + "\n", + "print X_train.shape, X_val.shape, X_test.shape, X_dev.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## SVM Classifier\n", + "\n", + "Your code for this section will all be written inside **cs231n/classifiers/linear_svm.py**. \n", + "\n", + "As you can see, we have prefilled the function `compute_loss_naive` which uses for loops to evaluate the multiclass SVM loss function. " + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Evaluate the naive implementation of the loss we provided for you:\n", + "from cs231n.classifiers.linear_svm import svm_loss_naive\n", + "import time\n", + "\n", + "# generate a random SVM weight matrix of small numbers\n", + "W = np.random.randn(3073, 10) * 0.0001 \n", + "\n", + "loss, grad = svm_loss_naive(W, X_dev, y_dev, 0.00001)\n", + "print 'loss: %f' % (loss, )" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "The `grad` returned from the function above is right now all zero. Derive and implement the gradient for the SVM cost function and implement it inline inside the function `svm_loss_naive`. You will find it helpful to interleave your new code inside the existing function.\n", + "\n", + "To check that you have correctly implemented the gradient correctly, you can numerically estimate the gradient of the loss function and compare the numeric estimate to the gradient that you computed. We have provided code that does this for you:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Once you've implemented the gradient, recompute it with the code below\n", + "# and gradient check it with the function we provided for you\n", + "\n", + "# Compute the loss and its gradient at W.\n", + "loss, grad = svm_loss_naive(W, X_dev, y_dev, 0.0)\n", + "\n", + "# Numerically compute the gradient along several randomly chosen dimensions, and\n", + "# compare them with your analytically computed gradient. The numbers should match\n", + "# almost exactly along all dimensions.\n", + "from cs231n.gradient_check import grad_check_sparse\n", + "f = lambda w: svm_loss_naive(w, X_dev, y_dev, 0.0)[0]\n", + "grad_numerical = grad_check_sparse(f, W, grad)\n", + "\n", + "# do the gradient check once again with regularization turned on\n", + "# you didn't forget the regularization gradient did you?\n", + "loss, grad = svm_loss_naive(W, X_dev, y_dev, 1e2)\n", + "f = lambda w: svm_loss_naive(w, X_dev, y_dev, 1e2)[0]\n", + "grad_numerical = grad_check_sparse(f, W, grad)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "### Inline Question 1:\n", + "It is possible that once in a while a dimension in the gradcheck will not match exactly. What could such a discrepancy be caused by? Is it a reason for concern? What is a simple example in one dimension where a gradient check could fail? *Hint: the SVM loss function is not strictly speaking differentiable*\n", + "\n", + "**Your Answer:** *fill this in.*" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Next implement the function svm_loss_vectorized; for now only compute the loss;\n", + "# we will implement the gradient in a moment.\n", + "tic = time.time()\n", + "loss_naive, grad_naive = svm_loss_naive(W, X_dev, y_dev, 0.00001)\n", + "toc = time.time()\n", + "print 'Naive loss: %e computed in %fs' % (loss_naive, toc - tic)\n", + "\n", + "from cs231n.classifiers.linear_svm import svm_loss_vectorized\n", + "tic = time.time()\n", + "loss_vectorized, _ = svm_loss_vectorized(W, X_dev, y_dev, 0.00001)\n", + "toc = time.time()\n", + "print 'Vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic)\n", + "\n", + "# The losses should match but your vectorized implementation should be much faster.\n", + "print 'difference: %f' % (loss_naive - loss_vectorized)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Complete the implementation of svm_loss_vectorized, and compute the gradient\n", + "# of the loss function in a vectorized way.\n", + "\n", + "# The naive implementation and the vectorized implementation should match, but\n", + "# the vectorized version should still be much faster.\n", + "tic = time.time()\n", + "_, grad_naive = svm_loss_naive(W, X_dev, y_dev, 0.00001)\n", + "toc = time.time()\n", + "print 'Naive loss and gradient: computed in %fs' % (toc - tic)\n", + "\n", + "tic = time.time()\n", + "_, grad_vectorized = svm_loss_vectorized(W, X_dev, y_dev, 0.00001)\n", + "toc = time.time()\n", + "print 'Vectorized loss and gradient: computed in %fs' % (toc - tic)\n", + "\n", + "# The loss is a single number, so it is easy to compare the values computed\n", + "# by the two implementations. The gradient on the other hand is a matrix, so\n", + "# we use the Frobenius norm to compare them.\n", + "difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')\n", + "print 'difference: %f' % difference" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "### Stochastic Gradient Descent\n", + "\n", + "We now have vectorized and efficient expressions for the loss, the gradient and our gradient matches the numerical gradient. We are therefore ready to do SGD to minimize the loss." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# In the file linear_classifier.py, implement SGD in the function\n", + "# LinearClassifier.train() and then run it with the code below.\n", + "from cs231n.classifiers import LinearSVM\n", + "svm = LinearSVM()\n", + "tic = time.time()\n", + "loss_hist = svm.train(X_train, y_train, learning_rate=1e-7, reg=5e4,\n", + " num_iters=1500, verbose=True)\n", + "toc = time.time()\n", + "print 'That took %fs' % (toc - tic)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# A useful debugging strategy is to plot the loss as a function of\n", + "# iteration number:\n", + "plt.plot(loss_hist)\n", + "plt.xlabel('Iteration number')\n", + "plt.ylabel('Loss value')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Write the LinearSVM.predict function and evaluate the performance on both the\n", + "# training and validation set\n", + "y_train_pred = svm.predict(X_train)\n", + "print 'training accuracy: %f' % (np.mean(y_train == y_train_pred), )\n", + "y_val_pred = svm.predict(X_val)\n", + "print 'validation accuracy: %f' % (np.mean(y_val == y_val_pred), )" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Use the validation set to tune hyperparameters (regularization strength and\n", + "# learning rate). You should experiment with different ranges for the learning\n", + "# rates and regularization strengths; if you are careful you should be able to\n", + "# get a classification accuracy of about 0.4 on the validation set.\n", + "learning_rates = [1e-7, 5e-5]\n", + "regularization_strengths = [5e4, 1e5]\n", + "\n", + "# results is dictionary mapping tuples of the form\n", + "# (learning_rate, regularization_strength) to tuples of the form\n", + "# (training_accuracy, validation_accuracy). The accuracy is simply the fraction\n", + "# of data points that are correctly classified.\n", + "results = {}\n", + "best_val = -1 # The highest validation accuracy that we have seen so far.\n", + "best_svm = None # The LinearSVM object that achieved the highest validation rate.\n", + "\n", + "################################################################################\n", + "# TODO: #\n", + "# Write code that chooses the best hyperparameters by tuning on the validation #\n", + "# set. For each combination of hyperparameters, train a linear SVM on the #\n", + "# training set, compute its accuracy on the training and validation sets, and #\n", + "# store these numbers in the results dictionary. In addition, store the best #\n", + "# validation accuracy in best_val and the LinearSVM object that achieves this #\n", + "# accuracy in best_svm. #\n", + "# #\n", + "# Hint: You should use a small value for num_iters as you develop your #\n", + "# validation code so that the SVMs don't take much time to train; once you are #\n", + "# confident that your validation code works, you should rerun the validation #\n", + "# code with a larger value for num_iters. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################\n", + " \n", + "# Print out results.\n", + "for lr, reg in sorted(results):\n", + " train_accuracy, val_accuracy = results[(lr, reg)]\n", + " print 'lr %e reg %e train accuracy: %f val accuracy: %f' % (\n", + " lr, reg, train_accuracy, val_accuracy)\n", + " \n", + "print 'best validation accuracy achieved during cross-validation: %f' % best_val" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Visualize the cross-validation results\n", + "import math\n", + "x_scatter = [math.log10(x[0]) for x in results]\n", + "y_scatter = [math.log10(x[1]) for x in results]\n", + "\n", + "# plot training accuracy\n", + "marker_size = 100\n", + "colors = [results[x][0] for x in results]\n", + "plt.subplot(2, 1, 1)\n", + "plt.scatter(x_scatter, y_scatter, marker_size, c=colors)\n", + "plt.colorbar()\n", + "plt.xlabel('log learning rate')\n", + "plt.ylabel('log regularization strength')\n", + "plt.title('CIFAR-10 training accuracy')\n", + "\n", + "# plot validation accuracy\n", + "colors = [results[x][1] for x in results] # default size of markers is 20\n", + "plt.subplot(2, 1, 2)\n", + "plt.scatter(x_scatter, y_scatter, marker_size, c=colors)\n", + "plt.colorbar()\n", + "plt.xlabel('log learning rate')\n", + "plt.ylabel('log regularization strength')\n", + "plt.title('CIFAR-10 validation accuracy')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Evaluate the best svm on test set\n", + "y_test_pred = best_svm.predict(X_test)\n", + "test_accuracy = np.mean(y_test == y_test_pred)\n", + "print 'linear SVM on raw pixels final test set accuracy: %f' % test_accuracy" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Visualize the learned weights for each class.\n", + "# Depending on your choice of learning rate and regularization strength, these may\n", + "# or may not be nice to look at.\n", + "w = best_svm.W[:-1,:] # strip out the bias\n", + "w = w.reshape(32, 32, 3, 10)\n", + "w_min, w_max = np.min(w), np.max(w)\n", + "classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']\n", + "for i in xrange(10):\n", + " plt.subplot(2, 5, i + 1)\n", + " \n", + " # Rescale the weights to be between 0 and 255\n", + " wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)\n", + " plt.imshow(wimg.astype('uint8'))\n", + " plt.axis('off')\n", + " plt.title(classes[i])" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "### Inline question 2:\n", + "Describe what your visualized SVM weights look like, and offer a brief explanation for why they look they way that they do.\n", + "\n", + "**Your answer:** *fill this in*" + ], + "cell_type": "markdown", + "metadata": {} + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.9", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment1/two_layer_net.ipynb b/assignments2016/assignment1/two_layer_net.ipynb new file mode 100644 index 00000000..3e7eb844 --- /dev/null +++ b/assignments2016/assignment1/two_layer_net.ipynb @@ -0,0 +1,454 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 뉴럴 네트워크의 구현\n", + "이번에 우리는 완전 연결 레이어로 뉴럴 네트워크를 만들어 분류를 수행하고 CIFAR-10 데이터셋으로 테스트 해볼 것 입니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 설치\n", + "\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "\n", + "from cs231n.classifiers.neural_net import TwoLayerNet\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # 기본 그래프 사이즈 설정\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# 외부 모듈 자동 불러오기\n", + "# 참고. http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "우리는 네트워크의 인스턴스를 나타내기 위해 `cs231n/classifiers/neural_net.py` 파일의 `TwoLayerNet` 클래스를 사용할 것입니다. 네트워크 파라메터는 인스턴스 변수 `self.params`에 키는 파라메터 이름인 문자열이고 값은 numpy 배열로 저장되어 있습니다. 아래에서 구현에 사용할 toy 데이터와 toy 모델을 초기화 합니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 작은 net을 만들고 toy 데이터로 구현을 체크해 봅니다.\n", + "# 반복되는 실험에서 우리가 랜덤 시드를 설정한다는 것을 주의하세요.\n", + "\n", + "input_size = 4\n", + "hidden_size = 10\n", + "num_classes = 3\n", + "num_inputs = 5\n", + "\n", + "def init_toy_model():\n", + " np.random.seed(0)\n", + " return TwoLayerNet(input_size, hidden_size, num_classes, std=1e-1)\n", + "\n", + "def init_toy_data():\n", + " np.random.seed(1)\n", + " X = 10 * np.random.randn(num_inputs, input_size)\n", + " y = np.array([0, 1, 2, 2, 1])\n", + " return X, y\n", + "\n", + "net = init_toy_model()\n", + "X, y = init_toy_data()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Forward pass: 점수 계산하기\n", + "`cs231n/classifiers/neural_net.py`파일을 열고 `TwoLayerNet.loss` 방법에 대해서 확인해 보세요. 이 함수는 SVM과 Softmax에서 작성했던 손실함수와 매우 유사합니다: 데이터와 가중치로 클래스의 점수, 손실정도, 매개변수의 그라디언트를 계산합니다.\n", + "\n", + "Forward pass의 첫 번째 부분의 구현은 모든 입력에 대한 점수를 계산하기 위해 가중치와 biases를 사용합니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "scores = net.loss(X)\n", + "print 'Your scores:'\n", + "print scores\n", + "print\n", + "print 'correct scores:'\n", + "correct_scores = np.asarray([\n", + " [-0.81233741, -1.27654624, -0.70335995],\n", + " [-0.17129677, -1.18803311, -0.47310444],\n", + " [-0.51590475, -1.01354314, -0.8504215 ],\n", + " [-0.15419291, -0.48629638, -0.52901952],\n", + " [-0.00618733, -0.12435261, -0.15226949]])\n", + "print correct_scores\n", + "print\n", + "\n", + "# 차이가 매우 작을 것입니다. 우리는 <1e-7 정도 나왔습니다.\n", + "print 'Difference between your scores and correct scores:'\n", + "print np.sum(np.abs(scores - correct_scores))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Forward pass: 손실 계산하기\n", + "같은 함수에서, 데이터와 정규화 손실 정도를 계산하는 부분을 구현해 봅시다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "loss, _ = net.loss(X, y, reg=0.1)\n", + "correct_loss = 1.30378789133\n", + "\n", + "# 매우 작을 것 입니다, 우리는 1e-12보다 적은 값을 얻었습니다.\n", + "print 'Difference between your loss and correct loss:'\n", + "print np.sum(np.abs(loss - correct_loss))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Backward pass\n", + "함수의 남은 부분을 구현합니다. W1, b1, W2, b2 변수들에 대한 손실함수의 그라디언트를 구현합니다. 이제 정확한 froward pass를 구현했습니다, 이제 backward pass를 수치 그라디언트 체크로 디버그 할 수 있을 것 입니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.gradient_check import eval_numerical_gradient\n", + "\n", + "# 수치 그라디언트 체크로 backward pass에서 구현한 것을 체크합니다.\n", + "# 만약 구현이 맞았다면, 각 W1, W2, b1, b2에 대해서 numeric 과 통계적 그라디언트는 1e-8 이하의 차이가 있을 것 입니다.\n", + "\n", + "loss, grads = net.loss(X, y, reg=0.1)\n", + "\n", + "# 모두 해봐야 1e-8 이하 정도일 것입니다.\n", + "for param_name in grads:\n", + " f = lambda W: net.loss(X, y, reg=0.1)[0]\n", + " param_grad_num = eval_numerical_gradient(f, net.params[param_name], verbose=False)\n", + " print '%s max relative error: %e' % (param_name, rel_error(param_grad_num, grads[param_name]))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 네트워크 학습\n", + "네트워크를 학습시키기 위해 우리는 SVM과 Softmax분류기와 비슷한 stochastic gradient descent(SGD)를 사용할 것입니다. `TwoLayerNet.train` 함수를 보고 비워있는 부분을 채워 넣어서 학습 프로시저를 구현해 보세요. 이것은 SVM과 Softmax 분류기에서 사용한 학습 과정과 매우 비슷할 것입니다. 또한 학습 과정은 정기적으로 네트워크가 학습되는 동안 정확도를 유지하기위한 예측모델을 수행하는 `TwoLayerNet.predict`도 구현해야 합니다.\n", + "\n", + "한번 메서드를 구현하면, 아래의 코드를 실행시켜 toy 데이터의 two-layer 네트워크를 학습시킵니다. 손실은 0.2미만이어야 합니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "net = init_toy_model()\n", + "stats = net.train(X, y, X, y,\n", + " learning_rate=1e-1, reg=1e-5,\n", + " num_iters=100, verbose=False)\n", + "\n", + "print 'Final training loss: ', stats['loss_history'][-1]\n", + "\n", + "# 손실 기록 그래프\n", + "plt.plot(stats['loss_history'])\n", + "plt.xlabel('iteration')\n", + "plt.ylabel('training loss')\n", + "plt.title('Training Loss history')\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 데이터 불러오기\n", + "이제 그라데이션 검사를 통과하고 toy 데이터에서 작동하는 two-layer 네트워크를 구현해야 합니다.\n", + "분류기에 실제 데이터셋을 학습시키기위해 우리가 좋아하는 CIFAR-10 데이터를 불러올 시간입니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.data_utils import load_CIFAR10\n", + "\n", + "def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000):\n", + " \"\"\"\n", + " CIFAR-10 데이터셋을 디스크에서 불러와서 two-layer 신경 망 분류기를 위해 준비한 사전 작업을\n", + " 수행합니다. 우리가 SVM에서 했던 작업과 비슷하지만 하나의 함수로 축약되어 있습니다.\n", + " \"\"\"\n", + " # 원본 CIFAR-10 데이터를 불러옵니다.\n", + " cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'\n", + " X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)\n", + " \n", + " # 데이터 표본\n", + " mask = range(num_training, num_training + num_validation)\n", + " X_val = X_train[mask]\n", + " y_val = y_train[mask]\n", + " mask = range(num_training)\n", + " X_train = X_train[mask]\n", + " y_train = y_train[mask]\n", + " mask = range(num_test)\n", + " X_test = X_test[mask]\n", + " y_test = y_test[mask]\n", + "\n", + " # 데이터를 정규화 시킵니다.: 평균 이미지를 뺍니다.\n", + " mean_image = np.mean(X_train, axis=0)\n", + " X_train -= mean_image\n", + " X_val -= mean_image\n", + " X_test -= mean_image\n", + "\n", + " # 데이터를 열(row)로 변형시킵니다.\n", + " X_train = X_train.reshape(num_training, -1)\n", + " X_val = X_val.reshape(num_validation, -1)\n", + " X_test = X_test.reshape(num_test, -1)\n", + "\n", + " return X_train, y_train, X_val, y_val, X_test, y_test\n", + "\n", + "\n", + "# 데이터를 얻기위해 위의 함수들을 호출합니다.\n", + "X_train, y_train, X_val, y_val, X_test, y_test = get_CIFAR10_data()\n", + "print 'Train data shape: ', X_train.shape\n", + "print 'Train labels shape: ', y_train.shape\n", + "print 'Validation data shape: ', X_val.shape\n", + "print 'Validation labels shape: ', y_val.shape\n", + "print 'Test data shape: ', X_test.shape\n", + "print 'Test labels shape: ', y_test.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 망 학습시키기\n", + "네트워크를 학습시키기 위해 모멘텀과 SGD를 사용합니다. 추가적으로 지수적인 학습 정도를 최적화되도로 조절합니다;\n", + "각 epoch 후에 학습 정도를 decay rate를 곱해서 감소시킵니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "input_size = 32 * 32 * 3\n", + "hidden_size = 50\n", + "num_classes = 10\n", + "net = TwoLayerNet(input_size, hidden_size, num_classes)\n", + "\n", + "# 망 학습시키기\n", + "stats = net.train(X_train, y_train, X_val, y_val,\n", + " num_iters=1000, batch_size=200,\n", + " learning_rate=1e-4, learning_rate_decay=0.95,\n", + " reg=0.5, verbose=True)\n", + "\n", + "# 검증 셋에 대해 확인하기\n", + "val_acc = (net.predict(X_val) == y_val).mean()\n", + "print 'Validation accuracy: ', val_acc\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 디버그\n", + "위에서 제공한 기본 파라메터로, 0.29 정도의 검증 정확도를 얻을 수 있을 것입니다. 별로 좋지 않죠.\n", + "\n", + "통찰력을 얻기위한 하나의 전략은 최적화 중의 학습과 검증 셋에 대한 손실 함수와 정확도 그래프가 틀린 이유를 찾아보는 것 입니다.\n", + "\n", + "다른 전략은 네트워크의 첫 레이어가 학습한 가중치를 시각화 해보는 것입니다. 대부분 시각 데이터를 학습한 뉴럴 네트워크의 첫 레이어의 가중치는 일반적으로 시각화 했을때 몇 가지 눈에보이는 구조를 갖습니다." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 손실함수와 학습 / 검증 정확도 그래프\n", + "plt.subplot(2, 1, 1)\n", + "plt.plot(stats['loss_history'])\n", + "plt.title('Loss history')\n", + "plt.xlabel('Iteration')\n", + "plt.ylabel('Loss')\n", + "\n", + "plt.subplot(2, 1, 2)\n", + "plt.plot(stats['train_acc_history'], label='train')\n", + "plt.plot(stats['val_acc_history'], label='val')\n", + "plt.title('Classification accuracy history')\n", + "plt.xlabel('Epoch')\n", + "plt.ylabel('Clasification accuracy')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "from cs231n.vis_utils import visualize_grid\n", + "\n", + "# 네트워크 가중치 시각화\n", + "\n", + "def show_net_weights(net):\n", + " W1 = net.params['W1']\n", + " W1 = W1.reshape(32, 32, 3, -1).transpose(3, 0, 1, 2)\n", + " plt.imshow(visualize_grid(W1, padding=3).astype('uint8'))\n", + " plt.gca().axis('off')\n", + " plt.show()\n", + "\n", + "show_net_weights(net)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + " # hyperparameters 튜닝하기\n", + "\n", + "**무엇이 문제인가?**. 위의 시각화를 살펴보면, 손실이 더(혹은 덜) 선형적으로 감소하고 있음을 확인할 수 있습니다. 이것은 학습률이 너무 낮을 수 있음을 시사하는것 처럼 보입니다. 게다가 학습과 검증 정확도 사이에 차이가 없다는것은 우리가 사용한 모델이 적은 용량을 가지고 있음을 나타내고 용량을 증가시켜야될 필요가 있습니다. 반면에, 매우 큰 모델을 사용한다면 overfitting이 발생할 수 있는데, 이 경우 학습 정확도와 검증 정확도 사이에 매우 큰 차이가 나는 것을 확인할 수 있을 것입니다.\n", + "\n", + "**튜닝**. hyperparameters를 튜닝하고 최종 성능에 어떻게 영향을 끼치는지에 대한 직관을 얻기위해 많은 연습을 해야합니다. 아래에선 다양한 hidden layer 크기, 학습률, numer of training epochs 와 정규화 강도를 포함한 hyperparameters의 값들로 실험해야 합니다. 또한 학습률의 decay를 튜닝하는 것에 대해 생각해 볼 수 있지만 아마 기본 값이 가장 좋은 성능을 낼 것입니다.\n", + "\n", + "**결과 예측하기**. 검증 셋에 대한 분류 정확도를 48% 이상으로 만들 수 있게 목표삼도록 합시다. 우리의 가장 좋은 네트워크는 검증 셋에 대해 52%이상의 정확도를 얻었습니다.\n", + "\n", + "**실험**: 이 연습에서 목표는 완전히 연결된 신경망으로 CIFAR-10에 좋은 결과를 얻는 것입니다. 테스트 세트에 대해 52%이상의 정확도에 대해 각 1%마다 추가 보너스 점수를 얻을 수 있습니다. 자신만의 기술에 대해서 적어넣으세요. (예. PCA로 차원 줄이기, dropout 추가하기, solver에 특징 추가하기 등)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "best_net = None # 여기에 가장 좋은 모델을 저장하세요.\n", + "\n", + "#################################################################################\n", + "# TODO: 검증 셋을 이용하여 hyperparameters를 튜닝하세요. 가장 잘 학습된 모델은 best_net에 저장 #\n", + "# 하세요. #\n", + "# #\n", + "# 우리가 위에서 사용한 시각화를 비슷하게 사용하면 디버그하는데 도움이 될 것입니다; #\n", + "# 시각화를 통해 위에서 잘 학습되지 않은 네트워크와 중요한 질적 차이를 보일 것 입니다. #\n", + "# #\n", + "# 손으로 hyperparameters를 미세조정하는것은 재밌을 수 있지만, 이전 연습에서 했던 것 처럼 자동으로 #\n", + "# 가능한 hyperparameters를 찾는 코드를 작성하는 것이 더 유용하다는것을 알게 될 것입니다. #\n", + "#################################################################################\n", + "pass\n", + "#################################################################################\n", + "# 코드의 끝 #\n", + "#################################################################################" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# 가장 좋은 네트워크의 가중치를 시각화 합니다.\n", + "show_net_weights(best_net)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 테스트 세트로 실행하기\n", + "실험이 끝났다면, 최종으로 학습된 네트워크를 테스트 세트로 실행해 봅니다; 48%이상의 정확도를 얻어야 합니다.\n", + "\n", + "**52%이상의 각 1% 마다 추가 점수를 얻을 수 있습니다.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "test_acc = (best_net.predict(X_test) == y_test).mean()\n", + "print 'Test accuracy: ', test_acc" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.5.2" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/assignments2016/assignment2.md b/assignments2016/assignment2.md index aedc082f..77349fad 100644 --- a/assignments2016/assignment2.md +++ b/assignments2016/assignment2.md @@ -4,131 +4,74 @@ mathjax: true permalink: assignments2016/assignment2/ --- -In this assignment you will practice writing backpropagation code, and training -Neural Networks and Convolutional Neural Networks. The goals of this assignment -are as follows: - -- understand **Neural Networks** and how they are arranged in layered - architectures -- understand and be able to implement (vectorized) **backpropagation** -- implement various **update rules** used to optimize Neural Networks -- implement **batch normalization** for training deep networks -- implement **dropout** to regularize networks -- effectively **cross-validate** and find the best hyperparameters for Neural - Network architecture -- understand the architecture of **Convolutional Neural Networks** and train - gain experience with training these models on data - -## Setup -You can work on the assignment in one of two ways: locally on your own machine, -or on a virtual machine through Terminal.com. - -### Working in the cloud on Terminal - -Terminal has created a separate subdomain to serve our class, -[www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com). Register -your account there. The Assignment 2 snapshot can then be found [HERE](https://www.stanfordterminalcloud.com/snapshot/6c95ca2c9866a962964ede3ea5813d4c2410ba48d92cf8d11a93fbb13e08b76a). If you are -registered in the class you can contact the TA (see Piazza for more information) -to request Terminal credits for use on the assignment. Once you boot up the -snapshot everything will be installed for you, and you will be ready to start on -your assignment right away. We have written a small tutorial on Terminal -[here](/terminal-tutorial). - -### Working locally -Get the code as a zip file -[here](http://vision.stanford.edu/teaching/cs231n/winter1516_assignment2.zip). -As for the dependencies: - -**[Option 1] Use Anaconda:** -The preferred approach for installing all the assignment dependencies is to use -[Anaconda](https://www.continuum.io/downloads), which is a Python distribution -that includes many of the most popular Python packages for science, math, -engineering and data analysis. Once you install it you can skip all mentions of -requirements and you are ready to go directly to working on the assignment. - -**[Option 2] Manual install, virtual environment:** -If you do not want to use Anaconda and want to go with a more manual and risky -installation route you will likely want to create a -[virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/) -for the project. If you choose not to use a virtual environment, it is up to you -to make sure that all dependencies for the code are installed globally on your -machine. To set up a virtual environment, run the following: - -```bash -cd assignment2 -sudo pip install virtualenv # This may already be installed -virtualenv .env # Create a virtual environment -source .env/bin/activate # Activate the virtual environment -pip install -r requirements.txt # Install dependencies +이번 숙제에서 여러분은 backpropagation 코드를 작성하는 법을 연습하고, 기본 형태의 뉴럴 네트워크(신경망)와 컨볼루션 신경망을 학습해볼 것입니다. 이번 숙제의 목표는 다음과 같습니다. + +- **뉴럴 네트워크(신경망)** 에 대해 이해하고 레이어가 있는 구조가 어떻게 배치되어 있는지 이해하기 +- **backpropagation** 에 대해 이해하고 (벡터화된) 코드로 구현하기 +- 뉴럴 네트워크를 학습시키는데 필요한 여러 가지 **업데이트 규칙** 구현하기 +- 딥 뉴럴 네트워크를 학습하는데 필요한 **batch normalization** 구현하기 +- 네트워크를 regularization 할 때 필요한 **dropout** 구현하기 +- 효과적인 **교차 검증(cross validation)** 을 통해 뉴럴 네트워크 구조에서 사용되는 여러 가지 hyperparameter 들의 최적값 찾기 +- **컨볼루션 신경망** 구조에 대해 이해하고 이 모델들을 실제 데이터에 학습해보는 것을 경험하기 + +## 설치 +여러분은 다음 두가지 방법으로 숙제를 시작할 수 있습니다: Terminal.com을 이용한 가상 환경 또는 로컬 환경. + +### Terminal에서의 가상 환경. +Terminal에는 우리의 수업을 위한 서브도메인이 만들어져 있습니다. [www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com) 계정을 등록하세요. 이번 숙제에 대한 스냅샷은 [여기](https://www.stanfordterminalcloud.com/snapshot/6c95ca2c9866a962964ede3ea5813d4c2410ba48d92cf8d11a93fbb13e08b76a)에서 찾아볼 수 있습니다. 만약 수업에 등록되었다면, TA(see Piazza for more information)에게 이 수업을 위한 Terminal 예산을 요구할 수 있습니다. 처음 스냅샷을 실행시키면, 수업을 위한 모든 것이 설치되어 있어서 바로 숙제를 시작할 수 있습니다. [여기](/terminal-tutorial)에 Terminal을 위한 간단한 튜토리얼을 작성해 뒀습니다. + +### 로컬 환경 +[여기](http://vision.stanford.edu/teaching/cs231n/winter1516_assignment2.zip)에서 압축파일을 다운받고 다음을 따르세요. + +**[선택 1] Use Anaconda:** +과학, 수학, 공학, 데이터 분석을 위한 대부분의 주요 패키지들을 담고있는 [Anaconda](https://www.continuum.io/downloads)를 사용하여 설치하는 것이 흔히 사용하는 방법입니다. 설치가 다 되면 모든 요구사항(dependency)을 넘기고 바로 숙제를 시작해도 좋습니다. + +**[선택 2] 수동 설치, virtual environment:** +만약 Anaconda 대신 좀 더 일반적이면서 까다로운 방법을 택하고 싶다면 이번 과제를 위한 [virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/)를 만들 수 있습니다. 만약 virtual environment를 사용하지 않는다면 모든 코드가 컴퓨터에 전역적으로 종속되게 설치됩니다. Virtual environment의 설정은 아래를 참조하세요. + +~~~bash +cd assignment1 +sudo pip install virtualenv # 아마 먼저 설치되어 있을 겁니다. +virtualenv .env # virtual environment를 만듭니다. +source .env/bin/activate # virtual environment를 활성화 합니다. +pip install -r requirements.txt # dependencies 설치합니다. # Work on the assignment for a while ... -deactivate # Exit the virtual environment -``` +deactivate # virtual environment를 종료합니다. +~~~ -**Download data:** -Once you have the starter code, you will need to download the CIFAR-10 dataset. -Run the following from the `assignment2` directory: +**데이터셋 다운로드:** +먼저 숙제를 시작하기전에 CIFAR-10 dataset를 다운로드해야 합니다. 아래 코드를 `assignment2` 폴더에서 실행하세요: -```bash +~~~bash cd cs231n/datasets ./get_datasets.sh -``` +~~~ -**Compile the Cython extension:** Convolutional Neural Networks require a very -efficient implementation. We have implemented of the functionality using -[Cython](http://cython.org/); you will need to compile the Cython extension -before you can run the code. From the `cs231n` directory, run the following -command: +**Cython extension 컴파일하기:** 컨볼루션 신경망은 매우 효율적인 구현을 필요로 합니다. 이 숙제를 위해서 [Cython](http://cython.org/)을 활용하여 여러 기능들을 구현해 놓았는데, 이를 위해 코드를 돌리기 전에 Cython extension을 컴파일 해야 합니다. `cs231n` 디렉토리에서 아래 명령어를 실행하세요: -```bash +~~~bash python setup.py build_ext --inplace -``` - -**Start IPython:** -After you have the CIFAR-10 data, you should start the IPython notebook server -from the `assignment2` directory. If you are unfamiliar with IPython, you should -read our [IPython tutorial](/ipython-tutorial). +~~~ -**NOTE:** If you are working in a virtual environment on OSX, you may encounter -errors with matplotlib due to the -[issues described here](http://matplotlib.org/faq/virtualenv_faq.html). -You can work around this issue by starting the IPython server using the -`start_ipython_osx.sh` script from the `assignment2` directory; the script -assumes that your virtual environment is named `.env`. +**IPython 시작:** +CIFAR-10 data를 받았다면, `assignment1` 폴더의 IPython notebook server를 시작할 수 있습니다. IPython에 친숙하지 않다면 작성해둔 [IPython tutorial](/ipython-tutorial)를 읽어보는 것을 권장합니다. +**NOTE:** OSX에서 virtual environment를 실행하면, matplotlib 에러가 날 수 있습니다([이 문제에 관한 이슈](http://matplotlib.org/faq/virtualenv_faq.html)). IPython 서버를 `assignment2`폴더의 `start_ipython_osx.sh`로 실행하면 이 문제를 피해갈 수 있습니다; 이 스크립트는 virtual environment가 `.env`라고 되어있다고 가정하고 작성되었습니다. -### Submitting your work: -Whether you work on the assignment locally or using Terminal, once you are done -working run the `collectSubmission.sh` script; this will produce a file called -`assignment2.zip`. Upload this file under the Assignments tab on -[the coursework](https://coursework.stanford.edu/portal/site/W15-CS-231N-01/) -page for the course. +### 과제 제출: +로컬 환경이나 Terminal에 상관없이, 이번 숙제를 마쳤다면 `collectSubmission.sh`스크립트를 실행하세요. 이 스크립트는 `assignment2.zip`파일을 만듭니다. 이 파일을 [the coursework](https://coursework.stanford.edu/portal/site/W16-CS-231N-01/)에 업로드하세요. - -### Q1: Fully-connected Neural Network (30 points) -The IPython notebook `FullyConnectedNets.ipynb` will introduce you to our -modular layer design, and then use those layers to implement fully-connected -networks of arbitrary depth. To optimize these models you will implement several -popular update rules. +### Q1: Fully-connected 뉴럴 네트워크 (30 points) +`FullyConnectedNets.ipynb` IPython notebook 파일에서 모듈화된 레이어 디자인을 소개하고, 이 레이어들을 이용해서 임의의 깊이를 갖는 fully-connected 네트워크를 구현할 것입니다. 이 모델들을 최적화하기 위해서 자주 사용되는 여러 가지 업데이트 규칙들을 구현해야 할 것입니다. ### Q2: Batch Normalization (30 points) -In the IPython notebook `BatchNormalization.ipynb` you will implement batch -normalization, and use it to train deep fully-connected networks. +`BatchNormalization.ipynb` IPython notebook 파일에서는 batch normalization 을 구현하고, 이를 사용하여 깊은(deep) fully-connected 네트워크를 학습할 것입니다. ### Q3: Dropout (10 points) -The IPython notebook `Dropout.ipynb` will help you implement Dropout and explore -its effects on model generalization. - -### Q4: ConvNet on CIFAR-10 (30 points) -In the IPython Notebook `ConvolutionalNetworks.ipynb` you will implement several -new layers that are commonly used in convolutional networks. You will train a -(shallow) convolutional network on CIFAR-10, and it will then be up to you to -train the best network that you can. - -### Q5: Do something extra! (up to +10 points) -In the process of training your network, you should feel free to implement -anything that you want to get better performance. You can modify the solver, -implement additional layers, use different types of regularization, use an -ensemble of models, or anything else that comes to mind. If you implement these -or other ideas not covered in the assignment then you will be awarded some bonus -points. +`Dropout.ipynb` IPython notebook 파일에서는 Dropout을 구현하고, 이것이 모델의 일반화 성능에 어떤 영향을 미치는지 살펴볼 것입니다. + +### Q4: CIFAR-10 에서의 컨볼루션 신경망 (30 points) +`ConvolutionalNetworks.ipynb` IPython notebook 파일에서는 컨볼루션 신경망에서 흔히 사용되는 여러 새로운 레이어들을 구현할 것입니다. 먼저 CIFAR-10 데이터셋에 대해 (얕은, 깊지않은, 작은 규모의) 컨볼루션 신경망을 학습하고, 이후에는 가능한 한 최선의 노력을 다해서 최고의 성능을 뽑아내보길 바랍니다. +### Q5: 추가 과제: 뭔가 더 해보세요! (up to +10 points) +네트워크를 학습하는 과정 속에서, 더 좋은 성능을 위해 필요한 것이 있다면 얼마든지 추가적으로 구현하기 바랍니다. 최적화 기법(solver)을 바꿔도 좋고, 추가적인 레이어를 구현하거나, 다른 종류의 regularization 을 사용하고나, 모델 ensemble 등 생각나는 모든 것을 시도해 보세요. 이번 숙제에서 다루지 않은 새로운 아이디어를 구현한다면 추가 점수를 받을 수 있을 것입니다. diff --git a/assignments2016/assignment2/.gitignore b/assignments2016/assignment2/.gitignore new file mode 100644 index 00000000..b0611d38 --- /dev/null +++ b/assignments2016/assignment2/.gitignore @@ -0,0 +1,3 @@ +*.swp +*.pyc +.env/* diff --git a/assignments2016/assignment2/BatchNormalization.ipynb b/assignments2016/assignment2/BatchNormalization.ipynb new file mode 100644 index 00000000..c0ca1d51 --- /dev/null +++ b/assignments2016/assignment2/BatchNormalization.ipynb @@ -0,0 +1,516 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Batch Normalization\n", + "One way to make deep networks easier to train is to use more sophisticated optimization procedures such as SGD+momentum, RMSProp, or Adam. Another strategy is to change the architecture of the network to make it easier to train. One idea along these lines is batch normalization which was recently proposed by [3].\n", + "\n", + "The idea is relatively straightforward. Machine learning methods tend to work better when their input data consists of uncorrelated features with zero mean and unit variance. When training a neural network, we can preprocess the data before feeding it to the network to explicitly decorrelate its features; this will ensure that the first layer of the network sees data that follows a nice distribution. However even if we preprocess the input data, the activations at deeper layers of the network will likely no longer be decorrelated and will no longer have zero mean or unit variance since they are output from earlier layers in the network. Even worse, during the training process the distribution of features at each layer of the network will shift as the weights of each layer are updated.\n", + "\n", + "The authors of [3] hypothesize that the shifting distribution of features inside deep neural networks may make training deep networks more difficult. To overcome this problem, [3] proposes to insert batch normalization layers into the network. At training time, a batch normalization layer uses a minibatch of data to estimate the mean and standard deviation of each feature. These estimated means and standard deviations are then used to center and normalize the features of the minibatch. A running average of these means and standard deviations is kept during training, and at test time these running averages are used to center and normalize features.\n", + "\n", + "It is possible that this normalization strategy could reduce the representational power of the network, since it may sometimes be optimal for certain layers to have features that are not zero-mean or unit variance. To this end, the batch normalization layer includes learnable shift and scale parameters for each feature dimension.\n", + "\n", + "[3] Sergey Ioffe and Christian Szegedy, \"Batch Normalization: Accelerating Deep Network Training by Reducing\n", + "Internal Covariate Shift\", ICML 2015." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# As usual, a bit of setup\n", + "\n", + "import time\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from cs231n.classifiers.fc_net import *\n", + "from cs231n.data_utils import get_CIFAR10_data\n", + "from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n", + "from cs231n.solver import Solver\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Load the (preprocessed) CIFAR10 data.\n", + "\n", + "data = get_CIFAR10_data()\n", + "for k, v in data.iteritems():\n", + " print '%s: ' % k, v.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Batch normalization: Forward\n", + "In the file `cs231n/layers.py`, implement the batch normalization forward pass in the function `batchnorm_forward`. Once you have done so, run the following to test your implementation." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Check the training-time forward pass by checking means and variances\n", + "# of features both before and after batch normalization\n", + "\n", + "# Simulate the forward pass for a two-layer network\n", + "N, D1, D2, D3 = 200, 50, 60, 3\n", + "X = np.random.randn(N, D1)\n", + "W1 = np.random.randn(D1, D2)\n", + "W2 = np.random.randn(D2, D3)\n", + "a = np.maximum(0, X.dot(W1)).dot(W2)\n", + "\n", + "print 'Before batch normalization:'\n", + "print ' means: ', a.mean(axis=0)\n", + "print ' stds: ', a.std(axis=0)\n", + "\n", + "# Means should be close to zero and stds close to one\n", + "print 'After batch normalization (gamma=1, beta=0)'\n", + "a_norm, _ = batchnorm_forward(a, np.ones(D3), np.zeros(D3), {'mode': 'train'})\n", + "print ' mean: ', a_norm.mean(axis=0)\n", + "print ' std: ', a_norm.std(axis=0)\n", + "\n", + "# Now means should be close to beta and stds close to gamma\n", + "gamma = np.asarray([1.0, 2.0, 3.0])\n", + "beta = np.asarray([11.0, 12.0, 13.0])\n", + "a_norm, _ = batchnorm_forward(a, gamma, beta, {'mode': 'train'})\n", + "print 'After batch normalization (nontrivial gamma, beta)'\n", + "print ' means: ', a_norm.mean(axis=0)\n", + "print ' stds: ', a_norm.std(axis=0)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Check the test-time forward pass by running the training-time\n", + "# forward pass many times to warm up the running averages, and then\n", + "# checking the means and variances of activations after a test-time\n", + "# forward pass.\n", + "\n", + "N, D1, D2, D3 = 200, 50, 60, 3\n", + "W1 = np.random.randn(D1, D2)\n", + "W2 = np.random.randn(D2, D3)\n", + "\n", + "bn_param = {'mode': 'train'}\n", + "gamma = np.ones(D3)\n", + "beta = np.zeros(D3)\n", + "for t in xrange(50):\n", + " X = np.random.randn(N, D1)\n", + " a = np.maximum(0, X.dot(W1)).dot(W2)\n", + " batchnorm_forward(a, gamma, beta, bn_param)\n", + "bn_param['mode'] = 'test'\n", + "X = np.random.randn(N, D1)\n", + "a = np.maximum(0, X.dot(W1)).dot(W2)\n", + "a_norm, _ = batchnorm_forward(a, gamma, beta, bn_param)\n", + "\n", + "# Means should be close to zero and stds close to one, but will be\n", + "# noisier than training-time forward passes.\n", + "print 'After batch normalization (test-time):'\n", + "print ' means: ', a_norm.mean(axis=0)\n", + "print ' stds: ', a_norm.std(axis=0)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Batch Normalization: backward\n", + "Now implement the backward pass for batch normalization in the function `batchnorm_backward`.\n", + "\n", + "To derive the backward pass you should write out the computation graph for batch normalization and backprop through each of the intermediate nodes. Some intermediates may have multiple outgoing branches; make sure to sum gradients across these branches in the backward pass.\n", + "\n", + "Once you have finished, run the following to numerically check your backward pass." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Gradient check batchnorm backward pass\n", + "\n", + "N, D = 4, 5\n", + "x = 5 * np.random.randn(N, D) + 12\n", + "gamma = np.random.randn(D)\n", + "beta = np.random.randn(D)\n", + "dout = np.random.randn(N, D)\n", + "\n", + "bn_param = {'mode': 'train'}\n", + "fx = lambda x: batchnorm_forward(x, gamma, beta, bn_param)[0]\n", + "fg = lambda a: batchnorm_forward(x, gamma, beta, bn_param)[0]\n", + "fb = lambda b: batchnorm_forward(x, gamma, beta, bn_param)[0]\n", + "\n", + "dx_num = eval_numerical_gradient_array(fx, x, dout)\n", + "da_num = eval_numerical_gradient_array(fg, gamma, dout)\n", + "db_num = eval_numerical_gradient_array(fb, beta, dout)\n", + "\n", + "_, cache = batchnorm_forward(x, gamma, beta, bn_param)\n", + "dx, dgamma, dbeta = batchnorm_backward(dout, cache)\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dgamma error: ', rel_error(da_num, dgamma)\n", + "print 'dbeta error: ', rel_error(db_num, dbeta)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Batch Normalization: alternative backward\n", + "In class we talked about two different implementations for the sigmoid backward pass. One strategy is to write out a computation graph composed of simple operations and backprop through all intermediate values. Another strategy is to work out the derivatives on paper. For the sigmoid function, it turns out that you can derive a very simple formula for the backward pass by simplifying gradients on paper.\n", + "\n", + "Surprisingly, it turns out that you can also derive a simple expression for the batch normalization backward pass if you work out derivatives on paper and simplify. After doing so, implement the simplified batch normalization backward pass in the function `batchnorm_backward_alt` and compare the two implementations by running the following. Your two implementations should compute nearly identical results, but the alternative implementation should be a bit faster.\n", + "\n", + "NOTE: You can still complete the rest of the assignment if you don't figure this part out, so don't worry too much if you can't get it." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D = 100, 500\n", + "x = 5 * np.random.randn(N, D) + 12\n", + "gamma = np.random.randn(D)\n", + "beta = np.random.randn(D)\n", + "dout = np.random.randn(N, D)\n", + "\n", + "bn_param = {'mode': 'train'}\n", + "out, cache = batchnorm_forward(x, gamma, beta, bn_param)\n", + "\n", + "t1 = time.time()\n", + "dx1, dgamma1, dbeta1 = batchnorm_backward(dout, cache)\n", + "t2 = time.time()\n", + "dx2, dgamma2, dbeta2 = batchnorm_backward_alt(dout, cache)\n", + "t3 = time.time()\n", + "\n", + "print 'dx difference: ', rel_error(dx1, dx2)\n", + "print 'dgamma difference: ', rel_error(dgamma1, dgamma2)\n", + "print 'dbeta difference: ', rel_error(dbeta1, dbeta2)\n", + "print 'speedup: %.2fx' % ((t2 - t1) / (t3 - t2))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Fully Connected Nets with Batch Normalization\n", + "Now that you have a working implementation for batch normalization, go back to your `FullyConnectedNet` in the file `cs2312n/classifiers/fc_net.py`. Modify your implementation to add batch normalization.\n", + "\n", + "Concretely, when the flag `use_batchnorm` is `True` in the constructor, you should insert a batch normalization layer before each ReLU nonlinearity. The outputs from the last layer of the network should not be normalized. Once you are done, run the following to gradient-check your implementation.\n", + "\n", + "HINT: You might find it useful to define an additional helper layer similar to those in the file `cs231n/layer_utils.py`. If you decide to do so, do it in the file `cs231n/classifiers/fc_net.py`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, H1, H2, C = 2, 15, 20, 30, 10\n", + "X = np.random.randn(N, D)\n", + "y = np.random.randint(C, size=(N,))\n", + "\n", + "for reg in [0, 3.14]:\n", + " print 'Running check with reg = ', reg\n", + " model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,\n", + " reg=reg, weight_scale=5e-2, dtype=np.float64,\n", + " use_batchnorm=True)\n", + "\n", + " loss, grads = model.loss(X, y)\n", + " print 'Initial loss: ', loss\n", + "\n", + " for name in sorted(grads):\n", + " f = lambda _: model.loss(X, y)[0]\n", + " grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)\n", + " print '%s relative error: %.2e' % (name, rel_error(grad_num, grads[name]))\n", + " if reg == 0: print" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Batchnorm for deep networks\n", + "Run the following to train a six-layer network on a subset of 1000 training examples both with and without batch normalization." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Try training a very deep net with batchnorm\n", + "hidden_dims = [100, 100, 100, 100, 100]\n", + "\n", + "num_train = 1000\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "weight_scale = 2e-2\n", + "bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=True)\n", + "model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=False)\n", + "\n", + "bn_solver = Solver(bn_model, small_data,\n", + " num_epochs=10, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=True, print_every=200)\n", + "bn_solver.train()\n", + "\n", + "solver = Solver(model, small_data,\n", + " num_epochs=10, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=True, print_every=200)\n", + "solver.train()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "Run the following to visualize the results from two networks trained above. You should find that using batch normalization helps the network to converge much faster." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "plt.subplot(3, 1, 1)\n", + "plt.title('Training loss')\n", + "plt.xlabel('Iteration')\n", + "\n", + "plt.subplot(3, 1, 2)\n", + "plt.title('Training accuracy')\n", + "plt.xlabel('Epoch')\n", + "\n", + "plt.subplot(3, 1, 3)\n", + "plt.title('Validation accuracy')\n", + "plt.xlabel('Epoch')\n", + "\n", + "plt.subplot(3, 1, 1)\n", + "plt.plot(solver.loss_history, 'o', label='baseline')\n", + "plt.plot(bn_solver.loss_history, 'o', label='batchnorm')\n", + "\n", + "plt.subplot(3, 1, 2)\n", + "plt.plot(solver.train_acc_history, '-o', label='baseline')\n", + "plt.plot(bn_solver.train_acc_history, '-o', label='batchnorm')\n", + "\n", + "plt.subplot(3, 1, 3)\n", + "plt.plot(solver.val_acc_history, '-o', label='baseline')\n", + "plt.plot(bn_solver.val_acc_history, '-o', label='batchnorm')\n", + " \n", + "for i in [1, 2, 3]:\n", + " plt.subplot(3, 1, i)\n", + " plt.legend(loc='upper center', ncol=4)\n", + "plt.gcf().set_size_inches(15, 15)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Batch normalization and initialization\n", + "We will now run a small experiment to study the interaction of batch normalization and weight initialization.\n", + "\n", + "The first cell will train 8-layer networks both with and without batch normalization using different scales for weight initialization. The second layer will plot training accuracy, validation set accuracy, and training loss as a function of the weight initialization scale." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Try training a very deep net with batchnorm\n", + "hidden_dims = [50, 50, 50, 50, 50, 50, 50]\n", + "\n", + "num_train = 1000\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "bn_solvers = {}\n", + "solvers = {}\n", + "weight_scales = np.logspace(-4, 0, num=20)\n", + "for i, weight_scale in enumerate(weight_scales):\n", + " print 'Running weight scale %d / %d' % (i + 1, len(weight_scales))\n", + " bn_model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=True)\n", + " model = FullyConnectedNet(hidden_dims, weight_scale=weight_scale, use_batchnorm=False)\n", + "\n", + " bn_solver = Solver(bn_model, small_data,\n", + " num_epochs=10, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=False, print_every=200)\n", + " bn_solver.train()\n", + " bn_solvers[weight_scale] = bn_solver\n", + "\n", + " solver = Solver(model, small_data,\n", + " num_epochs=10, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=False, print_every=200)\n", + " solver.train()\n", + " solvers[weight_scale] = solver" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Plot results of weight scale experiment\n", + "best_train_accs, bn_best_train_accs = [], []\n", + "best_val_accs, bn_best_val_accs = [], []\n", + "final_train_loss, bn_final_train_loss = [], []\n", + "\n", + "for ws in weight_scales:\n", + " best_train_accs.append(max(solvers[ws].train_acc_history))\n", + " bn_best_train_accs.append(max(bn_solvers[ws].train_acc_history))\n", + " \n", + " best_val_accs.append(max(solvers[ws].val_acc_history))\n", + " bn_best_val_accs.append(max(bn_solvers[ws].val_acc_history))\n", + " \n", + " final_train_loss.append(np.mean(solvers[ws].loss_history[-100:]))\n", + " bn_final_train_loss.append(np.mean(bn_solvers[ws].loss_history[-100:]))\n", + " \n", + "plt.subplot(3, 1, 1)\n", + "plt.title('Best val accuracy vs weight initialization scale')\n", + "plt.xlabel('Weight initialization scale')\n", + "plt.ylabel('Best val accuracy')\n", + "plt.semilogx(weight_scales, best_val_accs, '-o', label='baseline')\n", + "plt.semilogx(weight_scales, bn_best_val_accs, '-o', label='batchnorm')\n", + "plt.legend(ncol=2, loc='lower right')\n", + "\n", + "plt.subplot(3, 1, 2)\n", + "plt.title('Best train accuracy vs weight initialization scale')\n", + "plt.xlabel('Weight initialization scale')\n", + "plt.ylabel('Best training accuracy')\n", + "plt.semilogx(weight_scales, best_train_accs, '-o', label='baseline')\n", + "plt.semilogx(weight_scales, bn_best_train_accs, '-o', label='batchnorm')\n", + "plt.legend()\n", + "\n", + "plt.subplot(3, 1, 3)\n", + "plt.title('Final training loss vs weight initialization scale')\n", + "plt.xlabel('Weight initialization scale')\n", + "plt.ylabel('Final training loss')\n", + "plt.semilogx(weight_scales, final_train_loss, '-o', label='baseline')\n", + "plt.semilogx(weight_scales, bn_final_train_loss, '-o', label='batchnorm')\n", + "plt.legend()\n", + "\n", + "plt.gcf().set_size_inches(10, 15)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Question:\n", + "Describe the results of this experiment, and try to give a reason why the experiment gave the results that it did." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "# Answer:\n" + ], + "cell_type": "markdown", + "metadata": {} + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.6", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment2/ConvolutionalNetworks.ipynb b/assignments2016/assignment2/ConvolutionalNetworks.ipynb new file mode 100644 index 00000000..d57dd2ec --- /dev/null +++ b/assignments2016/assignment2/ConvolutionalNetworks.ipynb @@ -0,0 +1,869 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Convolutional Networks\n", + "So far we have worked with deep fully-connected networks, using them to explore different optimization strategies and network architectures. Fully-connected networks are a good testbed for experimentation because they are very computationally efficient, but in practice all state-of-the-art results use convolutional networks instead.\n", + "\n", + "First you will implement several layer types that are used in convolutional networks. You will then use these layers to train a convolutional network on the CIFAR-10 dataset." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# As usual, a bit of setup\n", + "\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from cs231n.classifiers.cnn import *\n", + "from cs231n.data_utils import get_CIFAR10_data\n", + "from cs231n.gradient_check import eval_numerical_gradient_array, eval_numerical_gradient\n", + "from cs231n.layers import *\n", + "from cs231n.fast_layers import *\n", + "from cs231n.solver import Solver\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Load the (preprocessed) CIFAR10 data.\n", + "\n", + "data = get_CIFAR10_data()\n", + "for k, v in data.iteritems():\n", + " print '%s: ' % k, v.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Convolution: Naive forward pass\n", + "The core of a convolutional network is the convolution operation. In the file `cs231n/layers.py`, implement the forward pass for the convolution layer in the function `conv_forward_naive`. \n", + "\n", + "You don't have to worry too much about efficiency at this point; just write the code in whatever way you find most clear.\n", + "\n", + "You can test your implementation by running the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "x_shape = (2, 3, 4, 4)\n", + "w_shape = (3, 3, 4, 4)\n", + "x = np.linspace(-0.1, 0.5, num=np.prod(x_shape)).reshape(x_shape)\n", + "w = np.linspace(-0.2, 0.3, num=np.prod(w_shape)).reshape(w_shape)\n", + "b = np.linspace(-0.1, 0.2, num=3)\n", + "\n", + "conv_param = {'stride': 2, 'pad': 1}\n", + "out, _ = conv_forward_naive(x, w, b, conv_param)\n", + "correct_out = np.array([[[[[-0.08759809, -0.10987781],\n", + " [-0.18387192, -0.2109216 ]],\n", + " [[ 0.21027089, 0.21661097],\n", + " [ 0.22847626, 0.23004637]],\n", + " [[ 0.50813986, 0.54309974],\n", + " [ 0.64082444, 0.67101435]]],\n", + " [[[-0.98053589, -1.03143541],\n", + " [-1.19128892, -1.24695841]],\n", + " [[ 0.69108355, 0.66880383],\n", + " [ 0.59480972, 0.56776003]],\n", + " [[ 2.36270298, 2.36904306],\n", + " [ 2.38090835, 2.38247847]]]]])\n", + "\n", + "# Compare your output to ours; difference should be around 1e-8\n", + "print 'Testing conv_forward_naive'\n", + "print 'difference: ', rel_error(out, correct_out)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Aside: Image processing via convolutions\n", + "\n", + "As fun way to both check your implementation and gain a better understanding of the type of operation that convolutional layers can perform, we will set up an input containing two images and manually set up filters that perform common image processing operations (grayscale conversion and edge detection). The convolution forward pass will apply these operations to each of the input images. We can then visualize the results as a sanity check." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from scipy.misc import imread, imresize\n", + "\n", + "kitten, puppy = imread('kitten.jpg'), imread('puppy.jpg')\n", + "# kitten is wide, and puppy is already square\n", + "d = kitten.shape[1] - kitten.shape[0]\n", + "kitten_cropped = kitten[:, d/2:-d/2, :]\n", + "\n", + "img_size = 200 # Make this smaller if it runs too slow\n", + "x = np.zeros((2, 3, img_size, img_size))\n", + "x[0, :, :, :] = imresize(puppy, (img_size, img_size)).transpose((2, 0, 1))\n", + "x[1, :, :, :] = imresize(kitten_cropped, (img_size, img_size)).transpose((2, 0, 1))\n", + "\n", + "# Set up a convolutional weights holding 2 filters, each 3x3\n", + "w = np.zeros((2, 3, 3, 3))\n", + "\n", + "# The first filter converts the image to grayscale.\n", + "# Set up the red, green, and blue channels of the filter.\n", + "w[0, 0, :, :] = [[0, 0, 0], [0, 0.3, 0], [0, 0, 0]]\n", + "w[0, 1, :, :] = [[0, 0, 0], [0, 0.6, 0], [0, 0, 0]]\n", + "w[0, 2, :, :] = [[0, 0, 0], [0, 0.1, 0], [0, 0, 0]]\n", + "\n", + "# Second filter detects horizontal edges in the blue channel.\n", + "w[1, 2, :, :] = [[1, 2, 1], [0, 0, 0], [-1, -2, -1]]\n", + "\n", + "# Vector of biases. We don't need any bias for the grayscale\n", + "# filter, but for the edge detection filter we want to add 128\n", + "# to each output so that nothing is negative.\n", + "b = np.array([0, 128])\n", + "\n", + "# Compute the result of convolving each input in x with each filter in w,\n", + "# offsetting by b, and storing the results in out.\n", + "out, _ = conv_forward_naive(x, w, b, {'stride': 1, 'pad': 1})\n", + "\n", + "def imshow_noax(img, normalize=True):\n", + " \"\"\" Tiny helper to show images as uint8 and remove axis labels \"\"\"\n", + " if normalize:\n", + " img_max, img_min = np.max(img), np.min(img)\n", + " img = 255.0 * (img - img_min) / (img_max - img_min)\n", + " plt.imshow(img.astype('uint8'))\n", + " plt.gca().axis('off')\n", + "\n", + "# Show the original images and the results of the conv operation\n", + "plt.subplot(2, 3, 1)\n", + "imshow_noax(puppy, normalize=False)\n", + "plt.title('Original image')\n", + "plt.subplot(2, 3, 2)\n", + "imshow_noax(out[0, 0])\n", + "plt.title('Grayscale')\n", + "plt.subplot(2, 3, 3)\n", + "imshow_noax(out[0, 1])\n", + "plt.title('Edges')\n", + "plt.subplot(2, 3, 4)\n", + "imshow_noax(kitten_cropped, normalize=False)\n", + "plt.subplot(2, 3, 5)\n", + "imshow_noax(out[1, 0])\n", + "plt.subplot(2, 3, 6)\n", + "imshow_noax(out[1, 1])\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Convolution: Naive backward pass\n", + "Implement the backward pass for the convolution operation in the function `conv_backward_naive` in the file `cs231n/layers.py`. Again, you don't need to worry too much about computational efficiency.\n", + "\n", + "When you are done, run the following to check your backward pass with a numeric gradient check." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "x = np.random.randn(4, 3, 5, 5)\n", + "w = np.random.randn(2, 3, 3, 3)\n", + "b = np.random.randn(2,)\n", + "dout = np.random.randn(4, 2, 5, 5)\n", + "conv_param = {'stride': 1, 'pad': 1}\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: conv_forward_naive(x, w, b, conv_param)[0], x, dout)\n", + "dw_num = eval_numerical_gradient_array(lambda w: conv_forward_naive(x, w, b, conv_param)[0], w, dout)\n", + "db_num = eval_numerical_gradient_array(lambda b: conv_forward_naive(x, w, b, conv_param)[0], b, dout)\n", + "\n", + "out, cache = conv_forward_naive(x, w, b, conv_param)\n", + "dx, dw, db = conv_backward_naive(dout, cache)\n", + "\n", + "# Your errors should be around 1e-9'\n", + "print 'Testing conv_backward_naive function'\n", + "print 'dx error: ', rel_error(dx, dx_num)\n", + "print 'dw error: ', rel_error(dw, dw_num)\n", + "print 'db error: ', rel_error(db, db_num)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Max pooling: Naive forward\n", + "Implement the forward pass for the max-pooling operation in the function `max_pool_forward_naive` in the file `cs231n/layers.py`. Again, don't worry too much about computational efficiency.\n", + "\n", + "Check your implementation by running the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "x_shape = (2, 3, 4, 4)\n", + "x = np.linspace(-0.3, 0.4, num=np.prod(x_shape)).reshape(x_shape)\n", + "pool_param = {'pool_width': 2, 'pool_height': 2, 'stride': 2}\n", + "\n", + "out, _ = max_pool_forward_naive(x, pool_param)\n", + "\n", + "correct_out = np.array([[[[-0.26315789, -0.24842105],\n", + " [-0.20421053, -0.18947368]],\n", + " [[-0.14526316, -0.13052632],\n", + " [-0.08631579, -0.07157895]],\n", + " [[-0.02736842, -0.01263158],\n", + " [ 0.03157895, 0.04631579]]],\n", + " [[[ 0.09052632, 0.10526316],\n", + " [ 0.14947368, 0.16421053]],\n", + " [[ 0.20842105, 0.22315789],\n", + " [ 0.26736842, 0.28210526]],\n", + " [[ 0.32631579, 0.34105263],\n", + " [ 0.38526316, 0.4 ]]]])\n", + "\n", + "# Compare your output with ours. Difference should be around 1e-8.\n", + "print 'Testing max_pool_forward_naive function:'\n", + "print 'difference: ', rel_error(out, correct_out)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Max pooling: Naive backward\n", + "Implement the backward pass for the max-pooling operation in the function `max_pool_backward_naive` in the file `cs231n/layers.py`. You don't need to worry about computational efficiency.\n", + "\n", + "Check your implementation with numeric gradient checking by running the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "x = np.random.randn(3, 2, 8, 8)\n", + "dout = np.random.randn(3, 2, 4, 4)\n", + "pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: max_pool_forward_naive(x, pool_param)[0], x, dout)\n", + "\n", + "out, cache = max_pool_forward_naive(x, pool_param)\n", + "dx = max_pool_backward_naive(dout, cache)\n", + "\n", + "# Your error should be around 1e-12\n", + "print 'Testing max_pool_backward_naive function:'\n", + "print 'dx error: ', rel_error(dx, dx_num)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Fast layers\n", + "Making convolution and pooling layers fast can be challenging. To spare you the pain, we've provided fast implementations of the forward and backward passes for convolution and pooling layers in the file `cs231n/fast_layers.py`.\n", + "\n", + "The fast convolution implementation depends on a Cython extension; to compile it you need to run the following from the `cs231n` directory:\n", + "\n", + "```bash\n", + "python setup.py build_ext --inplace\n", + "```\n", + "\n", + "The API for the fast versions of the convolution and pooling layers is exactly the same as the naive versions that you implemented above: the forward pass receives data, weights, and parameters and produces outputs and a cache object; the backward pass recieves upstream derivatives and the cache object and produces gradients with respect to the data and weights.\n", + "\n", + "**NOTE:** The fast implementation for pooling will only perform optimally if the pooling regions are non-overlapping and tile the input. If these conditions are not met then the fast pooling implementation will not be much faster than the naive implementation.\n", + "\n", + "You can compare the performance of the naive and fast versions of these layers by running the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.fast_layers import conv_forward_fast, conv_backward_fast\n", + "from time import time\n", + "\n", + "x = np.random.randn(100, 3, 31, 31)\n", + "w = np.random.randn(25, 3, 3, 3)\n", + "b = np.random.randn(25,)\n", + "dout = np.random.randn(100, 25, 16, 16)\n", + "conv_param = {'stride': 2, 'pad': 1}\n", + "\n", + "t0 = time()\n", + "out_naive, cache_naive = conv_forward_naive(x, w, b, conv_param)\n", + "t1 = time()\n", + "out_fast, cache_fast = conv_forward_fast(x, w, b, conv_param)\n", + "t2 = time()\n", + "\n", + "print 'Testing conv_forward_fast:'\n", + "print 'Naive: %fs' % (t1 - t0)\n", + "print 'Fast: %fs' % (t2 - t1)\n", + "print 'Speedup: %fx' % ((t1 - t0) / (t2 - t1))\n", + "print 'Difference: ', rel_error(out_naive, out_fast)\n", + "\n", + "t0 = time()\n", + "dx_naive, dw_naive, db_naive = conv_backward_naive(dout, cache_naive)\n", + "t1 = time()\n", + "dx_fast, dw_fast, db_fast = conv_backward_fast(dout, cache_fast)\n", + "t2 = time()\n", + "\n", + "print '\\nTesting conv_backward_fast:'\n", + "print 'Naive: %fs' % (t1 - t0)\n", + "print 'Fast: %fs' % (t2 - t1)\n", + "print 'Speedup: %fx' % ((t1 - t0) / (t2 - t1))\n", + "print 'dx difference: ', rel_error(dx_naive, dx_fast)\n", + "print 'dw difference: ', rel_error(dw_naive, dw_fast)\n", + "print 'db difference: ', rel_error(db_naive, db_fast)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.fast_layers import max_pool_forward_fast, max_pool_backward_fast\n", + "\n", + "x = np.random.randn(100, 3, 32, 32)\n", + "dout = np.random.randn(100, 3, 16, 16)\n", + "pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}\n", + "\n", + "t0 = time()\n", + "out_naive, cache_naive = max_pool_forward_naive(x, pool_param)\n", + "t1 = time()\n", + "out_fast, cache_fast = max_pool_forward_fast(x, pool_param)\n", + "t2 = time()\n", + "\n", + "print 'Testing pool_forward_fast:'\n", + "print 'Naive: %fs' % (t1 - t0)\n", + "print 'fast: %fs' % (t2 - t1)\n", + "print 'speedup: %fx' % ((t1 - t0) / (t2 - t1))\n", + "print 'difference: ', rel_error(out_naive, out_fast)\n", + "\n", + "t0 = time()\n", + "dx_naive = max_pool_backward_naive(dout, cache_naive)\n", + "t1 = time()\n", + "dx_fast = max_pool_backward_fast(dout, cache_fast)\n", + "t2 = time()\n", + "\n", + "print '\\nTesting pool_backward_fast:'\n", + "print 'Naive: %fs' % (t1 - t0)\n", + "print 'speedup: %fx' % ((t1 - t0) / (t2 - t1))\n", + "print 'dx difference: ', rel_error(dx_naive, dx_fast)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Convolutional \"sandwich\" layers\n", + "Previously we introduced the concept of \"sandwich\" layers that combine multiple operations into commonly used patterns. In the file `cs231n/layer_utils.py` you will find sandwich layers that implement a few commonly used patterns for convolutional networks." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.layer_utils import conv_relu_pool_forward, conv_relu_pool_backward\n", + "\n", + "x = np.random.randn(2, 3, 16, 16)\n", + "w = np.random.randn(3, 3, 3, 3)\n", + "b = np.random.randn(3,)\n", + "dout = np.random.randn(2, 3, 8, 8)\n", + "conv_param = {'stride': 1, 'pad': 1}\n", + "pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}\n", + "\n", + "out, cache = conv_relu_pool_forward(x, w, b, conv_param, pool_param)\n", + "dx, dw, db = conv_relu_pool_backward(dout, cache)\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: conv_relu_pool_forward(x, w, b, conv_param, pool_param)[0], x, dout)\n", + "dw_num = eval_numerical_gradient_array(lambda w: conv_relu_pool_forward(x, w, b, conv_param, pool_param)[0], w, dout)\n", + "db_num = eval_numerical_gradient_array(lambda b: conv_relu_pool_forward(x, w, b, conv_param, pool_param)[0], b, dout)\n", + "\n", + "print 'Testing conv_relu_pool'\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dw error: ', rel_error(dw_num, dw)\n", + "print 'db error: ', rel_error(db_num, db)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.layer_utils import conv_relu_forward, conv_relu_backward\n", + "\n", + "x = np.random.randn(2, 3, 8, 8)\n", + "w = np.random.randn(3, 3, 3, 3)\n", + "b = np.random.randn(3,)\n", + "dout = np.random.randn(2, 3, 8, 8)\n", + "conv_param = {'stride': 1, 'pad': 1}\n", + "\n", + "out, cache = conv_relu_forward(x, w, b, conv_param)\n", + "dx, dw, db = conv_relu_backward(dout, cache)\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: conv_relu_forward(x, w, b, conv_param)[0], x, dout)\n", + "dw_num = eval_numerical_gradient_array(lambda w: conv_relu_forward(x, w, b, conv_param)[0], w, dout)\n", + "db_num = eval_numerical_gradient_array(lambda b: conv_relu_forward(x, w, b, conv_param)[0], b, dout)\n", + "\n", + "print 'Testing conv_relu:'\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dw error: ', rel_error(dw_num, dw)\n", + "print 'db error: ', rel_error(db_num, db)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Three-layer ConvNet\n", + "Now that you have implemented all the necessary layers, we can put them together into a simple convolutional network.\n", + "\n", + "Open the file `cs231n/cnn.py` and complete the implementation of the `ThreeLayerConvNet` class. Run the following cells to help you debug:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "## Sanity check loss\n", + "After you build a new network, one of the first things you should do is sanity check the loss. When we use the softmax loss, we expect the loss for random weights (and no regularization) to be about `log(C)` for `C` classes. When we add regularization this should go up." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "model = ThreeLayerConvNet()\n", + "\n", + "N = 50\n", + "X = np.random.randn(N, 3, 32, 32)\n", + "y = np.random.randint(10, size=N)\n", + "\n", + "loss, grads = model.loss(X, y)\n", + "print 'Initial loss (no regularization): ', loss\n", + "\n", + "model.reg = 0.5\n", + "loss, grads = model.loss(X, y)\n", + "print 'Initial loss (with regularization): ', loss" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Gradient check\n", + "After the loss looks reasonable, use numeric gradient checking to make sure that your backward pass is correct. When you use numeric gradient checking you should use a small amount of artifical data and a small number of neurons at each layer." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "num_inputs = 2\n", + "input_dim = (3, 16, 16)\n", + "reg = 0.0\n", + "num_classes = 10\n", + "X = np.random.randn(num_inputs, *input_dim)\n", + "y = np.random.randint(num_classes, size=num_inputs)\n", + "\n", + "model = ThreeLayerConvNet(num_filters=3, filter_size=3,\n", + " input_dim=input_dim, hidden_dim=7,\n", + " dtype=np.float64)\n", + "loss, grads = model.loss(X, y)\n", + "for param_name in sorted(grads):\n", + " f = lambda _: model.loss(X, y)[0]\n", + " param_grad_num = eval_numerical_gradient(f, model.params[param_name], verbose=False, h=1e-6)\n", + " e = rel_error(param_grad_num, grads[param_name])\n", + " print '%s max relative error: %e' % (param_name, rel_error(param_grad_num, grads[param_name]))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Overfit small data\n", + "A nice trick is to train your model with just a few training samples. You should be able to overfit small datasets, which will result in very high training accuracy and comparatively low validation accuracy." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "num_train = 100\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "model = ThreeLayerConvNet(weight_scale=1e-2)\n", + "\n", + "solver = Solver(model, small_data,\n", + " num_epochs=10, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=True, print_every=1)\n", + "solver.train()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "Plotting the loss, training accuracy, and validation accuracy should show clear overfitting:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "plt.subplot(2, 1, 1)\n", + "plt.plot(solver.loss_history, 'o')\n", + "plt.xlabel('iteration')\n", + "plt.ylabel('loss')\n", + "\n", + "plt.subplot(2, 1, 2)\n", + "plt.plot(solver.train_acc_history, '-o')\n", + "plt.plot(solver.val_acc_history, '-o')\n", + "plt.legend(['train', 'val'], loc='upper left')\n", + "plt.xlabel('epoch')\n", + "plt.ylabel('accuracy')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Train the net\n", + "By training the three-layer convolutional network for one epoch, you should achieve greater than 40% accuracy on the training set:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "model = ThreeLayerConvNet(weight_scale=0.001, hidden_dim=500, reg=0.001)\n", + "\n", + "solver = Solver(model, data,\n", + " num_epochs=1, batch_size=50,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 1e-3,\n", + " },\n", + " verbose=True, print_every=20)\n", + "solver.train()" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + }, + { + "source": [ + "## Visualize Filters\n", + "You can visualize the first-layer convolutional filters from the trained network by running the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.vis_utils import visualize_grid\n", + "\n", + "grid = visualize_grid(model.params['W1'].transpose(0, 2, 3, 1))\n", + "plt.imshow(grid.astype('uint8'))\n", + "plt.axis('off')\n", + "plt.gcf().set_size_inches(5, 5)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Spatial Batch Normalization\n", + "We already saw that batch normalization is a very useful technique for training deep fully-connected networks. Batch normalization can also be used for convolutional networks, but we need to tweak it a bit; the modification will be called \"spatial batch normalization.\"\n", + "\n", + "Normally batch-normalization accepts inputs of shape `(N, D)` and produces outputs of shape `(N, D)`, where we normalize across the minibatch dimension `N`. For data coming from convolutional layers, batch normalization needs to accept inputs of shape `(N, C, H, W)` and produce outputs of shape `(N, C, H, W)` where the `N` dimension gives the minibatch size and the `(H, W)` dimensions give the spatial size of the feature map.\n", + "\n", + "If the feature map was produced using convolutions, then we expect the statistics of each feature channel to be relatively consistent both between different imagesand different locations within the same image. Therefore spatial batch normalization computes a mean and variance for each of the `C` feature channels by computing statistics over both the minibatch dimension `N` and the spatial dimensions `H` and `W`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "## Spatial batch normalization: forward\n", + "\n", + "In the file `cs231n/layers.py`, implement the forward pass for spatial batch normalization in the function `spatial_batchnorm_forward`. Check your implementation by running the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Check the training-time forward pass by checking means and variances\n", + "# of features both before and after spatial batch normalization\n", + "\n", + "N, C, H, W = 2, 3, 4, 5\n", + "x = 4 * np.random.randn(N, C, H, W) + 10\n", + "\n", + "print 'Before spatial batch normalization:'\n", + "print ' Shape: ', x.shape\n", + "print ' Means: ', x.mean(axis=(0, 2, 3))\n", + "print ' Stds: ', x.std(axis=(0, 2, 3))\n", + "\n", + "# Means should be close to zero and stds close to one\n", + "gamma, beta = np.ones(C), np.zeros(C)\n", + "bn_param = {'mode': 'train'}\n", + "out, _ = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", + "print 'After spatial batch normalization:'\n", + "print ' Shape: ', out.shape\n", + "print ' Means: ', out.mean(axis=(0, 2, 3))\n", + "print ' Stds: ', out.std(axis=(0, 2, 3))\n", + "\n", + "# Means should be close to beta and stds close to gamma\n", + "gamma, beta = np.asarray([3, 4, 5]), np.asarray([6, 7, 8])\n", + "out, _ = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", + "print 'After spatial batch normalization (nontrivial gamma, beta):'\n", + "print ' Shape: ', out.shape\n", + "print ' Means: ', out.mean(axis=(0, 2, 3))\n", + "print ' Stds: ', out.std(axis=(0, 2, 3))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Check the test-time forward pass by running the training-time\n", + "# forward pass many times to warm up the running averages, and then\n", + "# checking the means and variances of activations after a test-time\n", + "# forward pass.\n", + "\n", + "N, C, H, W = 10, 4, 11, 12\n", + "\n", + "bn_param = {'mode': 'train'}\n", + "gamma = np.ones(C)\n", + "beta = np.zeros(C)\n", + "for t in xrange(50):\n", + " x = 2.3 * np.random.randn(N, C, H, W) + 13\n", + " spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", + "bn_param['mode'] = 'test'\n", + "x = 2.3 * np.random.randn(N, C, H, W) + 13\n", + "a_norm, _ = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", + "\n", + "# Means should be close to zero and stds close to one, but will be\n", + "# noisier than training-time forward passes.\n", + "print 'After spatial batch normalization (test-time):'\n", + "print ' means: ', a_norm.mean(axis=(0, 2, 3))\n", + "print ' stds: ', a_norm.std(axis=(0, 2, 3))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Spatial batch normalization: backward\n", + "In the file `cs231n/layers.py`, implement the backward pass for spatial batch normalization in the function `spatial_batchnorm_backward`. Run the following to check your implementation using a numeric gradient check:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, C, H, W = 2, 3, 4, 5\n", + "x = 5 * np.random.randn(N, C, H, W) + 12\n", + "gamma = np.random.randn(C)\n", + "beta = np.random.randn(C)\n", + "dout = np.random.randn(N, C, H, W)\n", + "\n", + "bn_param = {'mode': 'train'}\n", + "fx = lambda x: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]\n", + "fg = lambda a: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]\n", + "fb = lambda b: spatial_batchnorm_forward(x, gamma, beta, bn_param)[0]\n", + "\n", + "dx_num = eval_numerical_gradient_array(fx, x, dout)\n", + "da_num = eval_numerical_gradient_array(fg, gamma, dout)\n", + "db_num = eval_numerical_gradient_array(fb, beta, dout)\n", + "\n", + "_, cache = spatial_batchnorm_forward(x, gamma, beta, bn_param)\n", + "dx, dgamma, dbeta = spatial_batchnorm_backward(dout, cache)\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dgamma error: ', rel_error(da_num, dgamma)\n", + "print 'dbeta error: ', rel_error(db_num, dbeta)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Experiment!\n", + "Experiment and try to get the best performance that you can on CIFAR-10 using a ConvNet. Here are some ideas to get you started:\n", + "\n", + "### Things you should try:\n", + "- Filter size: Above we used 7x7; this makes pretty pictures but smaller filters may be more efficient\n", + "- Number of filters: Above we used 32 filters. Do more or fewer do better?\n", + "- Batch normalization: Try adding spatial batch normalization after convolution layers and vanilla batch normalization aafter affine layers. Do your networks train faster?\n", + "- Network architecture: The network above has two layers of trainable parameters. Can you do better with a deeper network? You can implement alternative architectures in the file `cs231n/classifiers/convnet.py`. Some good architectures to try include:\n", + " - [conv-relu-pool]xN - conv - relu - [affine]xM - [softmax or SVM]\n", + " - [conv-relu-pool]XN - [affine]XM - [softmax or SVM]\n", + " - [conv-relu-conv-relu-pool]xN - [affine]xM - [softmax or SVM]\n", + "\n", + "### Tips for training\n", + "For each network architecture that you try, you should tune the learning rate and regularization strength. When doing this there are a couple important things to keep in mind:\n", + "\n", + "- If the parameters are working well, you should see improvement within a few hundred iterations\n", + "- Remember the course-to-fine approach for hyperparameter tuning: start by testing a large range of hyperparameters for just a few training iterations to find the combinations of parameters that are working at all.\n", + "- Once you have found some sets of parameters that seem to work, search more finely around these parameters. You may need to train for more epochs.\n", + "\n", + "### Going above and beyond\n", + "If you are feeling adventurous there are many other features you can implement to try and improve your performance. You are **not required** to implement any of these; however they would be good things to try for extra credit.\n", + "\n", + "- Alternative update steps: For the assignment we implemented SGD+momentum, RMSprop, and Adam; you could try alternatives like AdaGrad or AdaDelta.\n", + "- Alternative activation functions such as leaky ReLU, parametric ReLU, or MaxOut.\n", + "- Model ensembles\n", + "- Data augmentation\n", + "\n", + "If you do decide to implement something extra, clearly describe it in the \"Extra Credit Description\" cell below.\n", + "\n", + "### What we expect\n", + "At the very least, you should be able to train a ConvNet that gets at least 65% accuracy on the validation set. This is just a lower bound - if you are careful it should be possible to get accuracies much higher than that! Extra credit points will be awarded for particularly high-scoring models or unique approaches.\n", + "\n", + "You should use the space below to experiment and train your network. The final cell in this notebook should contain the training, validation, and test set accuracies for your final trained network. In this notebook you should also write an explanation of what you did, any additional features that you implemented, and any visualizations or graphs that you make in the process of training and evaluating your network.\n", + "\n", + "Have fun and happy training!" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Train a really good model on CIFAR-10" + ], + "outputs": [], + "metadata": { + "collapsed": true + } + }, + { + "source": [ + "# Extra Credit Description\n", + "If you implement any additional features for extra credit, clearly describe them here with pointers to any code in this or other files if applicable." + ], + "cell_type": "markdown", + "metadata": {} + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.6", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment2/Dropout.ipynb b/assignments2016/assignment2/Dropout.ipynb new file mode 100644 index 00000000..98050908 --- /dev/null +++ b/assignments2016/assignment2/Dropout.ipynb @@ -0,0 +1,275 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Dropout\n", + "Dropout [1] is a technique for regularizing neural networks by randomly setting some features to zero during the forward pass. In this exercise you will implement a dropout layer and modify your fully-connected network to optionally use dropout.\n", + "\n", + "[1] Geoffrey E. Hinton et al, \"Improving neural networks by preventing co-adaptation of feature detectors\", arXiv 2012" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# As usual, a bit of setup\n", + "\n", + "import time\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from cs231n.classifiers.fc_net import *\n", + "from cs231n.data_utils import get_CIFAR10_data\n", + "from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n", + "from cs231n.solver import Solver\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Load the (preprocessed) CIFAR10 data.\n", + "\n", + "data = get_CIFAR10_data()\n", + "for k, v in data.iteritems():\n", + " print '%s: ' % k, v.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Dropout forward pass\n", + "In the file `cs231n/layers.py`, implement the forward pass for dropout. Since dropout behaves differently during training and testing, make sure to implement the operation for both modes.\n", + "\n", + "Once you have done so, run the cell below to test your implementation." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "x = np.random.randn(500, 500) + 10\n", + "\n", + "for p in [0.3, 0.6, 0.75]:\n", + " out, _ = dropout_forward(x, {'mode': 'train', 'p': p})\n", + " out_test, _ = dropout_forward(x, {'mode': 'test', 'p': p})\n", + "\n", + " print 'Running tests with p = ', p\n", + " print 'Mean of input: ', x.mean()\n", + " print 'Mean of train-time output: ', out.mean()\n", + " print 'Mean of test-time output: ', out_test.mean()\n", + " print 'Fraction of train-time output set to zero: ', (out == 0).mean()\n", + " print 'Fraction of test-time output set to zero: ', (out_test == 0).mean()\n", + " print" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Dropout backward pass\n", + "In the file `cs231n/layers.py`, implement the backward pass for dropout. After doing so, run the following cell to numerically gradient-check your implementation." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "x = np.random.randn(10, 10) + 10\n", + "dout = np.random.randn(*x.shape)\n", + "\n", + "dropout_param = {'mode': 'train', 'p': 0.8, 'seed': 123}\n", + "out, cache = dropout_forward(x, dropout_param)\n", + "dx = dropout_backward(dout, cache)\n", + "dx_num = eval_numerical_gradient_array(lambda xx: dropout_forward(xx, dropout_param)[0], x, dout)\n", + "\n", + "print 'dx relative error: ', rel_error(dx, dx_num)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Fully-connected nets with Dropout\n", + "In the file `cs231n/classifiers/fc_net.py`, modify your implementation to use dropout. Specificially, if the constructor the the net receives a nonzero value for the `dropout` parameter, then the net should add dropout immediately after every ReLU nonlinearity. After doing so, run the following to numerically gradient-check your implementation." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, H1, H2, C = 2, 15, 20, 30, 10\n", + "X = np.random.randn(N, D)\n", + "y = np.random.randint(C, size=(N,))\n", + "\n", + "for dropout in [0, 0.25, 0.5]:\n", + " print 'Running check with dropout = ', dropout\n", + " model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,\n", + " weight_scale=5e-2, dtype=np.float64,\n", + " dropout=dropout, seed=123)\n", + "\n", + " loss, grads = model.loss(X, y)\n", + " print 'Initial loss: ', loss\n", + "\n", + " for name in sorted(grads):\n", + " f = lambda _: model.loss(X, y)[0]\n", + " grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)\n", + " print '%s relative error: %.2e' % (name, rel_error(grad_num, grads[name]))\n", + " print" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Regularization experiment\n", + "As an experiment, we will train a pair of two-layer networks on 500 training examples: one will use no dropout, and one will use a dropout probability of 0.75. We will then visualize the training and validation accuracies of the two networks over time." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Train two identical nets, one with dropout and one without\n", + "\n", + "num_train = 500\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "solvers = {}\n", + "dropout_choices = [0, 0.75]\n", + "for dropout in dropout_choices:\n", + " model = FullyConnectedNet([500], dropout=dropout)\n", + " print dropout\n", + "\n", + " solver = Solver(model, small_data,\n", + " num_epochs=25, batch_size=100,\n", + " update_rule='adam',\n", + " optim_config={\n", + " 'learning_rate': 5e-4,\n", + " },\n", + " verbose=True, print_every=100)\n", + " solver.train()\n", + " solvers[dropout] = solver" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Plot train and validation accuracies of the two models\n", + "\n", + "train_accs = []\n", + "val_accs = []\n", + "for dropout in dropout_choices:\n", + " solver = solvers[dropout]\n", + " train_accs.append(solver.train_acc_history[-1])\n", + " val_accs.append(solver.val_acc_history[-1])\n", + "\n", + "plt.subplot(3, 1, 1)\n", + "for dropout in dropout_choices:\n", + " plt.plot(solvers[dropout].train_acc_history, 'o', label='%.2f dropout' % dropout)\n", + "plt.title('Train accuracy')\n", + "plt.xlabel('Epoch')\n", + "plt.ylabel('Accuracy')\n", + "plt.legend(ncol=2, loc='lower right')\n", + " \n", + "plt.subplot(3, 1, 2)\n", + "for dropout in dropout_choices:\n", + " plt.plot(solvers[dropout].val_acc_history, 'o', label='%.2f dropout' % dropout)\n", + "plt.title('Val accuracy')\n", + "plt.xlabel('Epoch')\n", + "plt.ylabel('Accuracy')\n", + "plt.legend(ncol=2, loc='lower right')\n", + "\n", + "plt.gcf().set_size_inches(15, 15)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Question\n", + "Explain what you see in this experiment. What does it suggest about dropout?" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "# Answer\n" + ], + "cell_type": "markdown", + "metadata": {} + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.6", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment2/FullyConnectedNets.ipynb b/assignments2016/assignment2/FullyConnectedNets.ipynb new file mode 100644 index 00000000..bf7cefdd --- /dev/null +++ b/assignments2016/assignment2/FullyConnectedNets.ipynb @@ -0,0 +1,941 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Fully-Connected Neural Nets\n", + "In the previous homework you implemented a fully-connected two-layer neural network on CIFAR-10. The implementation was simple but not very modular since the loss and gradient were computed in a single monolithic function. This is manageable for a simple two-layer network, but would become impractical as we move to bigger models. Ideally we want to build networks using a more modular design so that we can implement different layer types in isolation and then snap them together into models with different architectures.\n", + "\n", + "In this exercise we will implement fully-connected networks using a more modular approach. For each layer we will implement a `forward` and a `backward` function. The `forward` function will receive inputs, weights, and other parameters and will return both an output and a `cache` object storing data needed for the backward pass, like this:\n", + "\n", + "```python\n", + "def layer_forward(x, w):\n", + " \"\"\" Receive inputs x and weights w \"\"\"\n", + " # Do some computations ...\n", + " z = # ... some intermediate value\n", + " # Do some more computations ...\n", + " out = # the output\n", + " \n", + " cache = (x, w, z, out) # Values we need to compute gradients\n", + " \n", + " return out, cache\n", + "```\n", + "\n", + "The backward pass will receive upstream derivatives and the `cache` object, and will return gradients with respect to the inputs and weights, like this:\n", + "\n", + "```python\n", + "def layer_backward(dout, cache):\n", + " \"\"\"\n", + " Receive derivative of loss with respect to outputs and cache,\n", + " and compute derivative with respect to inputs.\n", + " \"\"\"\n", + " # Unpack cache values\n", + " x, w, z, out = cache\n", + " \n", + " # Use values in cache to compute derivatives\n", + " dx = # Derivative of loss with respect to x\n", + " dw = # Derivative of loss with respect to w\n", + " \n", + " return dx, dw\n", + "```\n", + "\n", + "After implementing a bunch of layers this way, we will be able to easily combine them to build classifiers with different architectures.\n", + "\n", + "In addition to implementing fully-connected networks of arbitrary depth, we will also explore different update rules for optimization, and introduce Dropout as a regularizer and Batch Normalization as a tool to more efficiently optimize deep networks.\n", + " " + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# As usual, a bit of setup\n", + "\n", + "import time\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from cs231n.classifiers.fc_net import *\n", + "from cs231n.data_utils import get_CIFAR10_data\n", + "from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n", + "from cs231n.solver import Solver\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Load the (preprocessed) CIFAR10 data.\n", + "\n", + "data = get_CIFAR10_data()\n", + "for k, v in data.iteritems():\n", + " print '%s: ' % k, v.shape" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Affine layer: foward\n", + "Open the file `cs231n/layers.py` and implement the `affine_forward` function.\n", + "\n", + "Once you are done you can test your implementaion by running the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Test the affine_forward function\n", + "\n", + "num_inputs = 2\n", + "input_shape = (4, 5, 6)\n", + "output_dim = 3\n", + "\n", + "input_size = num_inputs * np.prod(input_shape)\n", + "weight_size = output_dim * np.prod(input_shape)\n", + "\n", + "x = np.linspace(-0.1, 0.5, num=input_size).reshape(num_inputs, *input_shape)\n", + "w = np.linspace(-0.2, 0.3, num=weight_size).reshape(np.prod(input_shape), output_dim)\n", + "b = np.linspace(-0.3, 0.1, num=output_dim)\n", + "\n", + "out, _ = affine_forward(x, w, b)\n", + "correct_out = np.array([[ 1.49834967, 1.70660132, 1.91485297],\n", + " [ 3.25553199, 3.5141327, 3.77273342]])\n", + "\n", + "# Compare your output with ours. The error should be around 1e-9.\n", + "print 'Testing affine_forward function:'\n", + "print 'difference: ', rel_error(out, correct_out)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Affine layer: backward\n", + "Now implement the `affine_backward` function and test your implementation using numeric gradient checking." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Test the affine_backward function\n", + "\n", + "x = np.random.randn(10, 2, 3)\n", + "w = np.random.randn(6, 5)\n", + "b = np.random.randn(5)\n", + "dout = np.random.randn(10, 5)\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: affine_forward(x, w, b)[0], x, dout)\n", + "dw_num = eval_numerical_gradient_array(lambda w: affine_forward(x, w, b)[0], w, dout)\n", + "db_num = eval_numerical_gradient_array(lambda b: affine_forward(x, w, b)[0], b, dout)\n", + "\n", + "_, cache = affine_forward(x, w, b)\n", + "dx, dw, db = affine_backward(dout, cache)\n", + "\n", + "# The error should be around 1e-10\n", + "print 'Testing affine_backward function:'\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dw error: ', rel_error(dw_num, dw)\n", + "print 'db error: ', rel_error(db_num, db)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# ReLU layer: forward\n", + "Implement the forward pass for the ReLU activation function in the `relu_forward` function and test your implementation using the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Test the relu_forward function\n", + "\n", + "x = np.linspace(-0.5, 0.5, num=12).reshape(3, 4)\n", + "\n", + "out, _ = relu_forward(x)\n", + "correct_out = np.array([[ 0., 0., 0., 0., ],\n", + " [ 0., 0., 0.04545455, 0.13636364,],\n", + " [ 0.22727273, 0.31818182, 0.40909091, 0.5, ]])\n", + "\n", + "# Compare your output with ours. The error should be around 1e-8\n", + "print 'Testing relu_forward function:'\n", + "print 'difference: ', rel_error(out, correct_out)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# ReLU layer: backward\n", + "Now implement the backward pass for the ReLU activation function in the `relu_backward` function and test your implementation using numeric gradient checking:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "x = np.random.randn(10, 10)\n", + "dout = np.random.randn(*x.shape)\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: relu_forward(x)[0], x, dout)\n", + "\n", + "_, cache = relu_forward(x)\n", + "dx = relu_backward(dout, cache)\n", + "\n", + "# The error should be around 1e-12\n", + "print 'Testing relu_backward function:'\n", + "print 'dx error: ', rel_error(dx_num, dx)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# \"Sandwich\" layers\n", + "There are some common patterns of layers that are frequently used in neural nets. For example, affine layers are frequently followed by a ReLU nonlinearity. To make these common patterns easy, we define several convenience layers in the file `cs231n/layer_utils.py`.\n", + "\n", + "For now take a look at the `affine_relu_forward` and `affine_relu_backward` functions, and run the following to numerically gradient check the backward pass:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.layer_utils import affine_relu_forward, affine_relu_backward\n", + "\n", + "x = np.random.randn(2, 3, 4)\n", + "w = np.random.randn(12, 10)\n", + "b = np.random.randn(10)\n", + "dout = np.random.randn(2, 10)\n", + "\n", + "out, cache = affine_relu_forward(x, w, b)\n", + "dx, dw, db = affine_relu_backward(dout, cache)\n", + "\n", + "dx_num = eval_numerical_gradient_array(lambda x: affine_relu_forward(x, w, b)[0], x, dout)\n", + "dw_num = eval_numerical_gradient_array(lambda w: affine_relu_forward(x, w, b)[0], w, dout)\n", + "db_num = eval_numerical_gradient_array(lambda b: affine_relu_forward(x, w, b)[0], b, dout)\n", + "\n", + "print 'Testing affine_relu_forward:'\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dw error: ', rel_error(dw_num, dw)\n", + "print 'db error: ', rel_error(db_num, db)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Loss layers: Softmax and SVM\n", + "You implemented these loss functions in the last assignment, so we'll give them to you for free here. You should still make sure you understand how they work by looking at the implementations in `cs231n/layers.py`.\n", + "\n", + "You can make sure that the implementations are correct by running the following:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "num_classes, num_inputs = 10, 50\n", + "x = 0.001 * np.random.randn(num_inputs, num_classes)\n", + "y = np.random.randint(num_classes, size=num_inputs)\n", + "\n", + "dx_num = eval_numerical_gradient(lambda x: svm_loss(x, y)[0], x, verbose=False)\n", + "loss, dx = svm_loss(x, y)\n", + "\n", + "# Test svm_loss function. Loss should be around 9 and dx error should be 1e-9\n", + "print 'Testing svm_loss:'\n", + "print 'loss: ', loss\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "\n", + "dx_num = eval_numerical_gradient(lambda x: softmax_loss(x, y)[0], x, verbose=False)\n", + "loss, dx = softmax_loss(x, y)\n", + "\n", + "# Test softmax_loss function. Loss should be 2.3 and dx error should be 1e-8\n", + "print '\\nTesting softmax_loss:'\n", + "print 'loss: ', loss\n", + "print 'dx error: ', rel_error(dx_num, dx)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Two-layer network\n", + "In the previous assignment you implemented a two-layer neural network in a single monolithic class. Now that you have implemented modular versions of the necessary layers, you will reimplement the two layer network using these modular implementations.\n", + "\n", + "Open the file `cs231n/classifiers/fc_net.py` and complete the implementation of the `TwoLayerNet` class. This class will serve as a model for the other networks you will implement in this assignment, so read through it to make sure you understand the API. You can run the cell below to test your implementation." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, H, C = 3, 5, 50, 7\n", + "X = np.random.randn(N, D)\n", + "y = np.random.randint(C, size=N)\n", + "\n", + "std = 1e-2\n", + "model = TwoLayerNet(input_dim=D, hidden_dim=H, num_classes=C, weight_scale=std)\n", + "\n", + "print 'Testing initialization ... '\n", + "W1_std = abs(model.params['W1'].std() - std)\n", + "b1 = model.params['b1']\n", + "W2_std = abs(model.params['W2'].std() - std)\n", + "b2 = model.params['b2']\n", + "assert W1_std < std / 10, 'First layer weights do not seem right'\n", + "assert np.all(b1 == 0), 'First layer biases do not seem right'\n", + "assert W2_std < std / 10, 'Second layer weights do not seem right'\n", + "assert np.all(b2 == 0), 'Second layer biases do not seem right'\n", + "\n", + "print 'Testing test-time forward pass ... '\n", + "model.params['W1'] = np.linspace(-0.7, 0.3, num=D*H).reshape(D, H)\n", + "model.params['b1'] = np.linspace(-0.1, 0.9, num=H)\n", + "model.params['W2'] = np.linspace(-0.3, 0.4, num=H*C).reshape(H, C)\n", + "model.params['b2'] = np.linspace(-0.9, 0.1, num=C)\n", + "X = np.linspace(-5.5, 4.5, num=N*D).reshape(D, N).T\n", + "scores = model.loss(X)\n", + "correct_scores = np.asarray(\n", + " [[11.53165108, 12.2917344, 13.05181771, 13.81190102, 14.57198434, 15.33206765, 16.09215096],\n", + " [12.05769098, 12.74614105, 13.43459113, 14.1230412, 14.81149128, 15.49994135, 16.18839143],\n", + " [12.58373087, 13.20054771, 13.81736455, 14.43418138, 15.05099822, 15.66781506, 16.2846319 ]])\n", + "scores_diff = np.abs(scores - correct_scores).sum()\n", + "assert scores_diff < 1e-6, 'Problem with test-time forward pass'\n", + "\n", + "print 'Testing training loss (no regularization)'\n", + "y = np.asarray([0, 5, 1])\n", + "loss, grads = model.loss(X, y)\n", + "correct_loss = 3.4702243556\n", + "assert abs(loss - correct_loss) < 1e-10, 'Problem with training-time loss'\n", + "\n", + "model.reg = 1.0\n", + "loss, grads = model.loss(X, y)\n", + "correct_loss = 26.5948426952\n", + "assert abs(loss - correct_loss) < 1e-10, 'Problem with regularization loss'\n", + "\n", + "for reg in [0.0, 0.7]:\n", + " print 'Running numeric gradient check with reg = ', reg\n", + " model.reg = reg\n", + " loss, grads = model.loss(X, y)\n", + "\n", + " for name in sorted(grads):\n", + " f = lambda _: model.loss(X, y)[0]\n", + " grad_num = eval_numerical_gradient(f, model.params[name], verbose=False)\n", + " print '%s relative error: %.2e' % (name, rel_error(grad_num, grads[name]))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Solver\n", + "In the previous assignment, the logic for training models was coupled to the models themselves. Following a more modular design, for this assignment we have split the logic for training models into a separate class.\n", + "\n", + "Open the file `cs231n/solver.py` and read through it to familiarize yourself with the API. After doing so, use a `Solver` instance to train a `TwoLayerNet` that achieves at least `50%` accuracy on the validation set." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "model = TwoLayerNet()\n", + "solver = None\n", + "\n", + "##############################################################################\n", + "# TODO: Use a Solver instance to train a TwoLayerNet that achieves at least #\n", + "# 50% accuracy on the validation set. #\n", + "##############################################################################\n", + "pass\n", + "##############################################################################\n", + "# END OF YOUR CODE #\n", + "##############################################################################" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Run this cell to visualize training loss and train / val accuracy\n", + "\n", + "plt.subplot(2, 1, 1)\n", + "plt.title('Training loss')\n", + "plt.plot(solver.loss_history, 'o')\n", + "plt.xlabel('Iteration')\n", + "\n", + "plt.subplot(2, 1, 2)\n", + "plt.title('Accuracy')\n", + "plt.plot(solver.train_acc_history, '-o', label='train')\n", + "plt.plot(solver.val_acc_history, '-o', label='val')\n", + "plt.plot([0.5] * len(solver.val_acc_history), 'k--')\n", + "plt.xlabel('Epoch')\n", + "plt.legend(loc='lower right')\n", + "plt.gcf().set_size_inches(15, 12)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Multilayer network\n", + "Next you will implement a fully-connected network with an arbitrary number of hidden layers.\n", + "\n", + "Read through the `FullyConnectedNet` class in the file `cs231n/classifiers/fc_net.py`.\n", + "\n", + "Implement the initialization, the forward pass, and the backward pass. For the moment don't worry about implementing dropout or batch normalization; we will add those features soon." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "## Initial loss and gradient check" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "As a sanity check, run the following to check the initial loss and to gradient check the network both with and without regularization. Do the initial losses seem reasonable?\n", + "\n", + "For gradient checking, you should expect to see errors around 1e-6 or less." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, H1, H2, C = 2, 15, 20, 30, 10\n", + "X = np.random.randn(N, D)\n", + "y = np.random.randint(C, size=(N,))\n", + "\n", + "for reg in [0, 3.14]:\n", + " print 'Running check with reg = ', reg\n", + " model = FullyConnectedNet([H1, H2], input_dim=D, num_classes=C,\n", + " reg=reg, weight_scale=5e-2, dtype=np.float64)\n", + "\n", + " loss, grads = model.loss(X, y)\n", + " print 'Initial loss: ', loss\n", + "\n", + " for name in sorted(grads):\n", + " f = lambda _: model.loss(X, y)[0]\n", + " grad_num = eval_numerical_gradient(f, model.params[name], verbose=False, h=1e-5)\n", + " print '%s relative error: %.2e' % (name, rel_error(grad_num, grads[name]))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "As another sanity check, make sure you can overfit a small dataset of 50 images. First we will try a three-layer network with 100 units in each hidden layer. You will need to tweak the learning rate and initialization scale, but you should be able to overfit and achieve 100% training accuracy within 20 epochs." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# TODO: Use a three-layer Net to overfit 50 training examples.\n", + "\n", + "num_train = 50\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "weight_scale = 1e-2\n", + "learning_rate = 1e-4\n", + "model = FullyConnectedNet([100, 100],\n", + " weight_scale=weight_scale, dtype=np.float64)\n", + "solver = Solver(model, small_data,\n", + " print_every=10, num_epochs=20, batch_size=25,\n", + " update_rule='sgd',\n", + " optim_config={\n", + " 'learning_rate': learning_rate,\n", + " }\n", + " )\n", + "solver.train()\n", + "\n", + "plt.plot(solver.loss_history, 'o')\n", + "plt.title('Training loss history')\n", + "plt.xlabel('Iteration')\n", + "plt.ylabel('Training loss')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + }, + { + "source": [ + "Now try to use a five-layer network with 100 units on each layer to overfit 50 training examples. Again you will have to adjust the learning rate and weight initialization, but you should be able to achieve 100% training accuracy within 20 epochs." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# TODO: Use a five-layer Net to overfit 50 training examples.\n", + "\n", + "num_train = 50\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "learning_rate = 1e-3\n", + "weight_scale = 1e-5\n", + "model = FullyConnectedNet([100, 100, 100, 100],\n", + " weight_scale=weight_scale, dtype=np.float64)\n", + "solver = Solver(model, small_data,\n", + " print_every=10, num_epochs=20, batch_size=25,\n", + " update_rule='sgd',\n", + " optim_config={\n", + " 'learning_rate': learning_rate,\n", + " }\n", + " )\n", + "solver.train()\n", + "\n", + "plt.plot(solver.loss_history, 'o')\n", + "plt.title('Training loss history')\n", + "plt.xlabel('Iteration')\n", + "plt.ylabel('Training loss')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Inline question: \n", + "Did you notice anything about the comparative difficulty of training the three-layer net vs training the five layer net?\n", + "\n", + "# Answer:\n", + "[FILL THIS IN]\n" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "# Update rules\n", + "So far we have used vanilla stochastic gradient descent (SGD) as our update rule. More sophisticated update rules can make it easier to train deep networks. We will implement a few of the most commonly used update rules and compare them to vanilla SGD." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "# SGD+Momentum\n", + "Stochastic gradient descent with momentum is a widely used update rule that tends to make deep networks converge faster than vanilla stochstic gradient descent.\n", + "\n", + "Open the file `cs231n/optim.py` and read the documentation at the top of the file to make sure you understand the API. Implement the SGD+momentum update rule in the function `sgd_momentum` and run the following to check your implementation. You should see errors less than 1e-8." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.optim import sgd_momentum\n", + "\n", + "N, D = 4, 5\n", + "w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)\n", + "dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)\n", + "v = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)\n", + "\n", + "config = {'learning_rate': 1e-3, 'velocity': v}\n", + "next_w, _ = sgd_momentum(w, dw, config=config)\n", + "\n", + "expected_next_w = np.asarray([\n", + " [ 0.1406, 0.20738947, 0.27417895, 0.34096842, 0.40775789],\n", + " [ 0.47454737, 0.54133684, 0.60812632, 0.67491579, 0.74170526],\n", + " [ 0.80849474, 0.87528421, 0.94207368, 1.00886316, 1.07565263],\n", + " [ 1.14244211, 1.20923158, 1.27602105, 1.34281053, 1.4096 ]])\n", + "expected_velocity = np.asarray([\n", + " [ 0.5406, 0.55475789, 0.56891579, 0.58307368, 0.59723158],\n", + " [ 0.61138947, 0.62554737, 0.63970526, 0.65386316, 0.66802105],\n", + " [ 0.68217895, 0.69633684, 0.71049474, 0.72465263, 0.73881053],\n", + " [ 0.75296842, 0.76712632, 0.78128421, 0.79544211, 0.8096 ]])\n", + "\n", + "print 'next_w error: ', rel_error(next_w, expected_next_w)\n", + "print 'velocity error: ', rel_error(expected_velocity, config['velocity'])" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "Once you have done so, run the following to train a six-layer network with both SGD and SGD+momentum. You should see the SGD+momentum update rule converge faster." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "num_train = 4000\n", + "small_data = {\n", + " 'X_train': data['X_train'][:num_train],\n", + " 'y_train': data['y_train'][:num_train],\n", + " 'X_val': data['X_val'],\n", + " 'y_val': data['y_val'],\n", + "}\n", + "\n", + "solvers = {}\n", + "\n", + "for update_rule in ['sgd', 'sgd_momentum']:\n", + " print 'running with ', update_rule\n", + " model = FullyConnectedNet([100, 100, 100, 100, 100], weight_scale=5e-2)\n", + "\n", + " solver = Solver(model, small_data,\n", + " num_epochs=5, batch_size=100,\n", + " update_rule=update_rule,\n", + " optim_config={\n", + " 'learning_rate': 1e-2,\n", + " },\n", + " verbose=True)\n", + " solvers[update_rule] = solver\n", + " solver.train()\n", + " print\n", + "\n", + "plt.subplot(3, 1, 1)\n", + "plt.title('Training loss')\n", + "plt.xlabel('Iteration')\n", + "\n", + "plt.subplot(3, 1, 2)\n", + "plt.title('Training accuracy')\n", + "plt.xlabel('Epoch')\n", + "\n", + "plt.subplot(3, 1, 3)\n", + "plt.title('Validation accuracy')\n", + "plt.xlabel('Epoch')\n", + "\n", + "for update_rule, solver in solvers.iteritems():\n", + " plt.subplot(3, 1, 1)\n", + " plt.plot(solver.loss_history, 'o', label=update_rule)\n", + " \n", + " plt.subplot(3, 1, 2)\n", + " plt.plot(solver.train_acc_history, '-o', label=update_rule)\n", + "\n", + " plt.subplot(3, 1, 3)\n", + " plt.plot(solver.val_acc_history, '-o', label=update_rule)\n", + " \n", + "for i in [1, 2, 3]:\n", + " plt.subplot(3, 1, i)\n", + " plt.legend(loc='upper center', ncol=4)\n", + "plt.gcf().set_size_inches(15, 15)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + }, + { + "source": [ + "# RMSProp and Adam\n", + "RMSProp [1] and Adam [2] are update rules that set per-parameter learning rates by using a running average of the second moments of gradients.\n", + "\n", + "In the file `cs231n/optim.py`, implement the RMSProp update rule in the `rmsprop` function and implement the Adam update rule in the `adam` function, and check your implementations using the tests below.\n", + "\n", + "[1] Tijmen Tieleman and Geoffrey Hinton. \"Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.\" COURSERA: Neural Networks for Machine Learning 4 (2012).\n", + "\n", + "[2] Diederik Kingma and Jimmy Ba, \"Adam: A Method for Stochastic Optimization\", ICLR 2015." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Test RMSProp implementation; you should see errors less than 1e-7\n", + "from cs231n.optim import rmsprop\n", + "\n", + "N, D = 4, 5\n", + "w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)\n", + "dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)\n", + "cache = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)\n", + "\n", + "config = {'learning_rate': 1e-2, 'cache': cache}\n", + "next_w, _ = rmsprop(w, dw, config=config)\n", + "\n", + "expected_next_w = np.asarray([\n", + " [-0.39223849, -0.34037513, -0.28849239, -0.23659121, -0.18467247],\n", + " [-0.132737, -0.08078555, -0.02881884, 0.02316247, 0.07515774],\n", + " [ 0.12716641, 0.17918792, 0.23122175, 0.28326742, 0.33532447],\n", + " [ 0.38739248, 0.43947102, 0.49155973, 0.54365823, 0.59576619]])\n", + "expected_cache = np.asarray([\n", + " [ 0.5976, 0.6126277, 0.6277108, 0.64284931, 0.65804321],\n", + " [ 0.67329252, 0.68859723, 0.70395734, 0.71937285, 0.73484377],\n", + " [ 0.75037008, 0.7659518, 0.78158892, 0.79728144, 0.81302936],\n", + " [ 0.82883269, 0.84469141, 0.86060554, 0.87657507, 0.8926 ]])\n", + "\n", + "print 'next_w error: ', rel_error(expected_next_w, next_w)\n", + "print 'cache error: ', rel_error(expected_cache, config['cache'])" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Test Adam implementation; you should see errors around 1e-7 or less\n", + "from cs231n.optim import adam\n", + "\n", + "N, D = 4, 5\n", + "w = np.linspace(-0.4, 0.6, num=N*D).reshape(N, D)\n", + "dw = np.linspace(-0.6, 0.4, num=N*D).reshape(N, D)\n", + "m = np.linspace(0.6, 0.9, num=N*D).reshape(N, D)\n", + "v = np.linspace(0.7, 0.5, num=N*D).reshape(N, D)\n", + "\n", + "config = {'learning_rate': 1e-2, 'm': m, 'v': v, 't': 5}\n", + "next_w, _ = adam(w, dw, config=config)\n", + "\n", + "expected_next_w = np.asarray([\n", + " [-0.40094747, -0.34836187, -0.29577703, -0.24319299, -0.19060977],\n", + " [-0.1380274, -0.08544591, -0.03286534, 0.01971428, 0.0722929],\n", + " [ 0.1248705, 0.17744702, 0.23002243, 0.28259667, 0.33516969],\n", + " [ 0.38774145, 0.44031188, 0.49288093, 0.54544852, 0.59801459]])\n", + "expected_v = np.asarray([\n", + " [ 0.69966, 0.68908382, 0.67851319, 0.66794809, 0.65738853,],\n", + " [ 0.64683452, 0.63628604, 0.6257431, 0.61520571, 0.60467385,],\n", + " [ 0.59414753, 0.58362676, 0.57311152, 0.56260183, 0.55209767,],\n", + " [ 0.54159906, 0.53110598, 0.52061845, 0.51013645, 0.49966, ]])\n", + "expected_m = np.asarray([\n", + " [ 0.48, 0.49947368, 0.51894737, 0.53842105, 0.55789474],\n", + " [ 0.57736842, 0.59684211, 0.61631579, 0.63578947, 0.65526316],\n", + " [ 0.67473684, 0.69421053, 0.71368421, 0.73315789, 0.75263158],\n", + " [ 0.77210526, 0.79157895, 0.81105263, 0.83052632, 0.85 ]])\n", + "\n", + "print 'next_w error: ', rel_error(expected_next_w, next_w)\n", + "print 'v error: ', rel_error(expected_v, config['v'])\n", + "print 'm error: ', rel_error(expected_m, config['m'])" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "Once you have debugged your RMSProp and Adam implementations, run the following to train a pair of deep networks using these new update rules:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "learning_rates = {'rmsprop': 1e-4, 'adam': 1e-3}\n", + "for update_rule in ['adam', 'rmsprop']:\n", + " print 'running with ', update_rule\n", + " model = FullyConnectedNet([100, 100, 100, 100, 100], weight_scale=5e-2)\n", + "\n", + " solver = Solver(model, small_data,\n", + " num_epochs=5, batch_size=100,\n", + " update_rule=update_rule,\n", + " optim_config={\n", + " 'learning_rate': learning_rates[update_rule]\n", + " },\n", + " verbose=True)\n", + " solvers[update_rule] = solver\n", + " solver.train()\n", + " print\n", + "\n", + "plt.subplot(3, 1, 1)\n", + "plt.title('Training loss')\n", + "plt.xlabel('Iteration')\n", + "\n", + "plt.subplot(3, 1, 2)\n", + "plt.title('Training accuracy')\n", + "plt.xlabel('Epoch')\n", + "\n", + "plt.subplot(3, 1, 3)\n", + "plt.title('Validation accuracy')\n", + "plt.xlabel('Epoch')\n", + "\n", + "for update_rule, solver in solvers.iteritems():\n", + " plt.subplot(3, 1, 1)\n", + " plt.plot(solver.loss_history, 'o', label=update_rule)\n", + " \n", + " plt.subplot(3, 1, 2)\n", + " plt.plot(solver.train_acc_history, '-o', label=update_rule)\n", + "\n", + " plt.subplot(3, 1, 3)\n", + " plt.plot(solver.val_acc_history, '-o', label=update_rule)\n", + " \n", + "for i in [1, 2, 3]:\n", + " plt.subplot(3, 1, i)\n", + " plt.legend(loc='upper center', ncol=4)\n", + "plt.gcf().set_size_inches(15, 15)\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Train a good model!\n", + "Train the best fully-connected model that you can on CIFAR-10, storing your best model in the `best_model` variable. We require you to get at least 50% accuracy on the validation set using a fully-connected net.\n", + "\n", + "If you are careful it should be possible to get accuracies above 55%, but we don't require it for this part and won't assign extra credit for doing so. Later in the assignment we will ask you to train the best convolutional network that you can on CIFAR-10, and we would prefer that you spend your effort working on convolutional nets rather than fully-connected nets.\n", + "\n", + "You might find it useful to complete the `BatchNormalization.ipynb` and `Dropout.ipynb` notebooks before completing this part, since those techniques can help you train powerful models." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "best_model = None\n", + "################################################################################\n", + "# TODO: Train the best FullyConnectedNet that you can on CIFAR-10. You might #\n", + "# batch normalization and dropout useful. Store your best model in the #\n", + "# best_model variable. #\n", + "################################################################################\n", + "pass\n", + "################################################################################\n", + "# END OF YOUR CODE #\n", + "################################################################################" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + }, + { + "source": [ + "# Test you model\n", + "Run your best model on the validation and test sets. You should achieve above 50% accuracy on the validation set." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "y_test_pred = np.argmax(best_model.loss(X_test), axis=1)\n", + "y_val_pred = np.argmax(best_model.loss(X_val), axis=1)\n", + "print 'Validation set accuracy: ', (y_val_pred == y_val).mean()\n", + "print 'Test set accuracy: ', (y_test_pred == y_test).mean()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.6", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment2/README.md b/assignments2016/assignment2/README.md new file mode 100644 index 00000000..2392c9f2 --- /dev/null +++ b/assignments2016/assignment2/README.md @@ -0,0 +1,128 @@ +In this assignment you will practice writing backpropagation code, and training +Neural Networks and Convolutional Neural Networks. The goals of this assignment +are as follows: + +- understand **Neural Networks** and how they are arranged in layered + architectures +- understand and be able to implement (vectorized) **backpropagation** +- implement various **update rules** used to optimize Neural Networks +- implement **batch normalization** for training deep networks +- implement **dropout** to regularize networks +- effectively **cross-validate** and find the best hyperparameters for Neural + Network architecture +- understand the architecture of **Convolutional Neural Networks** and train + gain experience with training these models on data + +## Setup +You can work on the assignment in one of two ways: locally on your own machine, +or on a virtual machine through Terminal.com. + +### Working in the cloud on Terminal + +Terminal has created a separate subdomain to serve our class, +[www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com). Register +your account there. The Assignment 2 snapshot can then be found HERE. If you are +registered in the class you can contact the TA (see Piazza for more information) +to request Terminal credits for use on the assignment. Once you boot up the +snapshot everything will be installed for you, and you will be ready to start on +your assignment right away. We have written a small tutorial on Terminal +[here](http://cs231n.github.io/terminal-tutorial/). + +### Working locally +Get the code as a zip file +[here](http://vision.stanford.edu/teaching/cs231n/winter1516_assignment2.zip). +As for the dependencies: + +**[Option 1] Use Anaconda:** +The preferred approach for installing all the assignment dependencies is to use +[Anaconda](https://www.continuum.io/downloads), which is a Python distribution +that includes many of the most popular Python packages for science, math, +engineering and data analysis. Once you install it you can skip all mentions of +requirements and you are ready to go directly to working on the assignment. + +**[Option 2] Manual install, virtual environment:** +If you do not want to use Anaconda and want to go with a more manual and risky +installation route you will likely want to create a +[virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/) +for the project. If you choose not to use a virtual environment, it is up to you +to make sure that all dependencies for the code are installed globally on your +machine. To set up a virtual environment, run the following: + +```bash +cd assignment2 +sudo pip install virtualenv # This may already be installed +virtualenv .env # Create a virtual environment +source .env/bin/activate # Activate the virtual environment +pip install -r requirements.txt # Install dependencies +# Work on the assignment for a while ... +deactivate # Exit the virtual environment +``` + +**Download data:** +Once you have the starter code, you will need to download the CIFAR-10 dataset. +Run the following from the `assignment2` directory: + +```bash +cd cs231n/datasets +./get_datasets.sh +``` + +**Compile the Cython extension:** Convolutional Neural Networks require a very +efficient implementation. We have implemented of the functionality using +[Cython](http://cython.org/); you will need to compile the Cython extension +before you can run the code. From the `cs231n` directory, run the following +command: + +```bash +python setup.py build_ext --inplace +``` + +**Start IPython:** +After you have the CIFAR-10 data, you should start the IPython notebook server +from the `assignment2` directory. If you are unfamiliar with IPython, you should +read our [IPython tutorial](http://cs231n.github.io/ipython-tutorial/). + +**NOTE:** If you are working in a virtual environment on OSX, you may encounter +errors with matplotlib due to the +[issues described here](http://matplotlib.org/faq/virtualenv_faq.html). +You can work around this issue by starting the IPython server using the +`start_ipython_osx.sh` script from the `assignment2` directory; the script +assumes that your virtual environment is named `.env`. + + +### Submitting your work: +Whether you work on the assignment locally or using Terminal, once you are done +working run the `collectSubmission.sh` script; this will produce a file called +`assignment2.zip`. Upload this file to your dropbox on +[the coursework](https://coursework.stanford.edu/portal/site/W15-CS-231N-01/) +page for the course. + + +### Q1: Fully-connected Neural Network (30 points) +The IPython notebook `FullyConnectedNets.ipynb` will introduce you to our +modular layer design, and then use those layers to implement fully-connected +networks of arbitrary depth. To optimize these models you will implement several +popular update rules. + +### Q2: Batch Normalization (30 points) +In the IPython notebook `BatchNormalization.ipynb` you will implement batch +normalization, and use it to train deep fully-connected networks. + +### Q3: Dropout (10 points) +The IPython notebook `Dropout.ipynb` will help you implement Dropout and explore +its effects on model generalization. + +### Q4: ConvNet on CIFAR-10 (30 points) +In the IPython Notebook `ConvolutionalNetworks.ipynb` you will implement several +new layers that are commonly used in convolutional networks. You will train a +(shallow) convolutional network on CIFAR-10, and it will then be up to you to +train the best network that you can. + +### Q5: Do something extra! (up to +10 points) +In the process of training your network, you should feel free to implement +anything that you want to get better performance. You can modify the solver, +implement additional layers, use different types of regularization, use an +ensemble of models, or anything else that comes to mind. If you implement these +or other ideas not covered in the assignment then you will be awarded some bonus +points. + diff --git a/assignments2016/assignment2/collectSubmission.sh b/assignments2016/assignment2/collectSubmission.sh new file mode 100755 index 00000000..f189c6bc --- /dev/null +++ b/assignments2016/assignment2/collectSubmission.sh @@ -0,0 +1,2 @@ +rm -f assignment2.zip +zip -r assignment2.zip . -x "*.git*" "*cs231n/datasets*" "*.ipynb_checkpoints*" "*README.md" "*collectSubmission.sh" "*requirements.txt" ".env/*" "*.pyc" "*cs231n/build/*" diff --git a/assignments2016/assignment2/cs231n/.gitignore b/assignments2016/assignment2/cs231n/.gitignore new file mode 100644 index 00000000..fbb42c24 --- /dev/null +++ b/assignments2016/assignment2/cs231n/.gitignore @@ -0,0 +1,3 @@ +build/* +im2col_cython.c +im2col_cython.so diff --git a/assignments2016/assignment2/cs231n/__init__.py b/assignments2016/assignment2/cs231n/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/assignments2016/assignment2/cs231n/classifiers/__init__.py b/assignments2016/assignment2/cs231n/classifiers/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/assignments2016/assignment2/cs231n/classifiers/cnn.py b/assignments2016/assignment2/cs231n/classifiers/cnn.py new file mode 100644 index 00000000..9646c31f --- /dev/null +++ b/assignments2016/assignment2/cs231n/classifiers/cnn.py @@ -0,0 +1,105 @@ +import numpy as np + +from cs231n.layers import * +from cs231n.fast_layers import * +from cs231n.layer_utils import * + + +class ThreeLayerConvNet(object): + """ + A three-layer convolutional network with the following architecture: + + conv - relu - 2x2 max pool - affine - relu - affine - softmax + + The network operates on minibatches of data that have shape (N, C, H, W) + consisting of N images, each with height H and width W and with C input + channels. + """ + + def __init__(self, input_dim=(3, 32, 32), num_filters=32, filter_size=7, + hidden_dim=100, num_classes=10, weight_scale=1e-3, reg=0.0, + dtype=np.float32): + """ + Initialize a new network. + + Inputs: + - input_dim: Tuple (C, H, W) giving size of input data + - num_filters: Number of filters to use in the convolutional layer + - filter_size: Size of filters to use in the convolutional layer + - hidden_dim: Number of units to use in the fully-connected hidden layer + - num_classes: Number of scores to produce from the final affine layer. + - weight_scale: Scalar giving standard deviation for random initialization + of weights. + - reg: Scalar giving L2 regularization strength + - dtype: numpy datatype to use for computation. + """ + self.params = {} + self.reg = reg + self.dtype = dtype + + ############################################################################ + # TODO: Initialize weights and biases for the three-layer convolutional # + # network. Weights should be initialized from a Gaussian with standard # + # deviation equal to weight_scale; biases should be initialized to zero. # + # All weights and biases should be stored in the dictionary self.params. # + # Store weights and biases for the convolutional layer using the keys 'W1' # + # and 'b1'; use keys 'W2' and 'b2' for the weights and biases of the # + # hidden affine layer, and keys 'W3' and 'b3' for the weights and biases # + # of the output affine layer. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + for k, v in self.params.iteritems(): + self.params[k] = v.astype(dtype) + + + def loss(self, X, y=None): + """ + Evaluate loss and gradient for the three-layer convolutional network. + + Input / output: Same API as TwoLayerNet in fc_net.py. + """ + W1, b1 = self.params['W1'], self.params['b1'] + W2, b2 = self.params['W2'], self.params['b2'] + W3, b3 = self.params['W3'], self.params['b3'] + + # pass conv_param to the forward pass for the convolutional layer + filter_size = W1.shape[2] + conv_param = {'stride': 1, 'pad': (filter_size - 1) / 2} + + # pass pool_param to the forward pass for the max-pooling layer + pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2} + + scores = None + ############################################################################ + # TODO: Implement the forward pass for the three-layer convolutional net, # + # computing the class scores for X and storing them in the scores # + # variable. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + if y is None: + return scores + + loss, grads = 0, {} + ############################################################################ + # TODO: Implement the backward pass for the three-layer convolutional net, # + # storing the loss and gradients in the loss and grads variables. Compute # + # data loss using softmax, and make sure that grads[k] holds the gradients # + # for self.params[k]. Don't forget to add L2 regularization! # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + return loss, grads + + +pass diff --git a/assignments2016/assignment2/cs231n/classifiers/fc_net.py b/assignments2016/assignment2/cs231n/classifiers/fc_net.py new file mode 100644 index 00000000..8f933636 --- /dev/null +++ b/assignments2016/assignment2/cs231n/classifiers/fc_net.py @@ -0,0 +1,250 @@ +import numpy as np + +from cs231n.layers import * +from cs231n.layer_utils import * + + +class TwoLayerNet(object): + """ + A two-layer fully-connected neural network with ReLU nonlinearity and + softmax loss that uses a modular layer design. We assume an input dimension + of D, a hidden dimension of H, and perform classification over C classes. + + The architecure should be affine - relu - affine - softmax. + + Note that this class does not implement gradient descent; instead, it + will interact with a separate Solver object that is responsible for running + optimization. + + The learnable parameters of the model are stored in the dictionary + self.params that maps parameter names to numpy arrays. + """ + + def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10, + weight_scale=1e-3, reg=0.0): + """ + Initialize a new network. + + Inputs: + - input_dim: An integer giving the size of the input + - hidden_dim: An integer giving the size of the hidden layer + - num_classes: An integer giving the number of classes to classify + - dropout: Scalar between 0 and 1 giving dropout strength. + - weight_scale: Scalar giving the standard deviation for random + initialization of the weights. + - reg: Scalar giving L2 regularization strength. + """ + self.params = {} + self.reg = reg + + ############################################################################ + # TODO: Initialize the weights and biases of the two-layer net. Weights # + # should be initialized from a Gaussian with standard deviation equal to # + # weight_scale, and biases should be initialized to zero. All weights and # + # biases should be stored in the dictionary self.params, with first layer # + # weights and biases using the keys 'W1' and 'b1' and second layer weights # + # and biases using the keys 'W2' and 'b2'. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + + def loss(self, X, y=None): + """ + Compute loss and gradient for a minibatch of data. + + Inputs: + - X: Array of input data of shape (N, d_1, ..., d_k) + - y: Array of labels, of shape (N,). y[i] gives the label for X[i]. + + Returns: + If y is None, then run a test-time forward pass of the model and return: + - scores: Array of shape (N, C) giving classification scores, where + scores[i, c] is the classification score for X[i] and class c. + + If y is not None, then run a training-time forward and backward pass and + return a tuple of: + - loss: Scalar value giving the loss + - grads: Dictionary with the same keys as self.params, mapping parameter + names to gradients of the loss with respect to those parameters. + """ + scores = None + ############################################################################ + # TODO: Implement the forward pass for the two-layer net, computing the # + # class scores for X and storing them in the scores variable. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + # If y is None then we are in test mode so just return scores + if y is None: + return scores + + loss, grads = 0, {} + ############################################################################ + # TODO: Implement the backward pass for the two-layer net. Store the loss # + # in the loss variable and gradients in the grads dictionary. Compute data # + # loss using softmax, and make sure that grads[k] holds the gradients for # + # self.params[k]. Don't forget to add L2 regularization! # + # # + # NOTE: To ensure that your implementation matches ours and you pass the # + # automated tests, make sure that your L2 regularization includes a factor # + # of 0.5 to simplify the expression for the gradient. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + return loss, grads + + +class FullyConnectedNet(object): + """ + A fully-connected neural network with an arbitrary number of hidden layers, + ReLU nonlinearities, and a softmax loss function. This will also implement + dropout and batch normalization as options. For a network with L layers, + the architecture will be + + {affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmax + + where batch normalization and dropout are optional, and the {...} block is + repeated L - 1 times. + + Similar to the TwoLayerNet above, learnable parameters are stored in the + self.params dictionary and will be learned using the Solver class. + """ + + def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10, + dropout=0, use_batchnorm=False, reg=0.0, + weight_scale=1e-2, dtype=np.float32, seed=None): + """ + Initialize a new FullyConnectedNet. + + Inputs: + - hidden_dims: A list of integers giving the size of each hidden layer. + - input_dim: An integer giving the size of the input. + - num_classes: An integer giving the number of classes to classify. + - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=0 then + the network should not use dropout at all. + - use_batchnorm: Whether or not the network should use batch normalization. + - reg: Scalar giving L2 regularization strength. + - weight_scale: Scalar giving the standard deviation for random + initialization of the weights. + - dtype: A numpy datatype object; all computations will be performed using + this datatype. float32 is faster but less accurate, so you should use + float64 for numeric gradient checking. + - seed: If not None, then pass this random seed to the dropout layers. This + will make the dropout layers deteriminstic so we can gradient check the + model. + """ + self.use_batchnorm = use_batchnorm + self.use_dropout = dropout > 0 + self.reg = reg + self.num_layers = 1 + len(hidden_dims) + self.dtype = dtype + self.params = {} + + ############################################################################ + # TODO: Initialize the parameters of the network, storing all values in # + # the self.params dictionary. Store weights and biases for the first layer # + # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be # + # initialized from a normal distribution with standard deviation equal to # + # weight_scale and biases should be initialized to zero. # + # # + # When using batch normalization, store scale and shift parameters for the # + # first layer in gamma1 and beta1; for the second layer use gamma2 and # + # beta2, etc. Scale parameters should be initialized to one and shift # + # parameters should be initialized to zero. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + # When using dropout we need to pass a dropout_param dictionary to each + # dropout layer so that the layer knows the dropout probability and the mode + # (train / test). You can pass the same dropout_param to each dropout layer. + self.dropout_param = {} + if self.use_dropout: + self.dropout_param = {'mode': 'train', 'p': dropout} + if seed is not None: + self.dropout_param['seed'] = seed + + # With batch normalization we need to keep track of running means and + # variances, so we need to pass a special bn_param object to each batch + # normalization layer. You should pass self.bn_params[0] to the forward pass + # of the first batch normalization layer, self.bn_params[1] to the forward + # pass of the second batch normalization layer, etc. + self.bn_params = [] + if self.use_batchnorm: + self.bn_params = [{'mode': 'train'} for i in xrange(self.num_layers - 1)] + + # Cast all parameters to the correct datatype + for k, v in self.params.iteritems(): + self.params[k] = v.astype(dtype) + + + def loss(self, X, y=None): + """ + Compute loss and gradient for the fully-connected net. + + Input / output: Same as TwoLayerNet above. + """ + X = X.astype(self.dtype) + mode = 'test' if y is None else 'train' + + # Set train/test mode for batchnorm params and dropout param since they + # behave differently during training and testing. + if self.dropout_param is not None: + self.dropout_param['mode'] = mode + if self.use_batchnorm: + for bn_param in self.bn_params: + bn_param[mode] = mode + + scores = None + ############################################################################ + # TODO: Implement the forward pass for the fully-connected net, computing # + # the class scores for X and storing them in the scores variable. # + # # + # When using dropout, you'll need to pass self.dropout_param to each # + # dropout forward pass. # + # # + # When using batch normalization, you'll need to pass self.bn_params[0] to # + # the forward pass for the first batch normalization layer, pass # + # self.bn_params[1] to the forward pass for the second batch normalization # + # layer, etc. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + # If test mode return early + if mode == 'test': + return scores + + loss, grads = 0.0, {} + ############################################################################ + # TODO: Implement the backward pass for the fully-connected net. Store the # + # loss in the loss variable and gradients in the grads dictionary. Compute # + # data loss using softmax, and make sure that grads[k] holds the gradients # + # for self.params[k]. Don't forget to add L2 regularization! # + # # + # When using batch normalization, you don't need to regularize the scale # + # and shift parameters. # + # # + # NOTE: To ensure that your implementation matches ours and you pass the # + # automated tests, make sure that your L2 regularization includes a factor # + # of 0.5 to simplify the expression for the gradient. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + return loss, grads diff --git a/assignments2016/assignment2/cs231n/data_utils.py b/assignments2016/assignment2/cs231n/data_utils.py new file mode 100644 index 00000000..a4740ea9 --- /dev/null +++ b/assignments2016/assignment2/cs231n/data_utils.py @@ -0,0 +1,199 @@ +import cPickle as pickle +import numpy as np +import os +from scipy.misc import imread + +def load_CIFAR_batch(filename): + """ load single batch of cifar """ + with open(filename, 'rb') as f: + datadict = pickle.load(f) + X = datadict['data'] + Y = datadict['labels'] + X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype("float") + Y = np.array(Y) + return X, Y + +def load_CIFAR10(ROOT): + """ load all of cifar """ + xs = [] + ys = [] + for b in range(1,6): + f = os.path.join(ROOT, 'data_batch_%d' % (b, )) + X, Y = load_CIFAR_batch(f) + xs.append(X) + ys.append(Y) + Xtr = np.concatenate(xs) + Ytr = np.concatenate(ys) + del X, Y + Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch')) + return Xtr, Ytr, Xte, Yte + + +def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000): + """ + Load the CIFAR-10 dataset from disk and perform preprocessing to prepare + it for classifiers. These are the same steps as we used for the SVM, but + condensed to a single function. + """ + # Load the raw CIFAR-10 data + cifar10_dir = 'cs231n/datasets/cifar-10-batches-py' + X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir) + + # Subsample the data + mask = range(num_training, num_training + num_validation) + X_val = X_train[mask] + y_val = y_train[mask] + mask = range(num_training) + X_train = X_train[mask] + y_train = y_train[mask] + mask = range(num_test) + X_test = X_test[mask] + y_test = y_test[mask] + + # Normalize the data: subtract the mean image + mean_image = np.mean(X_train, axis=0) + X_train -= mean_image + X_val -= mean_image + X_test -= mean_image + + # Transpose so that channels come first + X_train = X_train.transpose(0, 3, 1, 2).copy() + X_val = X_val.transpose(0, 3, 1, 2).copy() + X_test = X_test.transpose(0, 3, 1, 2).copy() + + # Package data into a dictionary + return { + 'X_train': X_train, 'y_train': y_train, + 'X_val': X_val, 'y_val': y_val, + 'X_test': X_test, 'y_test': y_test, + } + + +def load_tiny_imagenet(path, dtype=np.float32): + """ + Load TinyImageNet. Each of TinyImageNet-100-A, TinyImageNet-100-B, and + TinyImageNet-200 have the same directory structure, so this can be used + to load any of them. + + Inputs: + - path: String giving path to the directory to load. + - dtype: numpy datatype used to load the data. + + Returns: A tuple of + - class_names: A list where class_names[i] is a list of strings giving the + WordNet names for class i in the loaded dataset. + - X_train: (N_tr, 3, 64, 64) array of training images + - y_train: (N_tr,) array of training labels + - X_val: (N_val, 3, 64, 64) array of validation images + - y_val: (N_val,) array of validation labels + - X_test: (N_test, 3, 64, 64) array of testing images. + - y_test: (N_test,) array of test labels; if test labels are not available + (such as in student code) then y_test will be None. + """ + # First load wnids + with open(os.path.join(path, 'wnids.txt'), 'r') as f: + wnids = [x.strip() for x in f] + + # Map wnids to integer labels + wnid_to_label = {wnid: i for i, wnid in enumerate(wnids)} + + # Use words.txt to get names for each class + with open(os.path.join(path, 'words.txt'), 'r') as f: + wnid_to_words = dict(line.split('\t') for line in f) + for wnid, words in wnid_to_words.iteritems(): + wnid_to_words[wnid] = [w.strip() for w in words.split(',')] + class_names = [wnid_to_words[wnid] for wnid in wnids] + + # Next load training data. + X_train = [] + y_train = [] + for i, wnid in enumerate(wnids): + if (i + 1) % 20 == 0: + print 'loading training data for synset %d / %d' % (i + 1, len(wnids)) + # To figure out the filenames we need to open the boxes file + boxes_file = os.path.join(path, 'train', wnid, '%s_boxes.txt' % wnid) + with open(boxes_file, 'r') as f: + filenames = [x.split('\t')[0] for x in f] + num_images = len(filenames) + + X_train_block = np.zeros((num_images, 3, 64, 64), dtype=dtype) + y_train_block = wnid_to_label[wnid] * np.ones(num_images, dtype=np.int64) + for j, img_file in enumerate(filenames): + img_file = os.path.join(path, 'train', wnid, 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + ## grayscale file + img.shape = (64, 64, 1) + X_train_block[j] = img.transpose(2, 0, 1) + X_train.append(X_train_block) + y_train.append(y_train_block) + + # We need to concatenate all training data + X_train = np.concatenate(X_train, axis=0) + y_train = np.concatenate(y_train, axis=0) + + # Next load validation data + with open(os.path.join(path, 'val', 'val_annotations.txt'), 'r') as f: + img_files = [] + val_wnids = [] + for line in f: + img_file, wnid = line.split('\t')[:2] + img_files.append(img_file) + val_wnids.append(wnid) + num_val = len(img_files) + y_val = np.array([wnid_to_label[wnid] for wnid in val_wnids]) + X_val = np.zeros((num_val, 3, 64, 64), dtype=dtype) + for i, img_file in enumerate(img_files): + img_file = os.path.join(path, 'val', 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + img.shape = (64, 64, 1) + X_val[i] = img.transpose(2, 0, 1) + + # Next load test images + # Students won't have test labels, so we need to iterate over files in the + # images directory. + img_files = os.listdir(os.path.join(path, 'test', 'images')) + X_test = np.zeros((len(img_files), 3, 64, 64), dtype=dtype) + for i, img_file in enumerate(img_files): + img_file = os.path.join(path, 'test', 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + img.shape = (64, 64, 1) + X_test[i] = img.transpose(2, 0, 1) + + y_test = None + y_test_file = os.path.join(path, 'test', 'test_annotations.txt') + if os.path.isfile(y_test_file): + with open(y_test_file, 'r') as f: + img_file_to_wnid = {} + for line in f: + line = line.split('\t') + img_file_to_wnid[line[0]] = line[1] + y_test = [wnid_to_label[img_file_to_wnid[img_file]] for img_file in img_files] + y_test = np.array(y_test) + + return class_names, X_train, y_train, X_val, y_val, X_test, y_test + + +def load_models(models_dir): + """ + Load saved models from disk. This will attempt to unpickle all files in a + directory; any files that give errors on unpickling (such as README.txt) will + be skipped. + + Inputs: + - models_dir: String giving the path to a directory containing model files. + Each model file is a pickled dictionary with a 'model' field. + + Returns: + A dictionary mapping model file names to models. + """ + models = {} + for model_file in os.listdir(models_dir): + with open(os.path.join(models_dir, model_file), 'rb') as f: + try: + models[model_file] = pickle.load(f)['model'] + except pickle.UnpicklingError: + continue + return models diff --git a/assignments2016/assignment2/cs231n/datasets/.gitignore b/assignments2016/assignment2/cs231n/datasets/.gitignore new file mode 100644 index 00000000..0232c3ab --- /dev/null +++ b/assignments2016/assignment2/cs231n/datasets/.gitignore @@ -0,0 +1,4 @@ +cifar-10-batches-py/* +tiny-imagenet-100-A* +tiny-imagenet-100-B* +tiny-100-A-pretrained/* diff --git a/assignments2016/assignment2/cs231n/datasets/get_datasets.sh b/assignments2016/assignment2/cs231n/datasets/get_datasets.sh new file mode 100755 index 00000000..0dd93621 --- /dev/null +++ b/assignments2016/assignment2/cs231n/datasets/get_datasets.sh @@ -0,0 +1,4 @@ +# Get CIFAR10 +wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz +tar -xzvf cifar-10-python.tar.gz +rm cifar-10-python.tar.gz diff --git a/assignments2016/assignment2/cs231n/fast_layers.py b/assignments2016/assignment2/cs231n/fast_layers.py new file mode 100644 index 00000000..2ac8dfb0 --- /dev/null +++ b/assignments2016/assignment2/cs231n/fast_layers.py @@ -0,0 +1,270 @@ +import numpy as np +try: + from cs231n.im2col_cython import col2im_cython, im2col_cython + from cs231n.im2col_cython import col2im_6d_cython +except ImportError: + print 'run the following from the cs231n directory and try again:' + print 'python setup.py build_ext --inplace' + print 'You may also need to restart your iPython kernel' + +from cs231n.im2col import * + + +def conv_forward_im2col(x, w, b, conv_param): + """ + A fast implementation of the forward pass for a convolutional layer + based on im2col and col2im. + """ + N, C, H, W = x.shape + num_filters, _, filter_height, filter_width = w.shape + stride, pad = conv_param['stride'], conv_param['pad'] + + # Check dimensions + assert (W + 2 * pad - filter_width) % stride == 0, 'width does not work' + assert (H + 2 * pad - filter_height) % stride == 0, 'height does not work' + + # Create output + out_height = (H + 2 * pad - filter_height) / stride + 1 + out_width = (W + 2 * pad - filter_width) / stride + 1 + out = np.zeros((N, num_filters, out_height, out_width), dtype=x.dtype) + + # x_cols = im2col_indices(x, w.shape[2], w.shape[3], pad, stride) + x_cols = im2col_cython(x, w.shape[2], w.shape[3], pad, stride) + res = w.reshape((w.shape[0], -1)).dot(x_cols) + b.reshape(-1, 1) + + out = res.reshape(w.shape[0], out.shape[2], out.shape[3], x.shape[0]) + out = out.transpose(3, 0, 1, 2) + + cache = (x, w, b, conv_param, x_cols) + return out, cache + + +def conv_forward_strides(x, w, b, conv_param): + N, C, H, W = x.shape + F, _, HH, WW = w.shape + stride, pad = conv_param['stride'], conv_param['pad'] + + # Check dimensions + assert (W + 2 * pad - WW) % stride == 0, 'width does not work' + assert (H + 2 * pad - HH) % stride == 0, 'height does not work' + + # Pad the input + p = pad + x_padded = np.pad(x, ((0, 0), (0, 0), (p, p), (p, p)), mode='constant') + + # Figure out output dimensions + H += 2 * pad + W += 2 * pad + out_h = (H - HH) / stride + 1 + out_w = (W - WW) / stride + 1 + + # Perform an im2col operation by picking clever strides + shape = (C, HH, WW, N, out_h, out_w) + strides = (H * W, W, 1, C * H * W, stride * W, stride) + strides = x.itemsize * np.array(strides) + x_stride = np.lib.stride_tricks.as_strided(x_padded, + shape=shape, strides=strides) + x_cols = np.ascontiguousarray(x_stride) + x_cols.shape = (C * HH * WW, N * out_h * out_w) + + # Now all our convolutions are a big matrix multiply + res = w.reshape(F, -1).dot(x_cols) + b.reshape(-1, 1) + + # Reshape the output + res.shape = (F, N, out_h, out_w) + out = res.transpose(1, 0, 2, 3) + + # Be nice and return a contiguous array + # The old version of conv_forward_fast doesn't do this, so for a fair + # comparison we won't either + out = np.ascontiguousarray(out) + + cache = (x, w, b, conv_param, x_cols) + return out, cache + + +def conv_backward_strides(dout, cache): + x, w, b, conv_param, x_cols = cache + stride, pad = conv_param['stride'], conv_param['pad'] + + N, C, H, W = x.shape + F, _, HH, WW = w.shape + _, _, out_h, out_w = dout.shape + + db = np.sum(dout, axis=(0, 2, 3)) + + dout_reshaped = dout.transpose(1, 0, 2, 3).reshape(F, -1) + dw = dout_reshaped.dot(x_cols.T).reshape(w.shape) + + dx_cols = w.reshape(F, -1).T.dot(dout_reshaped) + dx_cols.shape = (C, HH, WW, N, out_h, out_w) + dx = col2im_6d_cython(dx_cols, N, C, H, W, HH, WW, pad, stride) + + return dx, dw, db + + +def conv_backward_im2col(dout, cache): + """ + A fast implementation of the backward pass for a convolutional layer + based on im2col and col2im. + """ + x, w, b, conv_param, x_cols = cache + stride, pad = conv_param['stride'], conv_param['pad'] + + db = np.sum(dout, axis=(0, 2, 3)) + + num_filters, _, filter_height, filter_width = w.shape + dout_reshaped = dout.transpose(1, 2, 3, 0).reshape(num_filters, -1) + dw = dout_reshaped.dot(x_cols.T).reshape(w.shape) + + dx_cols = w.reshape(num_filters, -1).T.dot(dout_reshaped) + # dx = col2im_indices(dx_cols, x.shape, filter_height, filter_width, pad, stride) + dx = col2im_cython(dx_cols, x.shape[0], x.shape[1], x.shape[2], x.shape[3], + filter_height, filter_width, pad, stride) + + return dx, dw, db + + +conv_forward_fast = conv_forward_strides +conv_backward_fast = conv_backward_strides + + +def max_pool_forward_fast(x, pool_param): + """ + A fast implementation of the forward pass for a max pooling layer. + + This chooses between the reshape method and the im2col method. If the pooling + regions are square and tile the input image, then we can use the reshape + method which is very fast. Otherwise we fall back on the im2col method, which + is not much faster than the naive method. + """ + N, C, H, W = x.shape + pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width'] + stride = pool_param['stride'] + + same_size = pool_height == pool_width == stride + tiles = H % pool_height == 0 and W % pool_width == 0 + if same_size and tiles: + out, reshape_cache = max_pool_forward_reshape(x, pool_param) + cache = ('reshape', reshape_cache) + else: + out, im2col_cache = max_pool_forward_im2col(x, pool_param) + cache = ('im2col', im2col_cache) + return out, cache + + +def max_pool_backward_fast(dout, cache): + """ + A fast implementation of the backward pass for a max pooling layer. + + This switches between the reshape method an the im2col method depending on + which method was used to generate the cache. + """ + method, real_cache = cache + if method == 'reshape': + return max_pool_backward_reshape(dout, real_cache) + elif method == 'im2col': + return max_pool_backward_im2col(dout, real_cache) + else: + raise ValueError('Unrecognized method "%s"' % method) + + +def max_pool_forward_reshape(x, pool_param): + """ + A fast implementation of the forward pass for the max pooling layer that uses + some clever reshaping. + + This can only be used for square pooling regions that tile the input. + """ + N, C, H, W = x.shape + pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width'] + stride = pool_param['stride'] + assert pool_height == pool_width == stride, 'Invalid pool params' + assert H % pool_height == 0 + assert W % pool_height == 0 + x_reshaped = x.reshape(N, C, H / pool_height, pool_height, + W / pool_width, pool_width) + out = x_reshaped.max(axis=3).max(axis=4) + + cache = (x, x_reshaped, out) + return out, cache + + +def max_pool_backward_reshape(dout, cache): + """ + A fast implementation of the backward pass for the max pooling layer that + uses some clever broadcasting and reshaping. + + This can only be used if the forward pass was computed using + max_pool_forward_reshape. + + NOTE: If there are multiple argmaxes, this method will assign gradient to + ALL argmax elements of the input rather than picking one. In this case the + gradient will actually be incorrect. However this is unlikely to occur in + practice, so it shouldn't matter much. One possible solution is to split the + upstream gradient equally among all argmax elements; this should result in a + valid subgradient. You can make this happen by uncommenting the line below; + however this results in a significant performance penalty (about 40% slower) + and is unlikely to matter in practice so we don't do it. + """ + x, x_reshaped, out = cache + + dx_reshaped = np.zeros_like(x_reshaped) + out_newaxis = out[:, :, :, np.newaxis, :, np.newaxis] + mask = (x_reshaped == out_newaxis) + dout_newaxis = dout[:, :, :, np.newaxis, :, np.newaxis] + dout_broadcast, _ = np.broadcast_arrays(dout_newaxis, dx_reshaped) + dx_reshaped[mask] = dout_broadcast[mask] + dx_reshaped /= np.sum(mask, axis=(3, 5), keepdims=True) + dx = dx_reshaped.reshape(x.shape) + + return dx + + +def max_pool_forward_im2col(x, pool_param): + """ + An implementation of the forward pass for max pooling based on im2col. + + This isn't much faster than the naive version, so it should be avoided if + possible. + """ + N, C, H, W = x.shape + pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width'] + stride = pool_param['stride'] + + assert (H - pool_height) % stride == 0, 'Invalid height' + assert (W - pool_width) % stride == 0, 'Invalid width' + + out_height = (H - pool_height) / stride + 1 + out_width = (W - pool_width) / stride + 1 + + x_split = x.reshape(N * C, 1, H, W) + x_cols = im2col(x_split, pool_height, pool_width, padding=0, stride=stride) + x_cols_argmax = np.argmax(x_cols, axis=0) + x_cols_max = x_cols[x_cols_argmax, np.arange(x_cols.shape[1])] + out = x_cols_max.reshape(out_height, out_width, N, C).transpose(2, 3, 0, 1) + + cache = (x, x_cols, x_cols_argmax, pool_param) + return out, cache + + +def max_pool_backward_im2col(dout, cache): + """ + An implementation of the backward pass for max pooling based on im2col. + + This isn't much faster than the naive version, so it should be avoided if + possible. + """ + x, x_cols, x_cols_argmax, pool_param = cache + N, C, H, W = x.shape + pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width'] + stride = pool_param['stride'] + + dout_reshaped = dout.transpose(2, 3, 0, 1).flatten() + dx_cols = np.zeros_like(x_cols) + dx_cols[x_cols_argmax, np.arange(dx_cols.shape[1])] = dout_reshaped + dx = col2im_indices(dx_cols, (N * C, 1, H, W), pool_height, pool_width, + padding=0, stride=stride) + dx = dx.reshape(x.shape) + + return dx diff --git a/assignments2016/assignment2/cs231n/gradient_check.py b/assignments2016/assignment2/cs231n/gradient_check.py new file mode 100644 index 00000000..2d6b1f62 --- /dev/null +++ b/assignments2016/assignment2/cs231n/gradient_check.py @@ -0,0 +1,124 @@ +import numpy as np +from random import randrange + +def eval_numerical_gradient(f, x, verbose=True, h=0.00001): + """ + a naive implementation of numerical gradient of f at x + - f should be a function that takes a single argument + - x is the point (numpy array) to evaluate the gradient at + """ + + fx = f(x) # evaluate function value at original point + grad = np.zeros_like(x) + # iterate over all indexes in x + it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite']) + while not it.finished: + + # evaluate function at x+h + ix = it.multi_index + oldval = x[ix] + x[ix] = oldval + h # increment by h + fxph = f(x) # evalute f(x + h) + x[ix] = oldval - h + fxmh = f(x) # evaluate f(x - h) + x[ix] = oldval # restore + + # compute the partial derivative with centered formula + grad[ix] = (fxph - fxmh) / (2 * h) # the slope + if verbose: + print ix, grad[ix] + it.iternext() # step to next dimension + + return grad + + +def eval_numerical_gradient_array(f, x, df, h=1e-5): + """ + Evaluate a numeric gradient for a function that accepts a numpy + array and returns a numpy array. + """ + grad = np.zeros_like(x) + it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite']) + while not it.finished: + ix = it.multi_index + + oldval = x[ix] + x[ix] = oldval + h + pos = f(x).copy() + x[ix] = oldval - h + neg = f(x).copy() + x[ix] = oldval + + grad[ix] = np.sum((pos - neg) * df) / (2 * h) + it.iternext() + return grad + + +def eval_numerical_gradient_blobs(f, inputs, output, h=1e-5): + """ + Compute numeric gradients for a function that operates on input + and output blobs. + + We assume that f accepts several input blobs as arguments, followed by a blob + into which outputs will be written. For example, f might be called like this: + + f(x, w, out) + + where x and w are input Blobs, and the result of f will be written to out. + + Inputs: + - f: function + - inputs: tuple of input blobs + - output: output blob + - h: step size + """ + numeric_diffs = [] + for input_blob in inputs: + diff = np.zeros_like(input_blob.diffs) + it = np.nditer(input_blob.vals, flags=['multi_index'], + op_flags=['readwrite']) + while not it.finished: + idx = it.multi_index + orig = input_blob.vals[idx] + + input_blob.vals[idx] = orig + h + f(*(inputs + (output,))) + pos = np.copy(output.vals) + input_blob.vals[idx] = orig - h + f(*(inputs + (output,))) + neg = np.copy(output.vals) + input_blob.vals[idx] = orig + + diff[idx] = np.sum((pos - neg) * output.diffs) / (2.0 * h) + + it.iternext() + numeric_diffs.append(diff) + return numeric_diffs + + +def eval_numerical_gradient_net(net, inputs, output, h=1e-5): + return eval_numerical_gradient_blobs(lambda *args: net.forward(), + inputs, output, h=h) + + +def grad_check_sparse(f, x, analytic_grad, num_checks=10, h=1e-5): + """ + sample a few random elements and only return numerical + in this dimensions. + """ + + for i in xrange(num_checks): + ix = tuple([randrange(m) for m in x.shape]) + + oldval = x[ix] + x[ix] = oldval + h # increment by h + fxph = f(x) # evaluate f(x + h) + x[ix] = oldval - h # increment by h + fxmh = f(x) # evaluate f(x - h) + x[ix] = oldval # reset + + grad_numerical = (fxph - fxmh) / (2 * h) + grad_analytic = analytic_grad[ix] + rel_error = abs(grad_numerical - grad_analytic) / (abs(grad_numerical) + abs(grad_analytic)) + print 'numerical: %f analytic: %f, relative error: %e' % (grad_numerical, grad_analytic, rel_error) + diff --git a/assignments2016/assignment2/cs231n/im2col.py b/assignments2016/assignment2/cs231n/im2col.py new file mode 100644 index 00000000..1942eab6 --- /dev/null +++ b/assignments2016/assignment2/cs231n/im2col.py @@ -0,0 +1,55 @@ +import numpy as np + + +def get_im2col_indices(x_shape, field_height, field_width, padding=1, stride=1): + # First figure out what the size of the output should be + N, C, H, W = x_shape + assert (H + 2 * padding - field_height) % stride == 0 + assert (W + 2 * padding - field_height) % stride == 0 + out_height = (H + 2 * padding - field_height) / stride + 1 + out_width = (W + 2 * padding - field_width) / stride + 1 + + i0 = np.repeat(np.arange(field_height), field_width) + i0 = np.tile(i0, C) + i1 = stride * np.repeat(np.arange(out_height), out_width) + j0 = np.tile(np.arange(field_width), field_height * C) + j1 = stride * np.tile(np.arange(out_width), out_height) + i = i0.reshape(-1, 1) + i1.reshape(1, -1) + j = j0.reshape(-1, 1) + j1.reshape(1, -1) + + k = np.repeat(np.arange(C), field_height * field_width).reshape(-1, 1) + + return (k, i, j) + + +def im2col_indices(x, field_height, field_width, padding=1, stride=1): + """ An implementation of im2col based on some fancy indexing """ + # Zero-pad the input + p = padding + x_padded = np.pad(x, ((0, 0), (0, 0), (p, p), (p, p)), mode='constant') + + k, i, j = get_im2col_indices(x.shape, field_height, field_width, padding, + stride) + + cols = x_padded[:, k, i, j] + C = x.shape[1] + cols = cols.transpose(1, 2, 0).reshape(field_height * field_width * C, -1) + return cols + + +def col2im_indices(cols, x_shape, field_height=3, field_width=3, padding=1, + stride=1): + """ An implementation of col2im based on fancy indexing and np.add.at """ + N, C, H, W = x_shape + H_padded, W_padded = H + 2 * padding, W + 2 * padding + x_padded = np.zeros((N, C, H_padded, W_padded), dtype=cols.dtype) + k, i, j = get_im2col_indices(x_shape, field_height, field_width, padding, + stride) + cols_reshaped = cols.reshape(C * field_height * field_width, -1, N) + cols_reshaped = cols_reshaped.transpose(2, 0, 1) + np.add.at(x_padded, (slice(None), k, i, j), cols_reshaped) + if padding == 0: + return x_padded + return x_padded[:, :, padding:-padding, padding:-padding] + +pass diff --git a/assignments2016/assignment2/cs231n/im2col_cython.pyx b/assignments2016/assignment2/cs231n/im2col_cython.pyx new file mode 100644 index 00000000..d6e33c6f --- /dev/null +++ b/assignments2016/assignment2/cs231n/im2col_cython.pyx @@ -0,0 +1,121 @@ +import numpy as np +cimport numpy as np +cimport cython + +# DTYPE = np.float64 +# ctypedef np.float64_t DTYPE_t + +ctypedef fused DTYPE_t: + np.float32_t + np.float64_t + +def im2col_cython(np.ndarray[DTYPE_t, ndim=4] x, int field_height, + int field_width, int padding, int stride): + cdef int N = x.shape[0] + cdef int C = x.shape[1] + cdef int H = x.shape[2] + cdef int W = x.shape[3] + + cdef int HH = (H + 2 * padding - field_height) / stride + 1 + cdef int WW = (W + 2 * padding - field_width) / stride + 1 + + cdef int p = padding + cdef np.ndarray[DTYPE_t, ndim=4] x_padded = np.pad(x, + ((0, 0), (0, 0), (p, p), (p, p)), mode='constant') + + cdef np.ndarray[DTYPE_t, ndim=2] cols = np.zeros( + (C * field_height * field_width, N * HH * WW), + dtype=x.dtype) + + # Moving the inner loop to a C function with no bounds checking works, but does + # not seem to help performance in any measurable way. + + im2col_cython_inner(cols, x_padded, N, C, H, W, HH, WW, + field_height, field_width, padding, stride) + return cols + + +@cython.boundscheck(False) +cdef int im2col_cython_inner(np.ndarray[DTYPE_t, ndim=2] cols, + np.ndarray[DTYPE_t, ndim=4] x_padded, + int N, int C, int H, int W, int HH, int WW, + int field_height, int field_width, int padding, int stride) except? -1: + cdef int c, ii, jj, row, yy, xx, i, col + + for c in range(C): + for yy in range(HH): + for xx in range(WW): + for ii in range(field_height): + for jj in range(field_width): + row = c * field_width * field_height + ii * field_height + jj + for i in range(N): + col = yy * WW * N + xx * N + i + cols[row, col] = x_padded[i, c, stride * yy + ii, stride * xx + jj] + + + +def col2im_cython(np.ndarray[DTYPE_t, ndim=2] cols, int N, int C, int H, int W, + int field_height, int field_width, int padding, int stride): + cdef np.ndarray x = np.empty((N, C, H, W), dtype=cols.dtype) + cdef int HH = (H + 2 * padding - field_height) / stride + 1 + cdef int WW = (W + 2 * padding - field_width) / stride + 1 + cdef np.ndarray[DTYPE_t, ndim=4] x_padded = np.zeros((N, C, H + 2 * padding, W + 2 * padding), + dtype=cols.dtype) + + # Moving the inner loop to a C-function with no bounds checking improves + # performance quite a bit for col2im. + col2im_cython_inner(cols, x_padded, N, C, H, W, HH, WW, + field_height, field_width, padding, stride) + if padding > 0: + return x_padded[:, :, padding:-padding, padding:-padding] + return x_padded + + +@cython.boundscheck(False) +cdef int col2im_cython_inner(np.ndarray[DTYPE_t, ndim=2] cols, + np.ndarray[DTYPE_t, ndim=4] x_padded, + int N, int C, int H, int W, int HH, int WW, + int field_height, int field_width, int padding, int stride) except? -1: + cdef int c, ii, jj, row, yy, xx, i, col + + for c in range(C): + for ii in range(field_height): + for jj in range(field_width): + row = c * field_width * field_height + ii * field_height + jj + for yy in range(HH): + for xx in range(WW): + for i in range(N): + col = yy * WW * N + xx * N + i + x_padded[i, c, stride * yy + ii, stride * xx + jj] += cols[row, col] + + +@cython.boundscheck(False) +@cython.wraparound(False) +cdef col2im_6d_cython_inner(np.ndarray[DTYPE_t, ndim=6] cols, + np.ndarray[DTYPE_t, ndim=4] x_padded, + int N, int C, int H, int W, int HH, int WW, + int out_h, int out_w, int pad, int stride): + + cdef int c, hh, ww, n, h, w + for n in range(N): + for c in range(C): + for hh in range(HH): + for ww in range(WW): + for h in range(out_h): + for w in range(out_w): + x_padded[n, c, stride * h + hh, stride * w + ww] += cols[c, hh, ww, n, h, w] + + +def col2im_6d_cython(np.ndarray[DTYPE_t, ndim=6] cols, int N, int C, int H, int W, + int HH, int WW, int pad, int stride): + cdef np.ndarray x = np.empty((N, C, H, W), dtype=cols.dtype) + cdef int out_h = (H + 2 * pad - HH) / stride + 1 + cdef int out_w = (W + 2 * pad - WW) / stride + 1 + cdef np.ndarray[DTYPE_t, ndim=4] x_padded = np.zeros((N, C, H + 2 * pad, W + 2 * pad), + dtype=cols.dtype) + + col2im_6d_cython_inner(cols, x_padded, N, C, H, W, HH, WW, out_h, out_w, pad, stride) + + if pad > 0: + return x_padded[:, :, pad:-pad, pad:-pad] + return x_padded diff --git a/assignments2016/assignment2/cs231n/layer_utils.py b/assignments2016/assignment2/cs231n/layer_utils.py new file mode 100644 index 00000000..c4989618 --- /dev/null +++ b/assignments2016/assignment2/cs231n/layer_utils.py @@ -0,0 +1,93 @@ +from cs231n.layers import * +from cs231n.fast_layers import * + + +def affine_relu_forward(x, w, b): + """ + Convenience layer that perorms an affine transform followed by a ReLU + + Inputs: + - x: Input to the affine layer + - w, b: Weights for the affine layer + + Returns a tuple of: + - out: Output from the ReLU + - cache: Object to give to the backward pass + """ + a, fc_cache = affine_forward(x, w, b) + out, relu_cache = relu_forward(a) + cache = (fc_cache, relu_cache) + return out, cache + + +def affine_relu_backward(dout, cache): + """ + Backward pass for the affine-relu convenience layer + """ + fc_cache, relu_cache = cache + da = relu_backward(dout, relu_cache) + dx, dw, db = affine_backward(da, fc_cache) + return dx, dw, db + + +pass + + +def conv_relu_forward(x, w, b, conv_param): + """ + A convenience layer that performs a convolution followed by a ReLU. + + Inputs: + - x: Input to the convolutional layer + - w, b, conv_param: Weights and parameters for the convolutional layer + + Returns a tuple of: + - out: Output from the ReLU + - cache: Object to give to the backward pass + """ + a, conv_cache = conv_forward_fast(x, w, b, conv_param) + out, relu_cache = relu_forward(a) + cache = (conv_cache, relu_cache) + return out, cache + + +def conv_relu_backward(dout, cache): + """ + Backward pass for the conv-relu convenience layer. + """ + conv_cache, relu_cache = cache + da = relu_backward(dout, relu_cache) + dx, dw, db = conv_backward_fast(da, conv_cache) + return dx, dw, db + + +def conv_relu_pool_forward(x, w, b, conv_param, pool_param): + """ + Convenience layer that performs a convolution, a ReLU, and a pool. + + Inputs: + - x: Input to the convolutional layer + - w, b, conv_param: Weights and parameters for the convolutional layer + - pool_param: Parameters for the pooling layer + + Returns a tuple of: + - out: Output from the pooling layer + - cache: Object to give to the backward pass + """ + a, conv_cache = conv_forward_fast(x, w, b, conv_param) + s, relu_cache = relu_forward(a) + out, pool_cache = max_pool_forward_fast(s, pool_param) + cache = (conv_cache, relu_cache, pool_cache) + return out, cache + + +def conv_relu_pool_backward(dout, cache): + """ + Backward pass for the conv-relu-pool convenience layer + """ + conv_cache, relu_cache, pool_cache = cache + ds = max_pool_backward_fast(dout, pool_cache) + da = relu_backward(ds, relu_cache) + dx, dw, db = conv_backward_fast(da, conv_cache) + return dx, dw, db + diff --git a/assignments2016/assignment2/cs231n/layers.py b/assignments2016/assignment2/cs231n/layers.py new file mode 100644 index 00000000..3a716cf2 --- /dev/null +++ b/assignments2016/assignment2/cs231n/layers.py @@ -0,0 +1,554 @@ +import numpy as np + + +def affine_forward(x, w, b): + """ + Computes the forward pass for an affine (fully-connected) layer. + + The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N + examples, where each example x[i] has shape (d_1, ..., d_k). We will + reshape each input into a vector of dimension D = d_1 * ... * d_k, and + then transform it to an output vector of dimension M. + + Inputs: + - x: A numpy array containing input data, of shape (N, d_1, ..., d_k) + - w: A numpy array of weights, of shape (D, M) + - b: A numpy array of biases, of shape (M,) + + Returns a tuple of: + - out: output, of shape (N, M) + - cache: (x, w, b) + """ + out = None + ############################################################################# + # TODO: Implement the affine forward pass. Store the result in out. You # + # will need to reshape the input into rows. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + cache = (x, w, b) + return out, cache + + +def affine_backward(dout, cache): + """ + Computes the backward pass for an affine layer. + + Inputs: + - dout: Upstream derivative, of shape (N, M) + - cache: Tuple of: + - x: Input data, of shape (N, d_1, ... d_k) + - w: Weights, of shape (D, M) + + Returns a tuple of: + - dx: Gradient with respect to x, of shape (N, d1, ..., d_k) + - dw: Gradient with respect to w, of shape (D, M) + - db: Gradient with respect to b, of shape (M,) + """ + x, w, b = cache + dx, dw, db = None, None, None + ############################################################################# + # TODO: Implement the affine backward pass. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + return dx, dw, db + + +def relu_forward(x): + """ + Computes the forward pass for a layer of rectified linear units (ReLUs). + + Input: + - x: Inputs, of any shape + + Returns a tuple of: + - out: Output, of the same shape as x + - cache: x + """ + out = None + ############################################################################# + # TODO: Implement the ReLU forward pass. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + cache = x + return out, cache + + +def relu_backward(dout, cache): + """ + Computes the backward pass for a layer of rectified linear units (ReLUs). + + Input: + - dout: Upstream derivatives, of any shape + - cache: Input x, of same shape as dout + + Returns: + - dx: Gradient with respect to x + """ + dx, x = None, cache + ############################################################################# + # TODO: Implement the ReLU backward pass. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + return dx + + +def batchnorm_forward(x, gamma, beta, bn_param): + """ + Forward pass for batch normalization. + + During training the sample mean and (uncorrected) sample variance are + computed from minibatch statistics and used to normalize the incoming data. + During training we also keep an exponentially decaying running mean of the mean + and variance of each feature, and these averages are used to normalize data + at test-time. + + At each timestep we update the running averages for mean and variance using + an exponential decay based on the momentum parameter: + + running_mean = momentum * running_mean + (1 - momentum) * sample_mean + running_var = momentum * running_var + (1 - momentum) * sample_var + + Note that the batch normalization paper suggests a different test-time + behavior: they compute sample mean and variance for each feature using a + large number of training images rather than using a running average. For + this implementation we have chosen to use running averages instead since + they do not require an additional estimation step; the torch7 implementation + of batch normalization also uses running averages. + + Input: + - x: Data of shape (N, D) + - gamma: Scale parameter of shape (D,) + - beta: Shift paremeter of shape (D,) + - bn_param: Dictionary with the following keys: + - mode: 'train' or 'test'; required + - eps: Constant for numeric stability + - momentum: Constant for running mean / variance. + - running_mean: Array of shape (D,) giving running mean of features + - running_var Array of shape (D,) giving running variance of features + + Returns a tuple of: + - out: of shape (N, D) + - cache: A tuple of values needed in the backward pass + """ + mode = bn_param['mode'] + eps = bn_param.get('eps', 1e-5) + momentum = bn_param.get('momentum', 0.9) + + N, D = x.shape + running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype)) + running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype)) + + out, cache = None, None + if mode == 'train': + ############################################################################# + # TODO: Implement the training-time forward pass for batch normalization. # + # Use minibatch statistics to compute the mean and variance, use these # + # statistics to normalize the incoming data, and scale and shift the # + # normalized data using gamma and beta. # + # # + # You should store the output in the variable out. Any intermediates that # + # you need for the backward pass should be stored in the cache variable. # + # # + # You should also use your computed sample mean and variance together with # + # the momentum variable to update the running mean and running variance, # + # storing your result in the running_mean and running_var variables. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + elif mode == 'test': + ############################################################################# + # TODO: Implement the test-time forward pass for batch normalization. Use # + # the running mean and variance to normalize the incoming data, then scale # + # and shift the normalized data using gamma and beta. Store the result in # + # the out variable. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + else: + raise ValueError('Invalid forward batchnorm mode "%s"' % mode) + + # Store the updated running means back into bn_param + bn_param['running_mean'] = running_mean + bn_param['running_var'] = running_var + + return out, cache + + +def batchnorm_backward(dout, cache): + """ + Backward pass for batch normalization. + + For this implementation, you should write out a computation graph for + batch normalization on paper and propagate gradients backward through + intermediate nodes. + + Inputs: + - dout: Upstream derivatives, of shape (N, D) + - cache: Variable of intermediates from batchnorm_forward. + + Returns a tuple of: + - dx: Gradient with respect to inputs x, of shape (N, D) + - dgamma: Gradient with respect to scale parameter gamma, of shape (D,) + - dbeta: Gradient with respect to shift parameter beta, of shape (D,) + """ + dx, dgamma, dbeta = None, None, None + ############################################################################# + # TODO: Implement the backward pass for batch normalization. Store the # + # results in the dx, dgamma, and dbeta variables. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return dx, dgamma, dbeta + + +def batchnorm_backward_alt(dout, cache): + """ + Alternative backward pass for batch normalization. + + For this implementation you should work out the derivatives for the batch + normalizaton backward pass on paper and simplify as much as possible. You + should be able to derive a simple expression for the backward pass. + + Note: This implementation should expect to receive the same cache variable + as batchnorm_backward, but might not use all of the values in the cache. + + Inputs / outputs: Same as batchnorm_backward + """ + dx, dgamma, dbeta = None, None, None + ############################################################################# + # TODO: Implement the backward pass for batch normalization. Store the # + # results in the dx, dgamma, and dbeta variables. # + # # + # After computing the gradient with respect to the centered inputs, you # + # should be able to compute gradients with respect to the inputs in a # + # single statement; our implementation fits on a single 80-character line. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return dx, dgamma, dbeta + + +def dropout_forward(x, dropout_param): + """ + Performs the forward pass for (inverted) dropout. + + Inputs: + - x: Input data, of any shape + - dropout_param: A dictionary with the following keys: + - p: Dropout parameter. We drop each neuron output with probability p. + - mode: 'test' or 'train'. If the mode is train, then perform dropout; + if the mode is test, then just return the input. + - seed: Seed for the random number generator. Passing seed makes this + function deterministic, which is needed for gradient checking but not in + real networks. + + Outputs: + - out: Array of the same shape as x. + - cache: A tuple (dropout_param, mask). In training mode, mask is the dropout + mask that was used to multiply the input; in test mode, mask is None. + """ + p, mode = dropout_param['p'], dropout_param['mode'] + if 'seed' in dropout_param: + np.random.seed(dropout_param['seed']) + + mask = None + out = None + + if mode == 'train': + ########################################################################### + # TODO: Implement the training phase forward pass for inverted dropout. # + # Store the dropout mask in the mask variable. # + ########################################################################### + pass + ########################################################################### + # END OF YOUR CODE # + ########################################################################### + elif mode == 'test': + ########################################################################### + # TODO: Implement the test phase forward pass for inverted dropout. # + ########################################################################### + pass + ########################################################################### + # END OF YOUR CODE # + ########################################################################### + + cache = (dropout_param, mask) + out = out.astype(x.dtype, copy=False) + + return out, cache + + +def dropout_backward(dout, cache): + """ + Perform the backward pass for (inverted) dropout. + + Inputs: + - dout: Upstream derivatives, of any shape + - cache: (dropout_param, mask) from dropout_forward. + """ + dropout_param, mask = cache + mode = dropout_param['mode'] + + dx = None + if mode == 'train': + ########################################################################### + # TODO: Implement the training phase backward pass for inverted dropout. # + ########################################################################### + pass + ########################################################################### + # END OF YOUR CODE # + ########################################################################### + elif mode == 'test': + dx = dout + return dx + + +def conv_forward_naive(x, w, b, conv_param): + """ + A naive implementation of the forward pass for a convolutional layer. + + The input consists of N data points, each with C channels, height H and width + W. We convolve each input with F different filters, where each filter spans + all C channels and has height HH and width HH. + + Input: + - x: Input data of shape (N, C, H, W) + - w: Filter weights of shape (F, C, HH, WW) + - b: Biases, of shape (F,) + - conv_param: A dictionary with the following keys: + - 'stride': The number of pixels between adjacent receptive fields in the + horizontal and vertical directions. + - 'pad': The number of pixels that will be used to zero-pad the input. + + Returns a tuple of: + - out: Output data, of shape (N, F, H', W') where H' and W' are given by + H' = 1 + (H + 2 * pad - HH) / stride + W' = 1 + (W + 2 * pad - WW) / stride + - cache: (x, w, b, conv_param) + """ + out = None + ############################################################################# + # TODO: Implement the convolutional forward pass. # + # Hint: you can use the function np.pad for padding. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + cache = (x, w, b, conv_param) + return out, cache + + +def conv_backward_naive(dout, cache): + """ + A naive implementation of the backward pass for a convolutional layer. + + Inputs: + - dout: Upstream derivatives. + - cache: A tuple of (x, w, b, conv_param) as in conv_forward_naive + + Returns a tuple of: + - dx: Gradient with respect to x + - dw: Gradient with respect to w + - db: Gradient with respect to b + """ + dx, dw, db = None, None, None + ############################################################################# + # TODO: Implement the convolutional backward pass. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + return dx, dw, db + + +def max_pool_forward_naive(x, pool_param): + """ + A naive implementation of the forward pass for a max pooling layer. + + Inputs: + - x: Input data, of shape (N, C, H, W) + - pool_param: dictionary with the following keys: + - 'pool_height': The height of each pooling region + - 'pool_width': The width of each pooling region + - 'stride': The distance between adjacent pooling regions + + Returns a tuple of: + - out: Output data + - cache: (x, pool_param) + """ + out = None + ############################################################################# + # TODO: Implement the max pooling forward pass # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + cache = (x, pool_param) + return out, cache + + +def max_pool_backward_naive(dout, cache): + """ + A naive implementation of the backward pass for a max pooling layer. + + Inputs: + - dout: Upstream derivatives + - cache: A tuple of (x, pool_param) as in the forward pass. + + Returns: + - dx: Gradient with respect to x + """ + dx = None + ############################################################################# + # TODO: Implement the max pooling backward pass # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + return dx + + +def spatial_batchnorm_forward(x, gamma, beta, bn_param): + """ + Computes the forward pass for spatial batch normalization. + + Inputs: + - x: Input data of shape (N, C, H, W) + - gamma: Scale parameter, of shape (C,) + - beta: Shift parameter, of shape (C,) + - bn_param: Dictionary with the following keys: + - mode: 'train' or 'test'; required + - eps: Constant for numeric stability + - momentum: Constant for running mean / variance. momentum=0 means that + old information is discarded completely at every time step, while + momentum=1 means that new information is never incorporated. The + default of momentum=0.9 should work well in most situations. + - running_mean: Array of shape (D,) giving running mean of features + - running_var Array of shape (D,) giving running variance of features + + Returns a tuple of: + - out: Output data, of shape (N, C, H, W) + - cache: Values needed for the backward pass + """ + out, cache = None, None + + ############################################################################# + # TODO: Implement the forward pass for spatial batch normalization. # + # # + # HINT: You can implement spatial batch normalization using the vanilla # + # version of batch normalization defined above. Your implementation should # + # be very short; ours is less than five lines. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return out, cache + + +def spatial_batchnorm_backward(dout, cache): + """ + Computes the backward pass for spatial batch normalization. + + Inputs: + - dout: Upstream derivatives, of shape (N, C, H, W) + - cache: Values from the forward pass + + Returns a tuple of: + - dx: Gradient with respect to inputs, of shape (N, C, H, W) + - dgamma: Gradient with respect to scale parameter, of shape (C,) + - dbeta: Gradient with respect to shift parameter, of shape (C,) + """ + dx, dgamma, dbeta = None, None, None + + ############################################################################# + # TODO: Implement the backward pass for spatial batch normalization. # + # # + # HINT: You can implement spatial batch normalization using the vanilla # + # version of batch normalization defined above. Your implementation should # + # be very short; ours is less than five lines. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return dx, dgamma, dbeta + + +def svm_loss(x, y): + """ + Computes the loss and gradient using for multiclass SVM classification. + + Inputs: + - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class + for the ith input. + - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and + 0 <= y[i] < C + + Returns a tuple of: + - loss: Scalar giving the loss + - dx: Gradient of the loss with respect to x + """ + N = x.shape[0] + correct_class_scores = x[np.arange(N), y] + margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0) + margins[np.arange(N), y] = 0 + loss = np.sum(margins) / N + num_pos = np.sum(margins > 0, axis=1) + dx = np.zeros_like(x) + dx[margins > 0] = 1 + dx[np.arange(N), y] -= num_pos + dx /= N + return loss, dx + + +def softmax_loss(x, y): + """ + Computes the loss and gradient for softmax classification. + + Inputs: + - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class + for the ith input. + - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and + 0 <= y[i] < C + + Returns a tuple of: + - loss: Scalar giving the loss + - dx: Gradient of the loss with respect to x + """ + probs = np.exp(x - np.max(x, axis=1, keepdims=True)) + probs /= np.sum(probs, axis=1, keepdims=True) + N = x.shape[0] + loss = -np.sum(np.log(probs[np.arange(N), y])) / N + dx = probs.copy() + dx[np.arange(N), y] -= 1 + dx /= N + return loss, dx diff --git a/assignments2016/assignment2/cs231n/optim.py b/assignments2016/assignment2/cs231n/optim.py new file mode 100644 index 00000000..ee84a73b --- /dev/null +++ b/assignments2016/assignment2/cs231n/optim.py @@ -0,0 +1,149 @@ +import numpy as np + +""" +This file implements various first-order update rules that are commonly used for +training neural networks. Each update rule accepts current weights and the +gradient of the loss with respect to those weights and produces the next set of +weights. Each update rule has the same interface: + +def update(w, dw, config=None): + +Inputs: + - w: A numpy array giving the current weights. + - dw: A numpy array of the same shape as w giving the gradient of the + loss with respect to w. + - config: A dictionary containing hyperparameter values such as learning rate, + momentum, etc. If the update rule requires caching values over many + iterations, then config will also hold these cached values. + +Returns: + - next_w: The next point after the update. + - config: The config dictionary to be passed to the next iteration of the + update rule. + +NOTE: For most update rules, the default learning rate will probably not perform +well; however the default values of the other hyperparameters should work well +for a variety of different problems. + +For efficiency, update rules may perform in-place updates, mutating w and +setting next_w equal to w. +""" + + +def sgd(w, dw, config=None): + """ + Performs vanilla stochastic gradient descent. + + config format: + - learning_rate: Scalar learning rate. + """ + if config is None: config = {} + config.setdefault('learning_rate', 1e-2) + + w -= config['learning_rate'] * dw + return w, config + + +def sgd_momentum(w, dw, config=None): + """ + Performs stochastic gradient descent with momentum. + + config format: + - learning_rate: Scalar learning rate. + - momentum: Scalar between 0 and 1 giving the momentum value. + Setting momentum = 0 reduces to sgd. + - velocity: A numpy array of the same shape as w and dw used to store a moving + average of the gradients. + """ + if config is None: config = {} + config.setdefault('learning_rate', 1e-2) + config.setdefault('momentum', 0.9) + v = config.get('velocity', np.zeros_like(w)) + + next_w = None + ############################################################################# + # TODO: Implement the momentum update formula. Store the updated value in # + # the next_w variable. You should also use and update the velocity v. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + config['velocity'] = v + + return next_w, config + + + +def rmsprop(x, dx, config=None): + """ + Uses the RMSProp update rule, which uses a moving average of squared gradient + values to set adaptive per-parameter learning rates. + + config format: + - learning_rate: Scalar learning rate. + - decay_rate: Scalar between 0 and 1 giving the decay rate for the squared + gradient cache. + - epsilon: Small scalar used for smoothing to avoid dividing by zero. + - cache: Moving average of second moments of gradients. + """ + if config is None: config = {} + config.setdefault('learning_rate', 1e-2) + config.setdefault('decay_rate', 0.99) + config.setdefault('epsilon', 1e-8) + config.setdefault('cache', np.zeros_like(x)) + + next_x = None + ############################################################################# + # TODO: Implement the RMSprop update formula, storing the next value of x # + # in the next_x variable. Don't forget to update cache value stored in # + # config['cache']. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return next_x, config + + +def adam(x, dx, config=None): + """ + Uses the Adam update rule, which incorporates moving averages of both the + gradient and its square and a bias correction term. + + config format: + - learning_rate: Scalar learning rate. + - beta1: Decay rate for moving average of first moment of gradient. + - beta2: Decay rate for moving average of second moment of gradient. + - epsilon: Small scalar used for smoothing to avoid dividing by zero. + - m: Moving average of gradient. + - v: Moving average of squared gradient. + - t: Iteration number. + """ + if config is None: config = {} + config.setdefault('learning_rate', 1e-3) + config.setdefault('beta1', 0.9) + config.setdefault('beta2', 0.999) + config.setdefault('epsilon', 1e-8) + config.setdefault('m', np.zeros_like(x)) + config.setdefault('v', np.zeros_like(x)) + config.setdefault('t', 0) + + next_x = None + ############################################################################# + # TODO: Implement the Adam update formula, storing the next value of x in # + # the next_x variable. Don't forget to update the m, v, and t variables # + # stored in config. # + ############################################################################# + pass + ############################################################################# + # END OF YOUR CODE # + ############################################################################# + + return next_x, config + + + + + diff --git a/assignments2016/assignment2/cs231n/setup.py b/assignments2016/assignment2/cs231n/setup.py new file mode 100644 index 00000000..9a2e6ca0 --- /dev/null +++ b/assignments2016/assignment2/cs231n/setup.py @@ -0,0 +1,14 @@ +from distutils.core import setup +from distutils.extension import Extension +from Cython.Build import cythonize +import numpy + +extensions = [ + Extension('im2col_cython', ['im2col_cython.pyx'], + include_dirs = [numpy.get_include()] + ), +] + +setup( + ext_modules = cythonize(extensions), +) diff --git a/assignments2016/assignment2/cs231n/solver.py b/assignments2016/assignment2/cs231n/solver.py new file mode 100644 index 00000000..02f2726c --- /dev/null +++ b/assignments2016/assignment2/cs231n/solver.py @@ -0,0 +1,266 @@ +import numpy as np + +from cs231n import optim + + +class Solver(object): + """ + A Solver encapsulates all the logic necessary for training classification + models. The Solver performs stochastic gradient descent using different + update rules defined in optim.py. + + The solver accepts both training and validataion data and labels so it can + periodically check classification accuracy on both training and validation + data to watch out for overfitting. + + To train a model, you will first construct a Solver instance, passing the + model, dataset, and various optoins (learning rate, batch size, etc) to the + constructor. You will then call the train() method to run the optimization + procedure and train the model. + + After the train() method returns, model.params will contain the parameters + that performed best on the validation set over the course of training. + In addition, the instance variable solver.loss_history will contain a list + of all losses encountered during training and the instance variables + solver.train_acc_history and solver.val_acc_history will be lists containing + the accuracies of the model on the training and validation set at each epoch. + + Example usage might look something like this: + + data = { + 'X_train': # training data + 'y_train': # training labels + 'X_val': # validation data + 'X_train': # validation labels + } + model = MyAwesomeModel(hidden_size=100, reg=10) + solver = Solver(model, data, + update_rule='sgd', + optim_config={ + 'learning_rate': 1e-3, + }, + lr_decay=0.95, + num_epochs=10, batch_size=100, + print_every=100) + solver.train() + + + A Solver works on a model object that must conform to the following API: + + - model.params must be a dictionary mapping string parameter names to numpy + arrays containing parameter values. + + - model.loss(X, y) must be a function that computes training-time loss and + gradients, and test-time classification scores, with the following inputs + and outputs: + + Inputs: + - X: Array giving a minibatch of input data of shape (N, d_1, ..., d_k) + - y: Array of labels, of shape (N,) giving labels for X where y[i] is the + label for X[i]. + + Returns: + If y is None, run a test-time forward pass and return: + - scores: Array of shape (N, C) giving classification scores for X where + scores[i, c] gives the score of class c for X[i]. + + If y is not None, run a training time forward and backward pass and return + a tuple of: + - loss: Scalar giving the loss + - grads: Dictionary with the same keys as self.params mapping parameter + names to gradients of the loss with respect to those parameters. + """ + + def __init__(self, model, data, **kwargs): + """ + Construct a new Solver instance. + + Required arguments: + - model: A model object conforming to the API described above + - data: A dictionary of training and validation data with the following: + 'X_train': Array of shape (N_train, d_1, ..., d_k) giving training images + 'X_val': Array of shape (N_val, d_1, ..., d_k) giving validation images + 'y_train': Array of shape (N_train,) giving labels for training images + 'y_val': Array of shape (N_val,) giving labels for validation images + + Optional arguments: + - update_rule: A string giving the name of an update rule in optim.py. + Default is 'sgd'. + - optim_config: A dictionary containing hyperparameters that will be + passed to the chosen update rule. Each update rule requires different + hyperparameters (see optim.py) but all update rules require a + 'learning_rate' parameter so that should always be present. + - lr_decay: A scalar for learning rate decay; after each epoch the learning + rate is multiplied by this value. + - batch_size: Size of minibatches used to compute loss and gradient during + training. + - num_epochs: The number of epochs to run for during training. + - print_every: Integer; training losses will be printed every print_every + iterations. + - verbose: Boolean; if set to false then no output will be printed during + training. + """ + self.model = model + self.X_train = data['X_train'] + self.y_train = data['y_train'] + self.X_val = data['X_val'] + self.y_val = data['y_val'] + + # Unpack keyword arguments + self.update_rule = kwargs.pop('update_rule', 'sgd') + self.optim_config = kwargs.pop('optim_config', {}) + self.lr_decay = kwargs.pop('lr_decay', 1.0) + self.batch_size = kwargs.pop('batch_size', 100) + self.num_epochs = kwargs.pop('num_epochs', 10) + + self.print_every = kwargs.pop('print_every', 10) + self.verbose = kwargs.pop('verbose', True) + + # Throw an error if there are extra keyword arguments + if len(kwargs) > 0: + extra = ', '.join('"%s"' % k for k in kwargs.keys()) + raise ValueError('Unrecognized arguments %s' % extra) + + # Make sure the update rule exists, then replace the string + # name with the actual function + if not hasattr(optim, self.update_rule): + raise ValueError('Invalid update_rule "%s"' % self.update_rule) + self.update_rule = getattr(optim, self.update_rule) + + self._reset() + + + def _reset(self): + """ + Set up some book-keeping variables for optimization. Don't call this + manually. + """ + # Set up some variables for book-keeping + self.epoch = 0 + self.best_val_acc = 0 + self.best_params = {} + self.loss_history = [] + self.train_acc_history = [] + self.val_acc_history = [] + + # Make a deep copy of the optim_config for each parameter + self.optim_configs = {} + for p in self.model.params: + d = {k: v for k, v in self.optim_config.iteritems()} + self.optim_configs[p] = d + + + def _step(self): + """ + Make a single gradient update. This is called by train() and should not + be called manually. + """ + # Make a minibatch of training data + num_train = self.X_train.shape[0] + batch_mask = np.random.choice(num_train, self.batch_size) + X_batch = self.X_train[batch_mask] + y_batch = self.y_train[batch_mask] + + # Compute loss and gradient + loss, grads = self.model.loss(X_batch, y_batch) + self.loss_history.append(loss) + + # Perform a parameter update + for p, w in self.model.params.iteritems(): + dw = grads[p] + config = self.optim_configs[p] + next_w, next_config = self.update_rule(w, dw, config) + self.model.params[p] = next_w + self.optim_configs[p] = next_config + + + def check_accuracy(self, X, y, num_samples=None, batch_size=100): + """ + Check accuracy of the model on the provided data. + + Inputs: + - X: Array of data, of shape (N, d_1, ..., d_k) + - y: Array of labels, of shape (N,) + - num_samples: If not None, subsample the data and only test the model + on num_samples datapoints. + - batch_size: Split X and y into batches of this size to avoid using too + much memory. + + Returns: + - acc: Scalar giving the fraction of instances that were correctly + classified by the model. + """ + + # Maybe subsample the data + N = X.shape[0] + if num_samples is not None and N > num_samples: + mask = np.random.choice(N, num_samples) + N = num_samples + X = X[mask] + y = y[mask] + + # Compute predictions in batches + num_batches = N / batch_size + if N % batch_size != 0: + num_batches += 1 + y_pred = [] + for i in xrange(num_batches): + start = i * batch_size + end = (i + 1) * batch_size + scores = self.model.loss(X[start:end]) + y_pred.append(np.argmax(scores, axis=1)) + y_pred = np.hstack(y_pred) + acc = np.mean(y_pred == y) + + return acc + + + def train(self): + """ + Run optimization to train the model. + """ + num_train = self.X_train.shape[0] + iterations_per_epoch = max(num_train / self.batch_size, 1) + num_iterations = self.num_epochs * iterations_per_epoch + + for t in xrange(num_iterations): + self._step() + + # Maybe print training loss + if self.verbose and t % self.print_every == 0: + print '(Iteration %d / %d) loss: %f' % ( + t + 1, num_iterations, self.loss_history[-1]) + + # At the end of every epoch, increment the epoch counter and decay the + # learning rate. + epoch_end = (t + 1) % iterations_per_epoch == 0 + if epoch_end: + self.epoch += 1 + for k in self.optim_configs: + self.optim_configs[k]['learning_rate'] *= self.lr_decay + + # Check train and val accuracy on the first iteration, the last + # iteration, and at the end of each epoch. + first_it = (t == 0) + last_it = (t == num_iterations + 1) + if first_it or last_it or epoch_end: + train_acc = self.check_accuracy(self.X_train, self.y_train, + num_samples=1000) + val_acc = self.check_accuracy(self.X_val, self.y_val) + self.train_acc_history.append(train_acc) + self.val_acc_history.append(val_acc) + + if self.verbose: + print '(Epoch %d / %d) train acc: %f; val_acc: %f' % ( + self.epoch, self.num_epochs, train_acc, val_acc) + + # Keep track of the best model + if val_acc > self.best_val_acc: + self.best_val_acc = val_acc + self.best_params = {} + for k, v in self.model.params.iteritems(): + self.best_params[k] = v.copy() + + # At the end of training swap the best params into the model + self.model.params = self.best_params + diff --git a/assignments2016/assignment2/cs231n/vis_utils.py b/assignments2016/assignment2/cs231n/vis_utils.py new file mode 100644 index 00000000..8d04473f --- /dev/null +++ b/assignments2016/assignment2/cs231n/vis_utils.py @@ -0,0 +1,73 @@ +from math import sqrt, ceil +import numpy as np + +def visualize_grid(Xs, ubound=255.0, padding=1): + """ + Reshape a 4D tensor of image data to a grid for easy visualization. + + Inputs: + - Xs: Data of shape (N, H, W, C) + - ubound: Output grid will have values scaled to the range [0, ubound] + - padding: The number of blank pixels between elements of the grid + """ + (N, H, W, C) = Xs.shape + grid_size = int(ceil(sqrt(N))) + grid_height = H * grid_size + padding * (grid_size - 1) + grid_width = W * grid_size + padding * (grid_size - 1) + grid = np.zeros((grid_height, grid_width, C)) + next_idx = 0 + y0, y1 = 0, H + for y in xrange(grid_size): + x0, x1 = 0, W + for x in xrange(grid_size): + if next_idx < N: + img = Xs[next_idx] + low, high = np.min(img), np.max(img) + grid[y0:y1, x0:x1] = ubound * (img - low) / (high - low) + # grid[y0:y1, x0:x1] = Xs[next_idx] + next_idx += 1 + x0 += W + padding + x1 += W + padding + y0 += H + padding + y1 += H + padding + # grid_max = np.max(grid) + # grid_min = np.min(grid) + # grid = ubound * (grid - grid_min) / (grid_max - grid_min) + return grid + +def vis_grid(Xs): + """ visualize a grid of images """ + (N, H, W, C) = Xs.shape + A = int(ceil(sqrt(N))) + G = np.ones((A*H+A, A*W+A, C), Xs.dtype) + G *= np.min(Xs) + n = 0 + for y in range(A): + for x in range(A): + if n < N: + G[y*H+y:(y+1)*H+y, x*W+x:(x+1)*W+x, :] = Xs[n,:,:,:] + n += 1 + # normalize to [0,1] + maxg = G.max() + ming = G.min() + G = (G - ming)/(maxg-ming) + return G + +def vis_nn(rows): + """ visualize array of arrays of images """ + N = len(rows) + D = len(rows[0]) + H,W,C = rows[0][0].shape + Xs = rows[0][0] + G = np.ones((N*H+N, D*W+D, C), Xs.dtype) + for y in range(N): + for x in range(D): + G[y*H+y:(y+1)*H+y, x*W+x:(x+1)*W+x, :] = rows[y][x] + # normalize to [0,1] + maxg = G.max() + ming = G.min() + G = (G - ming)/(maxg-ming) + return G + + + diff --git a/assignments2016/assignment2/frameworkpython b/assignments2016/assignment2/frameworkpython new file mode 100755 index 00000000..a0fa5517 --- /dev/null +++ b/assignments2016/assignment2/frameworkpython @@ -0,0 +1,13 @@ +#!/bin/bash + +# what real Python executable to use +PYVER=2.7 +PATHTOPYTHON=/usr/local/bin/ +PYTHON=${PATHTOPYTHON}python${PYVER} + +# find the root of the virtualenv, it should be the parent of the dir this script is in +ENV=`$PYTHON -c "import os; print os.path.abspath(os.path.join(os.path.dirname(\"$0\"), '..'))"` + +# now run Python with the virtualenv set as Python's HOME +export PYTHONHOME=$ENV +exec $PYTHON "$@" diff --git a/assignments2016/assignment2/kitten.jpg b/assignments2016/assignment2/kitten.jpg new file mode 100644 index 00000000..e421ec1d Binary files /dev/null and b/assignments2016/assignment2/kitten.jpg differ diff --git a/assignments2016/assignment2/puppy.jpg b/assignments2016/assignment2/puppy.jpg new file mode 100644 index 00000000..3cc12347 Binary files /dev/null and b/assignments2016/assignment2/puppy.jpg differ diff --git a/assignments2016/assignment2/requirements.txt b/assignments2016/assignment2/requirements.txt new file mode 100644 index 00000000..3e6c302d --- /dev/null +++ b/assignments2016/assignment2/requirements.txt @@ -0,0 +1,46 @@ +Cython==0.23.4 +Jinja2==2.8 +MarkupSafe==0.23 +Pillow==3.0.0 +Pygments==2.0.2 +appnope==0.1.0 +argparse==1.2.1 +backports-abc==0.4 +backports.ssl-match-hostname==3.5.0.1 +certifi==2015.11.20.1 +cycler==0.9.0 +decorator==4.0.6 +functools32==3.2.3-2 +gnureadline==6.3.3 +ipykernel==4.2.2 +ipython==4.0.1 +ipython-genutils==0.1.0 +ipywidgets==4.1.1 +jsonschema==2.5.1 +jupyter==1.0.0 +jupyter-client==4.1.1 +jupyter-console==4.0.3 +jupyter-core==4.0.6 +matplotlib==1.5.0 +mistune==0.7.1 +nbconvert==4.1.0 +nbformat==4.0.1 +notebook==4.0.6 +numpy==1.10.4 +path.py==8.1.2 +pexpect==4.0.1 +pickleshare==0.5 +ptyprocess==0.5 +pyparsing==2.0.7 +python-dateutil==2.4.2 +pytz==2015.7 +pyzmq==15.1.0 +qtconsole==4.1.1 +scipy==0.16.1 +simplegeneric==0.8.1 +singledispatch==3.4.0.3 +six==1.10.0 +terminado==0.5 +tornado==4.3 +traitlets==4.0.0 +wsgiref==0.1.2 diff --git a/assignments2016/assignment2/start_ipython_osx.sh b/assignments2016/assignment2/start_ipython_osx.sh new file mode 100755 index 00000000..4815b001 --- /dev/null +++ b/assignments2016/assignment2/start_ipython_osx.sh @@ -0,0 +1,4 @@ +# Assume the virtualenv is called .env + +cp frameworkpython .env/bin +.env/bin/frameworkpython -m IPython notebook diff --git a/assignments2016/assignment3.md b/assignments2016/assignment3.md index 8b7e7aad..230ad3b0 100644 --- a/assignments2016/assignment3.md +++ b/assignments2016/assignment3.md @@ -4,51 +4,34 @@ mathjax: true permalink: assignments2016/assignment3/ --- -In this assignment you will implement recurrent networks, and apply them to image captioning on Microsoft COCO. We will also introduce the TinyImageNet dataset, and use a pretrained model on this dataset to explore different applications of image gradients. +이번 과제에서는 회귀신경망(Recurrent Neural Network, RNN)을 구현하고, Microsoft COCO 데이터셋의 이미지 캡셔닝(captionint) 문제에 적용해볼 것입니다. 또한, TinyImageNet 데이터셋을 소개하고, 이 데이터셋에 대해 미리 학습된 모델을 사용하여 이미지 그라디언트에 대한 다양한 어플리케이션에 대해 알아볼 것입니다. -The goals of this assignment are as follows: +이번 과제의 목표는 다음과 같습니다. -- Understand the architecture of *recurrent neural networks (RNNs)* and how they operate on sequences by sharing weights over time -- Understand the difference between vanilla RNNs and Long-Short Term Memory (LSTM) RNNs -- Understand how to sample from an RNN at test-time -- Understand how to combine convolutional neural nets and recurrent nets to implement an image captioning system -- Understand how a trained convolutional network can be used to compute gradients with respect to the input image -- Implement and different applications of image gradients, including saliency maps, fooling images, class visualizations, feature inversion, and DeepDream. +- *회귀신경망(Recurrent Neural Network, RNN)* 구조에 대해 이해하고 시간축 상에서 파라미터 값을 공유하면서 어떻게 시퀀스 데이터에 대해 동작하는지 이해하기 +- 기본 RNN 구조와 Long-Short Term Memory (LSTM) RNN 구조의 차이점 이해하기 +- 테스트 시 RNN에서 어떻게 샘플을 뽑는지 이해하기 +- 이미지 캡셔닝 시스템을 구현하기 위해 컨볼루션 신경망(CNN)과 회귀신경망(RNN)을 결합하는 방법 이해하기 +- 학습된 CNN이 입력 이미지에 대한 그라디언트를 계산할 때 어떻게 활용되는지 이해하기 +- 이미지 그라디언트의 여러 가지 응용법들 구현하기 (saliency 맵, 모델 속이기, 클래스 시각화, 특징 추출의 역과정, DeepDream 등 포함) -## Setup -You can work on the assignment in one of two ways: locally on your own machine, -or on a virtual machine through Terminal.com. +## 설치 +다음 두가지 방법으로 숙제를 시작할 수 있습니다: Terminal.com을 이용한 가상 환경 또는 로컬 환경. -### Working in the cloud on Terminal +### Terminal에서의 가상 환경. +Terminal에는 우리의 수업을 위한 서브도메인이 만들어져 있습니다. [www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com) 계정을 등록하세요. 이번 숙제에 대한 스냅샷은 [여기](https://www.stanfordterminalcloud.com/snapshot/49f5a1ea15dc424aec19155b3398784d57c55045435315ce4f8b96b62819ef65)에서 찾아볼 수 있습니다. 만약 수업에 등록되었다면, TA(see Piazza for more information)에게 이 수업을 위한 Terminal 예산을 요구할 수 있습니다. 처음 스냅샷을 실행시키면, 수업을 위한 모든 것이 설치되어 있어서 바로 숙제를 시작할 수 있습니다. [여기](/terminal-tutorial)에 Terminal을 위한 간단한 튜토리얼을 작성해 뒀습니다. -Terminal has created a separate subdomain to serve our class, -[www.stanfordterminalcloud.com](https://www.stanfordterminalcloud.com). Register -your account there. The Assignment 3 snapshot can then be found [HERE](https://www.stanfordterminalcloud.com/snapshot/29054ca27bc2e8bda888709ba3d9dd07a172cbbf0824152aac49b14a018ffbe5). -If you are registered in the class you can contact the TA (see Piazza for more -information) to request Terminal credits for use on the assignment. Once you -boot up the snapshot everything will be installed for you, and you will be ready to start on your assignment right away. We have written a small tutorial on Terminal [here](/terminal-tutorial). - -### Working locally -Get the code as a zip file -[here](http://cs231n.stanford.edu/winter1516_assignment3.zip). -As for the dependencies: +### 로컬 환경 +[여기](http://cs231n.stanford.edu/winter1516_assignment3.zip)에서 압축파일을 다운받으세요. +Dependency 관련: **[Option 1] Use Anaconda:** -The preferred approach for installing all the assignment dependencies is to use -[Anaconda](https://www.continuum.io/downloads), which is a Python distribution -that includes many of the most popular Python packages for science, math, -engineering and data analysis. Once you install it you can skip all mentions of -requirements and you are ready to go directly to working on the assignment. - -**[Option 2] Manual install, virtual environment:** -If you do not want to use Anaconda and want to go with a more manual and risky -installation route you will likely want to create a -[virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/) -for the project. If you choose not to use a virtual environment, it is up to you -to make sure that all dependencies for the code are installed globally on your -machine. To set up a virtual environment, run the following: - -```bash +과학, 수학, 공학, 데이터 분석을 위한 대부분의 주요 패키지들을 담고있는 [Anaconda](https://www.continuum.io/downloads)를 사용하여 설치하는 것이 흔히 사용하는 방법입니다. 설치가 다 되면 모든 요구사항(dependency)을 넘기고 바로 숙제를 시작해도 좋습니다. + +**[Option 2] 수동 설치, virtual environment:** +만약 Anaconda 대신 좀 더 일반적이면서 까다로운 방법을 택하고 싶다면 이번 과제를 위한 [virtual environment](http://docs.python-guide.org/en/latest/dev/virtualenvs/)를 만들 수 있습니다. 만약 virtual environment를 사용하지 않는다면 모든 코드가 컴퓨터에 전역적으로 종속되게 설치됩니다. Virtual environment의 설정은 아래를 참조하세요. + +~~~bash드 cd assignment3 sudo pip install virtualenv # This may already be installed virtualenv .env # Create a virtual environment @@ -56,71 +39,50 @@ source .env/bin/activate # Activate the virtual environment pip install -r requirements.txt # Install dependencies # Work on the assignment for a while ... deactivate # Exit the virtual environment -``` +~~~ -**Download data:** -Once you have the starter code, you will need to download the processed MS-COCO dataset, the TinyImageNet dataset, and the pretrained TinyImageNet model. Run the following from the `assignment3` directory: +**데이터셋 다운로드:** +시작 코드를 받은 후, 전처리 과정이 수행된 MS-COCO 데이터셋, TinyImageNet 데이터셋, 미리 학습된 TinyImageNet 모델을 다운받아야 합니다. `assignment3` 디렉토리에서 다음 명령어를 입력하세요. -```bash +~~~bash cd cs231n/datasets ./get_coco_captioning.sh ./get_tiny_imagenet_a.sh ./get_pretrained_model.sh -``` +~~~ -**Compile the Cython extension:** Convolutional Neural Networks require a very -efficient implementation. We have implemented of the functionality using -[Cython](http://cython.org/); you will need to compile the Cython extension -before you can run the code. From the `cs231n` directory, run the following -command: +**Cython extension 컴파일하기:** 컨볼루션 신경망은 매우 효율적인 구현을 필요로 합니다. 이 숙제를 위해서 [Cython](http://cython.org/)을 활용하여 여러 기능들을 구현해 놓았는데, 이를 위해 코드를 돌리기 전에 Cython extension을 컴파일해 주어야 합크니다. `cs231n` 디렉토리에서 아래 명령어를 실행하세요: -```bash +~~~bash python setup.py build_ext --inplace -``` - -**Start IPython:** -After you have the data, you should start the IPython notebook server -from the `assignment3` directory. If you are unfamiliar with IPython, you should -read our [IPython tutorial](/ipython-tutorial). - -**NOTE:** If you are working in a virtual environment on OSX, you may encounter -errors with matplotlib due to the -[issues described here](http://matplotlib.org/faq/virtualenv_faq.html). -You can work around this issue by starting the IPython server using the -`start_ipython_osx.sh` script from the `assignment3` directory; the script -assumes that your virtual environment is named `.env`. - - -### Submitting your work: -Whether you work on the assignment locally or using Terminal, once you are done -working run the `collectSubmission.sh` script; this will produce a file called -`assignment3.zip`. Upload this file under the Assignments tab on -[the coursework](https://coursework.stanford.edu/portal/site/W15-CS-231N-01/) -page for the course. - - -### Q1: Image Captioning with Vanilla RNNs (40 points) -The IPython notebook `RNN_Captioning.ipynb` will walk you through the -implementation of an image captioning system on MS-COCO using vanilla recurrent -networks. - -### Q2: Image Captioning with LSTMs (35 points) -The IPython notebook `LSTM_Captioning.ipynb` will walk you through the -implementation of Long-Short Term Memory (LSTM) RNNs, and apply them to image -captioning on MS-COCO. - -### Q3: Image Gradients: Saliency maps and Fooling Images (10 points) -The IPython notebook `ImageGradients.ipynb` will introduce the TinyImageNet -dataset. You will use a pretrained model on this dataset to compute gradients -with respect to the image, and use them to produce saliency maps and fooling -images. - -### Q4: Image Generation: Classes, Inversion, DeepDream (15 points) -In the IPython notebook `ImageGeneration.ipynb` you will use the pretrained -TinyImageNet model to generate images. In particular you will generate -class visualizations and implement feature inversion and DeepDream. - -### Q5: Do something extra! (up to +10 points) -Given the components of the assignment, try to do something cool. Maybe there is -some way to generate images that we did not implement in the assignment? +~~~ + +**IPython 시작:** +데이터를 모두 다운받은 뒤, `assignment3`에서 IPython notebook 서버를 시작해야 합니다. IPython에 익숙하지 않다면 [IPython tutorial](/ipython-tutorial)을 먼저 읽어보는 것을 권장합니다. + +**NOTE:** OSX에서 virtual environment를 실행하면, matplotlib 에러가 날 수 있습니다([이 문제에 관한 이슈](http://matplotlib.org/faq/virtualenv_faq.html)). IPython 서버를 `assignment3`폴더의 `start_ipython_osx.sh`로 실행하면 이 문제를 피해갈 수 있습니다; 이 스크립트는 virtual environment가 `.env`라고 되어있다고 가정하고 작성되었습니다. + + +### 과제 제출: +로컬 환경이나 Terminal에서 숙제를 마쳤다면 `collectSubmission.sh`스크립트를 실행하세요. 이 스크립트는 `assignment3.zip`파일을 만듭니다. 이 파일을 [the coursework](https://coursework.stanford.edu/portal/site/W15-CS-231N-01/) 페이지의 Assignments 탭 아래에 업로드하세요. + +### Q1: 기본 RNN 구조로 이미지 캡셔닝 구현 (40 points) +IPython notebook `RNN_Captioning.ipynb`에서 기본 RNN 구조를 사용하여 MS COCO 데이터셋에서 이미지 캡셔닝 시스템을 구현하는 방법을 설명합니다. + +### Q2: LSTM 구조로 이미지 캡셔닝 구현 (35 points) +IPython notebook `LSTM_Captioning.ipynb`에서 Long-Short Term Memory (LSTM) RNN 구조의 구현에 대해 설명하고, 이를 MS COCO 데이터셋의 이미지 캡셔닝 문제에 적용해 봅니다. + +### Q3: 이미지 그라디언트: Saliency 맵과 Fooling Images (10 points) +IPython notebook `ImageGradients.ipynb`에서 TinyImageNet 데이터셋을 소개합니다. 이 데이터셋에 대해 미리 학습된 모델(pretrained model)을 활용하여 이미지에 대한 그라디언트를 계산하고, 이를 사용해서 saliency 맵과 fooling image들을 생성하는 법에 대해 설명합니다. + +### Q4: 이미지 생성: 클래스, 역 과정(Inversion), DeepDream (15 points) +IPython notebook `ImageGeneration.ipynb`에서는 미리 학습된 TinyImageNet 모델을 활용하여 이미지를 생성해볼 것입니다. 특히, 클래스들을 시각화 해보고 특징(feature) 추출의 역과정과 DeepDream을 구현할 것입니다. + +### Q5: 추가 과제: 뭔가 더 해보세요! (+10 points) +이번 과제에서 제공된 것들을 활용해서 무언가 멋있는 것들을 시도해볼 수 있을 것입니다. 과제에서 구현하지 않은 다른 방식으로 이미지들을 생성하는 방법이 있을 수도 있어요! + +--- +

+번역: 최명섭 (myungsub) +

diff --git a/assignments2016/assignment3/.gitignore b/assignments2016/assignment3/.gitignore new file mode 100644 index 00000000..b0611d38 --- /dev/null +++ b/assignments2016/assignment3/.gitignore @@ -0,0 +1,3 @@ +*.swp +*.pyc +.env/* diff --git a/assignments2016/assignment3/ImageGeneration.ipynb b/assignments2016/assignment3/ImageGeneration.ipynb new file mode 100644 index 00000000..24747ae5 --- /dev/null +++ b/assignments2016/assignment3/ImageGeneration.ipynb @@ -0,0 +1,511 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Image Generation\n", + "In this notebook we will continue our exploration of image gradients using the deep model that was pretrained on TinyImageNet. We will explore various ways of using these image gradients to generate images. We will implement class visualizations, feature inversion, and DeepDream." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# As usual, a bit of setup\n", + "\n", + "import time, os, json\n", + "import numpy as np\n", + "from scipy.misc import imread, imresize\n", + "import matplotlib.pyplot as plt\n", + "\n", + "from cs231n.classifiers.pretrained_cnn import PretrainedCNN\n", + "from cs231n.data_utils import load_tiny_imagenet\n", + "from cs231n.image_utils import blur_image, deprocess_image, preprocess_image\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# TinyImageNet and pretrained model\n", + "As in the previous notebook, load the TinyImageNet dataset and the pretrained model." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "data = load_tiny_imagenet('cs231n/datasets/tiny-imagenet-100-A', subtract_mean=True)\n", + "model = PretrainedCNN(h5_file='cs231n/datasets/pretrained_model.h5')" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + " # Class visualization\n", + "By starting with a random noise image and performing gradient ascent on a target class, we can generate an image that the network will recognize as the target class. This idea was first presented in [1]; [2] extended this idea by suggesting several regularization techniques that can improve the quality of the generated image.\n", + "\n", + "Concretely, let $I$ be an image and let $y$ be a target class. Let $s_y(I)$ be the score that a convolutional network assigns to the image $I$ for class $y$; note that these are raw unnormalized scores, not class probabilities. We wish to generate an image $I^*$ that achieves a high score for the class $y$ by solving the problem\n", + "\n", + "$$\n", + "I^* = \\arg\\max_I s_y(I) + R(I)\n", + "$$\n", + "\n", + "where $R$ is a (possibly implicit) regularizer. We can solve this optimization problem using gradient descent, computing gradients with respect to the generated image. We will use (explicit) L2 regularization of the form\n", + "\n", + "$$\n", + "R(I) + \\lambda \\|I\\|_2^2\n", + "$$\n", + "\n", + "and implicit regularization as suggested by [2] by peridically blurring the generated image. We can solve this problem using gradient ascent on the generated image.\n", + "\n", + "In the cell below, complete the implementation of the `create_class_visualization` function.\n", + "\n", + "[1] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. \"Deep Inside Convolutional Networks: Visualising\n", + "Image Classification Models and Saliency Maps\", ICLR Workshop 2014.\n", + "\n", + "[2] Yosinski et al, \"Understanding Neural Networks Through Deep Visualization\", ICML 2015 Deep Learning Workshop" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "def create_class_visualization(target_y, model, **kwargs):\n", + " \"\"\"\n", + " Perform optimization over the image to generate class visualizations.\n", + " \n", + " Inputs:\n", + " - target_y: Integer in the range [0, 100) giving the target class\n", + " - model: A PretrainedCNN that will be used for generation\n", + " \n", + " Keyword arguments:\n", + " - learning_rate: Floating point number giving the learning rate\n", + " - blur_every: An integer; how often to blur the image as a regularizer\n", + " - l2_reg: Floating point number giving L2 regularization strength on the image;\n", + " this is lambda in the equation above.\n", + " - max_jitter: How much random jitter to add to the image as regularization\n", + " - num_iterations: How many iterations to run for\n", + " - show_every: How often to show the image\n", + " \"\"\"\n", + " \n", + " learning_rate = kwargs.pop('learning_rate', 10000)\n", + " blur_every = kwargs.pop('blur_every', 1)\n", + " l2_reg = kwargs.pop('l2_reg', 1e-6)\n", + " max_jitter = kwargs.pop('max_jitter', 4)\n", + " num_iterations = kwargs.pop('num_iterations', 100)\n", + " show_every = kwargs.pop('show_every', 25)\n", + " \n", + " X = np.random.randn(1, 3, 64, 64)\n", + " for t in xrange(num_iterations):\n", + " # As a regularizer, add random jitter to the image\n", + " ox, oy = np.random.randint(-max_jitter, max_jitter+1, 2)\n", + " X = np.roll(np.roll(X, ox, -1), oy, -2)\n", + "\n", + " dX = None\n", + " ############################################################################\n", + " # TODO: Compute the image gradient dX of the image with respect to the #\n", + " # target_y class score. This should be similar to the fooling images. Also #\n", + " # add L2 regularization to dX and update the image X using the image #\n", + " # gradient and the learning rate. #\n", + " ############################################################################\n", + " pass\n", + " ############################################################################\n", + " # END OF YOUR CODE #\n", + " ############################################################################\n", + " \n", + " # Undo the jitter\n", + " X = np.roll(np.roll(X, -ox, -1), -oy, -2)\n", + " \n", + " # As a regularizer, clip the image\n", + " X = np.clip(X, -data['mean_image'], 255.0 - data['mean_image'])\n", + " \n", + " # As a regularizer, periodically blur the image\n", + " if t % blur_every == 0:\n", + " X = blur_image(X)\n", + " \n", + " # Periodically show the image\n", + " if t % show_every == 0:\n", + " plt.imshow(deprocess_image(X, data['mean_image']))\n", + " plt.gcf().set_size_inches(3, 3)\n", + " plt.axis('off')\n", + " plt.show()\n", + " return X" + ], + "outputs": [], + "metadata": { + "collapsed": true + } + }, + { + "source": [ + "You can use the code above to generate some cool images! An example is shown below. Try to generate a cool-looking image. If you want you can try to implement the other regularization schemes from Yosinski et al, but it isn't required." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "target_y = 43 # Tarantula\n", + "print data['class_names'][target_y]\n", + "X = create_class_visualization(target_y, model, show_every=25)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Feature Inversion\n", + "In an attempt to understand the types of features that convolutional networks learn to recognize, a recent paper [1] attempts to reconstruct an image from its feature representation. We can easily implement this idea using image gradients from the pretrained network.\n", + "\n", + "Concretely, given a image $I$, let $\\phi_\\ell(I)$ be the activations at layer $\\ell$ of the convolutional network $\\phi$. We wish to find an image $I^*$ with a similar feature representation as $I$ at layer $\\ell$ of the network $\\phi$ by solving the optimization problem\n", + "\n", + "$$\n", + "I^* = \\arg\\min_{I'} \\|\\phi_\\ell(I) - \\phi_\\ell(I')\\|_2^2 + R(I')\n", + "$$\n", + "\n", + "where $\\|\\cdot\\|_2^2$ is the squared Euclidean norm. As above, $R$ is a (possibly implicit) regularizer. We can solve this optimization problem using gradient descent, computing gradients with respect to the generated image. We will use (explicit) L2 regularization of the form\n", + "\n", + "$$\n", + "R(I') + \\lambda \\|I'\\|_2^2\n", + "$$\n", + "\n", + "together with implicit regularization by periodically blurring the image, as recommended by [2].\n", + "\n", + "Implement this method in the function below.\n", + "\n", + "[1] Aravindh Mahendran, Andrea Vedaldi, \"Understanding Deep Image Representations by Inverting them\", CVPR 2015\n", + "\n", + "[2] Yosinski et al, \"Understanding Neural Networks Through Deep Visualization\", ICML 2015 Deep Learning Workshop" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "def invert_features(target_feats, layer, model, **kwargs):\n", + " \"\"\"\n", + " Perform feature inversion in the style of Mahendran and Vedaldi 2015, using\n", + " L2 regularization and periodic blurring.\n", + " \n", + " Inputs:\n", + " - target_feats: Image features of the target image, of shape (1, C, H, W);\n", + " we will try to generate an image that matches these features\n", + " - layer: The index of the layer from which the features were extracted\n", + " - model: A PretrainedCNN that was used to extract features\n", + " \n", + " Keyword arguments:\n", + " - learning_rate: The learning rate to use for gradient descent\n", + " - num_iterations: The number of iterations to use for gradient descent\n", + " - l2_reg: The strength of L2 regularization to use; this is lambda in the\n", + " equation above.\n", + " - blur_every: How often to blur the image as implicit regularization; set\n", + " to 0 to disable blurring.\n", + " - show_every: How often to show the generated image; set to 0 to disable\n", + " showing intermediate reuslts.\n", + " \n", + " Returns:\n", + " - X: Generated image of shape (1, 3, 64, 64) that matches the target features.\n", + " \"\"\"\n", + " learning_rate = kwargs.pop('learning_rate', 10000)\n", + " num_iterations = kwargs.pop('num_iterations', 500)\n", + " l2_reg = kwargs.pop('l2_reg', 1e-7)\n", + " blur_every = kwargs.pop('blur_every', 1)\n", + " show_every = kwargs.pop('show_every', 50)\n", + " \n", + " X = np.random.randn(1, 3, 64, 64)\n", + " for t in xrange(num_iterations):\n", + " ############################################################################\n", + " # TODO: Compute the image gradient dX of the reconstruction loss with #\n", + " # respect to the image. You should include L2 regularization penalizing #\n", + " # large pixel values in the generated image using the l2_reg parameter; #\n", + " # then update the generated image using the learning_rate from above. #\n", + " ############################################################################\n", + " pass\n", + " ############################################################################\n", + " # END OF YOUR CODE #\n", + " ############################################################################\n", + " \n", + " # As a regularizer, clip the image\n", + " X = np.clip(X, -data['mean_image'], 255.0 - data['mean_image'])\n", + " \n", + " # As a regularizer, periodically blur the image\n", + " if (blur_every > 0) and t % blur_every == 0:\n", + " X = blur_image(X)\n", + "\n", + " if (show_every > 0) and (t % show_every == 0 or t + 1 == num_iterations):\n", + " plt.imshow(deprocess_image(X, data['mean_image']))\n", + " plt.gcf().set_size_inches(3, 3)\n", + " plt.axis('off')\n", + " plt.title('t = %d' % t)\n", + " plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "### Shallow feature reconstruction\n", + "After implementing the feature inversion above, run the following cell to try and reconstruct features from the fourth convolutional layer of the pretrained model. You should be able to reconstruct the features using the provided optimization parameters." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "filename = 'kitten.jpg'\n", + "layer = 3 # layers start from 0 so these are features after 4 convolutions\n", + "img = imresize(imread(filename), (64, 64))\n", + "\n", + "plt.imshow(img)\n", + "plt.gcf().set_size_inches(3, 3)\n", + "plt.title('Original image')\n", + "plt.axis('off')\n", + "plt.show()\n", + "\n", + "# Preprocess the image before passing it to the network:\n", + "# subtract the mean, add a dimension, etc\n", + "img_pre = preprocess_image(img, data['mean_image'])\n", + "\n", + "# Extract features from the image\n", + "feats, _ = model.forward(img_pre, end=layer)\n", + "\n", + "# Invert the features\n", + "kwargs = {\n", + " 'num_iterations': 400,\n", + " 'learning_rate': 5000,\n", + " 'l2_reg': 1e-8,\n", + " 'show_every': 100,\n", + " 'blur_every': 10,\n", + "}\n", + "X = invert_features(feats, layer, model, **kwargs)" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + }, + { + "source": [ + "### Deep feature reconstruction\n", + "Reconstructing images using features from deeper layers of the network tends to give interesting results. In the cell below, try to reconstruct the best image you can by inverting the features after 7 layers of convolutions. You will need to play with the hyperparameters to try and get a good result.\n", + "\n", + "HINT: If you read the paper by Mahendran and Vedaldi, you'll see that reconstructions from deep features tend not to look much like the original image, so you shouldn't expect the results to look like the reconstruction above. You should be able to get an image that shows some discernable structure within 1000 iterations." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "filename = 'kitten.jpg'\n", + "layer = 6 # layers start from 0 so these are features after 7 convolutions\n", + "img = imresize(imread(filename), (64, 64))\n", + "\n", + "plt.imshow(img)\n", + "plt.gcf().set_size_inches(3, 3)\n", + "plt.title('Original image')\n", + "plt.axis('off')\n", + "plt.show()\n", + "\n", + "# Preprocess the image before passing it to the network:\n", + "# subtract the mean, add a dimension, etc\n", + "img_pre = preprocess_image(img, data['mean_image'])\n", + "\n", + "# Extract features from the image\n", + "feats, _ = model.forward(img_pre, end=layer)\n", + "\n", + "# Invert the features\n", + "# You will need to play with these parameters.\n", + "kwargs = {\n", + " 'num_iterations': 1000,\n", + " 'learning_rate': 0,\n", + " 'l2_reg': 0,\n", + " 'show_every': 100,\n", + " 'blur_every': 0,\n", + "}\n", + "X = invert_features(feats, layer, model, **kwargs)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# DeepDream\n", + "In the summer of 2015, Google released a [blog post](http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html) describing a new method of generating images from neural networks, and they later [released code](https://github.com/google/deepdream) to generate these images.\n", + "\n", + "The idea is very simple. We pick some layer from the network, pass the starting image through the network to extract features at the chosen layer, set the gradient at that layer equal to the activations themselves, and then backpropagate to the image. This has the effect of modifying the image to amplify the activations at the chosen layer of the network.\n", + "\n", + "For DeepDream we usually extract features from one of the convolutional layers, allowing us to generate images of any resolution.\n", + "\n", + "We can implement this idea using our pretrained network. The results probably won't look as good as Google's since their network is much bigger, but we should still be able to generate some interesting images." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "def deepdream(X, layer, model, **kwargs):\n", + " \"\"\"\n", + " Generate a DeepDream image.\n", + " \n", + " Inputs:\n", + " - X: Starting image, of shape (1, 3, H, W)\n", + " - layer: Index of layer at which to dream\n", + " - model: A PretrainedCNN object\n", + " \n", + " Keyword arguments:\n", + " - learning_rate: How much to update the image at each iteration\n", + " - max_jitter: Maximum number of pixels for jitter regularization\n", + " - num_iterations: How many iterations to run for\n", + " - show_every: How often to show the generated image\n", + " \"\"\"\n", + " \n", + " X = X.copy()\n", + " \n", + " learning_rate = kwargs.pop('learning_rate', 5.0)\n", + " max_jitter = kwargs.pop('max_jitter', 16)\n", + " num_iterations = kwargs.pop('num_iterations', 100)\n", + " show_every = kwargs.pop('show_every', 25)\n", + " \n", + " for t in xrange(num_iterations):\n", + " # As a regularizer, add random jitter to the image\n", + " ox, oy = np.random.randint(-max_jitter, max_jitter+1, 2)\n", + " X = np.roll(np.roll(X, ox, -1), oy, -2)\n", + "\n", + " dX = None\n", + " ############################################################################\n", + " # TODO: Compute the image gradient dX using the DeepDream method. You'll #\n", + " # need to use the forward and backward methods of the model object to #\n", + " # extract activations and set gradients for the chosen layer. After #\n", + " # computing the image gradient dX, you should use the learning rate to #\n", + " # update the image X. #\n", + " ############################################################################\n", + " pass\n", + " ############################################################################\n", + " # END OF YOUR CODE #\n", + " ############################################################################\n", + " \n", + " # Undo the jitter\n", + " X = np.roll(np.roll(X, -ox, -1), -oy, -2)\n", + " \n", + " # As a regularizer, clip the image\n", + " mean_pixel = data['mean_image'].mean(axis=(1, 2), keepdims=True)\n", + " X = np.clip(X, -mean_pixel, 255.0 - mean_pixel)\n", + " \n", + " # Periodically show the image\n", + " if t == 0 or (t + 1) % show_every == 0:\n", + " img = deprocess_image(X, data['mean_image'], mean='pixel')\n", + " plt.imshow(img)\n", + " plt.title('t = %d' % (t + 1))\n", + " plt.gcf().set_size_inches(8, 8)\n", + " plt.axis('off')\n", + " plt.show()\n", + " return X" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Generate some images!\n", + "Try and generate a cool-looking DeepDeam image using the pretrained network. You can try using different layers, or starting from different images. You can reduce the image size if it runs too slowly on your machine, or increase the image size if you are feeling ambitious." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "def read_image(filename, max_size):\n", + " \"\"\"\n", + " Read an image from disk and resize it so its larger side is max_size\n", + " \"\"\"\n", + " img = imread(filename)\n", + " H, W, _ = img.shape\n", + " if H >= W:\n", + " img = imresize(img, (max_size, int(W * float(max_size) / H)))\n", + " elif H < W:\n", + " img = imresize(img, (int(H * float(max_size) / W), max_size))\n", + " return img\n", + "\n", + "filename = 'kitten.jpg'\n", + "max_size = 256\n", + "img = read_image(filename, max_size)\n", + "plt.imshow(img)\n", + "plt.axis('off')\n", + "\n", + "# Preprocess the image by converting to float, transposing,\n", + "# and performing mean subtraction.\n", + "img_pre = preprocess_image(img, data['mean_image'], mean='pixel')\n", + "\n", + "out = deepdream(img_pre, 7, model, learning_rate=2000)" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.6", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment3/ImageGradients.ipynb b/assignments2016/assignment3/ImageGradients.ipynb new file mode 100644 index 00000000..669cef26 --- /dev/null +++ b/assignments2016/assignment3/ImageGradients.ipynb @@ -0,0 +1,383 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Image Gradients\n", + "In this notebook we'll introduce the TinyImageNet dataset and a deep CNN that has been pretrained on this dataset. You will use this pretrained model to compute gradients with respect to images, and use these image gradients to produce class saliency maps and fooling images." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# As usual, a bit of setup\n", + "\n", + "import time, os, json\n", + "import numpy as np\n", + "import skimage.io\n", + "import matplotlib.pyplot as plt\n", + "\n", + "from cs231n.classifiers.pretrained_cnn import PretrainedCNN\n", + "from cs231n.data_utils import load_tiny_imagenet\n", + "from cs231n.image_utils import blur_image, deprocess_image\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Introducing TinyImageNet\n", + "\n", + "The TinyImageNet dataset is a subset of the ILSVRC-2012 classification dataset. It consists of 200 object classes, and for each object class it provides 500 training images, 50 validation images, and 50 test images. All images have been downsampled to 64x64 pixels. We have provided the labels for all training and validation images, but have withheld the labels for the test images.\n", + "\n", + "We have further split the full TinyImageNet dataset into two equal pieces, each with 100 object classes. We refer to these datasets as TinyImageNet-100-A and TinyImageNet-100-B; for this exercise you will work with TinyImageNet-100-A.\n", + "\n", + "To download the data, go into the `cs231n/datasets` directory and run the script `get_tiny_imagenet_a.sh`. Then run the following code to load the TinyImageNet-100-A dataset into memory.\n", + "\n", + "NOTE: The full TinyImageNet-100-A dataset will take up about 250MB of disk space, and loading the full TinyImageNet-100-A dataset into memory will use about 2.8GB of memory." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "data = load_tiny_imagenet('cs231n/datasets/tiny-imagenet-100-A', subtract_mean=True)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# TinyImageNet-100-A classes\n", + "Since ImageNet is based on the WordNet ontology, each class in ImageNet (and TinyImageNet) actually has several different names. For example \"pop bottle\" and \"soda bottle\" are both valid names for the same class. Run the following to see a list of all classes in TinyImageNet-100-A:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "for i, names in enumerate(data['class_names']):\n", + " print i, ' '.join('\"%s\"' % name for name in names)" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + }, + { + "source": [ + "# Visualize Examples\n", + "Run the following to visualize some example images from random classses in TinyImageNet-100-A. It selects classes and images randomly, so you can run it several times to see different images." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Visualize some examples of the training data\n", + "classes_to_show = 7\n", + "examples_per_class = 5\n", + "\n", + "class_idxs = np.random.choice(len(data['class_names']), size=classes_to_show, replace=False)\n", + "for i, class_idx in enumerate(class_idxs):\n", + " train_idxs, = np.nonzero(data['y_train'] == class_idx)\n", + " train_idxs = np.random.choice(train_idxs, size=examples_per_class, replace=False)\n", + " for j, train_idx in enumerate(train_idxs):\n", + " img = deprocess_image(data['X_train'][train_idx], data['mean_image'])\n", + " plt.subplot(examples_per_class, classes_to_show, 1 + i + classes_to_show * j)\n", + " if j == 0:\n", + " plt.title(data['class_names'][class_idx][0])\n", + " plt.imshow(img)\n", + " plt.gca().axis('off')\n", + "\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Pretrained model\n", + "We have trained a deep CNN for you on the TinyImageNet-100-A dataset that we will use for image visualization. The model has 9 convolutional layers (with spatial batch normalization) and 1 fully-connected hidden layer (with batch normalization).\n", + "\n", + "To get the model, run the script `get_pretrained_model.sh` from the `cs231n/datasets` directory. After doing so, run the following to load the model from disk." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "model = PretrainedCNN(h5_file='cs231n/datasets/pretrained_model.h5')" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Pretrained model performance\n", + "Run the following to test the performance of the pretrained model on some random training and validation set images. You should see training accuracy around 90% and validation accuracy around 60%; this indicates a bit of overfitting, but it should work for our visualization experiments." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "batch_size = 100\n", + "\n", + "# Test the model on training data\n", + "mask = np.random.randint(data['X_train'].shape[0], size=batch_size)\n", + "X, y = data['X_train'][mask], data['y_train'][mask]\n", + "y_pred = model.loss(X).argmax(axis=1)\n", + "print 'Training accuracy: ', (y_pred == y).mean()\n", + "\n", + "# Test the model on validation data\n", + "mask = np.random.randint(data['X_val'].shape[0], size=batch_size)\n", + "X, y = data['X_val'][mask], data['y_val'][mask]\n", + "y_pred = model.loss(X).argmax(axis=1)\n", + "print 'Validation accuracy: ', (y_pred == y).mean()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Saliency Maps\n", + "Using this pretrained model, we will compute class saliency maps as described in Section 3.1 of [1].\n", + "\n", + "As mentioned in Section 2 of the paper, you should compute the gradient of the image with respect to the unnormalized class score, not with respect to the normalized class probability.\n", + "\n", + "You will need to use the `forward` and `backward` methods of the `PretrainedCNN` class to compute gradients with respect to the image. Open the file `cs231n/classifiers/pretrained_cnn.py` and read the documentation for these methods to make sure you know how they work. For example usage, you can see the `loss` method. Make sure to run the model in `test` mode when computing saliency maps.\n", + "\n", + "[1] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. \"Deep Inside Convolutional Networks: Visualising\n", + "Image Classification Models and Saliency Maps\", ICLR Workshop 2014." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "def compute_saliency_maps(X, y, model):\n", + " \"\"\"\n", + " Compute a class saliency map using the model for images X and labels y.\n", + " \n", + " Input:\n", + " - X: Input images, of shape (N, 3, H, W)\n", + " - y: Labels for X, of shape (N,)\n", + " - model: A PretrainedCNN that will be used to compute the saliency map.\n", + " \n", + " Returns:\n", + " - saliency: An array of shape (N, H, W) giving the saliency maps for the input\n", + " images.\n", + " \"\"\"\n", + " saliency = None\n", + " ##############################################################################\n", + " # TODO: Implement this function. You should use the forward and backward #\n", + " # methods of the PretrainedCNN class, and compute gradients with respect to #\n", + " # the unnormalized class score of the ground-truth classes in y. #\n", + " ##############################################################################\n", + " pass\n", + " ##############################################################################\n", + " # END OF YOUR CODE #\n", + " ##############################################################################\n", + " return saliency" + ], + "outputs": [], + "metadata": { + "collapsed": true + } + }, + { + "source": [ + "Once you have completed the implementation in the cell above, run the following to visualize some class saliency maps on the validation set of TinyImageNet-100-A." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "def show_saliency_maps(mask):\n", + " mask = np.asarray(mask)\n", + " X = data['X_val'][mask]\n", + " y = data['y_val'][mask]\n", + "\n", + " saliency = compute_saliency_maps(X, y, model)\n", + "\n", + " for i in xrange(mask.size):\n", + " plt.subplot(2, mask.size, i + 1)\n", + " plt.imshow(deprocess_image(X[i], data['mean_image']))\n", + " plt.axis('off')\n", + " plt.title(data['class_names'][y[i]][0])\n", + " plt.subplot(2, mask.size, mask.size + i + 1)\n", + " plt.title(mask[i])\n", + " plt.imshow(saliency[i])\n", + " plt.axis('off')\n", + " plt.gcf().set_size_inches(10, 4)\n", + " plt.show()\n", + "\n", + "# Show some random images\n", + "mask = np.random.randint(data['X_val'].shape[0], size=5)\n", + "show_saliency_maps(mask)\n", + " \n", + "# These are some cherry-picked images that should give good results\n", + "show_saliency_maps([128, 3225, 2417, 1640, 4619])" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Fooling Images\n", + "We can also use image gradients to generate \"fooling images\" as discussed in [2]. Given an image and a target class, we can perform gradient ascent over the image to maximize the target class, stopping when the network classifies the image as the target class. Implement the following function to generate fooling images.\n", + "\n", + "[2] Szegedy et al, \"Intriguing properties of neural networks\", ICLR 2014" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "def make_fooling_image(X, target_y, model):\n", + " \"\"\"\n", + " Generate a fooling image that is close to X, but that the model classifies\n", + " as target_y.\n", + " \n", + " Inputs:\n", + " - X: Input image, of shape (1, 3, 64, 64)\n", + " - target_y: An integer in the range [0, 100)\n", + " - model: A PretrainedCNN\n", + " \n", + " Returns:\n", + " - X_fooling: An image that is close to X, but that is classifed as target_y\n", + " by the model.\n", + " \"\"\"\n", + " X_fooling = X.copy()\n", + " ##############################################################################\n", + " # TODO: Generate a fooling image X_fooling that the model will classify as #\n", + " # the class target_y. Use gradient ascent on the target class score, using #\n", + " # the model.forward method to compute scores and the model.backward method #\n", + " # to compute image gradients. #\n", + " # #\n", + " # HINT: For most examples, you should be able to generate a fooling image #\n", + " # in fewer than 100 iterations of gradient ascent. #\n", + " ##############################################################################\n", + " pass\n", + " ##############################################################################\n", + " # END OF YOUR CODE #\n", + " ##############################################################################\n", + " return X_fooling" + ], + "outputs": [], + "metadata": { + "collapsed": true + } + }, + { + "source": [ + "Run the following to choose a random validation set image that is correctly classified by the network, and then make a fooling image." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Find a correctly classified validation image\n", + "while True:\n", + " i = np.random.randint(data['X_val'].shape[0])\n", + " X = data['X_val'][i:i+1]\n", + " y = data['y_val'][i:i+1]\n", + " y_pred = model.loss(X)[0].argmax()\n", + " if y_pred == y: break\n", + "\n", + "target_y = 67\n", + "X_fooling = make_fooling_image(X, target_y, model)\n", + "\n", + "# Make sure that X_fooling is classified as y_target\n", + "scores = model.loss(X_fooling)\n", + "assert scores[0].argmax() == target_y, 'The network is not fooled!'\n", + "\n", + "# Show original image, fooling image, and difference\n", + "plt.subplot(1, 3, 1)\n", + "plt.imshow(deprocess_image(X, data['mean_image']))\n", + "plt.axis('off')\n", + "plt.title(data['class_names'][y][0])\n", + "plt.subplot(1, 3, 2)\n", + "plt.imshow(deprocess_image(X_fooling, data['mean_image'], renorm=True))\n", + "plt.title(data['class_names'][target_y][0])\n", + "plt.axis('off')\n", + "plt.subplot(1, 3, 3)\n", + "plt.title('Difference')\n", + "plt.imshow(deprocess_image(X - X_fooling, data['mean_image']))\n", + "plt.axis('off')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.6", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment3/LSTM_Captioning.ipynb b/assignments2016/assignment3/LSTM_Captioning.ipynb new file mode 100644 index 00000000..74c0bf44 --- /dev/null +++ b/assignments2016/assignment3/LSTM_Captioning.ipynb @@ -0,0 +1,483 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Image Captioning with LSTMs\n", + "In the previous exercise you implemented a vanilla RNN and applied it to image captioning. In this notebook you will implement the LSTM update rule and use it for image captioning." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# As usual, a bit of setup\n", + "\n", + "import time, os, json\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "\n", + "from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n", + "from cs231n.rnn_layers import *\n", + "from cs231n.captioning_solver import CaptioningSolver\n", + "from cs231n.classifiers.rnn import CaptioningRNN\n", + "from cs231n.coco_utils import load_coco_data, sample_coco_minibatch, decode_captions\n", + "from cs231n.image_utils import image_from_url\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Load MS-COCO data\n", + "As in the previous notebook, we will use the Microsoft COCO dataset for captioning." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Load COCO data from disk; this returns a dictionary\n", + "# We'll work with dimensionality-reduced features for this notebook, but feel\n", + "# free to experiment with the original features by changing the flag below.\n", + "data = load_coco_data(pca_features=True)\n", + "\n", + "# Print out all the keys and values from the data dictionary\n", + "for k, v in data.iteritems():\n", + " if type(v) == np.ndarray:\n", + " print k, type(v), v.shape, v.dtype\n", + " else:\n", + " print k, type(v), len(v)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# LSTM\n", + "If you read recent papers, you'll see that many people use a variant on the vanialla RNN called Long-Short Term Memory (LSTM) RNNs. Vanilla RNNs can be tough to train on long sequences due to vanishing and exploding gradiants caused by repeated matrix multiplication. LSTMs solve this problem by replacing the simple update rule of the vanilla RNN with a gating mechanism as follows.\n", + "\n", + "Similar to the vanilla RNN, at each timestep we receive an input $x_t\\in\\mathbb{R}^D$ and the previous hidden state $h_{t-1}\\in\\mathbb{R}^H$; the LSTM also maintains an $H$-dimensional *cell state*, so we also receive the previous cell state $c_{t-1}\\in\\mathbb{R}^H$. The learnable parameters of the LSTM are an *input-to-hidden* matrix $W_x\\in\\mathbb{R}^{4H\\times D}$, a *hidden-to-hidden* matrix $W_h\\in\\mathbb{R}^{4H\\times H}$ and a *bias vector* $b\\in\\mathbb{R}^{4H}$.\n", + "\n", + "At each timestep we first compute an *activation vector* $a\\in\\mathbb{R}^{4H}$ as $a=W_xx_t + W_hh_{t-1}+b$. We then divide this into four vectors $a_i,a_f,a_o,a_g\\in\\mathbb{R}^H$ where $a_i$ consists of the first $H$ elements of $a$, $a_f$ is the next $H$ elements of $a$, etc. We then compute the *input gate* $g\\in\\mathbb{R}^H$, *forget gate* $f\\in\\mathbb{R}^H$, *output gate* $o\\in\\mathbb{R}^H$ and *block input* $g\\in\\mathbb{R}^H$ as\n", + "\n", + "$$\n", + "\\begin{align*}\n", + "i = \\sigma(a_i) \\hspace{2pc}\n", + "f = \\sigma(a_f) \\hspace{2pc}\n", + "o = \\sigma(a_o) \\hspace{2pc}\n", + "g = \\tanh(a_g)\n", + "\\end{align*}\n", + "$$\n", + "\n", + "where $\\sigma$ is the sigmoid function and $\\tanh$ is the hyperbolic tangent, both applied elementwise.\n", + "\n", + "Finally we compute the next cell state $c_t$ and next hidden state $h_t$ as\n", + "\n", + "$$\n", + "c_{t} = f\\odot c_{t-1} + i\\odot g \\hspace{4pc}\n", + "h_t = o\\odot\\tanh(c_t)\n", + "$$\n", + "\n", + "where $\\odot$ is the elementwise product of vectors.\n", + "\n", + "In the rest of the notebook we will implement the LSTM update rule and apply it to the image captioning task." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "# LSTM: step forward\n", + "Implement the forward pass for a single timestep of an LSTM in the `lstm_step_forward` function in the file `cs231n/rnn_layers.py`. This should be similar to the `rnn_step_forward` function that you implemented above, but using the LSTM update rule instead.\n", + "\n", + "Once you are done, run the following to perform a simple test of your implementation. You should see errors around `1e-8` or less." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, H = 3, 4, 5\n", + "x = np.linspace(-0.4, 1.2, num=N*D).reshape(N, D)\n", + "prev_h = np.linspace(-0.3, 0.7, num=N*H).reshape(N, H)\n", + "prev_c = np.linspace(-0.4, 0.9, num=N*H).reshape(N, H)\n", + "Wx = np.linspace(-2.1, 1.3, num=4*D*H).reshape(D, 4 * H)\n", + "Wh = np.linspace(-0.7, 2.2, num=4*H*H).reshape(H, 4 * H)\n", + "b = np.linspace(0.3, 0.7, num=4*H)\n", + "\n", + "next_h, next_c, cache = lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)\n", + "\n", + "expected_next_h = np.asarray([\n", + " [ 0.24635157, 0.28610883, 0.32240467, 0.35525807, 0.38474904],\n", + " [ 0.49223563, 0.55611431, 0.61507696, 0.66844003, 0.7159181 ],\n", + " [ 0.56735664, 0.66310127, 0.74419266, 0.80889665, 0.858299 ]])\n", + "expected_next_c = np.asarray([\n", + " [ 0.32986176, 0.39145139, 0.451556, 0.51014116, 0.56717407],\n", + " [ 0.66382255, 0.76674007, 0.87195994, 0.97902709, 1.08751345],\n", + " [ 0.74192008, 0.90592151, 1.07717006, 1.25120233, 1.42395676]])\n", + "\n", + "print 'next_h error: ', rel_error(expected_next_h, next_h)\n", + "print 'next_c error: ', rel_error(expected_next_c, next_c)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "#LSTM: step backward\n", + "Implement the backward pass for a single LSTM timestep in the function `lstm_step_backward` in the file `cs231n/rnn_layers.py`. Once you are done, run the following to perform numeric gradient checking on your implementation. You should see errors around `1e-8` or less." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, H = 4, 5, 6\n", + "x = np.random.randn(N, D)\n", + "prev_h = np.random.randn(N, H)\n", + "prev_c = np.random.randn(N, H)\n", + "Wx = np.random.randn(D, 4 * H)\n", + "Wh = np.random.randn(H, 4 * H)\n", + "b = np.random.randn(4 * H)\n", + "\n", + "next_h, next_c, cache = lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)\n", + "\n", + "dnext_h = np.random.randn(*next_h.shape)\n", + "dnext_c = np.random.randn(*next_c.shape)\n", + "\n", + "fx_h = lambda x: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[0]\n", + "fh_h = lambda h: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[0]\n", + "fc_h = lambda c: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[0]\n", + "fWx_h = lambda Wx: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[0]\n", + "fWh_h = lambda Wh: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[0]\n", + "fb_h = lambda b: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[0]\n", + "\n", + "fx_c = lambda x: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[1]\n", + "fh_c = lambda h: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[1]\n", + "fc_c = lambda c: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[1]\n", + "fWx_c = lambda Wx: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[1]\n", + "fWh_c = lambda Wh: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[1]\n", + "fb_c = lambda b: lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b)[1]\n", + "\n", + "num_grad = eval_numerical_gradient_array\n", + "\n", + "dx_num = num_grad(fx_h, x, dnext_h) + num_grad(fx_c, x, dnext_c)\n", + "dh_num = num_grad(fh_h, prev_h, dnext_h) + num_grad(fh_c, prev_h, dnext_c)\n", + "dc_num = num_grad(fc_h, prev_c, dnext_h) + num_grad(fc_c, prev_c, dnext_c)\n", + "dWx_num = num_grad(fWx_h, Wx, dnext_h) + num_grad(fWx_c, Wx, dnext_c)\n", + "dWh_num = num_grad(fWh_h, Wh, dnext_h) + num_grad(fWh_c, Wh, dnext_c)\n", + "db_num = num_grad(fb_h, b, dnext_h) + num_grad(fb_c, b, dnext_c)\n", + "\n", + "dx, dh, dc, dWx, dWh, db = lstm_step_backward(dnext_h, dnext_c, cache)\n", + "\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dh error: ', rel_error(dh_num, dh)\n", + "print 'dc error: ', rel_error(dc_num, dc)\n", + "print 'dWx error: ', rel_error(dWx_num, dWx)\n", + "print 'dWh error: ', rel_error(dWh_num, dWh)\n", + "print 'db error: ', rel_error(db_num, db)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# LSTM: forward\n", + "In the function `lstm_forward` in the file `cs231n/rnn_layers.py`, implement the `lstm_forward` function to run an LSTM forward on an entire timeseries of data.\n", + "\n", + "When you are done run the following to check your implementation. You should see an error around `1e-7`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, H, T = 2, 5, 4, 3\n", + "x = np.linspace(-0.4, 0.6, num=N*T*D).reshape(N, T, D)\n", + "h0 = np.linspace(-0.4, 0.8, num=N*H).reshape(N, H)\n", + "Wx = np.linspace(-0.2, 0.9, num=4*D*H).reshape(D, 4 * H)\n", + "Wh = np.linspace(-0.3, 0.6, num=4*H*H).reshape(H, 4 * H)\n", + "b = np.linspace(0.2, 0.7, num=4*H)\n", + "\n", + "h, cache = lstm_forward(x, h0, Wx, Wh, b)\n", + "\n", + "expected_h = np.asarray([\n", + " [[ 0.01764008, 0.01823233, 0.01882671, 0.0194232 ],\n", + " [ 0.11287491, 0.12146228, 0.13018446, 0.13902939],\n", + " [ 0.31358768, 0.33338627, 0.35304453, 0.37250975]],\n", + " [[ 0.45767879, 0.4761092, 0.4936887, 0.51041945],\n", + " [ 0.6704845, 0.69350089, 0.71486014, 0.7346449 ],\n", + " [ 0.81733511, 0.83677871, 0.85403753, 0.86935314]]])\n", + "\n", + "print 'h error: ', rel_error(expected_h, h)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# LSTM: backward\n", + "Implement the backward pass for an LSTM over an entire timeseries of data in the function `lstm_backward` in the file `cs231n/rnn_layers.py`. When you are done run the following to perform numeric gradient checking on your implementation. You should see errors around `1e-8` or less." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.rnn_layers import lstm_forward, lstm_backward\n", + "\n", + "N, D, T, H = 2, 3, 10, 6\n", + "\n", + "x = np.random.randn(N, T, D)\n", + "h0 = np.random.randn(N, H)\n", + "Wx = np.random.randn(D, 4 * H)\n", + "Wh = np.random.randn(H, 4 * H)\n", + "b = np.random.randn(4 * H)\n", + "\n", + "out, cache = lstm_forward(x, h0, Wx, Wh, b)\n", + "\n", + "dout = np.random.randn(*out.shape)\n", + "\n", + "dx, dh0, dWx, dWh, db = lstm_backward(dout, cache)\n", + "\n", + "fx = lambda x: lstm_forward(x, h0, Wx, Wh, b)[0]\n", + "fh0 = lambda h0: lstm_forward(x, h0, Wx, Wh, b)[0]\n", + "fWx = lambda Wx: lstm_forward(x, h0, Wx, Wh, b)[0]\n", + "fWh = lambda Wh: lstm_forward(x, h0, Wx, Wh, b)[0]\n", + "fb = lambda b: lstm_forward(x, h0, Wx, Wh, b)[0]\n", + "\n", + "dx_num = eval_numerical_gradient_array(fx, x, dout)\n", + "dh0_num = eval_numerical_gradient_array(fh0, h0, dout)\n", + "dWx_num = eval_numerical_gradient_array(fWx, Wx, dout)\n", + "dWh_num = eval_numerical_gradient_array(fWh, Wh, dout)\n", + "db_num = eval_numerical_gradient_array(fb, b, dout)\n", + "\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dh0 error: ', rel_error(dx_num, dx)\n", + "print 'dWx error: ', rel_error(dx_num, dx)\n", + "print 'dWh error: ', rel_error(dx_num, dx)\n", + "print 'db error: ', rel_error(dx_num, dx)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "#LSTM captioning model\n", + "Now that you have implemented an LSTM, update the implementation of the `loss` method of the `CaptioningRNN` class in the file `cs231n/classifiers/rnn.py` to handle the case where `self.cell_type` is `lstm`. This should require adding less than 10 lines of code.\n", + "\n", + "Once you have done so, run the following to check your implementation. You should see a difference of less than `1e-10`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, W, H = 10, 20, 30, 40\n", + "word_to_idx = {'': 0, 'cat': 2, 'dog': 3}\n", + "V = len(word_to_idx)\n", + "T = 13\n", + "\n", + "model = CaptioningRNN(word_to_idx,\n", + " input_dim=D,\n", + " wordvec_dim=W,\n", + " hidden_dim=H,\n", + " cell_type='lstm',\n", + " dtype=np.float64)\n", + "\n", + "# Set all model parameters to fixed values\n", + "for k, v in model.params.iteritems():\n", + " model.params[k] = np.linspace(-1.4, 1.3, num=v.size).reshape(*v.shape)\n", + "\n", + "features = np.linspace(-0.5, 1.7, num=N*D).reshape(N, D)\n", + "captions = (np.arange(N * T) % V).reshape(N, T)\n", + "\n", + "loss, grads = model.loss(features, captions)\n", + "expected_loss = 9.82445935443\n", + "\n", + "print 'loss: ', loss\n", + "print 'expected loss: ', expected_loss\n", + "print 'difference: ', abs(loss - expected_loss)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Overfit LSTM captioning model\n", + "Run the following to overfit an LSTM captioning model on the same small dataset as we used for the RNN above." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "small_data = load_coco_data(max_train=50)\n", + "\n", + "small_lstm_model = CaptioningRNN(\n", + " cell_type='lstm',\n", + " word_to_idx=data['word_to_idx'],\n", + " input_dim=data['train_features'].shape[1],\n", + " hidden_dim=512,\n", + " wordvec_dim=256,\n", + " dtype=np.float32,\n", + " )\n", + "\n", + "small_lstm_solver = CaptioningSolver(small_lstm_model, small_data,\n", + " update_rule='adam',\n", + " num_epochs=50,\n", + " batch_size=25,\n", + " optim_config={\n", + " 'learning_rate': 5e-3,\n", + " },\n", + " lr_decay=0.995,\n", + " verbose=True, print_every=10,\n", + " )\n", + "\n", + "small_lstm_solver.train()\n", + "\n", + "# Plot the training losses\n", + "plt.plot(small_lstm_solver.loss_history)\n", + "plt.xlabel('Iteration')\n", + "plt.ylabel('Loss')\n", + "plt.title('Training loss history')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# LSTM test-time sampling\n", + "Modify the `sample` method of the `CaptioningRNN` class to handle the case where `self.cell_type` is `lstm`. This should take fewer than 10 lines of code.\n", + "\n", + "When you are done run the following to sample from your overfit LSTM model on some training and validation set samples." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "for split in ['train', 'val']:\n", + " minibatch = sample_coco_minibatch(small_data, split=split, batch_size=2)\n", + " gt_captions, features, urls = minibatch\n", + " gt_captions = decode_captions(gt_captions, data['idx_to_word'])\n", + "\n", + " sample_captions = small_lstm_model.sample(features)\n", + " sample_captions = decode_captions(sample_captions, data['idx_to_word'])\n", + "\n", + " for gt_caption, sample_caption, url in zip(gt_captions, sample_captions, urls):\n", + " plt.imshow(image_from_url(url))\n", + " plt.title('%s\\n%s\\nGT:%s' % (split, sample_caption, gt_caption))\n", + " plt.axis('off')\n", + " plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Train a good captioning model!\n", + "Using the pieces you have implemented in this and the previous notebook, try to train a captioning model that gives decent qualitative results (better than the random garbage you saw with the overfit models) when sampling on the validation set. You can subsample the training set if you want; we just want to see samples on the validatation set that are better than random.\n", + "\n", + "Don't spend too much time on this part; we don't have any explicit accuracy thresholds you need to meet." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "pass\n" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "pass\n" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.6", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment3/RNN_Captioning.ipynb b/assignments2016/assignment3/RNN_Captioning.ipynb new file mode 100644 index 00000000..61bc2a46 --- /dev/null +++ b/assignments2016/assignment3/RNN_Captioning.ipynb @@ -0,0 +1,659 @@ +{ + "nbformat_minor": 0, + "nbformat": 4, + "cells": [ + { + "source": [ + "# Image Captioning with RNNs\n", + "In this exercise you will implement a vanilla recurrent neural networks and use them it to train a model that can generate novel captions for images." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# As usual, a bit of setup\n", + "\n", + "import time, os, json\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "\n", + "from cs231n.gradient_check import eval_numerical_gradient, eval_numerical_gradient_array\n", + "from cs231n.rnn_layers import *\n", + "from cs231n.captioning_solver import CaptioningSolver\n", + "from cs231n.classifiers.rnn import CaptioningRNN\n", + "from cs231n.coco_utils import load_coco_data, sample_coco_minibatch, decode_captions\n", + "from cs231n.image_utils import image_from_url\n", + "\n", + "%matplotlib inline\n", + "plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots\n", + "plt.rcParams['image.interpolation'] = 'nearest'\n", + "plt.rcParams['image.cmap'] = 'gray'\n", + "\n", + "# for auto-reloading external modules\n", + "# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython\n", + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "def rel_error(x, y):\n", + " \"\"\" returns relative error \"\"\"\n", + " return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Microsoft COCO\n", + "For this exercise we will use the 2014 release of the [Microsoft COCO dataset](http://mscoco.org/) which has become the standard testbed for image captioning. The dataset consists of 80,000 training images and 40,000 validation images, each annotated with 5 captions written by workers on Amazon Mechanical Turk.\n", + "\n", + "To download the data, change to the `cs231n/datasets` directory and run the script `get_coco_captioning.sh`.\n", + "\n", + "We have preprocessed the data and extracted features for you already. For all images we have extracted features from the fc7 layer of the VGG-16 network pretrained on ImageNet; these features are stored in the files `train2014_vgg16_fc7.h5` and `val2014_vgg16_fc7.h5` respectively. To cut down on processing time and memory requirements, we have reduced the dimensionality of the features from 4096 to 512; these features can be found in the files `train2014_vgg16_fc7_pca.h5` and `val2014_vgg16_fc7_pca.h5`.\n", + "\n", + "The raw images take up a lot of space (nearly 20GB) so we have not included them in the download. However all images are taken from Flickr, and URLs of the training and validation images are stored in the files `train2014_urls.txt` and `val2014_urls.txt` respectively. This allows you to download images on the fly for visualization. Since images are downloaded on-the-fly, **you must be connected to the internet to view images**.\n", + "\n", + "Dealing with strings is inefficient, so we will work with an encoded version of the captions. Each word is assigned an integer ID, allowing us to represent a caption by a sequence of integers. The mapping between integer IDs and words is in the file `coco2014_vocab.json`, and you can use the function `decode_captions` from the file `cs231n/coco_utils.py` to convert numpy arrays of integer IDs back into strings.\n", + "\n", + "There are a couple special tokens that we add to the vocabulary. We prepend a special `` token and append an `` token to the beginning and end of each caption respectively. Rare words are replaced with a special `` token (for \"unknown\"). In addition, since we want to train with minibatches containing captions of different lengths, we pad short captions with a special `` token after the `` token and don't compute loss or gradient for `` tokens. Since they are a bit of a pain, we have taken care of all implementation details around special tokens for you.\n", + "\n", + "You can load all of the MS-COCO data (captions, features, URLs, and vocabulary) using the `load_coco_data` function from the file `cs231n/coco_utils.py`. Run the following cell to do so:" + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Load COCO data from disk; this returns a dictionary\n", + "# We'll work with dimensionality-reduced features for this notebook, but feel\n", + "# free to experiment with the original features by changing the flag below.\n", + "data = load_coco_data(pca_features=True)\n", + "\n", + "# Print out all the keys and values from the data dictionary\n", + "for k, v in data.iteritems():\n", + " if type(v) == np.ndarray:\n", + " print k, type(v), v.shape, v.dtype\n", + " else:\n", + " print k, type(v), len(v)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "## Look at the data\n", + "It is always a good idea to look at examples from the dataset before working with it.\n", + "\n", + "You can use the `sample_coco_minibatch` function from the file `cs231n/coco_utils.py` to sample minibatches of data from the data structure returned from `load_coco_data`. Run the following to sample a small minibatch of training data and show the images and their captions. Running it multiple times and looking at the results helps you to get a sense of the dataset.\n", + "\n", + "Note that we decode the captions using the `decode_captions` function and that we download the images on-the-fly using their Flickr URL, so **you must be connected to the internet to viw images**." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Sample a minibatch and show the images and captions\n", + "batch_size = 3\n", + "\n", + "captions, features, urls = sample_coco_minibatch(data, batch_size=batch_size)\n", + "for i, (caption, url) in enumerate(zip(captions, urls)):\n", + " plt.imshow(image_from_url(url))\n", + " plt.axis('off')\n", + " caption_str = decode_captions(caption, data['idx_to_word'])\n", + " plt.title(caption_str)\n", + " plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Recurrent Neural Networks\n", + "As discussed in lecture, we will use recurrent neural network (RNN) language models for image captioning. The file `cs231n/rnn_layers.py` contains implementations of different layer types that are needed for recurrent neural networks, and the file `cs231n/classifiers/rnn.py` uses these layers to implement an image captioning model.\n", + "\n", + "We will first implement different types of RNN layers in `cs231n/rnn_layers.py`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "source": [ + "# Vanilla RNN: step forward\n", + "Open the file `cs231n/rnn_layers.py`. This file implements the forward and backward passes for different types of layers that are commonly used in recurrent neural networks.\n", + "\n", + "First implement the function `rnn_step_forward` which implements the forward pass for a single timestep of a vanilla recurrent neural network. After doing so run the following to check your implementation." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, H = 3, 10, 4\n", + "\n", + "x = np.linspace(-0.4, 0.7, num=N*D).reshape(N, D)\n", + "prev_h = np.linspace(-0.2, 0.5, num=N*H).reshape(N, H)\n", + "Wx = np.linspace(-0.1, 0.9, num=D*H).reshape(D, H)\n", + "Wh = np.linspace(-0.3, 0.7, num=H*H).reshape(H, H)\n", + "b = np.linspace(-0.2, 0.4, num=H)\n", + "\n", + "next_h, _ = rnn_step_forward(x, prev_h, Wx, Wh, b)\n", + "expected_next_h = np.asarray([\n", + " [-0.58172089, -0.50182032, -0.41232771, -0.31410098],\n", + " [ 0.66854692, 0.79562378, 0.87755553, 0.92795967],\n", + " [ 0.97934501, 0.99144213, 0.99646691, 0.99854353]])\n", + "\n", + "print 'next_h error: ', rel_error(expected_next_h, next_h)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Vanilla RNN: step backward\n", + "In the file `cs231n/rnn_layers.py` implement the `rnn_step_backward` function. After doing so run the following to numerically gradient check your implementation. You should see errors less than `1e-8`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "from cs231n.rnn_layers import rnn_step_forward, rnn_step_backward\n", + "\n", + "N, D, H = 4, 5, 6\n", + "x = np.random.randn(N, D)\n", + "h = np.random.randn(N, H)\n", + "Wx = np.random.randn(D, H)\n", + "Wh = np.random.randn(H, H)\n", + "b = np.random.randn(H)\n", + "\n", + "out, cache = rnn_step_forward(x, h, Wx, Wh, b)\n", + "\n", + "dnext_h = np.random.randn(*out.shape)\n", + "\n", + "fx = lambda x: rnn_step_forward(x, h, Wx, Wh, b)[0]\n", + "fh = lambda prev_h: rnn_step_forward(x, h, Wx, Wh, b)[0]\n", + "fWx = lambda Wx: rnn_step_forward(x, h, Wx, Wh, b)[0]\n", + "fWh = lambda Wh: rnn_step_forward(x, h, Wx, Wh, b)[0]\n", + "fb = lambda b: rnn_step_forward(x, h, Wx, Wh, b)[0]\n", + "\n", + "dx_num = eval_numerical_gradient_array(fx, x, dnext_h)\n", + "dprev_h_num = eval_numerical_gradient_array(fh, h, dnext_h)\n", + "dWx_num = eval_numerical_gradient_array(fWx, Wx, dnext_h)\n", + "dWh_num = eval_numerical_gradient_array(fWh, Wh, dnext_h)\n", + "db_num = eval_numerical_gradient_array(fb, b, dnext_h)\n", + "\n", + "dx, dprev_h, dWx, dWh, db = rnn_step_backward(dnext_h, cache)\n", + "\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dprev_h error: ', rel_error(dprev_h_num, dprev_h)\n", + "print 'dWx error: ', rel_error(dWx_num, dWx)\n", + "print 'dWh error: ', rel_error(dWh_num, dWh)\n", + "print 'db error: ', rel_error(db_num, db)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Vanilla RNN: forward\n", + "Now that you have implemented the forward and backward passes for a single timestep of a vanilla RNN, you will combine these pieces to implement a RNN that process an entire sequence of data.\n", + "\n", + "In the file `cs231n/rnn_layers.py`, implement the function `rnn_forward`. This should be implemented using the `rnn_step_forward` function that you defined above. After doing so run the following to check your implementation. You should see errors less than `1e-7`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, T, D, H = 2, 3, 4, 5\n", + "\n", + "x = np.linspace(-0.1, 0.3, num=N*T*D).reshape(N, T, D)\n", + "h0 = np.linspace(-0.3, 0.1, num=N*H).reshape(N, H)\n", + "Wx = np.linspace(-0.2, 0.4, num=D*H).reshape(D, H)\n", + "Wh = np.linspace(-0.4, 0.1, num=H*H).reshape(H, H)\n", + "b = np.linspace(-0.7, 0.1, num=H)\n", + "\n", + "h, _ = rnn_forward(x, h0, Wx, Wh, b)\n", + "expected_h = np.asarray([\n", + " [\n", + " [-0.42070749, -0.27279261, -0.11074945, 0.05740409, 0.22236251],\n", + " [-0.39525808, -0.22554661, -0.0409454, 0.14649412, 0.32397316],\n", + " [-0.42305111, -0.24223728, -0.04287027, 0.15997045, 0.35014525],\n", + " ],\n", + " [\n", + " [-0.55857474, -0.39065825, -0.19198182, 0.02378408, 0.23735671],\n", + " [-0.27150199, -0.07088804, 0.13562939, 0.33099728, 0.50158768],\n", + " [-0.51014825, -0.30524429, -0.06755202, 0.17806392, 0.40333043]]])\n", + "print 'h error: ', rel_error(expected_h, h)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Vanilla RNN: backward\n", + "In the file `cs231n/rnn_layers.py`, implement the backward pass for a vanilla RNN in the function `rnn_backward`. This should run back-propagation over the entire sequence, calling into the `rnn_step_backward` function that you defined above." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, T, H = 2, 3, 10, 5\n", + "\n", + "x = np.random.randn(N, T, D)\n", + "h0 = np.random.randn(N, H)\n", + "Wx = np.random.randn(D, H)\n", + "Wh = np.random.randn(H, H)\n", + "b = np.random.randn(H)\n", + "\n", + "out, cache = rnn_forward(x, h0, Wx, Wh, b)\n", + "\n", + "dout = np.random.randn(*out.shape)\n", + "\n", + "dx, dh0, dWx, dWh, db = rnn_backward(dout, cache)\n", + "\n", + "fx = lambda x: rnn_forward(x, h0, Wx, Wh, b)[0]\n", + "fh0 = lambda h0: rnn_forward(x, h0, Wx, Wh, b)[0]\n", + "fWx = lambda Wx: rnn_forward(x, h0, Wx, Wh, b)[0]\n", + "fWh = lambda Wh: rnn_forward(x, h0, Wx, Wh, b)[0]\n", + "fb = lambda b: rnn_forward(x, h0, Wx, Wh, b)[0]\n", + "\n", + "dx_num = eval_numerical_gradient_array(fx, x, dout)\n", + "dh0_num = eval_numerical_gradient_array(fh0, h0, dout)\n", + "dWx_num = eval_numerical_gradient_array(fWx, Wx, dout)\n", + "dWh_num = eval_numerical_gradient_array(fWh, Wh, dout)\n", + "db_num = eval_numerical_gradient_array(fb, b, dout)\n", + "\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dh0 error: ', rel_error(dh0_num, dh0)\n", + "print 'dWx error: ', rel_error(dWx_num, dWx)\n", + "print 'dWh error: ', rel_error(dWh_num, dWh)\n", + "print 'db error: ', rel_error(db_num, db)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Word embedding: forward\n", + "In deep learning systems, we commonly represent words using vectors. Each word of the vocabulary will be associated with a vector, and these vectors will be learned jointly with the rest of the system.\n", + "\n", + "In the file `cs231n/rnn_layers.py`, implement the function `word_embedding_forward` to convert words (represented by integers) into vectors. Run the following to check your implementation. You should see error around `1e-8`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, T, V, D = 2, 4, 5, 3\n", + "\n", + "x = np.asarray([[0, 3, 1, 2], [2, 1, 0, 3]])\n", + "W = np.linspace(0, 1, num=V*D).reshape(V, D)\n", + "\n", + "out, _ = word_embedding_forward(x, W)\n", + "expected_out = np.asarray([\n", + " [[ 0., 0.07142857, 0.14285714],\n", + " [ 0.64285714, 0.71428571, 0.78571429],\n", + " [ 0.21428571, 0.28571429, 0.35714286],\n", + " [ 0.42857143, 0.5, 0.57142857]],\n", + " [[ 0.42857143, 0.5, 0.57142857],\n", + " [ 0.21428571, 0.28571429, 0.35714286],\n", + " [ 0., 0.07142857, 0.14285714],\n", + " [ 0.64285714, 0.71428571, 0.78571429]]])\n", + "\n", + "print 'out error: ', rel_error(expected_out, out)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Word embedding: backward\n", + "Implement the backward pass for the word embedding function in the function `word_embedding_backward`. After doing so run the following to numerically gradient check your implementation. You should see errors less than `1e-11`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, T, V, D = 50, 3, 5, 6\n", + "\n", + "x = np.random.randint(V, size=(N, T))\n", + "W = np.random.randn(V, D)\n", + "\n", + "out, cache = word_embedding_forward(x, W)\n", + "dout = np.random.randn(*out.shape)\n", + "dW = word_embedding_backward(dout, cache)\n", + "\n", + "f = lambda W: word_embedding_forward(x, W)[0]\n", + "dW_num = eval_numerical_gradient_array(f, W, dout)\n", + "\n", + "print 'dW error: ', rel_error(dW, dW_num)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Temporal Affine layer\n", + "At every timestep we use an affine function to transform the RNN hidden vector at that timestep into scores for each word in the vocabulary. Because this is very similar to the affine layer that you implemented in assignment 2, we have provided this function for you in the `temporal_affine_forward` and `temporal_affine_backward` functions in the file `cs231n/rnn_layers.py`. Run the following to perform numeric gradient checking on the implementation." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Gradient check for temporal affine layer\n", + "N, T, D, M = 2, 3, 4, 5\n", + "\n", + "x = np.random.randn(N, T, D)\n", + "w = np.random.randn(D, M)\n", + "b = np.random.randn(M)\n", + "\n", + "out, cache = temporal_affine_forward(x, w, b)\n", + "\n", + "dout = np.random.randn(*out.shape)\n", + "\n", + "fx = lambda x: temporal_affine_forward(x, w, b)[0]\n", + "fw = lambda w: temporal_affine_forward(x, w, b)[0]\n", + "fb = lambda b: temporal_affine_forward(x, w, b)[0]\n", + "\n", + "dx_num = eval_numerical_gradient_array(fx, x, dout)\n", + "dw_num = eval_numerical_gradient_array(fw, w, dout)\n", + "db_num = eval_numerical_gradient_array(fb, b, dout)\n", + "\n", + "dx, dw, db = temporal_affine_backward(dout, cache)\n", + "\n", + "print 'dx error: ', rel_error(dx_num, dx)\n", + "print 'dw error: ', rel_error(dw_num, dw)\n", + "print 'db error: ', rel_error(db_num, db)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Temporal Softmax loss\n", + "In an RNN language model, at every timestep we produce a score for each word in the vocabulary. We know the ground-truth word at each timestep, so we use a softmax loss function to compute loss and gradient at each timestep. We sum the losses over time and average them over the minibatch.\n", + "\n", + "However there is one wrinke: since we operate over minibatches and different captions may have different lengths, we append `` tokens to the end of each caption so they all have the same length. We don't want these `` tokens to count toward the loss or gradient, so in addition to scores and ground-truth labels our loss function also accepts a `mask` array that tells it which elements of the scores count towards the loss.\n", + "\n", + "Since this is very similar to the softmax loss function you implemented in assignment 1, we have implemented this loss function for you; look at the `temporal_softmax_loss` function in the file `cs231n/rnn_layers.py`.\n", + "\n", + "Run the following cell to sanity check the loss and perform numeric gradient checking on the function." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "# Sanity check for temporal softmax loss\n", + "from cs231n.rnn_layers import temporal_softmax_loss\n", + "\n", + "N, T, V = 100, 1, 10\n", + "\n", + "def check_loss(N, T, V, p):\n", + " x = 0.001 * np.random.randn(N, T, V)\n", + " y = np.random.randint(V, size=(N, T))\n", + " mask = np.random.rand(N, T) <= p\n", + " print temporal_softmax_loss(x, y, mask)[0]\n", + " \n", + "check_loss(100, 1, 10, 1.0) # Should be about 2.3\n", + "check_loss(100, 10, 10, 1.0) # Should be about 23\n", + "check_loss(5000, 10, 10, 0.1) # Should be about 2.3\n", + "\n", + "# Gradient check for temporal softmax loss\n", + "N, T, V = 7, 8, 9\n", + "\n", + "x = np.random.randn(N, T, V)\n", + "y = np.random.randint(V, size=(N, T))\n", + "mask = (np.random.rand(N, T) > 0.5)\n", + "\n", + "loss, dx = temporal_softmax_loss(x, y, mask, verbose=False)\n", + "\n", + "dx_num = eval_numerical_gradient(lambda x: temporal_softmax_loss(x, y, mask)[0], x, verbose=False)\n", + "\n", + "print 'dx error: ', rel_error(dx, dx_num)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# RNN for image captioning\n", + "Now that you have implemented the necessary layers, you can combine them to build an image captioning model. Open the file `cs231n/classifiers/rnn.py` and look at the `CaptioningRNN` class.\n", + "\n", + "Implement the forward and backward pass of the model in the `loss` function. For now you only need to implement the case where `cell_type='rnn'` for vanialla RNNs; you will implement the LSTM case later. After doing so, run the following to check your forward pass using a small test case; you should see error less than `1e-10`." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "N, D, W, H = 10, 20, 30, 40\n", + "word_to_idx = {'': 0, 'cat': 2, 'dog': 3}\n", + "V = len(word_to_idx)\n", + "T = 13\n", + "\n", + "model = CaptioningRNN(word_to_idx,\n", + " input_dim=D,\n", + " wordvec_dim=W,\n", + " hidden_dim=H,\n", + " cell_type='rnn',\n", + " dtype=np.float64)\n", + "\n", + "# Set all model parameters to fixed values\n", + "for k, v in model.params.iteritems():\n", + " model.params[k] = np.linspace(-1.4, 1.3, num=v.size).reshape(*v.shape)\n", + "\n", + "features = np.linspace(-1.5, 0.3, num=(N * D)).reshape(N, D)\n", + "captions = (np.arange(N * T) % V).reshape(N, T)\n", + "\n", + "loss, grads = model.loss(features, captions)\n", + "expected_loss = 9.83235591003\n", + "\n", + "print 'loss: ', loss\n", + "print 'expected loss: ', expected_loss\n", + "print 'difference: ', abs(loss - expected_loss)" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + }, + { + "source": [ + "Run the following cell to perform numeric gradient checking on the `CaptioningRNN` class; you should errors around `1e-7` or less." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "batch_size = 2\n", + "timesteps = 3\n", + "input_dim = 4\n", + "wordvec_dim = 5\n", + "hidden_dim = 6\n", + "word_to_idx = {'': 0, 'cat': 2, 'dog': 3}\n", + "vocab_size = len(word_to_idx)\n", + "\n", + "captions = np.random.randint(vocab_size, size=(batch_size, timesteps))\n", + "features = np.random.randn(batch_size, input_dim)\n", + "\n", + "model = CaptioningRNN(word_to_idx,\n", + " input_dim=input_dim,\n", + " wordvec_dim=wordvec_dim,\n", + " hidden_dim=hidden_dim,\n", + " cell_type='rnn',\n", + " dtype=np.float64,\n", + " )\n", + "\n", + "loss, grads = model.loss(features, captions)\n", + "\n", + "for param_name in sorted(grads):\n", + " f = lambda _: model.loss(features, captions)[0]\n", + " param_grad_num = eval_numerical_gradient(f, model.params[param_name], verbose=False, h=1e-6)\n", + " e = rel_error(param_grad_num, grads[param_name])\n", + " print '%s relative error: %e' % (param_name, e)" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Overfit small data\n", + "Similar to the `Solver` class that we used to train image classification models on the previous assignment, on this assignment we use a `CaptioningSolver` class to train image captioning models. Open the file `cs231n/captioning_solver.py` and read through the `CaptioningSolver` class; it should look very familiar.\n", + "\n", + "Once you have familiarized yourself with the API, run the following to make sure your model overfit a small sample of 100 training examples. You should see losses around 1." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "small_data = load_coco_data(max_train=50)\n", + "\n", + "small_rnn_model = CaptioningRNN(\n", + " cell_type='rnn',\n", + " word_to_idx=data['word_to_idx'],\n", + " input_dim=data['train_features'].shape[1],\n", + " hidden_dim=512,\n", + " wordvec_dim=256,\n", + " )\n", + "\n", + "small_rnn_solver = CaptioningSolver(small_rnn_model, small_data,\n", + " update_rule='adam',\n", + " num_epochs=50,\n", + " batch_size=25,\n", + " optim_config={\n", + " 'learning_rate': 5e-3,\n", + " },\n", + " lr_decay=0.95,\n", + " verbose=True, print_every=10,\n", + " )\n", + "\n", + "small_rnn_solver.train()\n", + "\n", + "# Plot the training losses\n", + "plt.plot(small_rnn_solver.loss_history)\n", + "plt.xlabel('Iteration')\n", + "plt.ylabel('Loss')\n", + "plt.title('Training loss history')\n", + "plt.show()" + ], + "outputs": [], + "metadata": { + "collapsed": false + } + }, + { + "source": [ + "# Test-time sampling\n", + "Unlike classification models, image captioning models behave very differently at training time and at test time. At training time, we have access to the ground-truth caption so we feed ground-truth words as input to the RNN at each timestep. At test time, we sample from the distribution over the vocabulary at each timestep, and feed the sample as input to the RNN at the next timestep.\n", + "\n", + "In the file `cs231n/classifiers/rnn.py`, implement the `sample` method for test-time sampling. After doing so, run the following to sample from your overfit model on both training and validation data. The samples on training data should be very good; the samples on validation data probably won't make sense." + ], + "cell_type": "markdown", + "metadata": {} + }, + { + "execution_count": null, + "cell_type": "code", + "source": [ + "for split in ['train', 'val']:\n", + " minibatch = sample_coco_minibatch(small_data, split=split, batch_size=2)\n", + " gt_captions, features, urls = minibatch\n", + " gt_captions = decode_captions(gt_captions, data['idx_to_word'])\n", + "\n", + " sample_captions = small_rnn_model.sample(features)\n", + " sample_captions = decode_captions(sample_captions, data['idx_to_word'])\n", + "\n", + " for gt_caption, sample_caption, url in zip(gt_captions, sample_captions, urls):\n", + " plt.imshow(image_from_url(url))\n", + " plt.title('%s\\n%s\\nGT:%s' % (split, sample_caption, gt_caption))\n", + " plt.axis('off')\n", + " plt.show()" + ], + "outputs": [], + "metadata": { + "scrolled": false, + "collapsed": false + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "name": "python2", + "language": "python" + }, + "language_info": { + "mimetype": "text/x-python", + "nbconvert_exporter": "python", + "name": "python", + "file_extension": ".py", + "version": "2.7.6", + "pygments_lexer": "ipython2", + "codemirror_mode": { + "version": 2, + "name": "ipython" + } + } + } +} \ No newline at end of file diff --git a/assignments2016/assignment3/collectSubmission.sh b/assignments2016/assignment3/collectSubmission.sh new file mode 100755 index 00000000..4e6b0b4c --- /dev/null +++ b/assignments2016/assignment3/collectSubmission.sh @@ -0,0 +1,2 @@ +rm -f assignment3.zip +zip -r assignment3.zip . -x "*.git" "*cs231n/datasets*" "*.ipynb_checkpoints*" "*README.md" "*collectSubmission.sh" "*requirements.txt" ".env/*" "*.pyc" "*cs231n/build/*" diff --git a/assignments2016/assignment3/cs231n/.gitignore b/assignments2016/assignment3/cs231n/.gitignore new file mode 100644 index 00000000..fbb42c24 --- /dev/null +++ b/assignments2016/assignment3/cs231n/.gitignore @@ -0,0 +1,3 @@ +build/* +im2col_cython.c +im2col_cython.so diff --git a/assignments2016/assignment3/cs231n/__init__.py b/assignments2016/assignment3/cs231n/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/assignments2016/assignment3/cs231n/captioning_solver.py b/assignments2016/assignment3/cs231n/captioning_solver.py new file mode 100644 index 00000000..6e3dddb1 --- /dev/null +++ b/assignments2016/assignment3/cs231n/captioning_solver.py @@ -0,0 +1,233 @@ +import numpy as np + +from cs231n import optim +from cs231n.coco_utils import sample_coco_minibatch + + +class CaptioningSolver(object): + """ + A CaptioningSolver encapsulates all the logic necessary for training + image captioning models. The CaptioningSolver performs stochastic gradient + descent using different update rules defined in optim.py. + + The solver accepts both training and validataion data and labels so it can + periodically check classification accuracy on both training and validation + data to watch out for overfitting. + + To train a model, you will first construct a CaptioningSolver instance, + passing the model, dataset, and various options (learning rate, batch size, + etc) to the constructor. You will then call the train() method to run the + optimization procedure and train the model. + + After the train() method returns, model.params will contain the parameters + that performed best on the validation set over the course of training. + In addition, the instance variable solver.loss_history will contain a list + of all losses encountered during training and the instance variables + solver.train_acc_history and solver.val_acc_history will be lists containing + the accuracies of the model on the training and validation set at each epoch. + + Example usage might look something like this: + + data = load_coco_data() + model = MyAwesomeModel(hidden_dim=100) + solver = CaptioningSolver(model, data, + update_rule='sgd', + optim_config={ + 'learning_rate': 1e-3, + }, + lr_decay=0.95, + num_epochs=10, batch_size=100, + print_every=100) + solver.train() + + + A CaptioningSolver works on a model object that must conform to the following + API: + + - model.params must be a dictionary mapping string parameter names to numpy + arrays containing parameter values. + + - model.loss(features, captions) must be a function that computes + training-time loss and gradients, with the following inputs and outputs: + + Inputs: + - features: Array giving a minibatch of features for images, of shape (N, D + - captions: Array of captions for those images, of shape (N, T) where + each element is in the range (0, V]. + + Returns: + - loss: Scalar giving the loss + - grads: Dictionary with the same keys as self.params mapping parameter + names to gradients of the loss with respect to those parameters. + """ + + def __init__(self, model, data, **kwargs): + """ + Construct a new CaptioningSolver instance. + + Required arguments: + - model: A model object conforming to the API described above + - data: A dictionary of training and validation data from load_coco_data + + Optional arguments: + - update_rule: A string giving the name of an update rule in optim.py. + Default is 'sgd'. + - optim_config: A dictionary containing hyperparameters that will be + passed to the chosen update rule. Each update rule requires different + hyperparameters (see optim.py) but all update rules require a + 'learning_rate' parameter so that should always be present. + - lr_decay: A scalar for learning rate decay; after each epoch the learning + rate is multiplied by this value. + - batch_size: Size of minibatches used to compute loss and gradient during + training. + - num_epochs: The number of epochs to run for during training. + - print_every: Integer; training losses will be printed every print_every + iterations. + - verbose: Boolean; if set to false then no output will be printed during + training. + """ + self.model = model + self.data = data + + # Unpack keyword arguments + self.update_rule = kwargs.pop('update_rule', 'sgd') + self.optim_config = kwargs.pop('optim_config', {}) + self.lr_decay = kwargs.pop('lr_decay', 1.0) + self.batch_size = kwargs.pop('batch_size', 100) + self.num_epochs = kwargs.pop('num_epochs', 10) + + self.print_every = kwargs.pop('print_every', 10) + self.verbose = kwargs.pop('verbose', True) + + # Throw an error if there are extra keyword arguments + if len(kwargs) > 0: + extra = ', '.join('"%s"' % k for k in kwargs.keys()) + raise ValueError('Unrecognized arguments %s' % extra) + + # Make sure the update rule exists, then replace the string + # name with the actual function + if not hasattr(optim, self.update_rule): + raise ValueError('Invalid update_rule "%s"' % self.update_rule) + self.update_rule = getattr(optim, self.update_rule) + + self._reset() + + + def _reset(self): + """ + Set up some book-keeping variables for optimization. Don't call this + manually. + """ + # Set up some variables for book-keeping + self.epoch = 0 + self.best_val_acc = 0 + self.best_params = {} + self.loss_history = [] + self.train_acc_history = [] + self.val_acc_history = [] + + # Make a deep copy of the optim_config for each parameter + self.optim_configs = {} + for p in self.model.params: + d = {k: v for k, v in self.optim_config.iteritems()} + self.optim_configs[p] = d + + + def _step(self): + """ + Make a single gradient update. This is called by train() and should not + be called manually. + """ + # Make a minibatch of training data + minibatch = sample_coco_minibatch(self.data, + batch_size=self.batch_size, + split='train') + captions, features, urls = minibatch + + # Compute loss and gradient + loss, grads = self.model.loss(features, captions) + self.loss_history.append(loss) + + # Perform a parameter update + for p, w in self.model.params.iteritems(): + dw = grads[p] + config = self.optim_configs[p] + next_w, next_config = self.update_rule(w, dw, config) + self.model.params[p] = next_w + self.optim_configs[p] = next_config + + + # TODO: This does nothing right now; maybe implement BLEU? + def check_accuracy(self, X, y, num_samples=None, batch_size=100): + """ + Check accuracy of the model on the provided data. + + Inputs: + - X: Array of data, of shape (N, d_1, ..., d_k) + - y: Array of labels, of shape (N,) + - num_samples: If not None, subsample the data and only test the model + on num_samples datapoints. + - batch_size: Split X and y into batches of this size to avoid using too + much memory. + + Returns: + - acc: Scalar giving the fraction of instances that were correctly + classified by the model. + """ + return 0.0 + + # Maybe subsample the data + N = X.shape[0] + if num_samples is not None and N > num_samples: + mask = np.random.choice(N, num_samples) + N = num_samples + X = X[mask] + y = y[mask] + + # Compute predictions in batches + num_batches = N / batch_size + if N % batch_size != 0: + num_batches += 1 + y_pred = [] + for i in xrange(num_batches): + start = i * batch_size + end = (i + 1) * batch_size + scores = self.model.loss(X[start:end]) + y_pred.append(np.argmax(scores, axis=1)) + y_pred = np.hstack(y_pred) + acc = np.mean(y_pred == y) + + return acc + + + def train(self): + """ + Run optimization to train the model. + """ + num_train = self.data['train_captions'].shape[0] + iterations_per_epoch = max(num_train / self.batch_size, 1) + num_iterations = self.num_epochs * iterations_per_epoch + + for t in xrange(num_iterations): + self._step() + + # Maybe print training loss + if self.verbose and t % self.print_every == 0: + print '(Iteration %d / %d) loss: %f' % ( + t + 1, num_iterations, self.loss_history[-1]) + + # At the end of every epoch, increment the epoch counter and decay the + # learning rate. + epoch_end = (t + 1) % iterations_per_epoch == 0 + if epoch_end: + self.epoch += 1 + for k in self.optim_configs: + self.optim_configs[k]['learning_rate'] *= self.lr_decay + + # Check train and val accuracy on the first iteration, the last + # iteration, and at the end of each epoch. + # TODO: Implement some logic to check Bleu on validation set periodically + + # At the end of training swap the best params into the model + # self.model.params = self.best_params + diff --git a/assignments2016/assignment3/cs231n/classifiers/__init__.py b/assignments2016/assignment3/cs231n/classifiers/__init__.py new file mode 100644 index 00000000..e69de29b diff --git a/assignments2016/assignment3/cs231n/classifiers/pretrained_cnn.py b/assignments2016/assignment3/cs231n/classifiers/pretrained_cnn.py new file mode 100644 index 00000000..a12dce98 --- /dev/null +++ b/assignments2016/assignment3/cs231n/classifiers/pretrained_cnn.py @@ -0,0 +1,252 @@ +import numpy as np +import h5py + +from cs231n.layers import * +from cs231n.fast_layers import * +from cs231n.layer_utils import * + + +class PretrainedCNN(object): + def __init__(self, dtype=np.float32, num_classes=100, input_size=64, h5_file=None): + self.dtype = dtype + self.conv_params = [] + self.input_size = input_size + self.num_classes = num_classes + + # TODO: In the future it would be nice if the architecture could be loaded from + # the HDF5 file rather than being hardcoded. For now this will have to do. + self.conv_params.append({'stride': 2, 'pad': 2}) + self.conv_params.append({'stride': 1, 'pad': 1}) + self.conv_params.append({'stride': 2, 'pad': 1}) + self.conv_params.append({'stride': 1, 'pad': 1}) + self.conv_params.append({'stride': 2, 'pad': 1}) + self.conv_params.append({'stride': 1, 'pad': 1}) + self.conv_params.append({'stride': 2, 'pad': 1}) + self.conv_params.append({'stride': 1, 'pad': 1}) + self.conv_params.append({'stride': 2, 'pad': 1}) + + self.filter_sizes = [5, 3, 3, 3, 3, 3, 3, 3, 3] + self.num_filters = [64, 64, 128, 128, 256, 256, 512, 512, 1024] + hidden_dim = 512 + + self.bn_params = [] + + cur_size = input_size + prev_dim = 3 + self.params = {} + for i, (f, next_dim) in enumerate(zip(self.filter_sizes, self.num_filters)): + fan_in = f * f * prev_dim + self.params['W%d' % (i + 1)] = np.sqrt(2.0 / fan_in) * np.random.randn(next_dim, prev_dim, f, f) + self.params['b%d' % (i + 1)] = np.zeros(next_dim) + self.params['gamma%d' % (i + 1)] = np.ones(next_dim) + self.params['beta%d' % (i + 1)] = np.zeros(next_dim) + self.bn_params.append({'mode': 'train'}) + prev_dim = next_dim + if self.conv_params[i]['stride'] == 2: cur_size /= 2 + + # Add a fully-connected layers + fan_in = cur_size * cur_size * self.num_filters[-1] + self.params['W%d' % (i + 2)] = np.sqrt(2.0 / fan_in) * np.random.randn(fan_in, hidden_dim) + self.params['b%d' % (i + 2)] = np.zeros(hidden_dim) + self.params['gamma%d' % (i + 2)] = np.ones(hidden_dim) + self.params['beta%d' % (i + 2)] = np.zeros(hidden_dim) + self.bn_params.append({'mode': 'train'}) + self.params['W%d' % (i + 3)] = np.sqrt(2.0 / hidden_dim) * np.random.randn(hidden_dim, num_classes) + self.params['b%d' % (i + 3)] = np.zeros(num_classes) + + for k, v in self.params.iteritems(): + self.params[k] = v.astype(dtype) + + if h5_file is not None: + self.load_weights(h5_file) + + + def load_weights(self, h5_file, verbose=False): + """ + Load pretrained weights from an HDF5 file. + + Inputs: + - h5_file: Path to the HDF5 file where pretrained weights are stored. + - verbose: Whether to print debugging info + """ + + # Before loading weights we need to make a dummy forward pass to initialize + # the running averages in the bn_pararams + x = np.random.randn(1, 3, self.input_size, self.input_size) + y = np.random.randint(self.num_classes, size=1) + loss, grads = self.loss(x, y) + + with h5py.File(h5_file, 'r') as f: + for k, v in f.iteritems(): + v = np.asarray(v) + if k in self.params: + if verbose: print k, v.shape, self.params[k].shape + if v.shape == self.params[k].shape: + self.params[k] = v.copy() + elif v.T.shape == self.params[k].shape: + self.params[k] = v.T.copy() + else: + raise ValueError('shapes for %s do not match' % k) + if k.startswith('running_mean'): + i = int(k[12:]) - 1 + assert self.bn_params[i]['running_mean'].shape == v.shape + self.bn_params[i]['running_mean'] = v.copy() + if verbose: print k, v.shape + if k.startswith('running_var'): + i = int(k[11:]) - 1 + assert v.shape == self.bn_params[i]['running_var'].shape + self.bn_params[i]['running_var'] = v.copy() + if verbose: print k, v.shape + + for k, v in self.params.iteritems(): + self.params[k] = v.astype(self.dtype) + + + def forward(self, X, start=None, end=None, mode='test'): + """ + Run part of the model forward, starting and ending at an arbitrary layer, + in either training mode or testing mode. + + You can pass arbitrary input to the starting layer, and you will receive + output from the ending layer and a cache object that can be used to run + the model backward over the same set of layers. + + For the purposes of this function, a "layer" is one of the following blocks: + + [conv - spatial batchnorm - relu] (There are 9 of these) + [affine - batchnorm - relu] (There is one of these) + [affine] (There is one of these) + + Inputs: + - X: The input to the starting layer. If start=0, then this should be an + array of shape (N, C, 64, 64). + - start: The index of the layer to start from. start=0 starts from the first + convolutional layer. Default is 0. + - end: The index of the layer to end at. start=11 ends at the last + fully-connected layer, returning class scores. Default is 11. + - mode: The mode to use, either 'test' or 'train'. We need this because + batch normalization behaves differently at training time and test time. + + Returns: + - out: Output from the end layer. + - cache: A cache object that can be passed to the backward method to run the + network backward over the same range of layers. + """ + X = X.astype(self.dtype) + if start is None: start = 0 + if end is None: end = len(self.conv_params) + 1 + layer_caches = [] + + prev_a = X + for i in xrange(start, end + 1): + i1 = i + 1 + if 0 <= i < len(self.conv_params): + # This is a conv layer + w, b = self.params['W%d' % i1], self.params['b%d' % i1] + gamma, beta = self.params['gamma%d' % i1], self.params['beta%d' % i1] + conv_param = self.conv_params[i] + bn_param = self.bn_params[i] + bn_param['mode'] = mode + + next_a, cache = conv_bn_relu_forward(prev_a, w, b, gamma, beta, conv_param, bn_param) + elif i == len(self.conv_params): + # This is the fully-connected hidden layer + w, b = self.params['W%d' % i1], self.params['b%d' % i1] + gamma, beta = self.params['gamma%d' % i1], self.params['beta%d' % i1] + bn_param = self.bn_params[i] + bn_param['mode'] = mode + next_a, cache = affine_bn_relu_forward(prev_a, w, b, gamma, beta, bn_param) + elif i == len(self.conv_params) + 1: + # This is the last fully-connected layer that produces scores + w, b = self.params['W%d' % i1], self.params['b%d' % i1] + next_a, cache = affine_forward(prev_a, w, b) + else: + raise ValueError('Invalid layer index %d' % i) + + layer_caches.append(cache) + prev_a = next_a + + out = prev_a + cache = (start, end, layer_caches) + return out, cache + + + def backward(self, dout, cache): + """ + Run the model backward over a sequence of layers that were previously run + forward using the self.forward method. + + Inputs: + - dout: Gradient with respect to the ending layer; this should have the same + shape as the out variable returned from the corresponding call to forward. + - cache: A cache object returned from self.forward. + + Returns: + - dX: Gradient with respect to the start layer. This will have the same + shape as the input X passed to self.forward. + - grads: Gradient of all parameters in the layers. For example if you run + forward through two convolutional layers, then on the corresponding call + to backward grads will contain the gradients with respect to the weights, + biases, and spatial batchnorm parameters of those two convolutional + layers. The grads dictionary will therefore contain a subset of the keys + of self.params, and grads[k] and self.params[k] will have the same shape. + """ + start, end, layer_caches = cache + dnext_a = dout + grads = {} + for i in reversed(range(start, end + 1)): + i1 = i + 1 + if i == len(self.conv_params) + 1: + # This is the last fully-connected layer + dprev_a, dw, db = affine_backward(dnext_a, layer_caches.pop()) + grads['W%d' % i1] = dw + grads['b%d' % i1] = db + elif i == len(self.conv_params): + # This is the fully-connected hidden layer + temp = affine_bn_relu_backward(dnext_a, layer_caches.pop()) + dprev_a, dw, db, dgamma, dbeta = temp + grads['W%d' % i1] = dw + grads['b%d' % i1] = db + grads['gamma%d' % i1] = dgamma + grads['beta%d' % i1] = dbeta + elif 0 <= i < len(self.conv_params): + # This is a conv layer + temp = conv_bn_relu_backward(dnext_a, layer_caches.pop()) + dprev_a, dw, db, dgamma, dbeta = temp + grads['W%d' % i1] = dw + grads['b%d' % i1] = db + grads['gamma%d' % i1] = dgamma + grads['beta%d' % i1] = dbeta + else: + raise ValueError('Invalid layer index %d' % i) + dnext_a = dprev_a + + dX = dnext_a + return dX, grads + + + def loss(self, X, y=None): + """ + Classification loss used to train the network. + + Inputs: + - X: Array of data, of shape (N, 3, 64, 64) + - y: Array of labels, of shape (N,) + + If y is None, then run a test-time forward pass and return: + - scores: Array of shape (N, 100) giving class scores. + + If y is not None, then run a training-time forward and backward pass and + return a tuple of: + - loss: Scalar giving loss + - grads: Dictionary of gradients, with the same keys as self.params. + """ + # Note that we implement this by just caling self.forward and self.backward + mode = 'test' if y is None else 'train' + scores, cache = self.forward(X, mode=mode) + if mode == 'test': + return scores + loss, dscores = softmax_loss(scores, y) + dX, grads = self.backward(dscores, cache) + return loss, grads + diff --git a/assignments2016/assignment3/cs231n/classifiers/rnn.py b/assignments2016/assignment3/cs231n/classifiers/rnn.py new file mode 100644 index 00000000..d43bba80 --- /dev/null +++ b/assignments2016/assignment3/cs231n/classifiers/rnn.py @@ -0,0 +1,204 @@ +import numpy as np + +from cs231n.layers import * +from cs231n.rnn_layers import * + + +class CaptioningRNN(object): + """ + A CaptioningRNN produces captions from image features using a recurrent + neural network. + + The RNN receives input vectors of size D, has a vocab size of V, works on + sequences of length T, has an RNN hidden dimension of H, uses word vectors + of dimension W, and operates on minibatches of size N. + + Note that we don't use any regularization for the CaptioningRNN. + """ + + def __init__(self, word_to_idx, input_dim=512, wordvec_dim=128, + hidden_dim=128, cell_type='rnn', dtype=np.float32): + """ + Construct a new CaptioningRNN instance. + + Inputs: + - word_to_idx: A dictionary giving the vocabulary. It contains V entries, + and maps each string to a unique integer in the range [0, V). + - input_dim: Dimension D of input image feature vectors. + - wordvec_dim: Dimension W of word vectors. + - hidden_dim: Dimension H for the hidden state of the RNN. + - cell_type: What type of RNN to use; either 'rnn' or 'lstm'. + - dtype: numpy datatype to use; use float32 for training and float64 for + numeric gradient checking. + """ + if cell_type not in {'rnn', 'lstm'}: + raise ValueError('Invalid cell_type "%s"' % cell_type) + + self.cell_type = cell_type + self.dtype = dtype + self.word_to_idx = word_to_idx + self.idx_to_word = {i: w for w, i in word_to_idx.iteritems()} + self.params = {} + + vocab_size = len(word_to_idx) + + self._null = word_to_idx[''] + self._start = word_to_idx.get('', None) + self._end = word_to_idx.get('', None) + + # Initialize word vectors + self.params['W_embed'] = np.random.randn(vocab_size, wordvec_dim) + self.params['W_embed'] /= 100 + + # Initialize CNN -> hidden state projection parameters + self.params['W_proj'] = np.random.randn(input_dim, hidden_dim) + self.params['W_proj'] /= np.sqrt(input_dim) + self.params['b_proj'] = np.zeros(hidden_dim) + + # Initialize parameters for the RNN + dim_mul = {'lstm': 4, 'rnn': 1}[cell_type] + self.params['Wx'] = np.random.randn(wordvec_dim, dim_mul * hidden_dim) + self.params['Wx'] /= np.sqrt(wordvec_dim) + self.params['Wh'] = np.random.randn(hidden_dim, dim_mul * hidden_dim) + self.params['Wh'] /= np.sqrt(hidden_dim) + self.params['b'] = np.zeros(dim_mul * hidden_dim) + + # Initialize output to vocab weights + self.params['W_vocab'] = np.random.randn(hidden_dim, vocab_size) + self.params['W_vocab'] /= np.sqrt(hidden_dim) + self.params['b_vocab'] = np.zeros(vocab_size) + + # Cast parameters to correct dtype + for k, v in self.params.iteritems(): + self.params[k] = v.astype(self.dtype) + + + def loss(self, features, captions): + """ + Compute training-time loss for the RNN. We input image features and + ground-truth captions for those images, and use an RNN (or LSTM) to compute + loss and gradients on all parameters. + + Inputs: + - features: Input image features, of shape (N, D) + - captions: Ground-truth captions; an integer array of shape (N, T) where + each element is in the range 0 <= y[i, t] < V + + Returns a tuple of: + - loss: Scalar loss + - grads: Dictionary of gradients parallel to self.params + """ + # Cut captions into two pieces: captions_in has everything but the last word + # and will be input to the RNN; captions_out has everything but the first + # word and this is what we will expect the RNN to generate. These are offset + # by one relative to each other because the RNN should produce word (t+1) + # after receiving word t. The first element of captions_in will be the START + # token, and the first element of captions_out will be the first word. + captions_in = captions[:, :-1] + captions_out = captions[:, 1:] + + # You'll need this + mask = (captions_out != self._null) + + # Weight and bias for the affine transform from image features to initial + # hidden state + W_proj, b_proj = self.params['W_proj'], self.params['b_proj'] + + # Word embedding matrix + W_embed = self.params['W_embed'] + + # Input-to-hidden, hidden-to-hidden, and biases for the RNN + Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b'] + + # Weight and bias for the hidden-to-vocab transformation. + W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab'] + + loss, grads = 0.0, {} + ############################################################################ + # TODO: Implement the forward and backward passes for the CaptioningRNN. # + # In the forward pass you will need to do the following: # + # (1) Use an affine transformation to compute the initial hidden state # + # from the image features. This should produce an array of shape (N, H)# + # (2) Use a word embedding layer to transform the words in captions_in # + # from indices to vectors, giving an array of shape (N, T, W). # + # (3) Use either a vanilla RNN or LSTM (depending on self.cell_type) to # + # process the sequence of input word vectors and produce hidden state # + # vectors for all timesteps, producing an array of shape (N, T, H). # + # (4) Use a (temporal) affine transformation to compute scores over the # + # vocabulary at every timestep using the hidden states, giving an # + # array of shape (N, T, V). # + # (5) Use (temporal) softmax to compute loss using captions_out, ignoring # + # the points where the output word is using the mask above. # + # # + # In the backward pass you will need to compute the gradient of the loss # + # with respect to all model parameters. Use the loss and grads variables # + # defined above to store loss and gradients; grads[k] should give the # + # gradients for self.params[k]. # + ############################################################################ + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + + return loss, grads + + + def sample(self, features, max_length=30): + """ + Run a test-time forward pass for the model, sampling captions for input + feature vectors. + + At each timestep, we embed the current word, pass it and the previous hidden + state to the RNN to get the next hidden state, use the hidden state to get + scores for all vocab words, and choose the word with the highest score as + the next word. The initial hidden state is computed by applying an affine + transform to the input image features, and the initial word is the + token. + + For LSTMs you will also have to keep track of the cell state; in that case + the initial cell state should be zero. + + Inputs: + - features: Array of input image features of shape (N, D). + - max_length: Maximum length T of generated captions. + + Returns: + - captions: Array of shape (N, max_length) giving sampled captions, + where each element is an integer in the range [0, V). The first element + of captions should be the first sampled word, not the token. + """ + N = features.shape[0] + captions = self._null * np.ones((N, max_length), dtype=np.int32) + + # Unpack parameters + W_proj, b_proj = self.params['W_proj'], self.params['b_proj'] + W_embed = self.params['W_embed'] + Wx, Wh, b = self.params['Wx'], self.params['Wh'], self.params['b'] + W_vocab, b_vocab = self.params['W_vocab'], self.params['b_vocab'] + + ########################################################################### + # TODO: Implement test-time sampling for the model. You will need to # + # initialize the hidden state of the RNN by applying the learned affine # + # transform to the input image features. The first word that you feed to # + # the RNN should be the token; its value is stored in the # + # variable self._start. At each timestep you will need to do to: # + # (1) Embed the previous word using the learned word embeddings # + # (2) Make an RNN step using the previous hidden state and the embedded # + # current word to get the next hidden state. # + # (3) Apply the learned affine transformation to the next hidden state to # + # get scores for all words in the vocabulary # + # (4) Select the word with the highest score as the next word, writing it # + # to the appropriate slot in the captions variable # + # # + # For simplicity, you do not need to stop generating after an token # + # is sampled, but you can if you want to. # + # # + # HINT: You will not be able to use the rnn_forward or lstm_forward # + # functions; you'll need to call rnn_step_forward or lstm_step_forward in # + # a loop. # + ########################################################################### + pass + ############################################################################ + # END OF YOUR CODE # + ############################################################################ + return captions diff --git a/assignments2016/assignment3/cs231n/coco_utils.py b/assignments2016/assignment3/cs231n/coco_utils.py new file mode 100644 index 00000000..bc5f5793 --- /dev/null +++ b/assignments2016/assignment3/cs231n/coco_utils.py @@ -0,0 +1,84 @@ +import os, json +import numpy as np +import h5py + + +def load_coco_data(base_dir='cs231n/datasets/coco_captioning', + max_train=None, + pca_features=True): + data = {} + caption_file = os.path.join(base_dir, 'coco2014_captions.h5') + with h5py.File(caption_file, 'r') as f: + for k, v in f.iteritems(): + data[k] = np.asarray(v) + + if pca_features: + train_feat_file = os.path.join(base_dir, 'train2014_vgg16_fc7_pca.h5') + else: + train_feat_file = os.path.join(base_dir, 'train2014_vgg16_fc7.h5') + with h5py.File(train_feat_file, 'r') as f: + data['train_features'] = np.asarray(f['features']) + + if pca_features: + val_feat_file = os.path.join(base_dir, 'val2014_vgg16_fc7_pca.h5') + else: + val_feat_file = os.path.join(base_dir, 'val2014_vgg16_fc7.h5') + with h5py.File(val_feat_file, 'r') as f: + data['val_features'] = np.asarray(f['features']) + + dict_file = os.path.join(base_dir, 'coco2014_vocab.json') + with open(dict_file, 'r') as f: + dict_data = json.load(f) + for k, v in dict_data.iteritems(): + data[k] = v + + train_url_file = os.path.join(base_dir, 'train2014_urls.txt') + with open(train_url_file, 'r') as f: + train_urls = np.asarray([line.strip() for line in f]) + data['train_urls'] = train_urls + + val_url_file = os.path.join(base_dir, 'val2014_urls.txt') + with open(val_url_file, 'r') as f: + val_urls = np.asarray([line.strip() for line in f]) + data['val_urls'] = val_urls + + # Maybe subsample the training data + if max_train is not None: + num_train = data['train_captions'].shape[0] + mask = np.random.randint(num_train, size=max_train) + data['train_captions'] = data['train_captions'][mask] + data['train_image_idxs'] = data['train_image_idxs'][mask] + + return data + + +def decode_captions(captions, idx_to_word): + singleton = False + if captions.ndim == 1: + singleton = True + captions = captions[None] + decoded = [] + N, T = captions.shape + for i in xrange(N): + words = [] + for t in xrange(T): + word = idx_to_word[captions[i, t]] + if word != '': + words.append(word) + if word == '': + break + decoded.append(' '.join(words)) + if singleton: + decoded = decoded[0] + return decoded + + +def sample_coco_minibatch(data, batch_size=100, split='train'): + split_size = data['%s_captions' % split].shape[0] + mask = np.random.choice(split_size, batch_size) + captions = data['%s_captions' % split][mask] + image_idxs = data['%s_image_idxs' % split][mask] + image_features = data['%s_features' % split][image_idxs] + urls = data['%s_urls' % split][image_idxs] + return captions, image_features, urls + diff --git a/assignments2016/assignment3/cs231n/data_utils.py b/assignments2016/assignment3/cs231n/data_utils.py new file mode 100644 index 00000000..0fca6f59 --- /dev/null +++ b/assignments2016/assignment3/cs231n/data_utils.py @@ -0,0 +1,219 @@ +import cPickle as pickle +import numpy as np +import os +from scipy.misc import imread + +def load_CIFAR_batch(filename): + """ load single batch of cifar """ + with open(filename, 'rb') as f: + datadict = pickle.load(f) + X = datadict['data'] + Y = datadict['labels'] + X = X.reshape(10000, 3, 32, 32).transpose(0,2,3,1).astype("float") + Y = np.array(Y) + return X, Y + +def load_CIFAR10(ROOT): + """ load all of cifar """ + xs = [] + ys = [] + for b in range(1,6): + f = os.path.join(ROOT, 'data_batch_%d' % (b, )) + X, Y = load_CIFAR_batch(f) + xs.append(X) + ys.append(Y) + Xtr = np.concatenate(xs) + Ytr = np.concatenate(ys) + del X, Y + Xte, Yte = load_CIFAR_batch(os.path.join(ROOT, 'test_batch')) + return Xtr, Ytr, Xte, Yte + + +def get_CIFAR10_data(num_training=49000, num_validation=1000, num_test=1000, + subtract_mean=True): + """ + Load the CIFAR-10 dataset from disk and perform preprocessing to prepare + it for classifiers. These are the same steps as we used for the SVM, but + condensed to a single function. + """ + # Load the raw CIFAR-10 data + cifar10_dir = 'cs231n/datasets/cifar-10-batches-py' + X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir) + + # Subsample the data + mask = range(num_training, num_training + num_validation) + X_val = X_train[mask] + y_val = y_train[mask] + mask = range(num_training) + X_train = X_train[mask] + y_train = y_train[mask] + mask = range(num_test) + X_test = X_test[mask] + y_test = y_test[mask] + + # Normalize the data: subtract the mean image + if subtract_mean: + mean_image = np.mean(X_train, axis=0) + X_train -= mean_image + X_val -= mean_image + X_test -= mean_image + + # Transpose so that channels come first + X_train = X_train.transpose(0, 3, 1, 2).copy() + X_val = X_val.transpose(0, 3, 1, 2).copy() + X_test = X_test.transpose(0, 3, 1, 2).copy() + + # Package data into a dictionary + return { + 'X_train': X_train, 'y_train': y_train, + 'X_val': X_val, 'y_val': y_val, + 'X_test': X_test, 'y_test': y_test, + } + + +def load_tiny_imagenet(path, dtype=np.float32, subtract_mean=True): + """ + Load TinyImageNet. Each of TinyImageNet-100-A, TinyImageNet-100-B, and + TinyImageNet-200 have the same directory structure, so this can be used + to load any of them. + + Inputs: + - path: String giving path to the directory to load. + - dtype: numpy datatype used to load the data. + - subtract_mean: Whether to subtract the mean training image. + + Returns: A dictionary with the following entries: + - class_names: A list where class_names[i] is a list of strings giving the + WordNet names for class i in the loaded dataset. + - X_train: (N_tr, 3, 64, 64) array of training images + - y_train: (N_tr,) array of training labels + - X_val: (N_val, 3, 64, 64) array of validation images + - y_val: (N_val,) array of validation labels + - X_test: (N_test, 3, 64, 64) array of testing images. + - y_test: (N_test,) array of test labels; if test labels are not available + (such as in student code) then y_test will be None. + - mean_image: (3, 64, 64) array giving mean training image + """ + # First load wnids + with open(os.path.join(path, 'wnids.txt'), 'r') as f: + wnids = [x.strip() for x in f] + + # Map wnids to integer labels + wnid_to_label = {wnid: i for i, wnid in enumerate(wnids)} + + # Use words.txt to get names for each class + with open(os.path.join(path, 'words.txt'), 'r') as f: + wnid_to_words = dict(line.split('\t') for line in f) + for wnid, words in wnid_to_words.iteritems(): + wnid_to_words[wnid] = [w.strip() for w in words.split(',')] + class_names = [wnid_to_words[wnid] for wnid in wnids] + + # Next load training data. + X_train = [] + y_train = [] + for i, wnid in enumerate(wnids): + if (i + 1) % 20 == 0: + print 'loading training data for synset %d / %d' % (i + 1, len(wnids)) + # To figure out the filenames we need to open the boxes file + boxes_file = os.path.join(path, 'train', wnid, '%s_boxes.txt' % wnid) + with open(boxes_file, 'r') as f: + filenames = [x.split('\t')[0] for x in f] + num_images = len(filenames) + + X_train_block = np.zeros((num_images, 3, 64, 64), dtype=dtype) + y_train_block = wnid_to_label[wnid] * np.ones(num_images, dtype=np.int64) + for j, img_file in enumerate(filenames): + img_file = os.path.join(path, 'train', wnid, 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + ## grayscale file + img.shape = (64, 64, 1) + X_train_block[j] = img.transpose(2, 0, 1) + X_train.append(X_train_block) + y_train.append(y_train_block) + + # We need to concatenate all training data + X_train = np.concatenate(X_train, axis=0) + y_train = np.concatenate(y_train, axis=0) + + # Next load validation data + with open(os.path.join(path, 'val', 'val_annotations.txt'), 'r') as f: + img_files = [] + val_wnids = [] + for line in f: + img_file, wnid = line.split('\t')[:2] + img_files.append(img_file) + val_wnids.append(wnid) + num_val = len(img_files) + y_val = np.array([wnid_to_label[wnid] for wnid in val_wnids]) + X_val = np.zeros((num_val, 3, 64, 64), dtype=dtype) + for i, img_file in enumerate(img_files): + img_file = os.path.join(path, 'val', 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + img.shape = (64, 64, 1) + X_val[i] = img.transpose(2, 0, 1) + + # Next load test images + # Students won't have test labels, so we need to iterate over files in the + # images directory. + img_files = os.listdir(os.path.join(path, 'test', 'images')) + X_test = np.zeros((len(img_files), 3, 64, 64), dtype=dtype) + for i, img_file in enumerate(img_files): + img_file = os.path.join(path, 'test', 'images', img_file) + img = imread(img_file) + if img.ndim == 2: + img.shape = (64, 64, 1) + X_test[i] = img.transpose(2, 0, 1) + + y_test = None + y_test_file = os.path.join(path, 'test', 'test_annotations.txt') + if os.path.isfile(y_test_file): + with open(y_test_file, 'r') as f: + img_file_to_wnid = {} + for line in f: + line = line.split('\t') + img_file_to_wnid[line[0]] = line[1] + y_test = [wnid_to_label[img_file_to_wnid[img_file]] for img_file in img_files] + y_test = np.array(y_test) + + mean_image = X_train.mean(axis=0) + if subtract_mean: + X_train -= mean_image[None] + X_val -= mean_image[None] + X_test -= mean_image[None] + + return { + 'class_names': class_names, + 'X_train': X_train, + 'y_train': y_train, + 'X_val': X_val, + 'y_val': y_val, + 'X_test': X_test, + 'y_test': y_test, + 'class_names': class_names, + 'mean_image': mean_image, + } + + +def load_models(models_dir): + """ + Load saved models from disk. This will attempt to unpickle all files in a + directory; any files that give errors on unpickling (such as README.txt) will + be skipped. + + Inputs: + - models_dir: String giving the path to a directory containing model files. + Each model file is a pickled dictionary with a 'model' field. + + Returns: + A dictionary mapping model file names to models. + """ + models = {} + for model_file in os.listdir(models_dir): + with open(os.path.join(models_dir, model_file), 'rb') as f: + try: + models[model_file] = pickle.load(f)['model'] + except pickle.UnpicklingError: + continue + return models diff --git a/assignments2016/assignment3/cs231n/datasets/get_coco_captioning.sh b/assignments2016/assignment3/cs231n/datasets/get_coco_captioning.sh new file mode 100755 index 00000000..683e34e4 --- /dev/null +++ b/assignments2016/assignment3/cs231n/datasets/get_coco_captioning.sh @@ -0,0 +1,3 @@ +wget "http://cs231n.stanford.edu/coco_captioning.zip" +unzip coco_captioning.zip +rm coco_captioning.zip diff --git a/assignments2016/assignment3/cs231n/datasets/get_pretrained_model.sh b/assignments2016/assignment3/cs231n/datasets/get_pretrained_model.sh new file mode 100755 index 00000000..d4a6ceb2 --- /dev/null +++ b/assignments2016/assignment3/cs231n/datasets/get_pretrained_model.sh @@ -0,0 +1 @@ +wget http://cs231n.stanford.edu/pretrained_model.h5 diff --git a/assignments2016/assignment3/cs231n/datasets/get_tiny_imagenet_a.sh b/assignments2016/assignment3/cs231n/datasets/get_tiny_imagenet_a.sh new file mode 100755 index 00000000..6d975605 --- /dev/null +++ b/assignments2016/assignment3/cs231n/datasets/get_tiny_imagenet_a.sh @@ -0,0 +1,3 @@ +wget http://cs231n.stanford.edu/tiny-imagenet-100-A.zip +unzip tiny-imagenet-100-A.zip +rm tiny-imagenet-100-A.zip diff --git a/assignments2016/assignment3/cs231n/fast_layers.py b/assignments2016/assignment3/cs231n/fast_layers.py new file mode 100644 index 00000000..ea0ce0bc --- /dev/null +++ b/assignments2016/assignment3/cs231n/fast_layers.py @@ -0,0 +1,270 @@ +import numpy as np +try: + from cs231n.im2col_cython import col2im_cython, im2col_cython + from cs231n.im2col_cython import col2im_6d_cython +except ImportError: + print 'run the following from the cs231n directory and try again:' + print 'python setup.py build_ext --inplace' + print 'You may also need to restart your iPython kernel' + +from cs231n.im2col import * + + +def conv_forward_im2col(x, w, b, conv_param): + """ + A fast implementation of the forward pass for a convolutional layer + based on im2col and col2im. + """ + N, C, H, W = x.shape + num_filters, _, filter_height, filter_width = w.shape + stride, pad = conv_param['stride'], conv_param['pad'] + + # Check dimensions + assert (W + 2 * pad - filter_width) % stride == 0, 'width does not work' + assert (H + 2 * pad - filter_height) % stride == 0, 'height does not work' + + # Create output + out_height = (H + 2 * pad - filter_height) / stride + 1 + out_width = (W + 2 * pad - filter_width) / stride + 1 + out = np.zeros((N, num_filters, out_height, out_width), dtype=x.dtype) + + # x_cols = im2col_indices(x, w.shape[2], w.shape[3], pad, stride) + x_cols = im2col_cython(x, w.shape[2], w.shape[3], pad, stride) + res = w.reshape((w.shape[0], -1)).dot(x_cols) + b.reshape(-1, 1) + + out = res.reshape(w.shape[0], out.shape[2], out.shape[3], x.shape[0]) + out = out.transpose(3, 0, 1, 2) + + cache = (x, w, b, conv_param, x_cols) + return out, cache + + +def conv_forward_strides(x, w, b, conv_param): + N, C, H, W = x.shape + F, _, HH, WW = w.shape + stride, pad = conv_param['stride'], conv_param['pad'] + + # Check dimensions + #assert (W + 2 * pad - WW) % stride == 0, 'width does not work' + #assert (H + 2 * pad - HH) % stride == 0, 'height does not work' + + # Pad the input + p = pad + x_padded = np.pad(x, ((0, 0), (0, 0), (p, p), (p, p)), mode='constant') + + # Figure out output dimensions + H += 2 * pad + W += 2 * pad + out_h = (H - HH) / stride + 1 + out_w = (W - WW) / stride + 1 + + # Perform an im2col operation by picking clever strides + shape = (C, HH, WW, N, out_h, out_w) + strides = (H * W, W, 1, C * H * W, stride * W, stride) + strides = x.itemsize * np.array(strides) + x_stride = np.lib.stride_tricks.as_strided(x_padded, + shape=shape, strides=strides) + x_cols = np.ascontiguousarray(x_stride) + x_cols.shape = (C * HH * WW, N * out_h * out_w) + + # Now all our convolutions are a big matrix multiply + res = w.reshape(F, -1).dot(x_cols) + b.reshape(-1, 1) + + # Reshape the output + res.shape = (F, N, out_h, out_w) + out = res.transpose(1, 0, 2, 3) + + # Be nice and return a contiguous array + # The old version of conv_forward_fast doesn't do this, so for a fair + # comparison we won't either + out = np.ascontiguousarray(out) + + cache = (x, w, b, conv_param, x_cols) + return out, cache + + +def conv_backward_strides(dout, cache): + x, w, b, conv_param, x_cols = cache + stride, pad = conv_param['stride'], conv_param['pad'] + + N, C, H, W = x.shape + F, _, HH, WW = w.shape + _, _, out_h, out_w = dout.shape + + db = np.sum(dout, axis=(0, 2, 3)) + + dout_reshaped = dout.transpose(1, 0, 2, 3).reshape(F, -1) + dw = dout_reshaped.dot(x_cols.T).reshape(w.shape) + + dx_cols = w.reshape(F, -1).T.dot(dout_reshaped) + dx_cols.shape = (C, HH, WW, N, out_h, out_w) + dx = col2im_6d_cython(dx_cols, N, C, H, W, HH, WW, pad, stride) + + return dx, dw, db + + +def conv_backward_im2col(dout, cache): + """ + A fast implementation of the backward pass for a convolutional layer + based on im2col and col2im. + """ + x, w, b, conv_param, x_cols = cache + stride, pad = conv_param['stride'], conv_param['pad'] + + db = np.sum(dout, axis=(0, 2, 3)) + + num_filters, _, filter_height, filter_width = w.shape + dout_reshaped = dout.transpose(1, 2, 3, 0).reshape(num_filters, -1) + dw = dout_reshaped.dot(x_cols.T).reshape(w.shape) + + dx_cols = w.reshape(num_filters, -1).T.dot(dout_reshaped) + # dx = col2im_indices(dx_cols, x.shape, filter_height, filter_width, pad, stride) + dx = col2im_cython(dx_cols, x.shape[0], x.shape[1], x.shape[2], x.shape[3], + filter_height, filter_width, pad, stride) + + return dx, dw, db + + +conv_forward_fast = conv_forward_strides +conv_backward_fast = conv_backward_strides + + +def max_pool_forward_fast(x, pool_param): + """ + A fast implementation of the forward pass for a max pooling layer. + + This chooses between the reshape method and the im2col method. If the pooling + regions are square and tile the input image, then we can use the reshape + method which is very fast. Otherwise we fall back on the im2col method, which + is not much faster than the naive method. + """ + N, C, H, W = x.shape + pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width'] + stride = pool_param['stride'] + + same_size = pool_height == pool_width == stride + tiles = H % pool_height == 0 and W % pool_width == 0 + if same_size and tiles: + out, reshape_cache = max_pool_forward_reshape(x, pool_param) + cache = ('reshape', reshape_cache) + else: + out, im2col_cache = max_pool_forward_im2col(x, pool_param) + cache = ('im2col', im2col_cache) + return out, cache + + +def max_pool_backward_fast(dout, cache): + """ + A fast implementation of the backward pass for a max pooling layer. + + This switches between the reshape method an the im2col method depending on + which method was used to generate the cache. + """ + method, real_cache = cache + if method == 'reshape': + return max_pool_backward_reshape(dout, real_cache) + elif method == 'im2col': + return max_pool_backward_im2col(dout, real_cache) + else: + raise ValueError('Unrecognized method "%s"' % method) + + +def max_pool_forward_reshape(x, pool_param): + """ + A fast implementation of the forward pass for the max pooling layer that uses + some clever reshaping. + + This can only be used for square pooling regions that tile the input. + """ + N, C, H, W = x.shape + pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width'] + stride = pool_param['stride'] + assert pool_height == pool_width == stride, 'Invalid pool params' + assert H % pool_height == 0 + assert W % pool_height == 0 + x_reshaped = x.reshape(N, C, H / pool_height, pool_height, + W / pool_width, pool_width) + out = x_reshaped.max(axis=3).max(axis=4) + + cache = (x, x_reshaped, out) + return out, cache + + +def max_pool_backward_reshape(dout, cache): + """ + A fast implementation of the backward pass for the max pooling layer that + uses some clever broadcasting and reshaping. + + This can only be used if the forward pass was computed using + max_pool_forward_reshape. + + NOTE: If there are multiple argmaxes, this method will assign gradient to + ALL argmax elements of the input rather than picking one. In this case the + gradient will actually be incorrect. However this is unlikely to occur in + practice, so it shouldn't matter much. One possible solution is to split the + upstream gradient equally among all argmax elements; this should result in a + valid subgradient. You can make this happen by uncommenting the line below; + however this results in a significant performance penalty (about 40% slower) + and is unlikely to matter in practice so we don't do it. + """ + x, x_reshaped, out = cache + + dx_reshaped = np.zeros_like(x_reshaped) + out_newaxis = out[:, :, :, np.newaxis, :, np.newaxis] + mask = (x_reshaped == out_newaxis) + dout_newaxis = dout[:, :, :, np.newaxis, :, np.newaxis] + dout_broadcast, _ = np.broadcast_arrays(dout_newaxis, dx_reshaped) + dx_reshaped[mask] = dout_broadcast[mask] + dx_reshaped /= np.sum(mask, axis=(3, 5), keepdims=True) + dx = dx_reshaped.reshape(x.shape) + + return dx + + +def max_pool_forward_im2col(x, pool_param): + """ + An implementation of the forward pass for max pooling based on im2col. + + This isn't much faster than the naive version, so it should be avoided if + possible. + """ + N, C, H, W = x.shape + pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width'] + stride = pool_param['stride'] + + assert (H - pool_height) % stride == 0, 'Invalid height' + assert (W - pool_width) % stride == 0, 'Invalid width' + + out_height = (H - pool_height) / stride + 1 + out_width = (W - pool_width) / stride + 1 + + x_split = x.reshape(N * C, 1, H, W) + x_cols = im2col(x_split, pool_height, pool_width, padding=0, stride=stride) + x_cols_argmax = np.argmax(x_cols, axis=0) + x_cols_max = x_cols[x_cols_argmax, np.arange(x_cols.shape[1])] + out = x_cols_max.reshape(out_height, out_width, N, C).transpose(2, 3, 0, 1) + + cache = (x, x_cols, x_cols_argmax, pool_param) + return out, cache + + +def max_pool_backward_im2col(dout, cache): + """ + An implementation of the backward pass for max pooling based on im2col. + + This isn't much faster than the naive version, so it should be avoided if + possible. + """ + x, x_cols, x_cols_argmax, pool_param = cache + N, C, H, W = x.shape + pool_height, pool_width = pool_param['pool_height'], pool_param['pool_width'] + stride = pool_param['stride'] + + dout_reshaped = dout.transpose(2, 3, 0, 1).flatten() + dx_cols = np.zeros_like(x_cols) + dx_cols[x_cols_argmax, np.arange(dx_cols.shape[1])] = dout_reshaped + dx = col2im_indices(dx_cols, (N * C, 1, H, W), pool_height, pool_width, + padding=0, stride=stride) + dx = dx.reshape(x.shape) + + return dx diff --git a/assignments2016/assignment3/cs231n/gradient_check.py b/assignments2016/assignment3/cs231n/gradient_check.py new file mode 100644 index 00000000..2d6b1f62 --- /dev/null +++ b/assignments2016/assignment3/cs231n/gradient_check.py @@ -0,0 +1,124 @@ +import numpy as np +from random import randrange + +def eval_numerical_gradient(f, x, verbose=True, h=0.00001): + """ + a naive implementation of numerical gradient of f at x + - f should be a function that takes a single argument + - x is the point (numpy array) to evaluate the gradient at + """ + + fx = f(x) # evaluate function value at original point + grad = np.zeros_like(x) + # iterate over all indexes in x + it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite']) + while not it.finished: + + # evaluate function at x+h + ix = it.multi_index + oldval = x[ix] + x[ix] = oldval + h # increment by h + fxph = f(x) # evalute f(x + h) + x[ix] = oldval - h + fxmh = f(x) # evaluate f(x - h) + x[ix] = oldval # restore + + # compute the partial derivative with centered formula + grad[ix] = (fxph - fxmh) / (2 * h) # the slope + if verbose: + print ix, grad[ix] + it.iternext() # step to next dimension + + return grad + + +def eval_numerical_gradient_array(f, x, df, h=1e-5): + """ + Evaluate a numeric gradient for a function that accepts a numpy + array and returns a numpy array. + """ + grad = np.zeros_like(x) + it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite']) + while not it.finished: + ix = it.multi_index + + oldval = x[ix] + x[ix] = oldval + h + pos = f(x).copy() + x[ix] = oldval - h + neg = f(x).copy() + x[ix] = oldval + + grad[ix] = np.sum((pos - neg) * df) / (2 * h) + it.iternext() + return grad + + +def eval_numerical_gradient_blobs(f, inputs, output, h=1e-5): + """ + Compute numeric gradients for a function that operates on input + and output blobs. + + We assume that f accepts several input blobs as arguments, followed by a blob + into which outputs will be written. For example, f might be called like this: + + f(x, w, out) + + where x and w are input Blobs, and the result of f will be written to out. + + Inputs: + - f: function + - inputs: tuple of input blobs + - output: output blob + - h: step size + """ + numeric_diffs = [] + for input_blob in inputs: + diff = np.zeros_like(input_blob.diffs) + it = np.nditer(input_blob.vals, flags=['multi_index'], + op_flags=['readwrite']) + while not it.finished: + idx = it.multi_index + orig = input_blob.vals[idx] + + input_blob.vals[idx] = orig + h + f(*(inputs + (output,))) + pos = np.copy(output.vals) + input_blob.vals[idx] = orig - h + f(*(inputs + (output,))) + neg = np.copy(output.vals) + input_blob.vals[idx] = orig + + diff[idx] = np.sum((pos - neg) * output.diffs) / (2.0 * h) + + it.iternext() + numeric_diffs.append(diff) + return numeric_diffs + + +def eval_numerical_gradient_net(net, inputs, output, h=1e-5): + return eval_numerical_gradient_blobs(lambda *args: net.forward(), + inputs, output, h=h) + + +def grad_check_sparse(f, x, analytic_grad, num_checks=10, h=1e-5): + """ + sample a few random elements and only return numerical + in this dimensions. + """ + + for i in xrange(num_checks): + ix = tuple([randrange(m) for m in x.shape]) + + oldval = x[ix] + x[ix] = oldval + h # increment by h + fxph = f(x) # evaluate f(x + h) + x[ix] = oldval - h # increment by h + fxmh = f(x) # evaluate f(x - h) + x[ix] = oldval # reset + + grad_numerical = (fxph - fxmh) / (2 * h) + grad_analytic = analytic_grad[ix] + rel_error = abs(grad_numerical - grad_analytic) / (abs(grad_numerical) + abs(grad_analytic)) + print 'numerical: %f analytic: %f, relative error: %e' % (grad_numerical, grad_analytic, rel_error) + diff --git a/assignments2016/assignment3/cs231n/im2col.py b/assignments2016/assignment3/cs231n/im2col.py new file mode 100644 index 00000000..1942eab6 --- /dev/null +++ b/assignments2016/assignment3/cs231n/im2col.py @@ -0,0 +1,55 @@ +import numpy as np + + +def get_im2col_indices(x_shape, field_height, field_width, padding=1, stride=1): + # First figure out what the size of the output should be + N, C, H, W = x_shape + assert (H + 2 * padding - field_height) % stride == 0 + assert (W + 2 * padding - field_height) % stride == 0 + out_height = (H + 2 * padding - field_height) / stride + 1 + out_width = (W + 2 * padding - field_width) / stride + 1 + + i0 = np.repeat(np.arange(field_height), field_width) + i0 = np.tile(i0, C) + i1 = stride * np.repeat(np.arange(out_height), out_width) + j0 = np.tile(np.arange(field_width), field_height * C) + j1 = stride * np.tile(np.arange(out_width), out_height) + i = i0.reshape(-1, 1) + i1.reshape(1, -1) + j = j0.reshape(-1, 1) + j1.reshape(1, -1) + + k = np.repeat(np.arange(C), field_height * field_width).reshape(-1, 1) + + return (k, i, j) + + +def im2col_indices(x, field_height, field_width, padding=1, stride=1): + """ An implementation of im2col based on some fancy indexing """ + # Zero-pad the input + p = padding + x_padded = np.pad(x, ((0, 0), (0, 0), (p, p), (p, p)), mode='constant') + + k, i, j = get_im2col_indices(x.shape, field_height, field_width, padding, + stride) + + cols = x_padded[:, k, i, j] + C = x.shape[1] + cols = cols.transpose(1, 2, 0).reshape(field_height * field_width * C, -1) + return cols + + +def col2im_indices(cols, x_shape, field_height=3, field_width=3, padding=1, + stride=1): + """ An implementation of col2im based on fancy indexing and np.add.at """ + N, C, H, W = x_shape + H_padded, W_padded = H + 2 * padding, W + 2 * padding + x_padded = np.zeros((N, C, H_padded, W_padded), dtype=cols.dtype) + k, i, j = get_im2col_indices(x_shape, field_height, field_width, padding, + stride) + cols_reshaped = cols.reshape(C * field_height * field_width, -1, N) + cols_reshaped = cols_reshaped.transpose(2, 0, 1) + np.add.at(x_padded, (slice(None), k, i, j), cols_reshaped) + if padding == 0: + return x_padded + return x_padded[:, :, padding:-padding, padding:-padding] + +pass diff --git a/assignments2016/assignment3/cs231n/im2col_cython.pyx b/assignments2016/assignment3/cs231n/im2col_cython.pyx new file mode 100644 index 00000000..d6e33c6f --- /dev/null +++ b/assignments2016/assignment3/cs231n/im2col_cython.pyx @@ -0,0 +1,121 @@ +import numpy as np +cimport numpy as np +cimport cython + +# DTYPE = np.float64 +# ctypedef np.float64_t DTYPE_t + +ctypedef fused DTYPE_t: + np.float32_t + np.float64_t + +def im2col_cython(np.ndarray[DTYPE_t, ndim=4] x, int field_height, + int field_width, int padding, int stride): + cdef int N = x.shape[0] + cdef int C = x.shape[1] + cdef int H = x.shape[2] + cdef int W = x.shape[3] + + cdef int HH = (H + 2 * padding - field_height) / stride + 1 + cdef int WW = (W + 2 * padding - field_width) / stride + 1 + + cdef int p = padding + cdef np.ndarray[DTYPE_t, ndim=4] x_padded = np.pad(x, + ((0, 0), (0, 0), (p, p), (p, p)), mode='constant') + + cdef np.ndarray[DTYPE_t, ndim=2] cols = np.zeros( + (C * field_height * field_width, N * HH * WW), + dtype=x.dtype) + + # Moving the inner loop to a C function with no bounds checking works, but does + # not seem to help performance in any measurable way. + + im2col_cython_inner(cols, x_padded, N, C, H, W, HH, WW, + field_height, field_width, padding, stride) + return cols + + +@cython.boundscheck(False) +cdef int im2col_cython_inner(np.ndarray[DTYPE_t, ndim=2] cols, + np.ndarray[DTYPE_t, ndim=4] x_padded, + int N, int C, int H, int W, int HH, int WW, + int field_height, int field_width, int padding, int stride) except? -1: + cdef int c, ii, jj, row, yy, xx, i, col + + for c in range(C): + for yy in range(HH): + for xx in range(WW): + for ii in range(field_height): + for jj in range(field_width): + row = c * field_width * field_height + ii * field_height + jj + for i in range(N): + col = yy * WW * N + xx * N + i + cols[row, col] = x_padded[i, c, stride * yy + ii, stride * xx + jj] + + + +def col2im_cython(np.ndarray[DTYPE_t, ndim=2] cols, int N, int C, int H, int W, + int field_height, int field_width, int padding, int stride): + cdef np.ndarray x = np.empty((N, C, H, W), dtype=cols.dtype) + cdef int HH = (H + 2 * padding - field_height) / stride + 1 + cdef int WW = (W + 2 * padding - field_width) / stride + 1 + cdef np.ndarray[DTYPE_t, ndim=4] x_padded = np.zeros((N, C, H + 2 * padding, W + 2 * padding), + dtype=cols.dtype) + + # Moving the inner loop to a C-function with no bounds checking improves + # performance quite a bit for col2im. + col2im_cython_inner(cols, x_padded, N, C, H, W, HH, WW, + field_height, field_width, padding, stride) + if padding > 0: + return x_padded[:, :, padding:-padding, padding:-padding] + return x_padded + + +@cython.boundscheck(False) +cdef int col2im_cython_inner(np.ndarray[DTYPE_t, ndim=2] cols, + np.ndarray[DTYPE_t, ndim=4] x_padded, + int N, int C, int H, int W, int HH, int WW, + int field_height, int field_width, int padding, int stride) except? -1: + cdef int c, ii, jj, row, yy, xx, i, col + + for c in range(C): + for ii in range(field_height): + for jj in range(field_width): + row = c * field_width * field_height + ii * field_height + jj + for yy in range(HH): + for xx in range(WW): + for i in range(N): + col = yy * WW * N + xx * N + i + x_padded[i, c, stride * yy + ii, stride * xx + jj] += cols[row, col] + + +@cython.boundscheck(False) +@cython.wraparound(False) +cdef col2im_6d_cython_inner(np.ndarray[DTYPE_t, ndim=6] cols, + np.ndarray[DTYPE_t, ndim=4] x_padded, + int N, int C, int H, int W, int HH, int WW, + int out_h, int out_w, int pad, int stride): + + cdef int c, hh, ww, n, h, w + for n in range(N): + for c in range(C): + for hh in range(HH): + for ww in range(WW): + for h in range(out_h): + for w in range(out_w): + x_padded[n, c, stride * h + hh, stride * w + ww] += cols[c, hh, ww, n, h, w] + + +def col2im_6d_cython(np.ndarray[DTYPE_t, ndim=6] cols, int N, int C, int H, int W, + int HH, int WW, int pad, int stride): + cdef np.ndarray x = np.empty((N, C, H, W), dtype=cols.dtype) + cdef int out_h = (H + 2 * pad - HH) / stride + 1 + cdef int out_w = (W + 2 * pad - WW) / stride + 1 + cdef np.ndarray[DTYPE_t, ndim=4] x_padded = np.zeros((N, C, H + 2 * pad, W + 2 * pad), + dtype=cols.dtype) + + col2im_6d_cython_inner(cols, x_padded, N, C, H, W, HH, WW, out_h, out_w, pad, stride) + + if pad > 0: + return x_padded[:, :, pad:-pad, pad:-pad] + return x_padded diff --git a/assignments2016/assignment3/cs231n/image_utils.py b/assignments2016/assignment3/cs231n/image_utils.py new file mode 100644 index 00000000..300ffb66 --- /dev/null +++ b/assignments2016/assignment3/cs231n/image_utils.py @@ -0,0 +1,98 @@ +import urllib2, os, tempfile + +import numpy as np +from scipy.misc import imread + +from cs231n.fast_layers import conv_forward_fast + + +""" +Utility functions used for viewing and processing images. +""" + + +def blur_image(X): + """ + A very gentle image blurring operation, to be used as a regularizer for image + generation. + + Inputs: + - X: Image data of shape (N, 3, H, W) + + Returns: + - X_blur: Blurred version of X, of shape (N, 3, H, W) + """ + w_blur = np.zeros((3, 3, 3, 3)) + b_blur = np.zeros(3) + blur_param = {'stride': 1, 'pad': 1} + for i in xrange(3): + w_blur[i, i] = np.asarray([[1, 2, 1], [2, 188, 2], [1, 2, 1]], dtype=np.float32) + w_blur /= 200.0 + return conv_forward_fast(X, w_blur, b_blur, blur_param)[0] + + +def preprocess_image(img, mean_img, mean='image'): + """ + Convert to float, transepose, and subtract mean pixel + + Input: + - img: (H, W, 3) + + Returns: + - (1, 3, H, 3) + """ + if mean == 'image': + mean = mean_img + elif mean == 'pixel': + mean = mean_img.mean(axis=(1, 2), keepdims=True) + elif mean == 'none': + mean = 0 + else: + raise ValueError('mean must be image or pixel or none') + return img.astype(np.float32).transpose(2, 0, 1)[None] - mean + + +def deprocess_image(img, mean_img, mean='image', renorm=False): + """ + Add mean pixel, transpose, and convert to uint8 + + Input: + - (1, 3, H, W) or (3, H, W) + + Returns: + - (H, W, 3) + """ + if mean == 'image': + mean = mean_img + elif mean == 'pixel': + mean = mean_img.mean(axis=(1, 2), keepdims=True) + elif mean == 'none': + mean = 0 + else: + raise ValueError('mean must be image or pixel or none') + if img.ndim == 3: + img = img[None] + img = (img + mean)[0].transpose(1, 2, 0) + if renorm: + low, high = img.min(), img.max() + img = 255.0 * (img - low) / (high - low) + return img.astype(np.uint8) + + +def image_from_url(url): + """ + Read an image from a URL. Returns a numpy array with the pixel data. + We write the image to a temporary file then read it back. Kinda gross. + """ + try: + f = urllib2.urlopen(url) + _, fname = tempfile.mkstemp() + with open(fname, 'wb') as ff: + ff.write(f.read()) + img = imread(fname) + os.remove(fname) + return img + except urllib2.URLError as e: + print 'URL Error: ', e.reason, url + except urllib2.HTTPError as e: + print 'HTTP Error: ', e.code, url diff --git a/assignments2016/assignment3/cs231n/layer_utils.py b/assignments2016/assignment3/cs231n/layer_utils.py new file mode 100644 index 00000000..0a04d333 --- /dev/null +++ b/assignments2016/assignment3/cs231n/layer_utils.py @@ -0,0 +1,141 @@ +from cs231n.layers import * +from cs231n.fast_layers import * + + +def affine_relu_forward(x, w, b): + """ + Convenience layer that perorms an affine transform followed by a ReLU + + Inputs: + - x: Input to the affine layer + - w, b: Weights for the affine layer + + Returns a tuple of: + - out: Output from the ReLU + - cache: Object to give to the backward pass + """ + a, fc_cache = affine_forward(x, w, b) + out, relu_cache = relu_forward(a) + cache = (fc_cache, relu_cache) + return out, cache + + +def affine_relu_backward(dout, cache): + """ + Backward pass for the affine-relu convenience layer + """ + fc_cache, relu_cache = cache + da = relu_backward(dout, relu_cache) + dx, dw, db = affine_backward(da, fc_cache) + return dx, dw, db + + +def affine_bn_relu_forward(x, w, b, gamma, beta, bn_param): + """ + Convenience layer that performs an affine transform, batch normalization, + and ReLU. + + Inputs: + - x: Array of shape (N, D1); input to the affine layer + - w, b: Arrays of shape (D2, D2) and (D2,) giving the weight and bias for + the affine transform. + - gamma, beta: Arrays of shape (D2,) and (D2,) giving scale and shift + parameters for batch normalization. + - bn_param: Dictionary of parameters for batch normalization. + + Returns: + - out: Output from ReLU, of shape (N, D2) + - cache: Object to give to the backward pass. + """ + a, fc_cache = affine_forward(x, w, b) + a_bn, bn_cache = batchnorm_forward(a, gamma, beta, bn_param) + out, relu_cache = relu_forward(a_bn) + cache = (fc_cache, bn_cache, relu_cache) + return out, cache + + +def affine_bn_relu_backward(dout, cache): + """ + Backward pass for the affine-batchnorm-relu convenience layer. + """ + fc_cache, bn_cache, relu_cache = cache + da_bn = relu_backward(dout, relu_cache) + da, dgamma, dbeta = batchnorm_backward(da_bn, bn_cache) + dx, dw, db = affine_backward(da, fc_cache) + return dx, dw, db, dgamma, dbeta + + +def conv_relu_forward(x, w, b, conv_param): + """ + A convenience layer that performs a convolution followed by a ReLU. + + Inputs: + - x: Input to the convolutional layer + - w, b, conv_param: Weights and parameters for the convolutional layer + + Returns a tuple of: + - out: Output from the ReLU + - cache: Object to give to the backward pass + """ + a, conv_cache = conv_forward_fast(x, w, b, conv_param) + out, relu_cache = relu_forward(a) + cache = (conv_cache, relu_cache) + return out, cache + + +def conv_relu_backward(dout, cache): + """ + Backward pass for the conv-relu convenience layer. + """ + conv_cache, relu_cache = cache + da = relu_backward(dout, relu_cache) + dx, dw, db = conv_backward_fast(da, conv_cache) + return dx, dw, db + + +def conv_bn_relu_forward(x, w, b, gamma, beta, conv_param, bn_param): + a, conv_cache = conv_forward_fast(x, w, b, conv_param) + an, bn_cache = spatial_batchnorm_forward(a, gamma, beta, bn_param) + out, relu_cache = relu_forward(an) + cache = (conv_cache, bn_cache, relu_cache) + return out, cache + + +def conv_bn_relu_backward(dout, cache): + conv_cache, bn_cache, relu_cache = cache + dan = relu_backward(dout, relu_cache) + da, dgamma, dbeta = spatial_batchnorm_backward(dan, bn_cache) + dx, dw, db = conv_backward_fast(da, conv_cache) + return dx, dw, db, dgamma, dbeta + + +def conv_relu_pool_forward(x, w, b, conv_param, pool_param): + """ + Convenience layer that performs a convolution, a ReLU, and a pool. + + Inputs: + - x: Input to the convolutional layer + - w, b, conv_param: Weights and parameters for the convolutional layer + - pool_param: Parameters for the pooling layer + + Returns a tuple of: + - out: Output from the pooling layer + - cache: Object to give to the backward pass + """ + a, conv_cache = conv_forward_fast(x, w, b, conv_param) + s, relu_cache = relu_forward(a) + out, pool_cache = max_pool_forward_fast(s, pool_param) + cache = (conv_cache, relu_cache, pool_cache) + return out, cache + + +def conv_relu_pool_backward(dout, cache): + """ + Backward pass for the conv-relu-pool convenience layer + """ + conv_cache, relu_cache, pool_cache = cache + ds = max_pool_backward_fast(dout, pool_cache) + da = relu_backward(ds, relu_cache) + dx, dw, db = conv_backward_fast(da, conv_cache) + return dx, dw, db + diff --git a/assignments2016/assignment3/cs231n/layers.py b/assignments2016/assignment3/cs231n/layers.py new file mode 100644 index 00000000..9fc9b80f --- /dev/null +++ b/assignments2016/assignment3/cs231n/layers.py @@ -0,0 +1,302 @@ +import numpy as np + + +def affine_forward(x, w, b): + """ + Computes the forward pass for an affine (fully-connected) layer. + + The input x has shape (N, d_1, ..., d_k) where x[i] is the ith input. + We multiply this against a weight matrix of shape (D, M) where + D = \prod_i d_i + + Inputs: + x - Input data, of shape (N, d_1, ..., d_k) + w - Weights, of shape (D, M) + b - Biases, of shape (M,) + + Returns a tuple of: + - out: output, of shape (N, M) + - cache: (x, w, b) + """ + out = x.reshape(x.shape[0], -1).dot(w) + b + cache = (x, w, b) + return out, cache + + +def affine_backward(dout, cache): + """ + Computes the backward pass for an affine layer. + + Inputs: + - dout: Upstream derivative, of shape (N, M) + - cache: Tuple of: + - x: Input data, of shape (N, d_1, ... d_k) + - w: Weights, of shape (D, M) + + Returns a tuple of: + - dx: Gradient with respect to x, of shape (N, d1, ..., d_k) + - dw: Gradient with respect to w, of shape (D, M) + - db: Gradient with respect to b, of shape (M,) + """ + x, w, b = cache + dx = dout.dot(w.T).reshape(x.shape) + dw = x.reshape(x.shape[0], -1).T.dot(dout) + db = np.sum(dout, axis=0) + return dx, dw, db + + +def relu_forward(x): + """ + Computes the forward pass for a layer of rectified linear units (ReLUs). + + Input: + - x: Inputs, of any shape + + Returns a tuple of: + - out: Output, of the same shape as x + - cache: x + """ + out = np.maximum(0, x) + cache = x + return out, cache + + +def relu_backward(dout, cache): + """ + Computes the backward pass for a layer of rectified linear units (ReLUs). + + Input: + - dout: Upstream derivatives, of any shape + - cache: Input x, of same shape as dout + + Returns: + - dx: Gradient with respect to x + """ + x = cache + dx = np.where(x > 0, dout, 0) + return dx + + +def batchnorm_forward(x, gamma, beta, bn_param): + """ + Forward pass for batch normalization. + + During training the sample mean and (uncorrected) sample variance are + computed from minibatch statistics and used to normalize the incoming data. + During training we also keep an exponentially decaying running mean of the mean + and variance of each feature, and these averages are used to normalize data + at test-time. + + At each timestep we update the running averages for mean and variance using + an exponential decay based on the momentum parameter: + + running_mean = momentum * running_mean + (1 - momentum) * sample_mean + running_var = momentum * running_var + (1 - momentum) * sample_var + + Note that the batch normalization paper suggests a different test-time + behavior: they compute sample mean and variance for each feature using a + large number of training images rather than using a running average. For + this implementation we have chosen to use running averages instead since + they do not require an additional estimation step; the torch7 implementation + of batch normalization also uses running averages. + + Input: + - x: Data of shape (N, D) + - gamma: Scale parameter of shape (D,) + - beta: Shift paremeter of shape (D,) + - bn_param: Dictionary with the following keys: + - mode: 'train' or 'test'; required + - eps: Constant for numeric stability + - momentum: Constant for running mean / variance. + - running_mean: Array of shape (D,) giving running mean of features + - running_var Array of shape (D,) giving running variance of features + + Returns a tuple of: + - out: of shape (N, D) + - cache: A tuple of values needed in the backward pass + """ + mode = bn_param['mode'] + eps = bn_param.get('eps', 1e-5) + momentum = bn_param.get('momentum', 0.9) + + N, D = x.shape + running_mean = bn_param.get('running_mean', np.zeros(D, dtype=x.dtype)) + running_var = bn_param.get('running_var', np.zeros(D, dtype=x.dtype)) + + out, cache = None, None + if mode == 'train': + # Compute output + mu = x.mean(axis=0) + xc = x - mu + var = np.mean(xc ** 2, axis=0) + std = np.sqrt(var + eps) + xn = xc / std + out = gamma * xn + beta + + cache = (mode, x, gamma, xc, std, xn, out) + + # Update running average of mean + running_mean *= momentum + running_mean += (1 - momentum) * mu + + # Update running average of variance + running_var *= momentum + running_var += (1 - momentum) * var + elif mode == 'test': + # Using running mean and variance to normalize + std = np.sqrt(running_var + eps) + xn = (x - running_mean) / std + out = gamma * xn + beta + cache = (mode, x, xn, gamma, beta, std) + else: + raise ValueError('Invalid forward batchnorm mode "%s"' % mode) + + # Store the updated running means back into bn_param + bn_param['running_mean'] = running_mean + bn_param['running_var'] = running_var + + return out, cache + + +def batchnorm_backward(dout, cache): + """ + Backward pass for batch normalization. + + For this implementation, you should write out a computation graph for + batch normalization on paper and propagate gradients backward through + intermediate nodes. + + Inputs: + - dout: Upstream derivatives, of shape (N, D) + - cache: Variable of intermediates from batchnorm_forward. + + Returns a tuple of: + - dx: Gradient with respect to inputs x, of shape (N, D) + - dgamma: Gradient with respect to scale parameter gamma, of shape (D,) + - dbeta: Gradient with respect to shift parameter beta, of shape (D,) + """ + mode = cache[0] + if mode == 'train': + mode, x, gamma, xc, std, xn, out = cache + + N = x.shape[0] + dbeta = dout.sum(axis=0) + dgamma = np.sum(xn * dout, axis=0) + dxn = gamma * dout + dxc = dxn / std + dstd = -np.sum((dxn * xc) / (std * std), axis=0) + dvar = 0.5 * dstd / std + dxc += (2.0 / N) * xc * dvar + dmu = np.sum(dxc, axis=0) + dx = dxc - dmu / N + elif mode == 'test': + mode, x, xn, gamma, beta, std = cache + dbeta = dout.sum(axis=0) + dgamma = np.sum(xn * dout, axis=0) + dxn = gamma * dout + dx = dxn / std + else: + raise ValueError(mode) + + return dx, dgamma, dbeta + + +def spatial_batchnorm_forward(x, gamma, beta, bn_param): + """ + Computes the forward pass for spatial batch normalization. + + Inputs: + - x: Input data of shape (N, C, H, W) + - gamma: Scale parameter, of shape (C,) + - beta: Shift parameter, of shape (C,) + - bn_param: Dictionary with the following keys: + - mode: 'train' or 'test'; required + - eps: Constant for numeric stability + - momentum: Constant for running mean / variance. momentum=0 means that + old information is discarded completely at every time step, while + momentum=1 means that new information is never incorporated. The + default of momentum=0.9 should work well in most situations. + - running_mean: Array of shape (D,) giving running mean of features + - running_var Array of shape (D,) giving running variance of features + + Returns a tuple of: + - out: Output data, of shape (N, C, H, W) + - cache: Values needed for the backward pass + """ + N, C, H, W = x.shape + x_flat = x.transpose(0, 2, 3, 1).reshape(-1, C) + out_flat, cache = batchnorm_forward(x_flat, gamma, beta, bn_param) + out = out_flat.reshape(N, H, W, C).transpose(0, 3, 1, 2) + return out, cache + + +def spatial_batchnorm_backward(dout, cache): + """ + Computes the backward pass for spatial batch normalization. + + Inputs: + - dout: Upstream derivatives, of shape (N, C, H, W) + - cache: Values from the forward pass + + Returns a tuple of: + - dx: Gradient with respect to inputs, of shape (N, C, H, W) + - dgamma: Gradient with respect to scale parameter, of shape (C,) + - dbeta: Gradient with respect to shift parameter, of shape (C,) + """ + N, C, H, W = dout.shape + dout_flat = dout.transpose(0, 2, 3, 1).reshape(-1, C) + dx_flat, dgamma, dbeta = batchnorm_backward(dout_flat, cache) + dx = dx_flat.reshape(N, H, W, C).transpose(0, 3, 1, 2) + return dx, dgamma, dbeta + + +def svm_loss(x, y): + """ + Computes the loss and gradient using for multiclass SVM classification. + + Inputs: + - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class + for the ith input. + - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and + 0 <= y[i] < C + + Returns a tuple of: + - loss: Scalar giving the loss + - dx: Gradient of the loss with respect to x + """ + N = x.shape[0] + correct_class_scores = x[np.arange(N), y] + margins = np.maximum(0, x - correct_class_scores[:, np.newaxis] + 1.0) + margins[np.arange(N), y] = 0 + loss = np.sum(margins) / N + num_pos = np.sum(margins > 0, axis=1) + dx = np.zeros_like(x) + dx[margins > 0] = 1 + dx[np.arange(N), y] -= num_pos + dx /= N + return loss, dx + + +def softmax_loss(x, y): + """ + Computes the loss and gradient for softmax classification. + + Inputs: + - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class + for the ith input. + - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and + 0 <= y[i] < C + + Returns a tuple of: + - loss: Scalar giving the loss + - dx: Gradient of the loss with respect to x + """ + probs = np.exp(x - np.max(x, axis=1, keepdims=True)) + probs /= np.sum(probs, axis=1, keepdims=True) + N = x.shape[0] + loss = -np.sum(np.log(probs[np.arange(N), y])) / N + dx = probs.copy() + dx[np.arange(N), y] -= 1 + dx /= N + return loss, dx + diff --git a/assignments2016/assignment3/cs231n/optim.py b/assignments2016/assignment3/cs231n/optim.py new file mode 100644 index 00000000..210e716a --- /dev/null +++ b/assignments2016/assignment3/cs231n/optim.py @@ -0,0 +1,85 @@ +import numpy as np + +""" +This file implements various first-order update rules that are commonly used for +training neural networks. Each update rule accepts current weights and the +gradient of the loss with respect to those weights and produces the next set of +weights. Each update rule has the same interface: + +def update(w, dw, config=None): + +Inputs: + - w: A numpy array giving the current weights. + - dw: A numpy array of the same shape as w giving the gradient of the + loss with respect to w. + - config: A dictionary containing hyperparameter values such as learning rate, + momentum, etc. If the update rule requires caching values over many + iterations, then config will also hold these cached values. + +Returns: + - next_w: The next point after the update. + - config: The config dictionary to be passed to the next iteration of the + update rule. + +NOTE: For most update rules, the default learning rate will probably not perform +well; however the default values of the other hyperparameters should work well +for a variety of different problems. + +For efficiency, update rules may perform in-place updates, mutating w and +setting next_w equal to w. +""" + + +def sgd(w, dw, config=None): + """ + Performs vanilla stochastic gradient descent. + + config format: + - learning_rate: Scalar learning rate. + """ + if config is None: config = {} + config.setdefault('learning_rate', 1e-2) + + w -= config['learning_rate'] * dw + return w, config + + +def adam(x, dx, config=None): + """ + Uses the Adam update rule, which incorporates moving averages of both the + gradient and its square and a bias correction term. + + config format: + - learning_rate: Scalar learning rate. + - beta1: Decay rate for moving average of first moment of gradient. + - beta2: Decay rate for moving average of second moment of gradient. + - epsilon: Small scalar used for smoothing to avoid dividing by zero. + - m: Moving average of gradient. + - v: Moving average of squared gradient. + - t: Iteration number. + """ + if config is None: config = {} + config.setdefault('learning_rate', 1e-3) + config.setdefault('beta1', 0.9) + config.setdefault('beta2', 0.999) + config.setdefault('epsilon', 1e-8) + config.setdefault('m', np.zeros_like(x)) + config.setdefault('v', np.zeros_like(x)) + config.setdefault('t', 0) + + next_x = None + beta1, beta2, eps = config['beta1'], config['beta2'], config['epsilon'] + t, m, v = config['t'], config['m'], config['v'] + m = beta1 * m + (1 - beta1) * dx + v = beta2 * v + (1 - beta2) * (dx * dx) + t += 1 + alpha = config['learning_rate'] * np.sqrt(1 - beta2 ** t) / (1 - beta1 ** t) + x -= alpha * (m / (np.sqrt(v) + eps)) + config['t'] = t + config['m'] = m + config['v'] = v + next_x = x + + return next_x, config + + diff --git a/assignments2016/assignment3/cs231n/rnn_layers.py b/assignments2016/assignment3/cs231n/rnn_layers.py new file mode 100644 index 00000000..d2ce0fe0 --- /dev/null +++ b/assignments2016/assignment3/cs231n/rnn_layers.py @@ -0,0 +1,420 @@ +import numpy as np + + +""" +This file defines layer types that are commonly used for recurrent neural +networks. +""" + + +def rnn_step_forward(x, prev_h, Wx, Wh, b): + """ + Run the forward pass for a single timestep of a vanilla RNN that uses a tanh + activation function. + + The input data has dimension D, the hidden state has dimension H, and we use + a minibatch size of N. + + Inputs: + - x: Input data for this timestep, of shape (N, D). + - prev_h: Hidden state from previous timestep, of shape (N, H) + - Wx: Weight matrix for input-to-hidden connections, of shape (D, H) + - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H) + - b: Biases of shape (H,) + + Returns a tuple of: + - next_h: Next hidden state, of shape (N, H) + - cache: Tuple of values needed for the backward pass. + """ + next_h, cache = None, None + ############################################################################## + # TODO: Implement a single forward step for the vanilla RNN. Store the next # + # hidden state and any values you need for the backward pass in the next_h # + # and cache variables respectively. # + ############################################################################## + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + return next_h, cache + + +def rnn_step_backward(dnext_h, cache): + """ + Backward pass for a single timestep of a vanilla RNN. + + Inputs: + - dnext_h: Gradient of loss with respect to next hidden state + - cache: Cache object from the forward pass + + Returns a tuple of: + - dx: Gradients of input data, of shape (N, D) + - dprev_h: Gradients of previous hidden state, of shape (N, H) + - dWx: Gradients of input-to-hidden weights, of shape (N, H) + - dWh: Gradients of hidden-to-hidden weights, of shape (H, H) + - db: Gradients of bias vector, of shape (H,) + """ + dx, dprev_h, dWx, dWh, db = None, None, None, None, None + ############################################################################## + # TODO: Implement the backward pass for a single step of a vanilla RNN. # + # # + # HINT: For the tanh function, you can compute the local derivative in terms # + # of the output value from tanh. # + ############################################################################## + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + return dx, dprev_h, dWx, dWh, db + + +def rnn_forward(x, h0, Wx, Wh, b): + """ + Run a vanilla RNN forward on an entire sequence of data. We assume an input + sequence composed of T vectors, each of dimension D. The RNN uses a hidden + size of H, and we work over a minibatch containing N sequences. After running + the RNN forward, we return the hidden states for all timesteps. + + Inputs: + - x: Input data for the entire timeseries, of shape (N, T, D). + - h0: Initial hidden state, of shape (N, H) + - Wx: Weight matrix for input-to-hidden connections, of shape (D, H) + - Wh: Weight matrix for hidden-to-hidden connections, of shape (H, H) + - b: Biases of shape (H,) + + Returns a tuple of: + - h: Hidden states for the entire timeseries, of shape (N, T, H). + - cache: Values needed in the backward pass + """ + h, cache = None, None + ############################################################################## + # TODO: Implement forward pass for a vanilla RNN running on a sequence of # + # input data. You should use the rnn_step_forward function that you defined # + # above. # + ############################################################################## + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + return h, cache + + +def rnn_backward(dh, cache): + """ + Compute the backward pass for a vanilla RNN over an entire sequence of data. + + Inputs: + - dh: Upstream gradients of all hidden states, of shape (N, T, H) + + Returns a tuple of: + - dx: Gradient of inputs, of shape (N, T, D) + - dh0: Gradient of initial hidden state, of shape (N, H) + - dWx: Gradient of input-to-hidden weights, of shape (D, H) + - dWh: Gradient of hidden-to-hidden weights, of shape (H, H) + - db: Gradient of biases, of shape (H,) + """ + dx, dh0, dWx, dWh, db = None, None, None, None, None + ############################################################################## + # TODO: Implement the backward pass for a vanilla RNN running an entire # + # sequence of data. You should use the rnn_step_backward function that you # + # defined above. # + ############################################################################## + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + return dx, dh0, dWx, dWh, db + + +def word_embedding_forward(x, W): + """ + Forward pass for word embeddings. We operate on minibatches of size N where + each sequence has length T. We assume a vocabulary of V words, assigning each + to a vector of dimension D. + + Inputs: + - x: Integer array of shape (N, T) giving indices of words. Each element idx + of x muxt be in the range 0 <= idx < V. + - W: Weight matrix of shape (V, D) giving word vectors for all words. + + Returns a tuple of: + - out: Array of shape (N, T, D) giving word vectors for all input words. + - cache: Values needed for the backward pass + """ + out, cache = None, None + ############################################################################## + # TODO: Implement the forward pass for word embeddings. # + # # + # HINT: This should be very simple. # + ############################################################################## + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + return out, cache + + +def word_embedding_backward(dout, cache): + """ + Backward pass for word embeddings. We cannot back-propagate into the words + since they are integers, so we only return gradient for the word embedding + matrix. + + HINT: Look up the function np.add.at + + Inputs: + - dout: Upstream gradients of shape (N, T, D) + - cache: Values from the forward pass + + Returns: + - dW: Gradient of word embedding matrix, of shape (V, D). + """ + dW = None + ############################################################################## + # TODO: Implement the backward pass for word embeddings. # + # # + # HINT: Look up the function np.add.at # + ############################################################################## + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + return dW + + +def sigmoid(x): + """ + A numerically stable version of the logistic sigmoid function. + """ + pos_mask = (x >= 0) + neg_mask = (x < 0) + z = np.zeros_like(x) + z[pos_mask] = np.exp(-x[pos_mask]) + z[neg_mask] = np.exp(x[neg_mask]) + top = np.ones_like(x) + top[neg_mask] = z[neg_mask] + return top / (1 + z) + + +def lstm_step_forward(x, prev_h, prev_c, Wx, Wh, b): + """ + Forward pass for a single timestep of an LSTM. + + The input data has dimension D, the hidden state has dimension H, and we use + a minibatch size of N. + + Inputs: + - x: Input data, of shape (N, D) + - prev_h: Previous hidden state, of shape (N, H) + - prev_c: previous cell state, of shape (N, H) + - Wx: Input-to-hidden weights, of shape (D, 4H) + - Wh: Hidden-to-hidden weights, of shape (H, 4H) + - b: Biases, of shape (4H,) + + Returns a tuple of: + - next_h: Next hidden state, of shape (N, H) + - next_c: Next cell state, of shape (N, H) + - cache: Tuple of values needed for backward pass. + """ + next_h, next_c, cache = None, None, None + ############################################################################# + # TODO: Implement the forward pass for a single timestep of an LSTM. # + # You may want to use the numerically stable sigmoid implementation above. # + ############################################################################# + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + + return next_h, next_c, cache + + +def lstm_step_backward(dnext_h, dnext_c, cache): + """ + Backward pass for a single timestep of an LSTM. + + Inputs: + - dnext_h: Gradients of next hidden state, of shape (N, H) + - dnext_c: Gradients of next cell state, of shape (N, H) + - cache: Values from the forward pass + + Returns a tuple of: + - dx: Gradient of input data, of shape (N, D) + - dprev_h: Gradient of previous hidden state, of shape (N, H) + - dprev_c: Gradient of previous cell state, of shape (N, H) + - dWx: Gradient of input-to-hidden weights, of shape (D, 4H) + - dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H) + - db: Gradient of biases, of shape (4H,) + """ + dx, dh, dc, dWx, dWh, db = None, None, None, None, None, None + ############################################################################# + # TODO: Implement the backward pass for a single timestep of an LSTM. # + # # + # HINT: For sigmoid and tanh you can compute local derivatives in terms of # + # the output value from the nonlinearity. # + ############################################################################# + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + + return dx, dprev_h, dprev_c, dWx, dWh, db + + +def lstm_forward(x, h0, Wx, Wh, b): + """ + Forward pass for an LSTM over an entire sequence of data. We assume an input + sequence composed of T vectors, each of dimension D. The LSTM uses a hidden + size of H, and we work over a minibatch containing N sequences. After running + the LSTM forward, we return the hidden states for all timesteps. + + Note that the initial cell state is passed as input, but the initial cell + state is set to zero. Also note that the cell state is not returned; it is + an internal variable to the LSTM and is not accessed from outside. + + Inputs: + - x: Input data of shape (N, T, D) + - h0: Initial hidden state of shape (N, H) + - Wx: Weights for input-to-hidden connections, of shape (D, 4H) + - Wh: Weights for hidden-to-hidden connections, of shape (H, 4H) + - b: Biases of shape (4H,) + + Returns a tuple of: + - h: Hidden states for all timesteps of all sequences, of shape (N, T, H) + - cache: Values needed for the backward pass. + """ + h, cache = None, None + ############################################################################# + # TODO: Implement the forward pass for an LSTM over an entire timeseries. # + # You should use the lstm_step_forward function that you just defined. # + ############################################################################# + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + + return h, cache + + +def lstm_backward(dh, cache): + """ + Backward pass for an LSTM over an entire sequence of data.] + + Inputs: + - dh: Upstream gradients of hidden states, of shape (N, T, H) + - cache: Values from the forward pass + + Returns a tuple of: + - dx: Gradient of input data of shape (N, T, D) + - dh0: Gradient of initial hidden state of shape (N, H) + - dWx: Gradient of input-to-hidden weight matrix of shape (D, 4H) + - dWh: Gradient of hidden-to-hidden weight matrix of shape (H, 4H) + - db: Gradient of biases, of shape (4H,) + """ + dx, dh0, dWx, dWh, db = None, None, None, None, None + ############################################################################# + # TODO: Implement the backward pass for an LSTM over an entire timeseries. # + # You should use the lstm_step_backward function that you just defined. # + ############################################################################# + pass + ############################################################################## + # END OF YOUR CODE # + ############################################################################## + + return dx, dh0, dWx, dWh, db + + +def temporal_affine_forward(x, w, b): + """ + Forward pass for a temporal affine layer. The input is a set of D-dimensional + vectors arranged into a minibatch of N timeseries, each of length T. We use + an affine function to transform each of those vectors into a new vector of + dimension M. + + Inputs: + - x: Input data of shape (N, T, D) + - w: Weights of shape (D, M) + - b: Biases of shape (M,) + + Returns a tuple of: + - out: Output data of shape (N, T, M) + - cache: Values needed for the backward pass + """ + N, T, D = x.shape + M = b.shape[0] + out = x.reshape(N * T, D).dot(w).reshape(N, T, M) + b + cache = x, w, b, out + return out, cache + + +def temporal_affine_backward(dout, cache): + """ + Backward pass for temporal affine layer. + + Input: + - dout: Upstream gradients of shape (N, T, M) + - cache: Values from forward pass + + Returns a tuple of: + - dx: Gradient of input, of shape (N, T, D) + - dw: Gradient of weights, of shape (D, M) + - db: Gradient of biases, of shape (M,) + """ + x, w, b, out = cache + N, T, D = x.shape + M = b.shape[0] + + dx = dout.reshape(N * T, M).dot(w.T).reshape(N, T, D) + dw = dout.reshape(N * T, M).T.dot(x.reshape(N * T, D)).T + db = dout.sum(axis=(0, 1)) + + return dx, dw, db + + +def temporal_softmax_loss(x, y, mask, verbose=False): + """ + A temporal version of softmax loss for use in RNNs. We assume that we are + making predictions over a vocabulary of size V for each timestep of a + timeseries of length T, over a minibatch of size N. The input x gives scores + for all vocabulary elements at all timesteps, and y gives the indices of the + ground-truth element at each timestep. We use a cross-entropy loss at each + timestep, summing the loss over all timesteps and averaging across the + minibatch. + + As an additional complication, we may want to ignore the model output at some + timesteps, since sequences of different length may have been combined into a + minibatch and padded with NULL tokens. The optional mask argument tells us + which elements should contribute to the loss. + + Inputs: + - x: Input scores, of shape (N, T, V) + - y: Ground-truth indices, of shape (N, T) where each element is in the range + 0 <= y[i, t] < V + - mask: Boolean array of shape (N, T) where mask[i, t] tells whether or not + the scores at x[i, t] should contribute to the loss. + + Returns a tuple of: + - loss: Scalar giving loss + - dx: Gradient of loss with respect to scores x. + """ + + N, T, V = x.shape + + x_flat = x.reshape(N * T, V) + y_flat = y.reshape(N * T) + mask_flat = mask.reshape(N * T) + + probs = np.exp(x_flat - np.max(x_flat, axis=1, keepdims=True)) + probs /= np.sum(probs, axis=1, keepdims=True) + loss = -np.sum(mask_flat * np.log(probs[np.arange(N * T), y_flat])) / N + dx_flat = probs.copy() + dx_flat[np.arange(N * T), y_flat] -= 1 + dx_flat /= N + dx_flat *= mask_flat[:, None] + + if verbose: print 'dx_flat: ', dx_flat.shape + + dx = dx_flat.reshape(N, T, V) + + return loss, dx + diff --git a/assignments2016/assignment3/cs231n/setup.py b/assignments2016/assignment3/cs231n/setup.py new file mode 100644 index 00000000..9a2e6ca0 --- /dev/null +++ b/assignments2016/assignment3/cs231n/setup.py @@ -0,0 +1,14 @@ +from distutils.core import setup +from distutils.extension import Extension +from Cython.Build import cythonize +import numpy + +extensions = [ + Extension('im2col_cython', ['im2col_cython.pyx'], + include_dirs = [numpy.get_include()] + ), +] + +setup( + ext_modules = cythonize(extensions), +) diff --git a/assignments2016/assignment3/frameworkpython b/assignments2016/assignment3/frameworkpython new file mode 100755 index 00000000..a0fa5517 --- /dev/null +++ b/assignments2016/assignment3/frameworkpython @@ -0,0 +1,13 @@ +#!/bin/bash + +# what real Python executable to use +PYVER=2.7 +PATHTOPYTHON=/usr/local/bin/ +PYTHON=${PATHTOPYTHON}python${PYVER} + +# find the root of the virtualenv, it should be the parent of the dir this script is in +ENV=`$PYTHON -c "import os; print os.path.abspath(os.path.join(os.path.dirname(\"$0\"), '..'))"` + +# now run Python with the virtualenv set as Python's HOME +export PYTHONHOME=$ENV +exec $PYTHON "$@" diff --git a/assignments2016/assignment3/kitten.jpg b/assignments2016/assignment3/kitten.jpg new file mode 100644 index 00000000..e421ec1d Binary files /dev/null and b/assignments2016/assignment3/kitten.jpg differ diff --git a/assignments2016/assignment3/requirements.txt b/assignments2016/assignment3/requirements.txt new file mode 100644 index 00000000..3e6c302d --- /dev/null +++ b/assignments2016/assignment3/requirements.txt @@ -0,0 +1,46 @@ +Cython==0.23.4 +Jinja2==2.8 +MarkupSafe==0.23 +Pillow==3.0.0 +Pygments==2.0.2 +appnope==0.1.0 +argparse==1.2.1 +backports-abc==0.4 +backports.ssl-match-hostname==3.5.0.1 +certifi==2015.11.20.1 +cycler==0.9.0 +decorator==4.0.6 +functools32==3.2.3-2 +gnureadline==6.3.3 +ipykernel==4.2.2 +ipython==4.0.1 +ipython-genutils==0.1.0 +ipywidgets==4.1.1 +jsonschema==2.5.1 +jupyter==1.0.0 +jupyter-client==4.1.1 +jupyter-console==4.0.3 +jupyter-core==4.0.6 +matplotlib==1.5.0 +mistune==0.7.1 +nbconvert==4.1.0 +nbformat==4.0.1 +notebook==4.0.6 +numpy==1.10.4 +path.py==8.1.2 +pexpect==4.0.1 +pickleshare==0.5 +ptyprocess==0.5 +pyparsing==2.0.7 +python-dateutil==2.4.2 +pytz==2015.7 +pyzmq==15.1.0 +qtconsole==4.1.1 +scipy==0.16.1 +simplegeneric==0.8.1 +singledispatch==3.4.0.3 +six==1.10.0 +terminado==0.5 +tornado==4.3 +traitlets==4.0.0 +wsgiref==0.1.2 diff --git a/assignments2016/assignment3/sky.jpg b/assignments2016/assignment3/sky.jpg new file mode 100644 index 00000000..81fe60ab Binary files /dev/null and b/assignments2016/assignment3/sky.jpg differ diff --git a/assignments2016/assignment3/start_ipython_osx.sh b/assignments2016/assignment3/start_ipython_osx.sh new file mode 100755 index 00000000..4815b001 --- /dev/null +++ b/assignments2016/assignment3/start_ipython_osx.sh @@ -0,0 +1,4 @@ +# Assume the virtualenv is called .env + +cp frameworkpython .env/bin +.env/bin/frameworkpython -m IPython notebook diff --git a/aws-tutorial.md b/aws-tutorial.md index 15ab379f..ea753c8b 100644 --- a/aws-tutorial.md +++ b/aws-tutorial.md @@ -3,168 +3,130 @@ layout: page title: AWS Tutorial permalink: /aws-tutorial/ --- -For GPU instances, we also have an Amazon Machine Image (AMI) that you can use -to launch GPU instances on Amazon EC2. This tutorial goes through how to set up -your own EC2 instance with the provided AMI. **We do not currently -distribute AWS credits to CS231N students but you are welcome to use this -snapshot on your own budget.** - -**TL;DR** for the AWS-savvy: Our image is -`cs231n_caffe_torch7_keras_lasagne_v2`, AMI ID: `ami-125b2c72` in the us-west-1 -region. Use a `g2.2xlarge` instance. Caffe, Torch7, Theano, Keras and Lasagne -are pre-installed. Python bindings of caffe are available. It has CUDA 7.5 and -CuDNN v3. - -First, if you don't have an AWS account already, create one by going to the [AWS -homepage](http://aws.amazon.com/), and clicking on the yellow "Sign In to the -Console" button. It will direct you to a signup page which looks like the -following. + +GPU 인스턴스를 사용할경우, 아마존 EC2에 GPU 인스턴스를 사용할 수 있는 아마존 머신 이미지 (AMI)가 있습니다. 이 튜토리얼은 제공된 AMI를 통해 자신의 EC2 인스턴스를 설정하는 방법에 대해서 설명합니다. **현재 CS231N 학생들에게 AWS크레딧을 제공하지 않습니다. AWS 스냅샷을 사용하기 위해 여러분의 예산을 사용하기 권장합니다.** + +**요약** AWS가 익숙한 분들: 사용할 이미지는 +`cs231n_caffe_torch7_keras_lasagne_v2` 입니다., AMI ID: `ami-125b2c72` region은 US WEST(N. California)입니다. 인스턴스는 `g2.2xlarge`를 사용합니다. 이 이미지에는 Caffe, Torch7, Theano, Keras 그리고 Lasagne가 설치되어 있습니다. 그리고 caffe의 Python binding을 사용할 수 있습니다. 생성한 인스턴스는 CUDA 7.5 와 CuDNN v3를 포함하고 있습니다. + +첫째로, AWS계정이 아직 없다면 [AWS홈페이지](http://aws.amazon.com/)에 접속하여 "가입"이라고 적혀있는 노란색 버튼을 눌러 계정을 생성합니다. 버튼을 누르면 가입페이지가 나오며 아래 그림과 같이 나타납니다.
- +
-Select the "I am a new user" checkbox, click the "Sign in using our secure -server" button, and follow the subsequent pages to provide the required details. -They will ask for a credit card information, and also a phone verification, so -have your phone and credit card ready. +이메일 또는 휴대폰 번호를 입력하고 "새 사용자입니다."를 선택합니다, "보안서버를 사용하여 로그인"을 누르면 세부사항을 입력하는 페이지들이 나오게 됩니다. 이 과정에서 신용카드 정보입력과 핸드폰 인증절차를 진행하게 됩니다. 가입을 위해서 핸드폰과 신용카드를 준비해주세요. -Once you have signed up, go back to the [AWS homepage](http://aws.amazon.com), -click on "Sign In to the Console", and this time sign in using your username and -password. +가입을 완료했다면 [AWS 홈페이지](http://aws.amazon.com)로 돌아가 "콘솔에 로그인" 버튼을 클릭합니다. 그리고 이메일과 비밀번호를 입력해 로그인을 진행합니다.
- +
-Once you have signed in, you will be greeted by a page like this: +로그인을 완료했다면 다음과 같은 페이지가 여러분을 맞아줍니다.
- +
-Make sure that the region information on the top right is set to N. California. -If it is not, change it to N. California by selecting from the dropdown menu -there. +오른쪽 상단의 region이 N. California로 설정되어있는지 확인합니다. 만약 제대로 설정되어 있지 않다면 드롭다운 메뉴에서 N. California로 설정합니다. -(Note that the subsequent steps requires your account to be "Verified" by - Amazon. This may take up to 2 hrs, and you may not be able to launch instances - until your account verification is complete.) +(그 다음으로 진행하기 위해서는 여러분의 계정이 "인증"되어야 합니다. 인증에 소요되는 시간은 약 2시간이며 인증이 완료되기 전까지는 인스턴스를 실행할 수 없을 수도 있습니다.) -Next, click on the EC2 link (first link under the Compute category). You will go -to a dashboard page like this: +다음으로 EC2링크를 클릭합니다. (Compute 카테고리의 첫번째 링크) 그러면 다음과 같은 대시보드 페이지로 이동합니다.
- +
-Click the blue "Launch Instance" button, and you will be redirected to a page -like the following: +"Launch Instace"라고 적혀있는 파란색 버튼을 클릭합니다. 그러면 다음과 같은 페이지로 이동하게 됩니다.
- +
-Click on the "Community AMIs" link on the left sidebar, and search for "cs231n" -in the search box. You should be able to see the AMI -`cs231n_caffe_torch7_keras_lasagne_v2` (AMI ID: `ami-125b2c72`). Select that -AMI, and continue to the next step to choose your instance type. +왼쪽의 사이드바 메뉴에서 "Community AMIs"를 클릭합니다. 그리고 검색창에 "cs231n"를 입력합니다. 검색결과에 `cs231n_caffe_torch7_keras_lasagne_v2`(AMI ID: `ami-125b2c72`)가 나타납니다. 이 AMI를 선택하고 다음 단게에서 인트턴스 타입을 선택합니다.
- +
-Choose the instance type `g2.2xlarge`, and click on "Review and Launch". +인스턴스 타입`g2.2xlarge` 를 선택하고 "Review and Launch"를 클릭합니다.
- +
-In the next page, click on Launch. +다음 화면에서 Launch를 클릭합니다.
- +
-You will be then prompted to create or use an existing key-pair. If you already -use AWS and have a key-pair, you can use that, or alternately you can create a -new one by choosing "Create a new key pair" from the drop-down menu and giving -it some name of your choice. You should then download the key pair, and keep it -somewhere that you won't accidentally delete. Remember that there is **NO WAY** -to get to your instance if you lose your key. +클릭하게 되면 기존에 사용하던 key-pair를 사용할 것인지 새로 key-pair를 만들것인지 묻는 창이 뜨게됩니다. 만약 AWS를 이미 사용하고 있다면 사용하던 key를 사용할 수 있습니다. 혹은 드롭다운 메뉴에서 "Create a new key pair"를 선택하여 새로 key를 생성할 수 있습니다. 그리고 key 를 다운로드해야합니다. 다운로드한 key를 실수로 삭제하지 않도록 각별한 주의를 기울여야합니다. 만약 key를 잃어버릴 경우 인스턴스에 **접속할 수 없습니다.**
- +
- +
-Once you download your key, you should change the permissions of the key to -user-only RW, In Linux/OSX you can do it by: +key 다운로드가 완료되면 key의 권한을 user-only RW로 바꿉니다. Linux/OSX 사용자는 다음 명령어로 권한을 수정할 수 있습니다. -``` +~~~ $ chmod 600 PEM_FILENAME -``` -Here `PEM_FILENAME` is the full file name of the .pem file you just downloaded. +~~~ + +여기서 `PEM_FILENAME`은 방금전에 다운로드한 .pem 파일의 이름입니다. + +권한수정을 마쳤다면 "Launch Instace"를 클릭합니다. 그럼 생성한 인스턴스가 지금 작동중(Your instance are now launching)이라는 메시지가 나타납니다. -After this is done, click on "Launch Instances", and you should see a screen -showing that your instances are launching:
- +
-Click on "View Instances" to see your instance state. It should change to -"Running" and "2/2 status checks passed" as shown below within some time. You -are now ready to ssh into the instance. +"View Instance"를 클릭하여 인스턴스의 상태를 확인합니다. "2/2 status checks passed"상태가 지나면 "Running"으로 상태가 변하게 됩니다. "Running"상태가 되면 ssh를 통해 생성한 인스턴스에 접속 할 수 있습니다.
- +
-First, note down the Public IP of the instance from the instance listing. Then, -do: +먼저, 인스턴스 리스트에서 인스턴스의 Public IP를 기억해 둡니다. 그리고 다음을 진행합니다. -``` +~~~ ssh -i PEM_FILENAME ubuntu@PUBLIC_IP -``` +~~~ -Now you should be logged in to the instance. You can check that Caffe is working -by doing: +이제 인스턴스에 로그인이 됩니다. 다음 명령어를 통해 Caffe가 작동중인지 확인할 수 있습니다. -``` +~~~ $ cd caffe $ ./build/tools/caffe time --gpu 0 --model examples/mnist/lenet.prototxt -``` +~~~ -We have Caffe, Theano, Torch7, Keras and Lasagne pre-installed. Caffe python -bindings are also available by default. We have CUDA 7.5 and CuDNN v3 installed. +생성한 인스턴스에는 Caff3, Theano, Torch7, Keras 그리고 Lasagne이 설치되어 있습니다. 또한 Caffe Python bindings를 기본적으로 사용할 수 있게 설정되어 있습니다. 그리고 인스턴스에는 CUDA 7.5 와 CuDNN v3가 설치되어 있습니다. -If you encounter any error such as +만약 아래와 같은 에러가 발생한다면 -``` +~~~ Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered -``` - -you might want to terminate your instance and start over again. I have observed -this rarely, and I am not sure what causes this. - -About how to use these instances: - -- The root directory is only 12GB, and only ~ 3GB of that is free. -- There should be a 60GB `/mnt` directory that you can use to put your data, -model checkpoints, models etc. -- Remember that the `/mnt` directory won't be persistent across -reboots/terminations. -- Stop your instances when are done for the day to avoid incurring charges. GPU -instances are costly. Use your funds wisely. Terminate them when you are sure -you are done with your instance (disk storage also costs something, and can be -significant if you have a large disk footprint). -- Look into creating custom alarms to automatically stop your instances when -they are not doing anything. -- If you need access to a large dataset and don't want to download it every time -you spin up an instance, the best way to go would be to create an AMI for that -and attach that AMI to your machine when configuring your instance (before -launching but after you have selected the AMI). +~~~ + +생성한 인스턴스를 terminate하고 인스턴스 생성부터 다시 시작해야합니다. 오류가 발생하는 정확한 이유는 알 수 없지만 이런현상이 드물게 일어난다고 합니다. + +생성한 인스턴스를 사용하는 방법: + +- root directory는 총 12GB 입니다. 그리고 ~ 3GB 정도의 여유공간이 있습니다. +- model checkpoins, model들을 저장할 수 있는 60GB의 공간이 `/mnt`에 있습니다. +- 인스턴스를 reboot/terminate 하면 `/mnt` 디렉토리의 자료는 소멸됩니다. +- 추가 비용이 발생하지 않도록 작업이 완료되면 인스턴스를 stop해야합니다. GPU 인스턴스는 사용료가 높습니다. 예산을 현명하게 사용하는것을 권장합니다. 여러분의 작업이 완전히 끝났다면 인스턴스를 Terminate합니다. (디스크 공간 또한 과금이 됩니다. 만약 큰 용량의 디스크를 사용한다면 과금이 많이 될 수 있습니다.) +- 'creating custom alarms'에서 인스턴스가 아무 작업을 하지 않을때 인스턴스를 stop하도록 설정할 수 있습니다. +- 만약 인스턴스의 큰 데이터베이스에 접근할 필요가 없거나 데이터베이스를 다운로드 하기위해서 인스턴스 작동을 원하지 않는다면 가장 좋은 방법은 AMI를 생성하고 인스턴스를 설정할 때 당신의 기기에 AMI를 연결하는 것 일것입니다. (이 작업은 AMI를 선택한 후에 인스턴스를 실행(launching) 하기 전에 설정해야합니다.) + +--- +

+번역: 김우정 (gnujoow) +

diff --git a/captions/En/Lecture10_en.srt b/captions/En/Lecture10_en.srt new file mode 100644 index 00000000..668f02dd --- /dev/null +++ b/captions/En/Lecture10_en.srt @@ -0,0 +1,4776 @@ +1 +00:00:00,000 --> 00:00:04,129 +trust us + +2 +00:00:04,129 --> 00:00:12,109 +ok that works ok good we'll get started +soon so today we'll be talking about the + +3 +00:00:12,109 --> 00:00:15,199 +recurrent neural networks which is one +of my favorite topics one of my favorite + +4 +00:00:15,199 --> 00:00:18,960 +models to play with input into neural +networks just everywhere there a lot of + +5 +00:00:18,960 --> 00:00:23,009 +fun to play with in terms of +administrative high temps recall that + +6 +00:00:23,009 --> 00:00:26,089 +your midterms on Wednesday this +Wednesday you can tell that I'm really + +7 +00:00:26,089 --> 00:00:32,738 +excited I know if you guys are excited +very excited to me what a cemetery will + +8 +00:00:32,738 --> 00:00:37,979 +be out due this Wednesday it's so he +will be out on Wednesday its due two + +9 +00:00:37,979 --> 00:00:40,429 +weeks from now on Monday but I think +since we're shifting it I think to + +10 +00:00:40,429 --> 00:00:43,399 +Wednesday we plan to have released it +today but we're gonna be shipping it to + +11 +00:00:43,399 --> 00:00:47,129 +roughly Wednesday so we'll probably the +first deadline for a few days and + +12 +00:00:47,130 --> 00:00:51,179 +assignment to him of mistaken was due on +Friday so if you're using 38 days then + +13 +00:00:51,179 --> 00:00:55,119 +you'd be having it in today hopefully +not too many of you are doing that our + +14 +00:00:55,119 --> 00:01:01,089 +people down with a 72 or many people are +done okay most of you looking great for + +15 +00:01:01,090 --> 00:01:04,549 +doing well so currently in the class +were talking about coming ashore neural + +16 +00:01:04,549 --> 00:01:07,820 +networks las Casas specifically we +looked at to visualizing understanding + +17 +00:01:07,819 --> 00:01:11,618 +convolutional neural networks so we look +at a whole bunch of pretty pictures and + +18 +00:01:11,618 --> 00:01:14,938 +video so we had a lot of fun trying to +interpret exactly what he's accomplished + +19 +00:01:14,938 --> 00:01:17,828 +all networks are doing what they're +learning how they're working and so on + +20 +00:01:17,828 --> 00:01:24,188 +and so we debug this through several +ways that you may become a call from a + +21 +00:01:24,188 --> 00:01:27,408 +structure actually over the weekend I +stumbled by some other visualizations + +22 +00:01:27,409 --> 00:01:32,569 +are new I found these on Twitter and +they look really cool and I'm not sure + +23 +00:01:32,569 --> 00:01:37,118 +how to how people made these because +there's not too much description to it + +24 +00:01:37,118 --> 00:01:43,099 +but looks like this is turtles tarantula +and then this is chained and some kind + +25 +00:01:43,099 --> 00:01:47,468 +of a dog and so the way you do this I +think it's something like the tree nuts + +26 +00:01:47,468 --> 00:01:50,509 +again optimization into images but +they're using a different regularize on + +27 +00:01:50,509 --> 00:01:53,679 +the image in this case I think they're +using a bilateral filter which is this + +28 +00:01:53,679 --> 00:01:57,049 +kind of a fancy filter so if you put +that regularization on the image that my + +29 +00:01:57,049 --> 00:01:59,420 +impression is that these are the kinds +of visualizations that you achieve + +30 +00:01:59,420 --> 00:02:03,659 +instead so that looks pretty cool but I +am not sure exactly what's going on I + +31 +00:02:03,659 --> 00:02:04,549 +guess we'll find out soon + +32 +00:02:04,549 --> 00:02:10,360 +ok so today we're going to be talking +about recurrent neural networks what's + +33 +00:02:10,360 --> 00:02:13,520 +nice about recurrent neural networks is +that they offer a lot of flexibility in + +34 +00:02:13,520 --> 00:02:15,870 +how to wire up your network +architectures + +35 +00:02:15,870 --> 00:02:18,650 +normally when you work with no let's +hear the case on the very left here + +36 +00:02:18,650 --> 00:02:22,849 +where you are given a fixed size picture +here in red then you process it with + +37 +00:02:22,848 --> 00:02:27,639 +some hidden layers and green and then +produce a fix I saw the better in San + +38 +00:02:27,639 --> 00:02:30,738 +image comes in which is a fix I +statements and we're producing a fixed + +39 +00:02:30,739 --> 00:02:34,469 +size picture which is the closest course +when the recurrent neural networks we + +40 +00:02:34,469 --> 00:02:38,239 +can actually operate over sequences +sequence at the input output or both at + +41 +00:02:38,239 --> 00:02:41,319 +the same time so for example in the case +of image captioning and we'll see some + +42 +00:02:41,318 --> 00:02:44,689 +of it today you're given a fixed size +image and then through a recurrent + +43 +00:02:44,689 --> 00:02:47,829 +neural network we're going to produce a +sequence of words that describe the + +44 +00:02:47,829 --> 00:02:52,560 +content of that image so that's going to +be a sentence that is the caption for + +45 +00:02:52,560 --> 00:02:55,969 +that and in the case of sentiment +classification in the lobby for example + +46 +00:02:55,969 --> 00:02:59,759 +were consuming a number of words and +sequins and they will try to class + +47 +00:02:59,759 --> 00:03:03,828 +driver the sentiment of that sentence is +positive or negative in a case of + +48 +00:03:03,829 --> 00:03:07,590 +machine translation we can have a +recurrent neural network that takes us a + +49 +00:03:07,590 --> 00:03:12,069 +number of words in say English and then +asked to produce a number of words in + +50 +00:03:12,068 --> 00:03:17,119 +French translation so we'd feed this end +Andrew recurrent neural network in what + +51 +00:03:17,120 --> 00:03:20,280 +we call sequence to sequence kind of +setup and so this work or not work with + +52 +00:03:20,280 --> 00:03:25,169 +just performed translation on arbitrary +sentences in English into French and in + +53 +00:03:25,169 --> 00:03:28,000 +the last case for example we have video +classification where you might want to + +54 +00:03:28,000 --> 00:03:31,699 +imagine classifying every single frame +of a video with some number of classes + +55 +00:03:31,699 --> 00:03:35,429 +but crucially don't want to the +prediction to be only a function of the + +56 +00:03:35,430 --> 00:03:38,739 +current time step the current frame of +the video but also all the things that + +57 +00:03:38,739 --> 00:03:41,909 +have come before it in the video as a +recurrent neural networks allow you to + +58 +00:03:41,909 --> 00:03:44,680 +wire up in architecture where the +prediction that every single time step + +59 +00:03:44,680 --> 00:03:48,760 +is a function of all the frames that +have come in up to that point now even + +60 +00:03:48,759 --> 00:03:52,388 +if you don't have sequences that input +or output you can still use recurrent + +61 +00:03:52,389 --> 00:03:55,250 +neural networks even in the case on the +very left because you can process your + +62 +00:03:55,250 --> 00:04:01,560 +fix as inputs or outputs sequentially +for example one of my favorite examples + +63 +00:04:01,560 --> 00:04:05,189 +of this is from people from deep mine +for a while ago we're trying to + +64 +00:04:05,189 --> 00:04:09,750 +transcribe house numbers and instead of +just having this big image feet into a + +65 +00:04:09,750 --> 00:04:13,530 +comment and try to classify exactly what +house numbers are in there they came up + +66 +00:04:13,530 --> 00:04:16,649 +with a recurrent neural network policy +where there's a small come that and it + +67 +00:04:16,649 --> 00:04:19,779 +steered around the image especially with +recurrent neural network and so their + +68 +00:04:19,779 --> 00:04:23,969 +current work learned to basically +readout house numbers from left to right + +69 +00:04:23,970 --> 00:04:26,870 +sequentially and so we have pics as +input but we're processing it + +70 +00:04:26,870 --> 00:04:32,019 +sequentially conversely we can think +about this is also a well-known people + +71 +00:04:32,019 --> 00:04:35,879 +draw this is a general model what you're +seeing here are samples from the model + +72 +00:04:35,879 --> 00:04:39,490 +where it's coming up with these digits +samples but crucially we're not just + +73 +00:04:39,490 --> 00:04:42,860 +predicting these digits at a single time +but we have our current network and we + +74 +00:04:42,860 --> 00:04:47,540 +think of the up as a canvas and the +kernel goes in and painted over time and + +75 +00:04:47,540 --> 00:04:50,200 +so you're giving yourself more chance to +actually do some computation before you + +76 +00:04:50,199 --> 00:04:53,479 +actually produce you're out that it's +more powerful kind of form of processing + +77 +00:04:53,480 --> 00:05:14,189 +data was a question over the specifics +of exactly what this means for now + +78 +00:05:14,189 --> 00:05:19,310 +eros just indicated indicate functional +dependence so things are so things are + +79 +00:05:19,310 --> 00:05:23,139 +harsh enough things before and we're +going to exactly what that looks like in + +80 +00:05:23,139 --> 00:05:37,168 +a bit okay so these are generated house +numbers so the network looked at a lot + +81 +00:05:37,168 --> 00:05:41,219 +of house numbers and came up with a way +of painting them and so these are not in + +82 +00:05:41,220 --> 00:05:44,830 +a training day on these are made up +numbers from the model none of these are + +83 +00:05:44,829 --> 00:05:48,219 +actually the training set these are made +up + +84 +00:05:48,220 --> 00:05:51,689 +yeah they look quite real but they're +actually made up from the local + +85 +00:05:51,689 --> 00:05:55,809 +so a recurrent neural network is +basically this thing he remarks and + +86 +00:05:55,809 --> 00:06:00,979 +green and it has a state and it +basically receives through time it + +87 +00:06:00,978 --> 00:06:04,859 +receives an actress so every single time +that we can feed in an input vector into + +88 +00:06:04,860 --> 00:06:08,538 +the armed men and it has some state +internally and then it can modify that + +89 +00:06:08,538 --> 00:06:12,988 +state as a function of what it what it +receives every single time step and so + +90 +00:06:12,988 --> 00:06:17,258 +they're all of course be weights and CNN +and so when we turn those wastes the + +91 +00:06:17,259 --> 00:06:20,829 +Arnold different behavior in terms of +how its stated goals as it received + +92 +00:06:20,829 --> 00:06:25,769 +exempts I usually we can also be +interested in producing and all but + +93 +00:06:25,769 --> 00:06:30,429 +based on the R&S state so we can produce +these matters on top of the hour now but + +94 +00:06:30,428 --> 00:06:33,988 +so you'll see a show pictures like this +but I just like to know that the Arnon + +95 +00:06:33,988 --> 00:06:36,688 +is really just the block in the middle + +96 +00:06:36,689 --> 00:06:39,489 +worked as a state and it can receive +pictures over time and then we can be + +97 +00:06:39,488 --> 00:06:44,838 +some prediction on top of its state in +some applications so completely the way + +98 +00:06:44,838 --> 00:06:50,610 +this will look like is the Army has some +kind of a state which hereunder noting + +99 +00:06:50,610 --> 00:06:55,399 +as Victor H and this can be also a +collection of doctors are just two more + +100 +00:06:55,399 --> 00:07:00,939 +general state and we're going to base it +as a function of the previous hadn't + +101 +00:07:00,939 --> 00:07:05,769 +state administration time I T minus one +and the current input vector 60 and this + +102 +00:07:05,769 --> 00:07:08,338 +is going to be done through a function +which I'll call a recurrence function + +103 +00:07:08,338 --> 00:07:13,728 +and that function will have parameters W +and so as we change those W us we're + +104 +00:07:13,728 --> 00:07:16,228 +going to see the Arnold different +behaviors and then of course we want + +105 +00:07:16,228 --> 00:07:19,338 +some specific behavior are the Arnon +we're going to be training those weights + +106 +00:07:19,338 --> 00:07:23,639 +under seal see examples of that song for +now I'd like to note that the same + +107 +00:07:23,639 --> 00:07:28,209 +function is used at every single time +step with a fixed function of weights w + +108 +00:07:28,209 --> 00:07:31,778 +and we played that single function at +every single time stuff and that allows + +109 +00:07:31,778 --> 00:07:35,928 +us to use the external network on +sequences of without having to commit to + +110 +00:07:35,928 --> 00:07:38,778 +the size of the sequence because we +applied the exact same function at every + +111 +00:07:38,778 --> 00:07:43,528 +single time step no matter how long the +input or output sequences are so in a + +112 +00:07:43,528 --> 00:07:46,769 +specific case of recurrent neural +network of recurrent neural network the + +113 +00:07:46,769 --> 00:07:50,309 +simplest way you can set this up in the +simplest recurrence you can use is what + +114 +00:07:50,309 --> 00:07:54,569 +other 42 as a bit alarming in this case +the state of recurrent neural network is + +115 +00:07:54,569 --> 00:08:00,569 +just a single state h and then we have a +cross formula that basically tells you + +116 +00:08:00,569 --> 00:08:04,039 +how you should update your hidden state +age as a function of the previous head + +117 +00:08:04,038 --> 00:08:04,688 +of state + +118 +00:08:04,689 --> 00:08:08,369 +and the current input eckstein and in +particular and the simplest case we're + +119 +00:08:08,369 --> 00:08:10,349 +going to have these weight matrices +whaaa + +120 +00:08:10,348 --> 00:08:15,238 +WX age and they're going to basically +project both in the hidden state from + +121 +00:08:15,238 --> 00:08:18,238 +the previous times that in the current +input and then those are going to add + +122 +00:08:18,238 --> 00:08:21,978 +and then we squish them with at any age +and that's how we update the state at + +123 +00:08:21,978 --> 00:08:26,199 +time t so this recurrence is telling you +how a total change as a function of its + +124 +00:08:26,199 --> 00:08:29,769 +history and also the current input at +this time that and then we can make + +125 +00:08:29,769 --> 00:08:34,129 +predictions we can base predictions on +top of H for example using just another + +126 +00:08:34,129 --> 00:08:37,528 +matrix projection on top of the hill +state so this is the simplest complete + +127 +00:08:37,528 --> 00:08:42,288 +case in which you can wire up in your +life work so just give you example of + +128 +00:08:42,288 --> 00:08:46,639 +how this will work right now I'm just +talked about sex age and why an abstract + +129 +00:08:46,639 --> 00:08:49,299 +forms in terms of actors we could +actually end of these factors with + +130 +00:08:49,299 --> 00:08:53,059 +semantics and so one of the ways in +which we can use a recurrent neural + +131 +00:08:53,059 --> 00:08:56,149 +network as in the case of character +level language models and this is one of + +132 +00:08:56,149 --> 00:08:59,899 +my favorite ways of explaining our next +because its intuitive and fun to look at + +133 +00:08:59,899 --> 00:09:04,698 +so in this case we have character level +language models using our dance and the + +134 +00:09:04,698 --> 00:09:07,859 +way this will work as we will feed a +sequence of characters into the + +135 +00:09:07,860 --> 00:09:10,899 +recurring role at work and at every +single time step will ask the recurrent + +136 +00:09:10,899 --> 00:09:14,299 +neural network to predict the next +character in the sequence will predict + +137 +00:09:14,299 --> 00:09:16,909 +an entire distribution for what it +thinks should come next in the sequence + +138 +00:09:16,909 --> 00:09:21,120 +that has seen so far so I suppose that +in this very simple example we have the + +139 +00:09:21,120 --> 00:09:25,610 +training sequence hello and so we have a +vocabulary for characters that are in + +140 +00:09:25,610 --> 00:09:29,870 +the ATL and we're going to try to get a +recurrent neural network to learn to + +141 +00:09:29,870 --> 00:09:33,289 +predict the next character in a sequence +on this training data so the way this + +142 +00:09:33,289 --> 00:09:37,000 +will work as will set up will feed in +every one of these characters one at a + +143 +00:09:37,000 --> 00:09:40,509 +time into a recurrent neural network +you'll see it in a chat the first time + +144 +00:09:40,509 --> 00:09:47,110 +step and hear the x-axis is the time to +time so we'll keep an H II L&L and + +145 +00:09:47,110 --> 00:09:50,629 +Hiromi coating characters using what we +call it one hot representation where we + +146 +00:09:50,629 --> 00:09:53,889 +just turned on the bitter that response +to that characters order and vocabulary + +147 +00:09:53,889 --> 00:09:58,129 +now we're going to use the recurrence +formula that I have shown you wear it + +148 +00:09:58,129 --> 00:10:01,860 +every single time step suppose we start +off with 80 and then we applied this + +149 +00:10:01,860 --> 00:10:04,720 +request to compute the hidden state +electorate every single time step using + +150 +00:10:04,720 --> 00:10:08,790 +this fix recurrence formula so I suppose +here we have only three percent in state + +151 +00:10:08,789 --> 00:10:11,099 +we're going to end up with a +three-dimensional representation that + +152 +00:10:11,100 --> 00:10:13,040 +basically at any point in time + +153 +00:10:13,039 --> 00:10:15,759 +summarizes all the characters that have +come until + +154 +00:10:15,759 --> 00:10:20,159 +and so we have to apply this requires +that every single time step and now + +155 +00:10:20,159 --> 00:10:23,139 +we're going to predict that every single +time step what should be the next + +156 +00:10:23,139 --> 00:10:27,569 +character in a sequence so for example +since we had four characters in this in + +157 +00:10:27,570 --> 00:10:32,100 +this we're going to protect the phone +numbers at every single time so for + +158 +00:10:32,100 --> 00:10:37,139 +example in a very first time that we've +said in the letter H and the RNN with + +159 +00:10:37,139 --> 00:10:40,799 +its current setting of weights computer +these are normalized lock problem he's + +160 +00:10:40,799 --> 00:10:42,959 +here for what it thinks should come next + +161 +00:10:42,960 --> 00:10:47,950 +so things that H is 110 likely to come +next things that eat as 2.2 likely well + +162 +00:10:47,950 --> 00:10:52,640 +as negative three likely and OS 4.1 +likely right now in terms of unless lott + +163 +00:10:52,639 --> 00:10:56,409 +probabilities of course we know that in +this training sequence we know that we + +164 +00:10:56,409 --> 00:11:00,669 +should follow each so in fact this 2.2 +which are shown in green is the correct + +165 +00:11:00,669 --> 00:11:04,559 +answer in this case and so we want that +to be high and we will do all these + +166 +00:11:04,559 --> 00:11:07,799 +other numbers to be low as on every +single time that we have basically a + +167 +00:11:07,799 --> 00:11:12,209 +target for what next character should +come in the sequence and so we just want + +168 +00:11:12,210 --> 00:11:15,470 +all these numbers to be high and all the +other numbers to be low and so that's of + +169 +00:11:15,470 --> 00:11:19,950 +course including in the included in the +green signal loss function and then that + +170 +00:11:19,950 --> 00:11:23,220 +gets back propagated through these +connections so another way to think + +171 +00:11:23,220 --> 00:11:26,600 +about it is that every single time step +we basically have a soft max classifier + +172 +00:11:26,600 --> 00:11:31,300 +so every one of these a soft max +classifier over the next character and + +173 +00:11:31,299 --> 00:11:34,269 +at every single point we know what the +next character should be and so we just + +174 +00:11:34,269 --> 00:11:37,879 +get all those losses slowing down from +the top and they will all flow through + +175 +00:11:37,879 --> 00:11:41,179 +this graph backwards to all the arrows +were going to get gradients on all the + +176 +00:11:41,179 --> 00:11:44,479 +weight matrices and then we'll know how +to shift the matrices so that the + +177 +00:11:44,480 --> 00:11:50,039 +correct problems are coming out of the +Arnon so we'd be shaping those weights + +178 +00:11:50,039 --> 00:11:53,599 +so that the correct behavior the army +have the correct behaviour you feeding + +179 +00:11:53,600 --> 00:11:57,750 +characters as you can imagine how we can +turn this over to other questions about + +180 +00:11:57,750 --> 00:12:02,879 +the diagram + +181 +00:12:02,879 --> 00:12:08,750 +yeah Thank you so desperately lying as I +mentioned a scene the recurrence the + +182 +00:12:08,750 --> 00:12:13,320 +same functions always so we have a +single WX patient every time step we + +183 +00:12:13,320 --> 00:12:17,010 +have a single WHYY at every time step in +the same whah applied at every time step + +184 +00:12:17,009 --> 00:12:23,830 +here so we've used WX awh why awhh four +times in this diagram and in back + +185 +00:12:23,830 --> 00:12:27,720 +propagation when we get through YouTube +account for that because we'll have all + +186 +00:12:27,720 --> 00:12:30,750 +these gradients adding up to the same +weight matrix because it has been used + +187 +00:12:30,750 --> 00:12:35,879 +at multiple time steps and this is what +allows us to process you know variably + +188 +00:12:35,879 --> 00:12:38,960 +sized inputs because every time that +we're doing the same thing so not a + +189 +00:12:38,960 --> 00:12:48,540 +function of the absolute amount of +things and your question what are common + +190 +00:12:48,539 --> 00:12:52,579 +things for initializing the first 80 I +think US Senate 20 this quite quite + +191 +00:12:52,580 --> 00:13:00,650 +common in the beginning but does the +order in which will receive the data + +192 +00:13:00,649 --> 00:13:01,289 +that matter + +193 +00:13:01,289 --> 00:13:11,299 +yes because so are you asking for these +characters in a different order so if + +194 +00:13:11,299 --> 00:13:14,359 +you see if this was a longer sequence +the order in this case in this case + +195 +00:13:14,360 --> 00:13:17,870 +always doesn't matter because it every +single point in time if you think about + +196 +00:13:17,870 --> 00:13:21,299 +it functionally like this it's a factor +at this time step as a function of + +197 +00:13:21,299 --> 00:13:26,859 +everything that has come before it right +and so disorder just matters for as long + +198 +00:13:26,860 --> 00:13:31,590 +as you're reading it and we're going to +go through a sieve through some specific + +199 +00:13:31,590 --> 00:13:36,149 +examples which i think will clarify some +of these points to look at a specific + +200 +00:13:36,149 --> 00:13:38,980 +example in fact if you want to try to +characterize the language model it's + +201 +00:13:38,980 --> 00:13:43,350 +quite short so I wrote a just then you +can find a good home where this is + +202 +00:13:43,350 --> 00:13:47,220 +hundred line application in numpy for +accuracy level are and then you can go + +203 +00:13:47,220 --> 00:13:49,840 +through an actual active steps through +this with you so you can see concretely + +204 +00:13:49,840 --> 00:13:53,220 +how we could train a recurrent neural +network impact this and so I'm going to + +205 +00:13:53,220 --> 00:13:58,250 +step through this so we're going to go +through all the blocks in the beginning + +206 +00:13:58,250 --> 00:14:02,389 +as you'll see the only dependence here +is we're loading in some text data so + +207 +00:14:02,389 --> 00:14:05,569 +our input here is just a large +collection of a large sequence of + +208 +00:14:05,570 --> 00:14:10,090 +characters in this case a text input +that txt file and then we get all the + +209 +00:14:10,090 --> 00:14:14,810 +characters in that file and we find all +the unique characters in that file + +210 +00:14:14,809 --> 00:14:18,179 +create these mapping dictionaries that +map from characteristic in the season + +211 +00:14:18,179 --> 00:14:23,120 +from indices two characters we basically +order our characters so seeming bread in + +212 +00:14:23,120 --> 00:14:27,350 +a whole bunch of to file and a whole +bunch of data and we have hundred + +213 +00:14:27,350 --> 00:14:30,860 +characters or something like that and +ordered them in a in a sequence so we + +214 +00:14:30,860 --> 00:14:36,300 +associate indices to every character men +here we're going to diminish license + +215 +00:14:36,299 --> 00:14:39,899 +first are hidden sizes hyper primary as +you'll see with recurrent neural + +216 +00:14:39,899 --> 00:14:43,100 +networks so you aren't using it to be a +hundred here we have a learning rate + +217 +00:14:43,100 --> 00:14:46,720 +sequence length here is up to +twenty-five this is a parameter that + +218 +00:14:46,720 --> 00:14:51,019 +you'll be you'll become aware of what +the problem is if our input data is way + +219 +00:14:51,019 --> 00:14:53,899 +too large say like millions of times the +UPS there's no way you can put in + +220 +00:14:53,899 --> 00:14:56,870 +dhahran and on top of all of it because +we need to maintain all of the stuff and + +221 +00:14:56,870 --> 00:15:00,070 +memory so that you can do back +propagation in fact we won't be able to + +222 +00:15:00,070 --> 00:15:03,540 +keep all of it and two men in memory and +a back rub through all of it so we'll go + +223 +00:15:03,539 --> 00:15:07,139 +in chunks through our input data in this +case we're going through chunks of 25 at + +224 +00:15:07,139 --> 00:15:09,230 +a time so as you'll see in a bit + +225 +00:15:09,230 --> 00:15:14,769 +we have this entire dataset but will be +going in chunks of 25 characters at a + +226 +00:15:14,769 --> 00:15:19,509 +time and every time we're just going to +backup get through 25 characters on time + +227 +00:15:19,509 --> 00:15:22,149 +because we can't afford to do back +propagation for longer because we have + +228 +00:15:22,149 --> 00:15:26,899 +to remember all that stuff and so we're +going in chunks here of 25 and then we + +229 +00:15:26,899 --> 00:15:30,789 +have all these W matrices that here I'm +analyzing randomly and some boxes so WX + +230 +00:15:30,789 --> 00:15:34,709 +HHH and HY and those are all of our hype +all of our parameters that we're going + +231 +00:15:34,710 --> 00:15:36,790 +to train a backrub + +232 +00:15:36,789 --> 00:15:40,699 +I'm going to skip over the loss function +here and I'm going to get to the bottom + +233 +00:15:40,700 --> 00:15:44,020 +of the script here we have a main loop +and I'm going to go through some of this + +234 +00:15:44,019 --> 00:15:48,399 +may look now so there are some +initialization here of various things 20 + +235 +00:15:48,399 --> 00:15:50,829 +in the beginning and then we're looking +for ever + +236 +00:15:50,830 --> 00:15:54,960 +we're doing here is a sampling a batch +of data so here it is where I actually + +237 +00:15:54,960 --> 00:15:58,970 +take a batch of 25 characters out of +this dataset so that's in the list + +238 +00:15:58,970 --> 00:16:03,019 +inputs and the list and puts basically +just has 25 integers correspond to the + +239 +00:16:03,019 --> 00:16:06,919 +characters the targets as you'll see is +just all the same characters but offset + +240 +00:16:06,919 --> 00:16:09,909 +by one because those are the indices +that we're trying to predict it every + +241 +00:16:09,909 --> 00:16:15,269 +single time stuff so so important +targets are just list of 25 characters + +242 +00:16:15,269 --> 00:16:20,689 +targets as offset by one into the future +so that's what we sampled basically back + +243 +00:16:20,690 --> 00:16:26,480 +from data here we this is some sample +code so it every single point in time + +244 +00:16:26,480 --> 00:16:30,659 +training this week and of course I try +to generate some samples of what it's + +245 +00:16:30,659 --> 00:16:35,370 +currently thanks character should +actually what these sequences look like + +246 +00:16:35,370 --> 00:16:40,320 +the way we use character low-level +artists and test time is that we're + +247 +00:16:40,320 --> 00:16:43,570 +going to see that with some characters +and then this aren't always gives us the + +248 +00:16:43,570 --> 00:16:46,379 +distribution of the next character in a +sequence so you can imagine sampling + +249 +00:16:46,379 --> 00:16:49,259 +from it and then you feat in the next +character getting a sample from the + +250 +00:16:49,259 --> 00:16:52,769 +distribution and keep doing it in to +keep feeding all the samples into the + +251 +00:16:52,769 --> 00:16:56,549 +iron and you can just generate arbitrary +text data that's what this code will do + +252 +00:16:56,549 --> 00:17:00,549 +and it caused the sample function so +we're going to that in a bit then here + +253 +00:17:00,549 --> 00:17:04,250 +I'm calling the loss function the loss +function receives the inputs the targets + +254 +00:17:04,250 --> 00:17:09,160 +and it receives also this H prep H +pressure is short for his state vector + +255 +00:17:09,160 --> 00:17:13,900 +from the previous trunk so we're going +in batches of 25 and we are keeping + +256 +00:17:13,900 --> 00:17:18,179 +track of what is the latest a picture at +the end of your 25 letters so that we + +257 +00:17:18,179 --> 00:17:22,400 +can when we meet in the next back we can +see that in as the initial h at that + +258 +00:17:22,400 --> 00:17:26,140 +time so we're making sure that the +hidden states are basically correctly + +259 +00:17:26,140 --> 00:17:30,700 +propagated from batch to batch through +that but we're only back propagating + +260 +00:17:30,700 --> 00:17:35,558 +those 25 time steps so we fit into a +function of the loss and gradients and + +261 +00:17:35,558 --> 00:17:39,319 +all the weight matrices and all the +boxes and you're just printing the loss + +262 +00:17:39,319 --> 00:17:44,149 +and then here's a primer update we're +told us older greetings and here we are + +263 +00:17:44,150 --> 00:17:47,429 +actually perform the update which you +should recognize as an undergrad update + +264 +00:17:47,429 --> 00:17:53,100 +so I have all these cash to think of all +these cashed + +265 +00:17:53,099 --> 00:17:56,819 +variables for the gradient squared which +I'm accumulating and then perform the + +266 +00:17:56,819 --> 00:18:00,639 +autocratic date someone to go into the +loss function and what that looks like + +267 +00:18:00,640 --> 00:18:05,790 +now the loss function is this block of +code it really consists of forward and + +268 +00:18:05,789 --> 00:18:08,990 +backward method so we're comparing the +forward pass and then the back of + +269 +00:18:08,990 --> 00:18:13,130 +passing Green so I'll go through those +two steps forward pass you should + +270 +00:18:13,130 --> 00:18:18,919 +recognize basically we get those deficit +targets we're waiting receive these 25 + +271 +00:18:18,919 --> 00:18:23,360 +indices and we're not trading through +them from 1 to 25 we create this text + +272 +00:18:23,359 --> 00:18:27,500 +input vector which is just zeros and +then we set the one hot encoding so + +273 +00:18:27,500 --> 00:18:32,169 +whatever the index and impetus we turned +it on for with one so we're feeding in + +274 +00:18:32,169 --> 00:18:34,110 +the character with that one hot encoding + +275 +00:18:34,109 --> 00:18:39,229 +here in computing the recurrence formula +using this equation so hsi T + +276 +00:18:39,230 --> 00:18:42,210 +their ages and all these things to keep +track of everything and every single + +277 +00:18:42,210 --> 00:18:46,910 +time stuff so we compute the state +vector and the output using the + +278 +00:18:46,910 --> 00:18:50,779 +recurrence formula and these two lines +and then over there I'm computing the + +279 +00:18:50,779 --> 00:18:54,440 +suspects function so normalizing this so +that if we get probabilities and then + +280 +00:18:54,440 --> 00:18:58,190 +your loss is negative lock probability +of the correct answer so that's just a + +281 +00:18:58,190 --> 00:19:02,779 +softness classifier lost over there so +that's the purpose and we're going to + +282 +00:19:02,779 --> 00:19:06,899 +back propagate through the graph so in +the backward pass we go backwards + +283 +00:19:06,900 --> 00:19:08,530 +through that sequence from 25 + +284 +00:19:08,529 --> 00:19:12,899 +all the way back to one and maybe you'll +recognize I don't know how much detail I + +285 +00:19:12,900 --> 00:19:16,509 +want to go in here but you'll recognize +them back propagating through a soft max + +286 +00:19:16,509 --> 00:19:19,089 +propagating through the activation +functions I'm not propagating through + +287 +00:19:19,089 --> 00:19:23,379 +all of it and I'm just adding up all the +greetings and all the prime minister and + +288 +00:19:23,380 --> 00:19:27,210 +one thing to note here especially is +that these ingredients and make weight + +289 +00:19:27,210 --> 00:19:31,210 +matrices like woahh I'm using a plus +equals because it every single time step + +290 +00:19:31,210 --> 00:19:34,590 +all of these weight matrices getting +gradient and we need to accumulate + +291 +00:19:34,589 --> 00:19:37,449 +fit into all the weight matrices because +we keep using all these weight matrices + +292 +00:19:37,450 --> 00:19:43,980 +at the same at every time step and so we +just backdrop into them over time and + +293 +00:19:43,980 --> 00:19:48,130 +that gives us the radiance and then we +can use that loss function from the + +294 +00:19:48,130 --> 00:19:52,580 +primary and then here we have finally a +sampling function so here is where we + +295 +00:19:52,579 --> 00:19:55,960 +try to actually get the artist to +generate new text data based on what he + +296 +00:19:55,960 --> 00:19:59,058 +has seen an attorney and based on the +statistics of the characters and how + +297 +00:19:59,058 --> 00:20:02,048 +they follow each other in the training +data so we initialize with some rain and + +298 +00:20:02,048 --> 00:20:06,759 +character and then we go for until we +get tired and we compute the recurrence + +299 +00:20:06,759 --> 00:20:09,289 +formula that the problem the +distribution sample from the + +300 +00:20:09,289 --> 00:20:10,450 +distribution + +301 +00:20:10,450 --> 00:20:15,640 +encoded in one hot Kate 11 hot +representation and then we fielded a + +302 +00:20:15,640 --> 00:20:22,460 +next time so we keep doing this until we +actually get 200 texts so is there any + +303 +00:20:22,460 --> 00:20:27,190 +question over just like the rough layout +of how this works + +304 +00:20:27,190 --> 00:21:04,680 +$25 South max classifiers at every batch +and we back all of those at the same + +305 +00:21:04,680 --> 00:21:14,910 +time and all add up in the connections +going backwards that's why do we use + +306 +00:21:14,910 --> 00:21:19,259 +regularization here you'll see that I +probably do not I guess I skipped it + +307 +00:21:19,259 --> 00:21:23,720 +here but you can in general I think +sometimes I tried regularization I don't + +308 +00:21:23,720 --> 00:21:27,269 +think it is common to use it and +recurring nuts as outside sometimes it + +309 +00:21:27,269 --> 00:21:38,379 +gave me like worst results so sometimes +I skip it it's kind of a fight promoter + +310 +00:21:38,380 --> 00:21:48,260 +yeah that's right yeah that's right so +in the sequence of 25 shots here we are + +311 +00:21:48,259 --> 00:21:51,839 +very low level on character level and we +don't actually care about words we don't + +312 +00:21:51,839 --> 00:21:56,289 +know that word exists as just character +indices so miss arnelle in fact doesn't + +313 +00:21:56,289 --> 00:21:58,569 +know anything about characters so +language or anything like that just in + +314 +00:21:58,569 --> 00:22:08,009 +the series and sequences appendices and +that's what we're modeling using pieces + +315 +00:22:08,009 --> 00:22:13,460 +can be used space as the letters or +something like that instead of just + +316 +00:22:13,460 --> 00:22:18,630 +constant batches of 25 I think he maybe +could but then it kind of just you have + +317 +00:22:18,630 --> 00:22:22,530 +to make assumptions about language will +see soon why you would want to do that + +318 +00:22:22,529 --> 00:22:25,359 +because you can plug anything into this +and we'll see that we can have a lot of + +319 +00:22:25,359 --> 00:22:31,539 +fun with that ok now we can do we can +take a whole bunch of texts we don't + +320 +00:22:31,539 --> 00:22:34,889 +care where it came from a sequence of +characters and we feed into the Arnon + +321 +00:22:34,890 --> 00:22:40,670 +and we can train the iron and to create +text like it and so for example you can + +322 +00:22:40,670 --> 00:22:44,789 +take all of William Shakespeare's works +you can catch all of it is just a giant + +323 +00:22:44,789 --> 00:22:48,289 +sequence of characters and you put into +the recurrent neural network and try to + +324 +00:22:48,289 --> 00:22:51,909 +predict the next character in a sequence +for William Shakespeare proponents and + +325 +00:22:51,910 --> 00:22:54,650 +so when you do those of course in the +beginning the recurrent neural network + +326 +00:22:54,650 --> 00:22:59,030 +has random random parameters so just +producing a garbled at the very end so + +327 +00:22:59,029 --> 00:23:03,200 +it's just random characters but then +when you train the Arnon will start to + +328 +00:23:03,200 --> 00:23:06,930 +understand that ok there are actually +things like spaces there's words start + +329 +00:23:06,930 --> 00:23:11,490 +to experiment with quotes it and it +basically learn some of the very short + +330 +00:23:11,490 --> 00:23:16,420 +words like here or on and so on and then +as you train more and more disease + +331 +00:23:16,420 --> 00:23:18,820 +becomes more and more refined and the +recurrent neural network learns that + +332 +00:23:18,819 --> 00:23:22,609 +when you open a quote you should close +it later or that those sentences and + +333 +00:23:22,609 --> 00:23:26,379 +with a cup with a dot it learns all the +stuff statistically just from the rock + +334 +00:23:26,380 --> 00:23:29,630 +patterns without actually having to head +coach anything and in the end you can + +335 +00:23:29,630 --> 00:23:30,580 +sample entire + +336 +00:23:30,579 --> 00:23:34,349 +shakespeare based on this on a character +level so just give an idea about what + +337 +00:23:34,349 --> 00:23:38,740 +kind of stuff comes out a lot I think he +shall become approached and the gang + +338 +00:23:38,740 --> 00:23:42,900 +will strain would be attained into being +never fed and his but the chain and + +339 +00:23:42,900 --> 00:23:45,460 +subject of his death I should not sleep + +340 +00:23:45,460 --> 00:23:56,909 +that's the kind of stuff that you would +get out of this regard network you're + +341 +00:23:56,909 --> 00:24:02,679 +mean up a very subtle point which I'd +like to get back to you in a bit okay so + +342 +00:24:02,679 --> 00:24:05,980 +we can run this on Shakespeare but we +can run the Sun basically anything so + +343 +00:24:05,980 --> 00:24:08,960 +we're playing with this with Justin I +think like roughly year ago and so + +344 +00:24:08,960 --> 00:24:12,990 +Justin Tuck he found this book on +algebraic geometry and this is just a + +345 +00:24:12,990 --> 00:24:18,069 +large latex source file and we took that +latex source file for this geometry and + +346 +00:24:18,069 --> 00:24:23,398 +finance the art and the artist can learn +to basically generate mathematics so + +347 +00:24:23,398 --> 00:24:27,199 +this is a sample submitted this morning +just spits out late check and then we + +348 +00:24:27,200 --> 00:24:30,009 +can pilot and of course doesn't work +right away we had to tune it a tiny bit + +349 +00:24:30,009 --> 00:24:33,890 +but basically the Arnon after we tweaked +some of the mistakes that has made you + +350 +00:24:33,890 --> 00:24:37,200 +can compile it and you can get the +generate mathematics as you'll see that + +351 +00:24:37,200 --> 00:24:42,460 +it basically creates all these proofs it +puts her stupid little squares at the + +352 +00:24:42,460 --> 00:24:47,090 +end of troops it creates let us and so +on + +353 +00:24:47,089 --> 00:24:52,428 +sometimes we are going to create +diagrams to varying amounts of success + +354 +00:24:52,429 --> 00:24:56,720 +and my best my favorite part about this +is that on the top left the proof here + +355 +00:24:56,720 --> 00:24:59,650 +is emitted + +356 +00:24:59,650 --> 00:25:05,780 +the Sarno is just lazy but otherwise +this stuff is quite indistinguishable I + +357 +00:25:05,779 --> 00:25:12,480 +would say from from actual geometry so +let X 10 scheme of X ok I'm not sure + +358 +00:25:12,480 --> 00:25:16,160 +about that part but otherwise the +gestalt of this looks very good + +359 +00:25:16,160 --> 00:25:19,529 +arbitrary things that it so I tried to +find the hardest arbitrary thing that I + +360 +00:25:19,529 --> 00:25:22,769 +could throw the character level I +decided that source code is actually + +361 +00:25:22,769 --> 00:25:27,879 +very difficult so I took all of Linux +source which is just older like C code + +362 +00:25:27,880 --> 00:25:30,850 +you can copy it and you end up with I +think some hundred megabytes and just + +363 +00:25:30,849 --> 00:25:35,079 +see code and header files and then just +thrown into the Arnon and then it can + +364 +00:25:35,079 --> 00:25:39,849 +learn to generate code and so this is +generated code from the Arnon and you + +365 +00:25:39,849 --> 00:25:42,949 +can see that basically creates function +declarations it knows about inputs + +366 +00:25:42,950 --> 00:25:47,460 +syntactically it makes very few mistakes +it knows about variables sort of how to + +367 +00:25:47,460 --> 00:25:53,230 +use them sometimes it intends to code it +creates its own bogus comments + +368 +00:25:53,230 --> 00:25:58,089 +syntactically is very rare to find that +it would open a bracket and not close it + +369 +00:25:58,089 --> 00:26:01,808 +and so on this actually is relatively +easy for dornin to learn and so some of + +370 +00:26:01,808 --> 00:26:04,058 +the mistakes that makes actually is that +for example it + +371 +00:26:04,058 --> 00:26:07,240 +declare some variables that it never +ends up using or do the same variables + +372 +00:26:07,240 --> 00:26:09,929 +that it never declared and so some of +these high level stuff is still missing + +373 +00:26:09,929 --> 00:26:12,509 +but otherwise it can do just fine + +374 +00:26:12,509 --> 00:26:17,460 +it also no hostile recite the Jeep the +new GOP licensed character by character + +375 +00:26:17,460 --> 00:26:22,009 +that has learned from data and that +knows that after the GPL license there + +376 +00:26:22,009 --> 00:26:25,779 +some include files there some macros and +then there's some code so that's + +377 +00:26:25,779 --> 00:26:33,879 +basically what has learned that in turn +into just a show is very small + +378 +00:26:33,880 --> 00:26:37,169 +just a toy thing to show you what's +going on then there's a char and then + +379 +00:26:37,169 --> 00:26:41,230 +which is a more kind of implementation +and torch which has just been charged + +380 +00:26:41,230 --> 00:26:45,009 +and scaled up and runs and GPU and so +you can play with that yourself and so + +381 +00:26:45,009 --> 00:26:49,269 +this in particular was going to this by +then the latter it's a three-layer Alice + +382 +00:26:49,269 --> 00:26:52,289 +team and so we'll see what that means +it's a more complex kind of phone + +383 +00:26:52,289 --> 00:26:58,839 +network I just give an idea about how +this works so there's a paper that we + +384 +00:26:58,839 --> 00:27:02,089 +played with a lot but this was just an +last year and we're basically trying to + +385 +00:27:02,089 --> 00:27:08,949 +pretend that we're neuroscientists and +we threw a hair salon on some test text + +386 +00:27:08,950 --> 00:27:13,110 +and so the Arden is reading this text in +the snippet of code and we're looking at + +387 +00:27:13,109 --> 00:27:17,119 +a specific cell and his state of the art +coloring the text based on whether or + +388 +00:27:17,119 --> 00:27:18,699 +not that sells excited or not + +389 +00:27:18,700 --> 00:27:23,470 +ok so you can see that many of the state + +390 +00:27:23,470 --> 00:27:27,110 +neurons are not interpretable to kind of +fire on nothin kind of weird ways + +391 +00:27:27,109 --> 00:27:29,829 +because they have to do some of them +have to do quite low level character + +392 +00:27:29,829 --> 00:27:33,859 +level stuff like how often does she come +after age and stuff like that let's all + +393 +00:27:33,859 --> 00:27:37,928 +the cells are quite interpretable so for +example we find ourselves like a quick + +394 +00:27:37,929 --> 00:27:41,830 +detection so that this cell just turns +on when it is a quote and then it stays + +395 +00:27:41,829 --> 00:27:46,460 +on until the quote closets and so this +quite reliably keeps track of this and + +396 +00:27:46,460 --> 00:27:50,610 +it just comes out from backpropagation +the island of this size that the + +397 +00:27:50,609 --> 00:27:54,329 +character level statistics are different +inside and outside of course and this is + +398 +00:27:54,329 --> 00:27:57,639 +a useful feature to learn and so it +dedicate some of its head of state to + +399 +00:27:57,640 --> 00:28:00,650 +keeping track of whether or not you're +inside a quote and this goes back to + +400 +00:28:00,650 --> 00:28:05,159 +your question which I want to point out +here that this RNN was trained on I + +401 +00:28:05,159 --> 00:28:06,500 +think sequence length + +402 +00:28:06,500 --> 00:28:10,269 +hundred but if you measure the length of +this quote is actually much more than a + +403 +00:28:10,269 --> 00:28:16,220 +hundred i think is like 250 and so we +worked on we only back propagated up to + +404 +00:28:16,220 --> 00:28:20,190 +a hundred and so that's the only place +where the cell can actually like Lauren + +405 +00:28:20,190 --> 00:28:23,460 +itself because it wouldn't be able to +spot the appendices there much longer + +406 +00:28:23,460 --> 00:28:27,809 +than that but I think basically this +seems to show that you can train this + +407 +00:28:27,809 --> 00:28:31,159 +character level detection sell as a +useful on sequences less than a hundred + +408 +00:28:31,160 --> 00:28:36,580 +and then it generalizes properly to +longer sequences so this so this cell + +409 +00:28:36,579 --> 00:28:39,859 +seems to work for more than a hundred +steps even if it was only trained even + +410 +00:28:39,859 --> 00:28:44,759 +if it was only able to spot the +dependencies on less than a hundred this + +411 +00:28:44,759 --> 00:28:48,890 +is another dataset here this is I think +Leo Tolstoy's War and Peace this is in + +412 +00:28:48,890 --> 00:28:52,460 +this dataset there's a new line +character at every single at roughly 80 + +413 +00:28:52,460 --> 00:28:57,819 +characters in 80 characters roughly +there's a new line and there's a there's + +414 +00:28:57,819 --> 00:29:02,470 +a line link tracking so that we found +where it starts off at like one and then + +415 +00:29:02,470 --> 00:29:06,539 +it slowly case over time and you might +imagine that a cell like this is + +416 +00:29:06,539 --> 00:29:09,019 +actually very useful in predicting that +you like character at the end because + +417 +00:29:09,019 --> 00:29:13,059 +this arnie's to count a tee time steps +so that it knows when a new line + +418 +00:29:13,059 --> 00:29:15,149 +character is likely to come next + +419 +00:29:15,150 --> 00:29:19,280 +ok so there's like tracking tell us we +found cells that actually respond only a + +420 +00:29:19,279 --> 00:29:23,970 +sudden statements we found cells that +only respondents cite quotes and strings + +421 +00:29:23,970 --> 00:29:28,710 +we found cells that I get more excited +the deeper you nestin expression and so + +422 +00:29:28,710 --> 00:29:33,150 +all kinds of interesting cells that you +can actually find inside these are not + +423 +00:29:33,150 --> 00:29:36,710 +completely come out just from the back +propagation and so that's quite magical + +424 +00:29:36,710 --> 00:29:42,130 +I suppose but + +425 +00:29:42,130 --> 00:29:49,110 +this Alice team I think they're about +2,100 cell so you just gonna go through + +426 +00:29:49,109 --> 00:29:54,589 +them and some of them look like this but +I would say roughly 5 percent of them + +427 +00:29:54,589 --> 00:30:00,429 +you spot something interesting so you +just go through it manually + +428 +00:30:00,430 --> 00:30:05,310 +sorry so we are completely running the +entire are in an intact but we're only + +429 +00:30:05,309 --> 00:30:09,679 +looking at a single hidden state fire at +the firing of US one single cell in + +430 +00:30:09,680 --> 00:30:14,470 +dhahran so running the are normally but +we're just kind of a recording from one + +431 +00:30:14,470 --> 00:30:20,900 +cell and the hidden state that makes +sense so this sell just the entire are + +432 +00:30:20,900 --> 00:30:23,940 +among those rising one part of the +hidden state basically there's many + +433 +00:30:23,940 --> 00:30:27,740 +other hidden still hittin cells that +involved in different ways and they're + +434 +00:30:27,740 --> 00:30:30,349 +all believing in different times and +they're all doing different things + +435 +00:30:30,349 --> 00:30:41,899 +inside the Arnon hidden state + +436 +00:30:41,900 --> 00:30:50,150 +but you can get similar results with one +layer + +437 +00:30:50,150 --> 00:31:00,490 +these cells were always between negative +one in 110 each and this is from + +438 +00:31:00,490 --> 00:31:04,120 +analysis team which we haven't covered +yet but the firing of the salsa between + +439 +00:31:04,119 --> 00:31:11,869 +a one and one so that's the scale that's +us this picture so are as are pretty + +440 +00:31:11,869 --> 00:31:15,609 +cool and you can actually trendy +sequence models over time about roughly + +441 +00:31:15,609 --> 00:31:19,039 +one year ago several people have come to +realize that you can actually use the + +442 +00:31:19,039 --> 00:31:22,039 +same very neat application in the +context of computer vision to perform + +443 +00:31:22,039 --> 00:31:25,210 +image capturing in this context for +taking a single imagine we'd like to + +444 +00:31:25,210 --> 00:31:27,840 +describe it with a sequence of warrants +and these are nuns are very good at + +445 +00:31:27,839 --> 00:31:32,490 +understanding how sequences develop over +time so in this particular model them + +446 +00:31:32,490 --> 00:31:36,240 +going to describe this actually work +from roughly year-ago happens to be my + +447 +00:31:36,240 --> 00:31:43,039 +paper I have I have pictures from my +paper so I'm going to use those so we + +448 +00:31:43,039 --> 00:31:46,629 +are feeding a commission and omission to +accomplish on your network and then + +449 +00:31:46,630 --> 00:31:48,990 +you'll see that this phone models +actually just made up of two modules + +450 +00:31:48,990 --> 00:31:51,750 +there's the comment that is doing the +processing of the image and their + +451 +00:31:51,750 --> 00:31:55,460 +current debt which will be very which is +very good with modeling sequences as so + +452 +00:31:55,460 --> 00:31:58,470 +if you remember my analogy from the very +beginning of the course where this is + +453 +00:31:58,470 --> 00:32:01,039 +kinda like playing with Lego blocks +we're going to take those two modules + +454 +00:32:01,039 --> 00:32:04,509 +and stick them together that corresponds +to the arrow in between and so what + +455 +00:32:04,509 --> 00:32:07,829 +we're doing effectively here is where +conditioning this RNN generative model + +456 +00:32:07,829 --> 00:32:11,349 +or not just telling its sample text at +random but we're conditioning that + +457 +00:32:11,349 --> 00:32:14,939 +generate process by the upper to come +ashore network and I'll show you exactly + +458 +00:32:14,940 --> 00:32:21,220 +how that looks like so suppose I'm going +to show you what the forward pass on + +459 +00:32:21,220 --> 00:32:24,110 +your own that is so suppose we have a +test image and we're trying to describe + +460 +00:32:24,109 --> 00:32:27,679 +it with a sequence of words so the way +this model with process the images US + +461 +00:32:27,680 --> 00:32:31,240 +policy which take that any plugin to +accomplish on your left work in this + +462 +00:32:31,240 --> 00:32:35,250 +case is a VG nett so we go through a +whole bunch of comics pool and so on + +463 +00:32:35,250 --> 00:32:37,349 +until we arrived at the end + +464 +00:32:37,349 --> 00:32:40,149 +normally at the end we have this +automatic classifier which is giving you + +465 +00:32:40,150 --> 00:32:44,440 +a profit distribution over say 1000 +categories of images in this case we're + +466 +00:32:44,440 --> 00:32:47,420 +going to actually get rid of that +classifier and instead we're going to + +467 +00:32:47,420 --> 00:32:50,750 +redirect representation at the top of +the coalition member into the recurrent + +468 +00:32:50,750 --> 00:32:54,880 +neural network so we begin to generation +of the Arnon with a special + +469 +00:32:54,880 --> 00:33:00,410 +art vector so the impetus are +nonetheless I think 300 emotional and + +470 +00:33:00,410 --> 00:33:02,700 +this is a special three hundred +emotional victory that we always plug + +471 +00:33:02,700 --> 00:33:05,750 +into the first iteration tells me that +this is the beginning of the sequence + +472 +00:33:05,750 --> 00:33:09,039 +and then we're going to perform the +recurrence formula that I shown you + +473 +00:33:09,039 --> 00:33:13,769 +before for recurrent neural network +normally we computed this recurrence + +474 +00:33:13,769 --> 00:33:18,779 +which we've solidarity where we compute +WSH time sex but whhhy and now we want + +475 +00:33:18,779 --> 00:33:23,500 +to additionally conditioned as recurrent +neural network not only on the current + +476 +00:33:23,500 --> 00:33:28,089 +input and current in a state which we +must like 20 so that term goes away at + +477 +00:33:28,089 --> 00:33:33,649 +the first time that but we initially +condition just by adding wiht times be + +478 +00:33:33,650 --> 00:33:38,040 +and so this is the top of the comment +here and we've added interaction and + +479 +00:33:38,039 --> 00:33:43,399 +added weight matrix W which tells us how +this image information emerges into the + +480 +00:33:43,400 --> 00:33:46,380 +very first time since the recurring role +at work now there are many ways to + +481 +00:33:46,380 --> 00:33:48,940 +actually play with this recurrence in +many ways to actually plug in the image + +482 +00:33:48,940 --> 00:33:51,690 +into there are now and this is only one +of them and one of the simpler ones + +483 +00:33:51,690 --> 00:33:55,750 +perhaps and at the very first time step +here in this wine zero vector is the + +484 +00:33:55,750 --> 00:34:00,009 +distribution over the first word in a +sequence so the way this works + +485 +00:34:00,009 --> 00:34:05,490 +you might imagine for example is you can +see that these structures in the mass + +486 +00:34:05,490 --> 00:34:09,699 +hat can be recognized by the Coalition +network as strong like stuff and then + +487 +00:34:09,699 --> 00:34:12,939 +through this interaction wiht my +condition to hit in state to go into a + +488 +00:34:12,940 --> 00:34:17,039 +particular state where the probability +of the word straw can be slightly higher + +489 +00:34:17,039 --> 00:34:20,519 +right so you might imagine that the +strong like textures can influence the + +490 +00:34:20,519 --> 00:34:23,940 +probability of strong so one of the +numbers inside 10 to be higher because + +491 +00:34:23,940 --> 00:34:28,470 +their structures and so the army from +now on has to kind of jungle two tasks + +492 +00:34:28,469 --> 00:34:32,269 +it has to predict the next care and next +word in the sequence in this case and it + +493 +00:34:32,269 --> 00:34:36,550 +has to remember the image information so +we sent from that sock Max and + +494 +00:34:36,550 --> 00:34:40,629 +supposedly the most likely word that we +sampled from that distribution was + +495 +00:34:40,628 --> 00:34:44,710 +indeed the word strong we will take +strong and we would try to plug it into + +496 +00:34:44,710 --> 00:34:47,519 +the recording all that work on the +bottom again and so in this case I think + +497 +00:34:47,519 --> 00:34:52,190 +we're using word level and beddings so +the strong strong word is associate with + +498 +00:34:52,190 --> 00:34:55,750 +a three hundred national Dr we're going +to learn to represent the three hundred + +499 +00:34:55,750 --> 00:35:00,010 +national representation for every single +unique jewellery and we plug in those + +500 +00:35:00,010 --> 00:35:02,940 +three hundred numbers into the Arnon and +forward again to get a description of + +501 +00:35:02,940 --> 00:35:07,090 +the second world and sequence inside why +one so we get all these properties we + +502 +00:35:07,090 --> 00:35:08,010 +sample from it again + +503 +00:35:08,010 --> 00:35:12,490 +suppose that the word hat is likely now +we take hats 400 much older presentation + +504 +00:35:12,489 --> 00:35:18,299 +and get the distribution of it there and +then we sample again and we sample until + +505 +00:35:18,300 --> 00:35:21,350 +we sample a special and token which is +really the period at the end of the + +506 +00:35:21,349 --> 00:35:24,900 +sentence and that tells us that the +arnaz now done generating and at this + +507 +00:35:24,900 --> 00:35:30,280 +point the army would have described this +image as a straw hat period ok so the + +508 +00:35:30,280 --> 00:35:34,010 +number of dimensions and his wife +picture is a number of words in your + +509 +00:35:34,010 --> 00:35:39,220 +vocabulary +1 for the special and token +and we are always feeding industry + +510 +00:35:39,219 --> 00:35:43,609 +sectors that correspond to different +words and a special start talkin and + +511 +00:35:43,610 --> 00:35:46,250 +then we always just that propagates +through the whole thing and a single + +512 +00:35:46,250 --> 00:35:49,769 +time to nationalize this at random or +you can initialize your BG net with free + +513 +00:35:49,769 --> 00:35:52,099 +trade for a minute and then + +514 +00:35:52,099 --> 00:35:56,319 +distributions and then you encode the +gradient and then you backed up through + +515 +00:35:56,320 --> 00:35:59,700 +the whole thing as a single model and +just trained at all jointly and you get + +516 +00:35:59,699 --> 00:36:08,389 +a caption or image capture lots of +questions ok but yes i three hundred + +517 +00:36:08,389 --> 00:36:12,609 +emotional embeddings they're just +independent of the image so every word + +518 +00:36:12,610 --> 00:36:18,430 +has 300 numbers associated with it so +we're going to bankrupt get into it so + +519 +00:36:18,429 --> 00:36:21,769 +you initialize it random and then you +can back up to get into these better sex + +520 +00:36:21,769 --> 00:36:25,360 +right so those embeddings will shift +around there just a parameter another + +521 +00:36:25,360 --> 00:36:30,530 +way to think about it is it's to having +a one-hop representation for all the + +522 +00:36:30,530 --> 00:36:34,960 +words and then you have a giant W matrix +where every single + +523 +00:36:34,960 --> 00:36:40,130 +multiplied W with that one hundred +plantation and it w has 300 out but size + +524 +00:36:40,130 --> 00:36:43,530 +then it's effectively plucking out a +single broke up w which and something + +525 +00:36:43,530 --> 00:36:47,560 +I'm betting it's kind of a cold front so +just think of it if you don't like those + +526 +00:36:47,559 --> 00:36:50,279 +in bed and just think of it as a +one-hopper presentation and you can + +527 +00:36:50,280 --> 00:36:58,920 +think of it that way yes the modelers to +up at the end token yes in the training + +528 +00:36:58,920 --> 00:37:02,769 +data the correct sequence that we expect +from the art is the first words I can + +529 +00:37:02,769 --> 00:37:07,969 +look forward and so every single +training example sort of have a special + +530 +00:37:07,969 --> 00:37:10,288 +and token it go ahead + +531 +00:37:10,289 --> 00:37:28,929 +you can wired differently we plugged +into every single state it turns out + +532 +00:37:28,929 --> 00:37:32,999 +that actually works worse so it actually +works better if you just plug in the + +533 +00:37:32,998 --> 00:37:36,718 +very first time step and then the Arnon +has to juggle these both tasks that has + +534 +00:37:36,719 --> 00:37:40,829 +to remember about the image what it +needs to remember through the art and it + +535 +00:37:40,829 --> 00:37:45,179 +also has to produce all these outfits +and somehow it wants to do that there's + +536 +00:37:45,179 --> 00:38:04,209 +some headway the reasons I can give you +after class right that's true + +537 +00:38:04,208 --> 00:38:10,208 +a single instance will correspond to an +image and a sequence of words and so we + +538 +00:38:10,208 --> 00:38:16,328 +would plug in those words here and we +will talk in that image and we shall I + +539 +00:38:16,329 --> 00:38:22,159 +come so it's a train time you have all +those weren't planning on the bottom of + +540 +00:38:22,159 --> 00:38:25,528 +the image London and then you unroll +this graph and you have your losses in + +541 +00:38:25,528 --> 00:38:29,389 +the background and then you can do +batches of images if you're careful and + +542 +00:38:29,389 --> 00:38:33,108 +so if your images they sometimes have +different lengths sequences in the + +543 +00:38:33,108 --> 00:38:36,199 +training data have to be careful with +that because you have to say that ok I'm + +544 +00:38:36,199 --> 00:38:41,059 +willing to process batches of up to +twenty words maybe and then some of + +545 +00:38:41,059 --> 00:38:44,499 +those sentences will be shorter or +longer a need to in your code you know + +546 +00:38:44,498 --> 00:38:48,188 +worry about that because some some some +sentences are longer than others + +547 +00:38:48,188 --> 00:38:55,368 +we have way too many questions I have +stuff to go + +548 +00:38:55,369 --> 00:39:03,450 +yes thank you so that propagate +everything completely jointly and two in + +549 +00:39:03,449 --> 00:39:07,538 +training so you can pre train with the +internet and then you put those words + +550 +00:39:07,539 --> 00:39:10,190 +there but then you just want to train +everything jointly and that's a big + +551 +00:39:10,190 --> 00:39:15,429 +advantage actually because we can we can +figure out what features to look for in + +552 +00:39:15,429 --> 00:39:20,368 +order to better describe the image that +the end so when you train this in + +553 +00:39:20,369 --> 00:39:23,890 +practice we tried this on the census +data sets one of the more common wants + +554 +00:39:23,889 --> 00:39:27,368 +is called Microsoft Coco so just to give +you an idea of what it looks like it's + +555 +00:39:27,369 --> 00:39:31,499 +roughly 800,000 images and five sentence +descriptions for each image these were + +556 +00:39:31,498 --> 00:39:35,288 +obtained using Amazon Mechanical Turk so +you just ask people please give us a + +557 +00:39:35,289 --> 00:39:39,710 +sentence description for an image and +your record and end up your data set and + +558 +00:39:39,710 --> 00:39:43,249 +so when you train this model the kinds +of results that you can expect or + +559 +00:39:43,248 --> 00:39:49,078 +roughly what is kinda like this so this +is our in describing these images so + +560 +00:39:49,079 --> 00:39:52,329 +this it says that this is a man in black +shirt playing guitar or construction + +561 +00:39:52,329 --> 00:39:55,710 +worker in Orange City West working on +the road or two young girls are playing + +562 +00:39:55,710 --> 00:40:00,528 +with Lego toy or boy is doing backflip +on a wakeboard and of course that's not + +563 +00:40:00,528 --> 00:40:04,650 +a wakeboard but it's close there are +also some very funny failure cases which + +564 +00:40:04,650 --> 00:40:07,680 +I also like to show this is a young boy +holding a baseball bat + +565 +00:40:07,679 --> 00:40:12,338 +this is a cat sitting on a couch with +the remote control that's a woman + +566 +00:40:12,338 --> 00:40:15,710 +holding a teddy bear in front of a +mirror + +567 +00:40:15,710 --> 00:40:22,400 +I'm pretty sure that the texture here +probably is what what happened made it + +568 +00:40:22,400 --> 00:40:26,289 +think that it's a teddy bear and the +last one is a whore standing in the + +569 +00:40:26,289 --> 00:40:30,409 +middle of a street road so there's no +horse obviously some not sure what + +570 +00:40:30,409 --> 00:40:34,858 +happened there so this is just a +simplest kind of model that came out + +571 +00:40:34,858 --> 00:40:37,619 +last year there were many people who try +to work on top of these kinds of models + +572 +00:40:37,619 --> 00:40:41,559 +and make them more complex I just like +to give you an idea of 11 level that is + +573 +00:40:41,559 --> 00:40:44,929 +interesting just to get an idea how how +people play with this basic architecture + +574 +00:40:44,929 --> 00:40:51,329 +so this is a paper from last year where +if you noticed in the current model we + +575 +00:40:51,329 --> 00:40:55,608 +only feed into images single time to +time at the beginning and one where you + +576 +00:40:55,608 --> 00:40:59,480 +can play with this is actually a rowdy +recurrent neural network to look back to + +577 +00:40:59,480 --> 00:41:03,130 +the image and reference parts of the +image Wireless describing to work does + +578 +00:41:03,130 --> 00:41:07,180 +the words such as you're generating +every single word you allow the aren't + +579 +00:41:07,179 --> 00:41:10,460 +actually make a look up next to the +image and look for different features of + +580 +00:41:10,460 --> 00:41:13,470 +what it might want to describe next and +you can actually do this in the fully + +581 +00:41:13,469 --> 00:41:17,899 +trainable way so they are not only +create these words but also the sides + +582 +00:41:17,900 --> 00:41:21,289 +where to look next in the image and so +the way this works is not only does the + +583 +00:41:21,289 --> 00:41:24,259 +Arnon out but you're probably +distribution for the next one sequence + +584 +00:41:24,260 --> 00:41:29,250 +but this coming that gives you does +valium so saying this case we forwarded + +585 +00:41:29,250 --> 00:41:37,389 +the comment and got a 14 by 14 by 512 by +512 activation volume and at every + +586 +00:41:37,389 --> 00:41:40,179 +single time that we don't just admit +that distribution but you also emit a + +587 +00:41:40,179 --> 00:41:44,358 +five hundred and twelve dimensional +picture that is kinda like a look up key + +588 +00:41:44,358 --> 00:41:48,019 +of what you want to look for next to the +image and so actually I don't think this + +589 +00:41:48,019 --> 00:41:51,210 +is what they did in in this particular +paper but this is one way you can wire + +590 +00:41:51,210 --> 00:41:54,510 +something like this up and saw this +picture is emitted from the Arnon just + +591 +00:41:54,510 --> 00:41:58,430 +like it's just predicted using some +weights and then this picture can be dot + +592 +00:41:58,429 --> 00:42:03,618 +product and with all these 14 by 14 +locations so we do all these dot product + +593 +00:42:03,619 --> 00:42:09,108 +and we achieved our we compute basically +14 by 14 compatibility now and then we + +594 +00:42:09,108 --> 00:42:13,949 +put a soft max on this so basically we +normalize all this so that it's all you + +595 +00:42:13,949 --> 00:42:17,149 +get this what we call in the tension +over the image so it's a 14 by 14 + +596 +00:42:17,150 --> 00:42:21,230 +probably map over what's interesting for +the Arnon right now in the image and + +597 +00:42:21,230 --> 00:42:25,889 +then we use this problem asked to do a +weighted sum of these guys with this + +598 +00:42:25,889 --> 00:42:27,239 +saliency + +599 +00:42:27,239 --> 00:42:30,929 +and so this morning can basically a myth +of what it thinks is currently + +600 +00:42:30,929 --> 00:42:36,089 +interesting for it and it goes back and +you end up doing a weighted sum of + +601 +00:42:36,090 --> 00:42:39,850 +different kinds of features that the +Ellis team wants to look at this point + +602 +00:42:39,849 --> 00:42:44,809 +in time and so for example the island's +generating stuff and it might decide + +603 +00:42:44,809 --> 00:42:49,400 +that ok I'd like to look for something +object like now admits a vector file + +604 +00:42:49,400 --> 00:42:53,220 +numbers of objects like stuff it +interacts with cum that's when the + +605 +00:42:53,219 --> 00:42:57,379 +comment a commission and maybe some of +the object like regions of that coming + +606 +00:42:57,380 --> 00:43:01,700 +out of that activation falling like +light up and ceiling see map in this + +607 +00:43:01,699 --> 00:43:05,949 +4514 irate and then you just end up +focusing your attention on that part of + +608 +00:43:05,949 --> 00:43:10,059 +the image through this interaction and +so you can basically just do lookups + +609 +00:43:10,059 --> 00:43:14,130 +into the image while you're describing +the sentence and so this is something we + +610 +00:43:14,130 --> 00:43:17,360 +refer to as a soft detention and will +actually going to this in a few lectures + +611 +00:43:17,360 --> 00:43:21,050 +so we're going to cover things like this +where the army can actually haven't + +612 +00:43:21,050 --> 00:43:26,880 +selective attention over its imports as +processing the input and that's so I + +613 +00:43:26,880 --> 00:43:30,030 +just want to bring it up roughly an hour +just to give you a preview of what that + +614 +00:43:30,030 --> 00:43:34,490 +looks like okay now if we want to make +our lives more complex one of the ways + +615 +00:43:34,489 --> 00:43:39,259 +we can do that is to stack them up in +layers so this gives you you know more + +616 +00:43:39,260 --> 00:43:43,570 +deep stuff usually works better so the +way we start this up one of the ways at + +617 +00:43:43,570 --> 00:43:46,809 +least you can stack recurrent neural +networks and there's many ways but this + +618 +00:43:46,809 --> 00:43:49,409 +is just one of them that people use in +practice as you can + +619 +00:43:49,409 --> 00:43:53,339 +straight up just plug harness into each +other so the impetus for one Arnon is + +620 +00:43:53,340 --> 00:43:59,170 +the director of the State picture of the +previous on and so this image we have + +621 +00:43:59,170 --> 00:44:02,750 +the time axis going horizontally and +then going upwards we have different + +622 +00:44:02,750 --> 00:44:05,960 +ordinance and so in this particular +image there are three separate recurrent + +623 +00:44:05,960 --> 00:44:09,858 +neural networks each with their own set +of weights and these are colonel that + +624 +00:44:09,858 --> 00:44:16,299 +works I just feed into each other and so +this is always train jointly there's no + +625 +00:44:16,300 --> 00:44:19,119 +train first won second term one that's +all just a single competition growth of + +626 +00:44:19,119 --> 00:44:22,700 +a backdrop to get through this +recurrence formula to top it + +627 +00:44:22,699 --> 00:44:25,980 +ivory britain it's likely to make it +more general rule still we're still + +628 +00:44:25,980 --> 00:44:29,280 +doing the exact same thing is we didn't +have the same formula we're taking a + +629 +00:44:29,280 --> 00:44:35,390 +lecture from below and below in depth +and effective from before in time we're + +630 +00:44:35,389 --> 00:44:39,469 +cutting them and putting them supporting +them through this w transformation and a + +631 +00:44:39,469 --> 00:44:40,519 +smashing the 10 each + +632 +00:44:40,519 --> 00:44:44,509 +so if you remember if you are slightly +confused about this there's there was a + +633 +00:44:44,510 --> 00:44:51,760 +WRX H times X plus whah times H you can +rewrite this is a concatenation of exxon + +634 +00:44:51,760 --> 00:44:56,260 +H multiplied by a single matrix right so +it's as if I stick tacks nation to a + +635 +00:44:56,260 --> 00:45:03,680 +single column vector and then I have +this w matrix where basically what ends + +636 +00:45:03,679 --> 00:45:07,690 +up happening is that your WX ages the +first part of this matrix and WH + +637 +00:45:07,690 --> 00:45:12,700 +nation's second part of your matrix and +so this kind of formula can be written + +638 +00:45:12,699 --> 00:45:16,099 +into formula where you stack with your +inputs and you have a single W + +639 +00:45:16,099 --> 00:45:24,759 +transformation so the same formula so +that's how we can stop these are + +640 +00:45:24,760 --> 00:45:29,780 +announced and then there now indexed by +both time and by later at which they + +641 +00:45:29,780 --> 00:45:33,510 +occur now one way we can also make these +more complex is not shared by stacking + +642 +00:45:33,510 --> 00:45:37,030 +them but actually using a slightly +better recurrence formula so right now + +643 +00:45:37,030 --> 00:45:40,300 +so far we've seen as very simple +recurrence formula for the return to + +644 +00:45:40,300 --> 00:45:44,480 +work in practice you will actually +rarely ever use formula like this and + +645 +00:45:44,480 --> 00:45:48,170 +basic network is very rarely used +instead you'll use what we call it an + +646 +00:45:48,170 --> 00:45:52,059 +LSD and our long short term memory so +this is basically used in all the papers + +647 +00:45:52,059 --> 00:45:56,500 +now so this is the formula would be +using also your project if you were to + +648 +00:45:56,500 --> 00:46:00,989 +use are currently works but I'd like you +to notice at this point is everything is + +649 +00:46:00,989 --> 00:46:04,729 +exactly the same as with an arlen it's +just that the recurrence formula has a + +650 +00:46:04,730 --> 00:46:09,050 +slightly more complex function ok we're +still taking the picture from the low + +651 +00:46:09,050 --> 00:46:13,789 +and depth like your input from before in +time the previous an estate were + +652 +00:46:13,789 --> 00:46:18,309 +contacting them putting them through aww +transport but now we have this more + +653 +00:46:18,309 --> 00:46:21,869 +complexity and how we actually achieve +the New Haven state at this point in + +654 +00:46:21,869 --> 00:46:25,539 +time so we're just being a slightly more +complex and how to combine defector from + +655 +00:46:25,539 --> 00:46:28,900 +below and before to actually perform an +update on heading state just a more + +656 +00:46:28,900 --> 00:46:33,050 +complex formula so we'll go into some +details of exactly what motivates this + +657 +00:46:33,050 --> 00:46:41,609 +formula and why it might be a better +idea to actually use in Austin + +658 +00:46:41,608 --> 00:46:49,909 +and it makes sense trust me we'll go +through it just right now so if you + +659 +00:46:49,909 --> 00:46:56,480 +block 4 p.m. some online video or you go +to Google Images you'll find diagrams + +660 +00:46:56,480 --> 00:47:00,989 +like this which is really not helping I +think anyone the first time I saw him + +661 +00:47:00,989 --> 00:47:04,048 +being really scared like this one really +scared he was really sure what's going + +662 +00:47:04,048 --> 00:47:08,170 +on I understand Ellis teams and I still +don't know what these two diagrams are + +663 +00:47:08,170 --> 00:47:14,289 +so ok so I'm going to try to break down +the list and it's kind of a tricky thing + +664 +00:47:14,289 --> 00:47:18,329 +to put into a diagram you really have to +kind of step through it so lecture + +665 +00:47:18,329 --> 00:47:24,220 +format is perfect for no steam ok so +here we have the US equations and I'm + +666 +00:47:24,219 --> 00:47:28,238 +going to focus on the first part here on +the top where we take these two vectors + +667 +00:47:28,239 --> 00:47:32,720 +from below and from before so X and HHS +our previous in a state an accident but + +668 +00:47:32,719 --> 00:47:37,848 +we met them through that transformation +W and now if both Jackson href size and + +669 +00:47:37,849 --> 00:47:40,950 +so there's a number send them we're +going to end up producing for any + +670 +00:47:40,949 --> 00:47:46,068 +numbers ok through this w matrix which +was put forward by 21 so we have these + +671 +00:47:46,068 --> 00:47:51,108 +four and dimensional vectors I F oMG +they're short for input for get out but + +672 +00:47:51,108 --> 00:47:57,328 +and G I'm not sure what that's just you +and so the ISI no go through signaled + +673 +00:47:57,329 --> 00:48:05,859 +gates and G go straight tenants gate now +the way this way this actually works the + +674 +00:48:05,858 --> 00:48:09,420 +best way to think about it is one thing +I forgot to mention actually in the + +675 +00:48:09,420 --> 00:48:15,028 +previous slide is normally require no +network to says the single HVAC tried + +676 +00:48:15,028 --> 00:48:18,018 +every single time stopped and asked him +actually has two vectors that every + +677 +00:48:18,018 --> 00:48:23,618 +single time and thus we call see the +cell state vector so that every single + +678 +00:48:23,619 --> 00:48:29,470 +time step we have both agency in peril +and and see vector here as shown in + +679 +00:48:29,469 --> 00:48:33,558 +yellow so we basically have two vectors +every single point in space here and + +680 +00:48:33,559 --> 00:48:37,849 +what they're doing is they're basically +operating over this cell state so + +681 +00:48:37,849 --> 00:48:41,680 +depending on what's before you and below +you and that is your context you end up + +682 +00:48:41,679 --> 00:48:45,199 +operating over the cell state with these + +683 +00:48:45,199 --> 00:48:50,509 +and Ong elements and new way to think +about it is I'm going to go through a + +684 +00:48:50,510 --> 00:48:58,290 +lot of this way to think about this is I +N O as just binary either 0 or 1 we want + +685 +00:48:58,289 --> 00:49:01,199 +them to be we want them to have an +interpretation of the gate like to think + +686 +00:49:01,199 --> 00:49:05,449 +of it as heroes are ones we of course +make them later signals because we want + +687 +00:49:05,449 --> 00:49:08,348 +this to be differentiable so that we can +back propagate through everything but + +688 +00:49:08,349 --> 00:49:11,960 +just think of Ino as just binary things +that were computing base in our context + +689 +00:49:11,960 --> 00:49:17,740 +and then what this from always doing +here see you can see that based on what + +690 +00:49:17,739 --> 00:49:22,250 +these gates are and what Diaz we're +going to end up dating this see value + +691 +00:49:22,250 --> 00:49:29,289 +and in particular this episode to forget +gate that will be used to shut a tus + +692 +00:49:29,289 --> 00:49:34,869 +reset some of the cells 20 solar cells +are best thought of as shelters and + +693 +00:49:34,869 --> 00:49:38,700 +these counters basically we can either +recent than 20 with us this interaction + +694 +00:49:38,699 --> 00:49:42,368 +this is an element of multiplication +their laser pointer is running out of + +695 +00:49:42,369 --> 00:49:45,530 +battery so + +696 +00:49:45,530 --> 00:49:50,140 +interaction 0 then you can see that will +zero out the cell so we can reset the + +697 +00:49:50,139 --> 00:49:53,969 +counter and then we can also add to a +counter so we can add through this + +698 +00:49:53,969 --> 00:50:00,459 +interaction I times G and since I S +between 11 and G is between negative one + +699 +00:50:00,460 --> 00:50:05,900 +in 10 basically adding a number between +one and 12 every cell so that every + +700 +00:50:05,900 --> 00:50:09,338 +single time step we have these counters +in all the cells we can reset these + +701 +00:50:09,338 --> 00:50:13,588 +countries 2012 forget Kate or we can +choose to add a number between one and + +702 +00:50:13,588 --> 00:50:18,039 +12 every single cell oK so that's how we +performed the cell update and then the + +703 +00:50:18,039 --> 00:50:24,029 +head an update ends up being a squashed +cell so 10 HFC squashed cell that is + +704 +00:50:24,030 --> 00:50:28,760 +modulated by this update so only some of +the cell state and up leaking into the + +705 +00:50:28,760 --> 00:50:33,500 +hidden state is modulated by this vector +oh so we only choose to reveal some of + +706 +00:50:33,500 --> 00:50:39,530 +the cells into the hen state and +learnable way there are several things + +707 +00:50:39,530 --> 00:50:43,910 +to to kind of highlight here one maybe +most confusing part here is that we're + +708 +00:50:43,909 --> 00:50:47,500 +adding a number between one and one with +I times D here but that's kind of + +709 +00:50:47,500 --> 00:50:51,809 +confusing because if we only had a G +there instead then jeez already between + +710 +00:50:51,809 --> 00:50:56,679 +8 11 so why do we need I times G what +does that actually giving us when all we + +711 +00:50:56,679 --> 00:50:58,279 +want is to implement a sea by + +712 +00:50:58,280 --> 00:51:02,330 +a number between one and one and so +that's kind of my castle part about the + +713 +00:51:02,329 --> 00:51:08,989 +last time I think one answer is that if +you think about the G it's a function of + +714 +00:51:08,989 --> 00:51:16,159 +its a linear function of your context no +one has to laser printer by chance right + +715 +00:51:16,159 --> 00:51:26,649 +ok so G as a function of your Geo 310 +age ok so G is a linear function of your + +716 +00:51:26,650 --> 00:51:30,579 +previous contacts squashed by 10 h and +so if we were adding just jeans that if + +717 +00:51:30,579 --> 00:51:35,349 +I time she then that would be kind of +like a very simple function so by adding + +718 +00:51:35,349 --> 00:51:38,929 +this I and then having a multiplicative +interaction you're actually getting more + +719 +00:51:38,929 --> 00:51:42,710 +richer function that you can actually +expressed in terms of what we're adding + +720 +00:51:42,710 --> 00:51:47,010 +torso state as a function of the +previous tests and another way to think + +721 +00:51:47,010 --> 00:51:50,620 +about this is that would basically +decoupling these two concepts of how + +722 +00:51:50,619 --> 00:51:54,159 +much do we want to add to a cell state +which is G and then do we want to + +723 +00:51:54,159 --> 00:51:58,129 +address all state which is so I is +likely we actually what this operation + +724 +00:51:58,130 --> 00:52:03,280 +to go through and genius what we want to +by decoupling these two that also may be + +725 +00:52:03,280 --> 00:52:08,470 +dynamically has some nice properties in +terms of how this all steam trains but + +726 +00:52:08,469 --> 00:52:12,039 +we just end up that's like the Austin +formulas and I'm going to actually go + +727 +00:52:12,039 --> 00:52:14,059 +through this in more detail as well + +728 +00:52:14,059 --> 00:52:21,400 +ok so think about this as cell C flowing +through and now the first interaction + +729 +00:52:21,400 --> 00:52:28,269 +here is the DOTC so efforts in a bit of +a sigmoid of that and so economically + +730 +00:52:28,269 --> 00:52:32,559 +gating yourselves with multiplicative +interaction so if f is zero you will + +731 +00:52:32,559 --> 00:52:38,409 +shut off the cell and reset the counter +the cytology part is basically giving + +732 +00:52:38,409 --> 00:52:44,799 +you a comp is basically adding to the +sole state and then the sub-state leaks + +733 +00:52:44,800 --> 00:52:51,100 +into the hill state but only through a +10 h and then that gets gated by so the + +734 +00:52:51,099 --> 00:52:55,380 +only electric and decide which parts +with some state to actually reveal into + +735 +00:52:55,380 --> 00:52:59,610 +the hidden didn't sell and then you'll +notice that this interstate not only + +736 +00:52:59,610 --> 00:53:03,720 +goes to the next iteration of the STM +but it also actually closed up to a + +737 +00:53:03,719 --> 00:53:07,159 +higher layers because this is the head +of state doctrine that we actually end + +738 +00:53:07,159 --> 00:53:11,250 +up looking into teams above us or them +goes into a prediction + +739 +00:53:11,250 --> 00:53:14,510 +and so when you unroll this basically +the way it looks like it's kind of like + +740 +00:53:14,510 --> 00:53:19,270 +this which now I have a confusing +diagram of my own thats I guess we ended + +741 +00:53:19,269 --> 00:53:24,550 +up with but you get your input vectors +from below you have your own state from + +742 +00:53:24,550 --> 00:53:26,090 +248 + +743 +00:53:26,090 --> 00:53:31,030 +determine your gates fije know they're +all in dimensional vectors and then the + +744 +00:53:31,030 --> 00:53:35,110 +end of modulating how you operate over +the cell state and the cell state can + +745 +00:53:35,110 --> 00:53:38,610 +once you actually we set some countries +and once you add numbers between one and + +746 +00:53:38,610 --> 00:53:42,630 +12 your country's the cell state leaks +out some of it leaks out in a learnable + +747 +00:53:42,630 --> 00:53:45,840 +way and then it can either go up to the +prediction or can go to the next + +748 +00:53:45,840 --> 00:53:52,269 +iteration of the US team going forward +and so that's the so this looks ugly so + +749 +00:53:52,269 --> 00:53:58,429 +we're going to so the question is +probably in your mind is why did we go + +750 +00:53:58,429 --> 00:54:02,649 +through all of this there something why +does this look at this particular way I + +751 +00:54:02,650 --> 00:54:05,639 +should like to know that this point that +there are many various to analyst at + +752 +00:54:05,639 --> 00:54:09,309 +this point but by the end of lecture +people play a lot with these equations + +753 +00:54:09,309 --> 00:54:12,840 +and we've kind of converged on this as +being like a reasonable thing but + +754 +00:54:12,840 --> 00:54:15,510 +there's many little tweaks you can make +to this that actually doesn't + +755 +00:54:15,510 --> 00:54:18,930 +deteriorate the performance by a lot you +can remove some of those gates like + +756 +00:54:18,929 --> 00:54:20,359 +maybe the implicate and so on + +757 +00:54:20,360 --> 00:54:25,200 +you can turns out that the stench of see +that can be a sea and it works just fine + +758 +00:54:25,199 --> 00:54:28,619 +normally but with a tender age of seats +were slightly better sometimes and I + +759 +00:54:28,619 --> 00:54:33,869 +don't think we have a very good reasons +for why and CSI end up with a bit of a + +760 +00:54:33,869 --> 00:54:37,039 +monster but I think it actually kinda +makes sense in terms of Justice counters + +761 +00:54:37,039 --> 00:54:40,739 +that can be reset to zero or you can add +small numbers between one and 12 them + +762 +00:54:40,739 --> 00:54:46,039 +and so it's kind of like a nice actually +relatively simple now to understand + +763 +00:54:46,039 --> 00:54:49,300 +exactly why this is much better than our +own and we have to go to a slightly + +764 +00:54:49,300 --> 00:54:55,330 +different picture to draw the +distinction so the recurrent neural + +765 +00:54:55,329 --> 00:54:59,259 +network that has some state vector right +and you're operating over it and you're + +766 +00:54:59,260 --> 00:55:02,260 +completely transforming into through +this recurrence formula and so you end + +767 +00:55:02,260 --> 00:55:06,280 +up changing your state vector from time +to time stuff you'll notice that the US + +768 +00:55:06,280 --> 00:55:11,140 +team instead has the cell States flowing +through and what we're doing effectively + +769 +00:55:11,139 --> 00:55:15,250 +as we're looking at the cells and some +of it leaks into the head of state to + +770 +00:55:15,250 --> 00:55:19,329 +state for deciding how to operate over +the cell and if you forget gains then we + +771 +00:55:19,329 --> 00:55:22,869 +end up basically just tweaking the cell +by + +772 +00:55:22,869 --> 00:55:28,509 +active interaction here so so there's +some stuff that looked at as a function + +773 +00:55:28,510 --> 00:55:33,040 +of the cell state and then whatever it +is we end up changing the soul state + +774 +00:55:33,039 --> 00:55:37,190 +instead of just transforming it right +away so it's an additive instead of a + +775 +00:55:37,190 --> 00:55:38,429 +transformative + +776 +00:55:38,429 --> 00:55:42,929 +interaction or something like that now +this is actually remind you of something + +777 +00:55:42,929 --> 00:55:48,839 +that we've already covered in the class +with this in mind you that's right yeah + +778 +00:55:48,840 --> 00:55:53,240 +so in fact like this is basically the +same thing as be solid resonates so + +779 +00:55:53,239 --> 00:55:56,299 +normally with a calm that we're +transforming representation resident has + +780 +00:55:56,300 --> 00:56:00,019 +these skip connections here and you'll +see that basically residents have this + +781 +00:56:00,019 --> 00:56:04,690 +additive interaction so we have this X +here now we do some computation based on + +782 +00:56:04,690 --> 00:56:10,240 +sex and then we have an additive +interaction with acts and so that's the + +783 +00:56:10,239 --> 00:56:12,959 +basic block of residents and that's in +fact what happens with an awesome as + +784 +00:56:12,960 --> 00:56:18,440 +well we have these interactions we're +here and the ex is your cell and we go + +785 +00:56:18,440 --> 00:56:22,619 +off with you some function and then we +choose to add to this cell state but the + +786 +00:56:22,619 --> 00:56:26,900 +LSD and unlike residents have also +please forget dates that were adding + +787 +00:56:26,900 --> 00:56:31,519 +these forget case-control choose to shut +off some parts of the signal as well but + +788 +00:56:31,519 --> 00:56:33,679 +otherwise it looks very much like a +president so I think it's kind of + +789 +00:56:33,679 --> 00:56:36,710 +interesting that were converging on very +similar kind of looking architecture + +790 +00:56:36,710 --> 00:56:40,429 +that works both income that's end in +recurrent neural networks where it seems + +791 +00:56:40,429 --> 00:56:43,809 +like dynamically somehow it's much nicer +to actually have these additive + +792 +00:56:43,809 --> 00:56:48,739 +interactions that allow you to actually +that propagate much more effectively so + +793 +00:56:48,739 --> 00:56:49,779 +to that point + +794 +00:56:49,780 --> 00:56:53,860 +think about the the back propagation +dynamics between our analysis team + +795 +00:56:53,860 --> 00:56:57,760 +especially in the US team is very clear +that if I inject some gradients and + +796 +00:56:57,760 --> 00:57:01,120 +sometimes that's here so if I inject +radiance and let the end of this diagram + +797 +00:57:01,119 --> 00:57:05,239 +then these plus interactions are just +like ingredients superhighway here right + +798 +00:57:05,239 --> 00:57:09,299 +like these videos will just flow through +all the tabs addition interactions right + +799 +00:57:09,300 --> 00:57:13,240 +because edition distributed equally so +if I plugging gradient any point in time + +800 +00:57:13,239 --> 00:57:16,849 +here just going to blow all the way back +and then of course the gradient also + +801 +00:57:16,849 --> 00:57:20,809 +flows through these acts and they end up +contributing their ingredients into the + +802 +00:57:20,809 --> 00:57:25,630 +reading to flow but you'll never end up +with what we refer to with our intense + +803 +00:57:25,630 --> 00:57:30,110 +problem called vanishing regions where +these gradients just died off go to zero + +804 +00:57:30,110 --> 00:57:32,880 +as you back propagate through and I'll +show an example + +805 +00:57:32,880 --> 00:57:36,640 +completely off why this happens in a bit +sonar now we have this banishing + +806 +00:57:36,639 --> 00:57:40,670 +gradient problem I'll show you why that +happens analyst am because of this + +807 +00:57:40,670 --> 00:57:45,210 +superhighway of just editions these +gradients of every single time step that + +808 +00:57:45,210 --> 00:57:47,130 +we inject into the US team from above + +809 +00:57:47,130 --> 00:57:54,829 +just flow through the cells and your +ratings don't end up finishing at this + +810 +00:57:54,829 --> 00:57:57,339 +point maybe I take some questions are +there questions about what's confusing + +811 +00:57:57,338 --> 00:58:01,849 +here but the last time and then after +that I'm going to why arnaz have been in + +812 +00:58:01,849 --> 00:58:03,059 +Greensboro + +813 +00:58:03,059 --> 00:58:09,789 +yes 000 vector is that important + +814 +00:58:09,789 --> 00:58:13,400 +turns out that I think that one +specifically is not super important so + +815 +00:58:13,400 --> 00:58:16,660 +there's a paper I'm going to show you +what else to answer Space Odyssey they + +816 +00:58:16,659 --> 00:58:21,719 +really played with this take stuff out +but stuff in there there's also like + +817 +00:58:21,719 --> 00:58:25,588 +these people connections you can you can +add so this cell state here that can be + +818 +00:58:25,588 --> 00:58:29,538 +actually put in with the hidden state +better as an input so people really play + +819 +00:58:29,539 --> 00:58:32,049 +with this architecture and they tried +lots of iterations of exactly these + +820 +00:58:32,048 --> 00:58:37,230 +equations and what you end up with us +almost everything works about equal some + +821 +00:58:37,230 --> 00:58:40,490 +of it we're slightly were sometimes so +it's very kind of confusing in this in + +822 +00:58:40,489 --> 00:58:45,699 +this way to show your paper where they +took they treated the DS update + +823 +00:58:45,699 --> 00:58:49,538 +equations has just been built trees over +the update equations and then they did + +824 +00:58:49,539 --> 00:58:52,950 +this like random mutation stuff and they +tried all kinds of different grass and + +825 +00:58:52,949 --> 00:58:57,028 +updates you can have and most of them +work about some of them break some of + +826 +00:58:57,028 --> 00:58:59,858 +them work about the same but nothing +like really does much better than + +827 +00:58:59,858 --> 00:59:08,150 +analyst team and the questions are going +to why recurrent neural networks have + +828 +00:59:08,150 --> 00:59:15,389 +terrible backward flow video also + +829 +00:59:15,389 --> 00:59:22,000 +showing the vanishing gradients problem +in recurrent neural networks with + +830 +00:59:22,000 --> 00:59:29,250 +respect to all stems so we're showing +here as we're looking at a recurrent + +831 +00:59:29,250 --> 00:59:33,039 +neural network over many periods many +time steps and then injecting gradient + +832 +00:59:33,039 --> 00:59:36,760 +and say it's a hundred and twenty eighth +time step and we're bankrupting + +833 +00:59:36,760 --> 00:59:40,028 +ingredients through the network and +we're looking at what is the gradient + +834 +00:59:40,028 --> 00:59:44,699 +for I think the input type hidden matrix +one of the weight matrices at every + +835 +00:59:44,699 --> 00:59:49,009 +single time step so remember that to +actually get the full update through the + +836 +00:59:49,010 --> 00:59:52,289 +back we actually adding all those +gradients here and so what's what's + +837 +00:59:52,289 --> 00:59:56,760 +what's being shown here is that as a +backdrop we only inject ingredient at + +838 +00:59:56,760 --> 01:00:00,799 +120th time steps we do backdrop back +through time and the strong the slices + +839 +01:00:00,798 --> 01:00:04,088 +of that propagation what you're seeing +is that the US team gives you lots of + +840 +01:00:04,088 --> 01:00:06,699 +gradients throughout this +backpropagation so there's lots of + +841 +01:00:06,699 --> 01:00:11,000 +information that is flowing through this +art and just instantly dies off that + +842 +01:00:11,000 --> 01:00:15,210 +just greedy and we say banishes just +just becomes tiny numbers there's no + +843 +01:00:15,210 --> 01:00:18,750 +gradient so in this case I think +indication about a time steps are so + +844 +01:00:18,750 --> 01:00:22,679 +like 10 times steps as all the +information that we injected did not + +845 +01:00:22,679 --> 01:00:26,149 +flow through the network and you can't +learn very long dependencies because all + +846 +01:00:26,150 --> 01:00:29,720 +the correlation structure has been just +died down there so we'll see why this + +847 +01:00:29,719 --> 01:00:39,399 +happens dynamically in a bit there some +comments your channel so funny he's like + +848 +01:00:39,400 --> 01:00:40,490 +YouTube or something + +849 +01:00:40,489 --> 01:00:44,779 +ok + +850 +01:00:44,780 --> 01:00:53,170 +ok so let's look at very simple example +here we have a recurrent neural network + +851 +01:00:53,170 --> 01:00:56,300 +that I'm going to unfold for you in this +recurrent neural network I'm not showing + +852 +01:00:56,300 --> 01:01:03,960 +any inputs we're only have his state +updates so wait whaaa church and state + +853 +01:01:03,960 --> 01:01:07,260 +hidden to hit an interaction and I'm +going to basically forward a recurrent + +854 +01:01:07,260 --> 01:01:12,380 +neural network does it not for some tea +time steps here I'm using T-fifty so + +855 +01:01:12,380 --> 01:01:16,260 +what I'm doing is WHAS time the previous +tenant and stuff and then on top of that + +856 +01:01:16,260 --> 01:01:20,570 +so this is just a forward pass for +ignoring any input vectors coming in is + +857 +01:01:20,570 --> 01:01:25,280 +just WHAS times H threshold WHAS time +sage threshold and so on + +858 +01:01:25,280 --> 01:01:29,500 +that's the forward pass and then +backward pass here where i'm directing a + +859 +01:01:29,500 --> 01:01:33,820 +random gradient here at the last time +step so in the 50th time step by + +860 +01:01:33,820 --> 01:01:37,880 +injecting gradient which is random and +then go backwards and I backed up so + +861 +01:01:37,880 --> 01:01:41,059 +when you back up through this right you +have to back up through here I'm using a + +862 +01:01:41,059 --> 01:01:46,170 +rather get the backdrop through a wh +multiply than 400 W H multiply and so on + +863 +01:01:46,170 --> 01:01:51,800 +and so the thing to note here is so here +I am doing developers brownback + +864 +01:01:51,800 --> 01:01:54,980 +propagate through the relevant just +holding anything that where the imports + +865 +01:01:54,980 --> 01:02:02,309 +were less than zero and Here I am +dropping the WH times each operation + +866 +01:02:02,309 --> 01:02:06,570 +where we actually multiplied by WH +matrix before we do the nonlinearity so + +867 +01:02:06,570 --> 01:02:09,570 +there's something very funky going on +when you actually look at what happens + +868 +01:02:09,570 --> 01:02:13,300 +to these DHS which is the gradient on +the NHS as you go backwards through time + +869 +01:02:13,300 --> 01:02:18,160 +it has a very kind of funny structure +that is very worrying as you look at + +870 +01:02:18,159 --> 01:02:22,210 +like how this gets chained up in the +loop like what we're doing here with + +871 +01:02:22,210 --> 01:02:33,409 +these two time steps + +872 +01:02:33,409 --> 01:02:43,849 +zeros yes I think and sometimes that's +maybe the outputs the rebels were all + +873 +01:02:43,849 --> 01:02:47,630 +dead and seemed you may have killed it +but that's not really the issue of the + +874 +01:02:47,630 --> 01:02:51,470 +more worrying issue is well that would +be a show of all but I think one wearing + +875 +01:02:51,469 --> 01:02:55,500 +issue that people can easily spot as +well as you'll see that we're + +876 +01:02:55,500 --> 01:03:00,380 +multiplying by this whah matrix over and +over and over again because in the + +877 +01:03:00,380 --> 01:03:04,840 +forward pass we multiply by awhh at +every single iteration + +878 +01:03:04,840 --> 01:03:09,670 +back propagates through all the hidden +states we end up back propagating this + +879 +01:03:09,670 --> 01:03:13,820 +formula wh ich konnte chess and a +backrub turns out to actually be that + +880 +01:03:13,820 --> 01:03:19,000 +you take your greeting signaling +multiplied by whah matrix and so we end + +881 +01:03:19,000 --> 01:03:26,199 +up the gradient gets multiplied by whah +hold it then multiplied by WH official + +882 +01:03:26,199 --> 01:03:32,019 +did so we end up multiplying by does +matrix WH age fifty times and so the + +883 +01:03:32,019 --> 01:03:37,509 +issue with this is that the green signal +basically two things can happen like if + +884 +01:03:37,510 --> 01:03:41,080 +you think about working with scalar +value supposedly scale is not matrices + +885 +01:03:41,079 --> 01:03:45,469 +if I take a number that's random and +then I have a second number and I keep + +886 +01:03:45,469 --> 01:03:48,509 +multiplying the first number by the +second number so again and again and + +887 +01:03:48,510 --> 01:03:55,990 +again what does that sequence go to +their cases right to play with the same + +888 +01:03:55,989 --> 01:04:01,849 +number either I die or just goes to +sleep yet if your second number exactly + +889 +01:04:01,849 --> 01:04:05,119 +one year so that the only case where you +don't actually explode but otherwise + +890 +01:04:05,119 --> 01:04:09,679 +really bad things are happening either +die or we explode and here we have major + +891 +01:04:09,679 --> 01:04:12,659 +cities we don't have a single number but +in fact it's the same thing happens a + +892 +01:04:12,659 --> 01:04:16,599 +generalization of it happens in the +spectral radius of the WHS major axis + +893 +01:04:16,599 --> 01:04:21,839 +which is the largest eigenvalue of that +matrix is greater than one then does + +894 +01:04:21,840 --> 01:04:25,220 +radio signal will explode if it's lower +than one degree in civil completely died + +895 +01:04:25,219 --> 01:04:30,549 +and so basically since dr Tan has this +very weird because of this recurrence + +896 +01:04:30,550 --> 01:04:34,680 +formula we end up at this very just +terrible dynamics and it's very unstable + +897 +01:04:34,679 --> 01:04:39,949 +and just dies or explodes and so in +practice the way this was handled was + +898 +01:04:39,949 --> 01:04:44,439 +you can control the exploding gradients +one simple hockey as if your greetings + +899 +01:04:44,440 --> 01:04:45,720 +exploding you click it + +900 +01:04:45,719 --> 01:04:50,789 +so people actually do this and practices +like a very patchy solution but if + +901 +01:04:50,789 --> 01:04:55,119 +you're reading Does above five min +Norman Lin clampett 25 element twice or + +902 +01:04:55,119 --> 01:04:58,150 +something like that so you can do that +is degrading clipping that's how you + +903 +01:04:58,150 --> 01:05:01,829 +address the exploding grading problem +and then you're you're recording don't + +904 +01:05:01,829 --> 01:05:06,049 +explode anymore but the Greens can still +vanish in a carnival at work and Ellis + +905 +01:05:06,050 --> 01:05:08,310 +team is very good with the vanishing +gradient problem because of these + +906 +01:05:08,309 --> 01:05:12,429 +highways of cells that are only change +with additive interactions with the + +907 +01:05:12,429 --> 01:05:17,309 +gradient just blow they never die down +if you are if you because you're + +908 +01:05:17,309 --> 01:05:21,000 +multiplying by the same age or something +like that that's roughly why these are + +909 +01:05:21,000 --> 01:05:26,909 +just better dynamically so we always +teams and we do do gradient clipping + +910 +01:05:26,909 --> 01:05:30,149 +usually so because the gradients in +Dallas team can potentially explode + +911 +01:05:30,150 --> 01:05:33,400 +still made they don't usually vanish + +912 +01:05:33,400 --> 01:05:48,608 +recurrent neural networks as well for +Ellis teams it's not clear where you + +913 +01:05:48,608 --> 01:05:53,769 +would plunge in its not clear in this +equation like exactly how you would plug + +914 +01:05:53,769 --> 01:06:00,619 +into relative and where maybe instead of +the May from G much since then attend + +915 +01:06:00,619 --> 01:06:08,690 +huug here but then resells would only +grow in a single direction right so + +916 +01:06:08,690 --> 01:06:11,980 +maybe then you can't actually end up +making it smaller so that's not a great + +917 +01:06:11,980 --> 01:06:18,539 +idea I suppose you know so there is +basically there's no clear way to plug + +918 +01:06:18,539 --> 01:06:25,380 +in a row here so yeah one thing I notice +that in terms of these superhighways + +919 +01:06:25,380 --> 01:06:29,780 +gradients this this viewpoint actually +breaks down when you have four get gates + +920 +01:06:29,780 --> 01:06:33,310 +because when you have four get Kate's +where we can forget some of these acts + +921 +01:06:33,309 --> 01:06:37,150 +with the multiplicative interaction then +whenever I forget gates turns on and it + +922 +01:06:37,150 --> 01:06:41,470 +kills the gradient then of course the +backward flow will stop so these super + +923 +01:06:41,469 --> 01:06:45,250 +highways are only kind of true if you +don't have any forget gates but if you + +924 +01:06:45,250 --> 01:06:50,000 +have a forget gave their then it can +kill the gradient and so in practice + +925 +01:06:50,000 --> 01:06:54,710 +when we play with us teams are we use +Austin's I suppose sometimes people when + +926 +01:06:54,710 --> 01:06:58,099 +they initially to forget get to the +initializer with a positive bias because + +927 +01:06:58,099 --> 01:06:58,769 +that by + +928 +01:06:58,769 --> 01:07:05,699 +forget to to turn on to me always kind +of turned off I suppose in the beginning + +929 +01:07:05,699 --> 01:07:08,679 +so in the beginning the green spoke very +well and then the US team can learn how + +930 +01:07:08,679 --> 01:07:12,779 +to shut them off at once you later on so +people play with that bias on that for + +931 +01:07:12,780 --> 01:07:17,530 +decades sometimes and so the last night +here I wanted to mention that cost him + +932 +01:07:17,530 --> 01:07:21,580 +so many people have basically play with +this quite a bit so there's a Space + +933 +01:07:21,579 --> 01:07:26,119 +Odyssey paper where they tried various +changes to the architecture there's a + +934 +01:07:26,119 --> 01:07:32,829 +paper here that tries to do this search +over huge number of potential changes to + +935 +01:07:32,829 --> 01:07:36,940 +the LST equations and they did a large +search and they didn't find anything + +936 +01:07:36,940 --> 01:07:42,300 +that works substantially better than +just analyst am so yeah and then there's + +937 +01:07:42,300 --> 01:07:45,560 +the GRU which also has a relatively +actually popular and I would actually + +938 +01:07:45,559 --> 01:07:50,159 +recommend that you might want to use +this drus a change in the Coliseum it + +939 +01:07:50,159 --> 01:07:54,460 +also has decided to interactions with +nice about it is that it's a shorter + +940 +01:07:54,460 --> 01:07:59,400 +smaller formula and it only has a single +a tractor doesn't have a Tennessee it + +941 +01:07:59,400 --> 01:08:03,130 +only has an H so implementation wise is +just nicer to remember just a single had + +942 +01:08:03,130 --> 01:08:07,590 +a setback in your forward past two +factors as just a smaller simpler thing + +943 +01:08:07,590 --> 01:08:12,190 +that seems to have most of the benefits +of a nasty but so it's called GRU and it + +944 +01:08:12,190 --> 01:08:16,730 +almost always works about the coolest +and in my experience and so you might + +945 +01:08:16,729 --> 01:08:19,939 +want to use it or you can use the last +time they both kinda knew about the same + +946 +01:08:19,939 --> 01:08:28,088 +and so somebody is that harness are very +nice but the rawr and does not actually + +947 +01:08:28,088 --> 01:08:29,130 +work very well + +948 +01:08:29,130 --> 01:08:32,420 +soyuz US teams are used instead what's +nice about them is that weird having + +949 +01:08:32,420 --> 01:08:36,000 +these additive interactions that allow +Greece to play much better and you don't + +950 +01:08:36,000 --> 01:08:39,579 +get a vanishing breed problem we still +have to worry a bit about the exploding + +951 +01:08:39,579 --> 01:08:44,269 +feeding problems so it's common to see +people clip these women sometimes and I + +952 +01:08:44,270 --> 01:08:46,670 +would say that better simpler +architectures are really trying to + +953 +01:08:46,670 --> 01:08:50,838 +understand how come there's something +deeper going on with the connection + +954 +01:08:50,838 --> 01:08:53,899 +between residents and Ellis teams and +there's something deeper about these + +955 +01:08:53,899 --> 01:08:57,579 +interactions that I think we're not +fully understanding yet exactly why that + +956 +01:08:57,579 --> 01:09:02,210 +works so well and which parts of it were +cool and so I think we need to + +957 +01:09:02,210 --> 01:09:05,119 +understand both theoretical and +empirical in the space and it's a very + +958 +01:09:05,119 --> 01:09:10,979 +wide open area of research and so so +it's + +959 +01:09:10,979 --> 01:09:23,469 +sport 10 but the end of class where they +can I i suppose to explode so it's not + +960 +01:09:23,470 --> 01:09:27,020 +as clear why they would but you keep +injecting gradient into the cell state + +961 +01:09:27,020 --> 01:09:30,069 +and so maybe degrading can sometimes get +larger + +962 +01:09:30,069 --> 01:09:33,960 +it's common to collect em but I think +not as may be important maybe as an hour + +963 +01:09:33,960 --> 01:09:40,829 +and then I'm not a hundred percent sure +about that point but urological basis I + +964 +01:09:40,829 --> 01:09:46,640 +have no idea what's interesting yeah I +think we should end up here but I'm + +965 +01:09:46,640 --> 01:09:47,569 +happy to take your questions here + diff --git a/captions/En/Lecture11_en.srt b/captions/En/Lecture11_en.srt new file mode 100644 index 00000000..4aefae64 --- /dev/null +++ b/captions/En/Lecture11_en.srt @@ -0,0 +1,4833 @@ +1 +00:00:00,000 --> 00:00:03,428 +right side we have a lot of stuff to get +through today so I'd like to get started + +2 +00:00:03,428 --> 00:00:08,669 +so today we're going to talk about CNN's +and practice and talked about a lot of + +3 +00:00:08,669 --> 00:00:12,050 +really low level sort of implementation +details that are really comment to get + +4 +00:00:12,050 --> 00:00:15,980 +these things to work when you're +actually training things but first as + +5 +00:00:15,980 --> 00:00:20,189 +usual we have some administrative stuff +to talk about number one is that through + +6 +00:00:20,189 --> 00:00:24,600 +a really heroic effort by all the TA is +all the midterms are degraded so you + +7 +00:00:24,600 --> 00:00:27,740 +guys should definitely thank them for +that and you can either pick them up + +8 +00:00:27,739 --> 00:00:34,920 +after class today or in any of these +office hours that are up here also keep + +9 +00:00:34,920 --> 00:00:38,609 +in mind that your project milestones are +going to be due tonight at midnight so + +10 +00:00:38,609 --> 00:00:41,628 +make sure that I hope you've been +working on your projects for the last + +11 +00:00:41,628 --> 00:00:45,579 +couple for the last week or so and have +made some really exciting progress so + +12 +00:00:45,579 --> 00:00:51,289 +make sure to write that up and put it in +the assignments tab on Dropbox no no not + +13 +00:00:51,289 --> 00:00:55,460 +on Dropbox but on the assignment tab on +coursework sorry that I know this is + +14 +00:00:55,460 --> 00:00:58,910 +really confusing but assignments tab +just like just like assignment to + +15 +00:00:58,909 --> 00:01:04,000 +assignment two were working on grading +hopefully we'll have that done sometime + +16 +00:01:04,000 --> 00:01:10,140 +this week and remember that assignment +three is out so how's that been going + +17 +00:01:10,140 --> 00:01:17,159 +anyone anyone done okay that's good one +person's done so the rest you should get + +18 +00:01:17,159 --> 00:01:22,740 +started because it's due in a week so we +have some fun stats from the midterm so + +19 +00:01:22,739 --> 00:01:26,379 +don't freak out when you see your grade +cuz we actually had this really nice + +20 +00:01:26,379 --> 00:01:30,759 +beautiful Gaussian distribution with a +beautiful standard deviation we don't + +21 +00:01:30,759 --> 00:01:34,549 +need to bash normalize this thing it's +already perfect I'd also like to point + +22 +00:01:34,549 --> 00:01:38,049 +out that someone got up max score a +hundred and three which means they got + +23 +00:01:38,049 --> 00:01:43,470 +everything right in the bonus so that's +means it wasn't hard enough to maybe + +24 +00:01:43,469 --> 00:01:49,500 +we also have some per questions thats my +percussion breakdown on average score + +25 +00:01:49,500 --> 00:01:52,450 +per every single question in the midterm +so if you want if you got something + +26 +00:01:52,450 --> 00:01:55,510 +wrong and you want to see if everyone +else got it wrong to you can go check on + +27 +00:01:55,510 --> 00:01:59,380 +these stats leader at you're on your own +time we have stats for the true false + +28 +00:01:59,379 --> 00:02:00,959 +and the multiple choice + +29 +00:02:00,959 --> 00:02:04,729 +keep in mind actually fired for two of +the true false we decided during grading + +30 +00:02:04,730 --> 00:02:07,090 +that they were little bit unfair to +throw it out and just give you all the + +31 +00:02:07,090 --> 00:02:12,960 +points which is why two of those are a +hundred percent we have these stats for + +32 +00:02:12,960 --> 00:02:19,810 +all the individual questions so go ahead +and have fun with those later + +33 +00:02:19,810 --> 00:02:24,379 +last time I know it's been a while but +we had a midterm and we had a holiday + +34 +00:02:24,379 --> 00:02:28,030 +but if you can remember like over a week +ago we were talking about recurrent + +35 +00:02:28,030 --> 00:02:31,509 +networks we talked about how recurrent +networks can be used for modeling + +36 +00:02:31,509 --> 00:02:35,500 +sequences you know normally with these +feedforward networks they takin it they + +37 +00:02:35,500 --> 00:02:39,139 +model this keyboard function but these +recurrent networks we talked about how + +38 +00:02:39,139 --> 00:02:43,208 +they can model different kinds of +sequence problems we talked about to + +39 +00:02:43,209 --> 00:02:48,319 +particular implementations of recurrent +networks 10 are announced and Alice and + +40 +00:02:48,319 --> 00:02:51,539 +implement both of those on the +assignment so you should know what they + +41 +00:02:51,539 --> 00:02:56,079 +are we talked about how these the +correct recurrent neural networks can be + +42 +00:02:56,080 --> 00:03:01,010 +used for language models and had some +fun showing some sample generated text + +43 +00:03:01,009 --> 00:03:06,329 +on what is the Shakespeare and algebraic +geometry that's one we talked about how + +44 +00:03:06,330 --> 00:03:09,590 +we can combine recurrent networks with +convolutional networks to do image + +45 +00:03:09,590 --> 00:03:14,180 +capturing and we played a little bit +this game of being RNN neuroscientists + +46 +00:03:14,180 --> 00:03:17,700 +and diving into the cells of the +Ardennes and trying to interpret what + +47 +00:03:17,699 --> 00:03:21,879 +they're doing and we saw that sometimes +we have these interminable cells that + +48 +00:03:21,879 --> 00:03:27,049 +are for example activating incited +statements which is pretty cool but + +49 +00:03:27,049 --> 00:03:28,890 +today we're going to talk about +something totally different + +50 +00:03:28,889 --> 00:03:33,339 +there are three we're gonna talk about +really a lot of low-level things that + +51 +00:03:33,340 --> 00:03:37,830 +you need to know to get CNN's working in +practice so there's three major themes + +52 +00:03:37,830 --> 00:03:41,600 +it's a little bit of a potpourri but +we're going to try to tie it together so + +53 +00:03:41,599 --> 00:03:45,349 +the first is really squeezing all the +juice that you cannot of your data so I + +54 +00:03:45,349 --> 00:03:48,219 +know a lot of you especially for +projects you don't have large datasets + +55 +00:03:48,219 --> 00:03:51,789 +we're going to talk about data +augmentation and transfer learning which + +56 +00:03:51,789 --> 00:03:55,079 +are two really powerful useful +techniques especially when you're + +57 +00:03:55,080 --> 00:03:56,350 +working with small datasets + +58 +00:03:56,349 --> 00:04:00,889 +we're going to really dive deep into +convolutions and talk a lot more about + +59 +00:04:00,889 --> 00:04:05,959 +those both how you can design efficient +architectures using convolutions and + +60 +00:04:05,960 --> 00:04:10,480 +also how contributions are efficiently +implemented in practice and then finally + +61 +00:04:10,479 --> 00:04:13,269 +we gonna talk about something but +usually gets lumped under implementation + +62 +00:04:13,270 --> 00:04:17,480 +details and doesn't even make it into +papers but that stuff like someone has a + +63 +00:04:17,480 --> 00:04:21,750 +CPU and GPU what kind of bottlenecks to +experience and training how much you + +64 +00:04:21,750 --> 00:04:26,069 +distribute raining over multiple over +multiple devices that's a lot of stuff + +65 +00:04:26,069 --> 00:04:31,620 +we should get started so first let's +talk about data augmentation I think + +66 +00:04:31,620 --> 00:04:34,910 +we've sort of mentioned this may be in +passing so far in the lectures but never + +67 +00:04:34,910 --> 00:04:39,780 +really talked about it so normally when +you're training CNN's you're really + +68 +00:04:39,779 --> 00:04:44,179 +familiar with this type of pipeline when +during training you're gonna load images + +69 +00:04:44,180 --> 00:04:48,379 +and labels up off the desk you're gonna +pay the image through to your CNN then + +70 +00:04:48,379 --> 00:04:51,009 +you're going to use the image together +with the label to compute some loss + +71 +00:04:51,009 --> 00:04:55,610 +function and back-propagation update the +CNN and repeat former so that they + +72 +00:04:55,610 --> 00:05:00,970 +should be really familiar with that by +now the thing about a documentation is + +73 +00:05:00,970 --> 00:05:05,960 +we just had one little step to this +pipeline which is here so after we load + +74 +00:05:05,959 --> 00:05:09,849 +the image above desk we're going to +transform it in some way before passing + +75 +00:05:09,850 --> 00:05:13,910 +it to the CNN and this transformation +should preserve the label + +76 +00:05:13,910 --> 00:05:19,090 +gonna come back propagate and the CNN so +it's really simple and the trick is just + +77 +00:05:19,089 --> 00:05:24,089 +what kind of Transformers you should be +using such data augmentation the idea is + +78 +00:05:24,089 --> 00:05:27,679 +really simple it's sort of this way that +lets you artificially expander training + +79 +00:05:27,680 --> 00:05:32,030 +set through clever usage of different +kinds of transformations so if you + +80 +00:05:32,029 --> 00:05:35,409 +remember the computer is really seeing +these images as these try and get some + +81 +00:05:35,410 --> 00:05:39,189 +pixels and there are these different +kinds of transformations we can make + +82 +00:05:39,189 --> 00:05:43,230 +that should preserve the label but which +will change all the pixels if you + +83 +00:05:43,230 --> 00:05:46,770 +imagine like shipping that cat 1 pixel +to left it's still a cat but all the + +84 +00:05:46,769 --> 00:05:50,539 +pixels are going to change that so what +when you talk about a documentation + +85 +00:05:50,540 --> 00:05:54,680 +you're sort of imagine that you're +expanding your training that these + +86 +00:05:54,680 --> 00:05:58,629 +trainings and these new basic training +samples be correlated but it will still + +87 +00:05:58,629 --> 00:06:03,389 +help you train models with with bigger +models with preventing and this is very + +88 +00:06:03,389 --> 00:06:04,959 +very widely used in practice + +89 +00:06:04,959 --> 00:06:08,668 +pretty much any CNN you see that's +winning competitions or doing well on + +90 +00:06:08,668 --> 00:06:09,810 +benchmarks is using some + +91 +00:06:09,810 --> 00:06:15,889 +station so the easiest from a bit +augmentation is horizontal flipping if + +92 +00:06:15,889 --> 00:06:18,699 +we think this cat when you look at the +mirror image the mirror images should + +93 +00:06:18,699 --> 00:06:22,949 +still be a cat and this is really really +easy to implement an umpire you can just + +94 +00:06:22,949 --> 00:06:27,159 +do it with a single call a single line +of code similarly easy and torch other + +95 +00:06:27,160 --> 00:06:32,040 +frameworks this is really easy vary +widely used something else that's very + +96 +00:06:32,040 --> 00:06:37,120 +widely used to take random crops from +the training images so a training time + +97 +00:06:37,120 --> 00:06:40,949 +we're gonna load up her image and we're +gonna take a patch about image at a + +98 +00:06:40,949 --> 00:06:42,629 +random scale and location + +99 +00:06:42,629 --> 00:06:47,189 +resize it to our fixed whatever size are +CNN's expecting and then use that as our + +100 +00:06:47,189 --> 00:06:51,389 +training example and again this is very +very widely used just give you a flavour + +101 +00:06:51,389 --> 00:06:56,610 +of how exactly this is used I looked up +the details for residents so they + +102 +00:06:56,610 --> 00:07:01,639 +actually had training time each training +image resize paper sticker random number + +103 +00:07:01,639 --> 00:07:05,620 +resize the whole image so that the +shorter side is that number then sample + +104 +00:07:05,620 --> 00:07:09,720 +of random 224 by 224 crop from the +resize dimension and then use that as + +105 +00:07:09,720 --> 00:07:13,990 +their training sample so that's pretty +easy to implement and usually helps + +106 +00:07:13,990 --> 00:07:20,560 +quite a bit so when you're using this +form of data augmentation usually things + +107 +00:07:20,560 --> 00:07:25,269 +change a little bit test time so a +training time when using this form of + +108 +00:07:25,269 --> 00:07:29,079 +data augmentation the network is not +really trained on full images it strain + +109 +00:07:29,079 --> 00:07:34,219 +on his crops so it doesn't really make +sense or seem fair to try to force the + +110 +00:07:34,220 --> 00:07:38,900 +network to look at the whole image as a +test I'm so usually in practice when + +111 +00:07:38,899 --> 00:07:42,879 +you're doing this kind of random +cropping for data augmentation at US + +112 +00:07:42,879 --> 00:07:48,379 +time you'll have some fixed set of crops +and use these for testing so very + +113 +00:07:48,379 --> 00:07:52,019 +commonly you'll see that you'll see ten +crops will take the upper left hand + +114 +00:07:52,019 --> 00:07:52,649 +corner + +115 +00:07:52,649 --> 00:07:56,189 +the upper right hand corner that you +bottom corners and the center gives you + +116 +00:07:56,189 --> 00:08:00,800 +five together at the horizontal flips +gives you 10 he'll take those 10 crops a + +117 +00:08:00,800 --> 00:08:06,460 +test time passing through the network an +average scores of those 10 crops so + +118 +00:08:06,459 --> 00:08:09,519 +resonant actually takes those little bit +one step further and actually do + +119 +00:08:09,519 --> 00:08:14,759 +multiscale multiple scales attest time +as well this is something that tends to + +120 +00:08:14,759 --> 00:08:20,649 +help performance in practice and again +very easy to implement vary widely used + +121 +00:08:20,649 --> 00:08:26,418 +another thing that we usually do 48 +augmentation is color generating so if + +122 +00:08:26,418 --> 00:08:29,529 +you take this picture of a cat maybe +maybe it was a little bit cloudier that + +123 +00:08:29,529 --> 00:08:33,348 +day a little bit funnier that day and if +we would have taken a picture than a lot + +124 +00:08:33,349 --> 00:08:37,070 +of the colors would have been quite +different so one thing that's very + +125 +00:08:37,070 --> 00:08:40,360 +common to do is just change the color a +little bit of our training images before + +126 +00:08:40,360 --> 00:08:45,539 +we get to the CNN so I'm very simple way +is just a change the contrast this is a + +127 +00:08:45,539 --> 00:08:50,469 +very easy to implement a very simple to +do but actually in practice you'll see + +128 +00:08:50,470 --> 00:08:55,759 +that this contract during a little bit +less common and what instead you see is + +129 +00:08:55,759 --> 00:09:01,259 +this slightly more complex pipeline +using principal component analysis over + +130 +00:09:01,259 --> 00:09:06,439 +all the pixels of the training data the +idea is that we for each pixel in the + +131 +00:09:06,440 --> 00:09:11,390 +training data is this vector of length 3 +an RGB and if we collect those pixels + +132 +00:09:11,389 --> 00:09:15,129 +over the entire training data that you +get a sense of what kinds of colors + +133 +00:09:15,129 --> 00:09:19,330 +generally exist in the training data +then using principal component analysis + +134 +00:09:19,330 --> 00:09:23,930 +gives us three principal component +directions in color space that kind of + +135 +00:09:23,929 --> 00:09:27,879 +tell us what are the directions along +which color tends to vary in the dataset + +136 +00:09:27,879 --> 00:09:32,429 +so than a test at training time for +color augmentation + +137 +00:09:32,429 --> 00:09:35,889 +we can actually use these principal +components of the color of the training + +138 +00:09:35,889 --> 00:09:41,419 +site to choose exactly how to gender the +color at training time this is again a + +139 +00:09:41,419 --> 00:09:46,719 +little bit more complicated but it is +pretty widely used so this type of PCA + +140 +00:09:46,720 --> 00:09:51,580 +driven data augmentation for color I +think was introduced with Alex that + +141 +00:09:51,580 --> 00:09:58,310 +paper in 2012 and it's also used in +ResNet for example so data augmentation + +142 +00:09:58,309 --> 00:10:02,829 +is israeli this very general thing right +you just want to think about for your + +143 +00:10:02,830 --> 00:10:06,420 +data set what kinds of transformations +do you want your class fire to be in + +144 +00:10:06,419 --> 00:10:11,179 +various too and then you want to +introduce those types of variations to + +145 +00:10:11,179 --> 00:10:15,229 +your training data a training time and +you can really go crazy here and get + +146 +00:10:15,230 --> 00:10:18,740 +creative and really think about your +data and what types of them variances + +147 +00:10:18,740 --> 00:10:23,659 +makes sense for your data so you might +want to try it like maybe random + +148 +00:10:23,659 --> 00:10:27,708 +rotations depending on your data may be +rotations of a couple degrees make sense + +149 +00:10:27,708 --> 00:10:31,399 +you could try it like different kinds of +stretching and shearing to simulate + +150 +00:10:31,399 --> 00:10:33,189 +maybe affine transformations of your +data + +151 +00:10:33,190 --> 00:10:36,990 +and you could really go crazy here and +try to get creative and think of + +152 +00:10:36,990 --> 00:10:43,840 +interesting ways to make your data and +other thing about I'd like to point out + +153 +00:10:43,840 --> 00:10:49,009 +is this idea of data augmentation really +fits into a larger theme that now we've + +154 +00:10:49,009 --> 00:10:54,090 +seen repeated many times throughout the +course and this team is that one way + +155 +00:10:54,090 --> 00:10:58,420 +that's really useful in practice for +preventing overfitting as a regular + +156 +00:10:58,419 --> 00:11:02,209 +rider is that during the fourth pass +during training when we're training our + +157 +00:11:02,210 --> 00:11:05,930 +network we had some kind of weird +stochastic noise to kind of mess with + +158 +00:11:05,929 --> 00:11:10,629 +the network for example with data +augmentation we're actually modifying + +159 +00:11:10,629 --> 00:11:14,210 +the training data that we put into the +network with things like drop out or + +160 +00:11:14,210 --> 00:11:18,860 +drop connect you're taking random parts +of the network and he they're setting + +161 +00:11:18,860 --> 00:11:22,730 +the activations are the weights 20 +randomly + +162 +00:11:22,730 --> 00:11:28,450 +this also has this also appears kind of +with Bosch normalization with patch + +163 +00:11:28,450 --> 00:11:31,930 +normalization your normalization +contents depend on the other things in + +164 +00:11:31,929 --> 00:11:35,000 +any batch so your normal and during +training + +165 +00:11:35,000 --> 00:11:39,440 +the same image might end up appearing in +many batches with different other images + +166 +00:11:39,440 --> 00:11:43,840 +that actually introduces this type of +noise I training time but for all of + +167 +00:11:43,840 --> 00:11:47,690 +these examples a test time we averaged +out this noise so for data augmentation + +168 +00:11:47,690 --> 00:11:52,790 +we all take averages over many different +samples of the training data for dropout + +169 +00:11:52,789 --> 00:11:56,870 +and dropped connect you can sort of +evaluate and marginalized this out in a + +170 +00:11:56,870 --> 00:12:01,090 +little more analytically and forecasts +normalization we keep keep his running + +171 +00:12:01,090 --> 00:12:05,269 +means so I just think that's kind of a +nice way to unify lot of these ideas for + +172 +00:12:05,269 --> 00:12:08,960 +regularization is that when you can add +noise at the forward pass and then + +173 +00:12:08,960 --> 00:12:13,540 +marginalized over at a time so keep that +in mind if you're trying to come up with + +174 +00:12:13,539 --> 00:12:20,250 +other creative ways to regularize your +networks so that's the main takeaways + +175 +00:12:20,250 --> 00:12:24,149 +for data augmentation are that one it's +it's usually really simple to implement + +176 +00:12:24,149 --> 00:12:28,329 +so you should almost always be using it +there's not really any excuse not to + +177 +00:12:28,330 --> 00:12:32,730 +it's very very useful especially for +small datasets which i think many of you + +178 +00:12:32,730 --> 00:12:36,850 +are using for your projects and it also +fits in nicely with this framework of + +179 +00:12:36,850 --> 00:12:41,509 +noise at training and marginalization a +test I'm so I think that's that's pretty + +180 +00:12:41,509 --> 00:12:45,360 +much all there is to say about data +augmentation so there's any questions + +181 +00:12:45,360 --> 00:12:45,840 +about that + +182 +00:12:45,840 --> 00:13:01,840 +i'm happy to talk about it now yeah a +lot of time training time it would take + +183 +00:13:01,840 --> 00:13:05,790 +a lot of disk space to try to dump these +things to desk so that I'm so sometimes + +184 +00:13:05,789 --> 00:13:08,879 +people get creative and even have like +background threads their matching data + +185 +00:13:08,879 --> 00:13:16,799 +and documentation right so I think +that's that's clear we can talk about + +186 +00:13:16,799 --> 00:13:21,069 +the next idea so there's this myth +floating around that when you work with + +187 +00:13:21,070 --> 00:13:25,770 +CNN's you really need a lot of data but +it turns out that would transfer + +188 +00:13:25,769 --> 00:13:33,029 +learning this myth is busted so there's +this really simple recipe that you can + +189 +00:13:33,029 --> 00:13:37,769 +use for transfer learning and thats +first you take whatever your favorite + +190 +00:13:37,769 --> 00:13:42,879 +CNN architecture is Alex matter BG or +what have you and you either training on + +191 +00:13:42,879 --> 00:13:46,970 +image not yourself or you down for more +commonly you download free trade bottle + +192 +00:13:46,970 --> 00:13:51,360 +from the internet that's easy to do just +takes 20 minutes to download many hours + +193 +00:13:51,360 --> 00:13:56,590 +to train but you probably won't do that +part next there's sort of two general + +194 +00:13:56,590 --> 00:14:00,910 +cases one if your data set is really +small and you really don't have any + +195 +00:14:00,909 --> 00:14:05,019 +images whatsoever then you can just +treat this classifier as a fixed feature + +196 +00:14:05,019 --> 00:14:10,110 +extractor so one way to look at this is +that you'll take the last layer of the + +197 +00:14:10,110 --> 00:14:15,580 +network the soft max hospital asian +model will take it away and he'll + +198 +00:14:15,580 --> 00:14:18,370 +replace it with some kind of linear +classifier for the task that you + +199 +00:14:18,370 --> 00:14:21,810 +actually care about and now you'll +freeze the rest of the network and + +200 +00:14:21,809 --> 00:14:26,969 +retraining only that top layer so this +is sort of equivalent to just training a + +201 +00:14:26,970 --> 00:14:31,230 +linear classifier directly on top of +features extracted from the network so + +202 +00:14:31,230 --> 00:14:35,149 +what you'll see a lot of times in +practice for this case is that sort of + +203 +00:14:35,149 --> 00:14:38,399 +as a preprocessing step you'll just +dumped features to test for all of your + +204 +00:14:38,399 --> 00:14:42,100 +training images and then work entirely +on top of those cast features so that + +205 +00:14:42,100 --> 00:14:48,110 +can help speed things up quite a bit and +that's quite easy to use its very very + +206 +00:14:48,110 --> 00:14:51,250 +common and usually provides a very +strong baseline for a lot of problems + +207 +00:14:51,250 --> 00:14:56,169 +that you might encounter in practice and +if you have a little bit more data than + +208 +00:14:56,169 --> 00:14:58,599 +then you can actually afford to train +more comfy + +209 +00:14:58,600 --> 00:15:03,949 +models so depending on the size of your +dataset usually you'll freeze some parts + +210 +00:15:03,948 --> 00:15:07,669 +some of the lower layers of the network +and then instead of retraining only the + +211 +00:15:07,669 --> 00:15:11,919 +last lair you'll pick some number of the +last letters to train depending on how + +212 +00:15:11,919 --> 00:15:16,349 +larger dataset is and generally when you +have a larger dataset available for + +213 +00:15:16,350 --> 00:15:21,350 +training you can afford to train more of +these final theirs and again if you're + +214 +00:15:21,350 --> 00:15:26,060 +similar to the similar to the trick over +here what you'll see very commonly is + +215 +00:15:26,059 --> 00:15:29,729 +that instead of actually explicitly +computing this part you'll just dump + +216 +00:15:29,730 --> 00:15:35,019 +these last layer features to desk and +then work on this part in memory so that + +217 +00:15:35,019 --> 00:15:47,490 +can speed things up quite a lot and +sometimes I question that you basically + +218 +00:15:47,490 --> 00:15:51,959 +have to try it and see but especially +for this type of small dataset will work + +219 +00:15:51,958 --> 00:15:55,799 +on instances so if you have like if you +want to just do image retrieval a pretty + +220 +00:15:55,799 --> 00:16:01,338 +strong baseline is just use LTE distance +on CNN features so it may be so this + +221 +00:16:01,339 --> 00:16:05,110 +type of approach I mean harmonies how +many samples you expect to need to train + +222 +00:16:05,110 --> 00:16:10,470 +a lot like an FBI or something and for +these if you have more than if you have + +223 +00:16:10,470 --> 00:16:15,310 +more data than you would expect to need +for a nice p.m. then try that so it's + +224 +00:16:15,309 --> 00:16:28,879 +not at all and maybe I'm sorry yeah it +depends sometimes you you actually will + +225 +00:16:28,879 --> 00:16:32,309 +run through the forward pass but +sometimes you just run the four pass + +226 +00:16:32,309 --> 00:16:36,818 +once and dump these two desk that's +kinda that's that's pretty common + +227 +00:16:36,818 --> 00:16:41,458 +actually saves compute + +228 +00:16:41,458 --> 00:16:59,729 +from random house you'll probably have +different classes are you + +229 +00:16:59,730 --> 00:17:03,350 +Russian problem or something but then +these these other intermediate layers + +230 +00:17:03,350 --> 00:17:08,750 +you initialize from whatever was in the +previous model and actually and in + +231 +00:17:08,750 --> 00:17:15,068 +practice when you find a nice tip it to +actually do this is bad there are there + +232 +00:17:15,068 --> 00:17:18,588 +only be two types of layers when I guess +three types of layers when you're fine + +233 +00:17:18,588 --> 00:17:22,349 +tuning they'll be the frozen layers +which you can think up as having a + +234 +00:17:22,349 --> 00:17:27,448 +learning rate of zero there are these +these new larry is that Yuri initialize + +235 +00:17:27,449 --> 00:17:32,548 +from scratch and typically those have +maybe a higher learning rate but not too + +236 +00:17:32,548 --> 00:17:36,528 +high maybe one tenth of what their +network was originally trained west and + +237 +00:17:36,528 --> 00:17:40,079 +then we'll have these intermediate +layers that you are initializing from + +238 +00:17:40,079 --> 00:17:43,269 +the pre train network but you're +planning to modify joint optimization + +239 +00:17:43,269 --> 00:17:47,470 +and fine-tuning so these intermediate +layers you'll tend to be very small + +240 +00:17:47,470 --> 00:17:56,589 +learning rate maybe one one-hundredth of +the original yeah + +241 +00:17:56,589 --> 00:18:04,319 +that's some people have tried to +investigate and found that generally + +242 +00:18:04,319 --> 00:18:08,079 +fine tuning this type of transfer +learning fine-tuning approach works + +243 +00:18:08,079 --> 00:18:11,710 +better when the network was originally +trained with similar types of data + +244 +00:18:11,710 --> 00:18:16,610 +whatever that means but in fact these +these very low-level features are things + +245 +00:18:16,609 --> 00:18:20,308 +like edges and colors and Gabor filters +which are probably gonna be applicable + +246 +00:18:20,308 --> 00:18:24,190 +to just about any type of visual data so +especially these lower level features I + +247 +00:18:24,190 --> 00:18:29,009 +think are generally pretty applicable to +almost anything and by the way I another + +248 +00:18:29,009 --> 00:18:33,788 +tip that you said that you sometimes see +in practice for fine-tuning is that you + +249 +00:18:33,788 --> 00:18:37,609 +might actually have a multi-stage +approach where first you freeze the + +250 +00:18:37,609 --> 00:18:42,079 +entire network and then only trained +this last lair and then after this last + +251 +00:18:42,079 --> 00:18:46,939 +layers seems to be converging then go +back and actually find to indies you can + +252 +00:18:46,940 --> 00:18:51,519 +sometimes have this problem that these +because this last layers initialize + +253 +00:18:51,519 --> 00:18:54,690 +randomly you might have very large +gradients that kind of mess up this + +254 +00:18:54,690 --> 00:18:59,070 +initialization so that the two ways to +get around that are either freezing this + +255 +00:18:59,069 --> 00:19:02,788 +at first I'm writing this converge or by +having this bearing learning rate + +256 +00:19:02,788 --> 00:19:08,658 +between the two regimes of the network +so this idea of transfer learning + +257 +00:19:08,659 --> 00:19:14,470 +actually works really well so there was +a couple pretty early papers from 2013 + +258 +00:19:14,470 --> 00:19:19,390 +2014 when CNN's per started getting +popular this one in particular + +259 +00:19:19,390 --> 00:19:24,490 +astounding baseline paper was was pretty +cool what they did is they took the what + +260 +00:19:24,490 --> 00:19:26,009 +at the time was one of the best + +261 +00:19:26,009 --> 00:19:30,470 +CNN's out there was over feat they just +extracted features from overseas and + +262 +00:19:30,470 --> 00:19:33,640 +apply these features to a bunch of +different standard datasets and standard + +263 +00:19:33,640 --> 00:19:38,679 +problems in computer vision and they +compared to these rights then the idea + +264 +00:19:38,679 --> 00:19:42,210 +is that they compared against what was +at the time these very specialized + +265 +00:19:42,210 --> 00:19:45,298 +pipelines and very specialized +architectures for each individual + +266 +00:19:45,298 --> 00:19:49,408 +problems and datasets and for each +problem they just replaced this very + +267 +00:19:49,409 --> 00:19:54,380 +specialized pipeline with very simple +linear models on top of features from + +268 +00:19:54,380 --> 00:19:58,559 +over feet and they did this for a whole +bunch of different datasets and found + +269 +00:19:58,558 --> 00:20:01,940 +that in general overall these +over-the-top teachers were a very very + +270 +00:20:01,940 --> 00:20:06,080 +strong baseline and for some problems +they were actually better than existing + +271 +00:20:06,079 --> 00:20:08,428 +methods and for some problems they were + +272 +00:20:08,429 --> 00:20:12,879 +get worse but still quite competitive so +this this was a really cool paper that + +273 +00:20:12,878 --> 00:20:16,118 +just demonstrated that these are really +strong features that can be used in a + +274 +00:20:16,118 --> 00:20:19,949 +lot of different tasks and tend to work +quite well and other paper along those + +275 +00:20:19,950 --> 00:20:25,419 +lines was from Berkeley the decaf paper +and decaf later became became + +276 +00:20:25,419 --> 00:20:33,610 +caffeinated and became cafe so that's +that's that's kind of lineage there so + +277 +00:20:33,609 --> 00:20:37,388 +kind of the recipe for transfer learning +is that there is you should think about + +278 +00:20:37,388 --> 00:20:43,398 +too little to buy two matrix how similar +is your data set to what the preteen + +279 +00:20:43,398 --> 00:20:47,989 +model was and how much data do you have +and what should you do in those four + +280 +00:20:47,990 --> 00:20:53,240 +different columns so generally if you +have very similar data set and very + +281 +00:20:53,240 --> 00:20:57,538 +little data just using the network has a +fixed feature extractor and training + +282 +00:20:57,538 --> 00:21:02,429 +simple linear models on top of those +features tends to work very well if you + +283 +00:21:02,429 --> 00:21:06,470 +have a little bit more data than you can +try to try fine-tuning and try actually + +284 +00:21:06,470 --> 00:21:10,509 +initializing network from fine tune from +pre-screened weights and running + +285 +00:21:10,509 --> 00:21:15,868 +optimization from there is another +column is little tricks here in this box + +286 +00:21:15,868 --> 00:21:20,099 +you might be in trouble you can try to +get creative and maybe instead of + +287 +00:21:20,099 --> 00:21:23,998 +extracting features from the very last +layer you might try extracting features + +288 +00:21:23,999 --> 00:21:27,470 +from different layers of the continent +and that can sometimes sometimes help + +289 +00:21:27,470 --> 00:21:32,819 +the intuition there is that maybe for +something like MRI data probably these + +290 +00:21:32,819 --> 00:21:37,178 +very top level features are very +specific image now categories but these + +291 +00:21:37,179 --> 00:21:42,059 +very low-level features are things like +edges and stuff like that that may be + +292 +00:21:42,058 --> 00:21:47,980 +more transferable to turn on and turn on +image net tech data sets and obviously + +293 +00:21:47,980 --> 00:21:51,099 +in this box you're in better shape and +again you can just sort of initializing + +294 +00:21:51,099 --> 00:21:57,928 +fine soon so another thing I'd like to +point out is this idea of initializing + +295 +00:21:57,929 --> 00:22:01,590 +with preteen models and fine-tuning is +actually not the exception this is + +296 +00:22:01,589 --> 00:22:05,439 +pretty much standard practice in almost +any larger system that you'll see in + +297 +00:22:05,440 --> 00:22:09,070 +computer vision these days and we've +actually seen two examples of this + +298 +00:22:09,069 --> 00:22:13,220 +already in the quarters so for example +if you remember from a few lectures ago + +299 +00:22:13,220 --> 00:22:17,220 +we talked about object detection where +we had a CNN looking at the image + +300 +00:22:17,220 --> 00:22:21,620 +region proposals and this other call +this all this crazy stuff but this part + +301 +00:22:21,619 --> 00:22:25,529 +was a CNN looking at the image and image +captioning we had a CNN looking at the + +302 +00:22:25,529 --> 00:22:29,399 +image so in both of those cases though +CNN's were initially is from imagefap + +303 +00:22:29,400 --> 00:22:34,080 +models and that really helps to solve +these other more specialized problems + +304 +00:22:34,079 --> 00:22:38,839 +even without a gigantic datasets and +also for the image captioning model in + +305 +00:22:38,839 --> 00:22:42,829 +particular part of this model includes +these were demanding that you should + +306 +00:22:42,829 --> 00:22:47,500 +have seen by now on homework if you +started on it but those weren't vectors + +307 +00:22:47,500 --> 00:22:50,099 +you can actually initialize from +something else that was maybe + +308 +00:22:50,099 --> 00:22:54,019 +pre-training a bunch of taxed and that +can sometimes help maybe in some search + +309 +00:22:54,019 --> 00:22:58,668 +in some situations where you might not +have a lot of capturing data available + +310 +00:22:58,669 --> 00:23:15,490 +yeah I'm here to help sometimes it +depends on the problem depends on the + +311 +00:23:15,490 --> 00:23:18,859 +network but it's definitely something +you can try and that especially might + +312 +00:23:18,859 --> 00:23:27,548 +help when you're in this box but yeah +that's a good trick to the takeaway + +313 +00:23:27,548 --> 00:23:31,210 +about fine-tuning is that you should +really use it it's a really good idea + +314 +00:23:31,210 --> 00:23:35,950 +yeah so that it works really well in +practice you should probably almost + +315 +00:23:35,950 --> 00:23:39,900 +always be using it and to some extent +you generally don't want to be training + +316 +00:23:39,900 --> 00:23:42,519 +these things from scratch unless you +have really really large data sets + +317 +00:23:42,519 --> 00:23:45,970 +available in almost all of the +circumstances it's much more convenient + +318 +00:23:45,970 --> 00:23:52,279 +to find to an existing model and by the +way Cafe has this existing model of you + +319 +00:23:52,279 --> 00:23:58,230 +you can download many exist many famous +image not models + +320 +00:23:58,230 --> 00:24:01,880 +actually the residual networks the +official model got released recently so + +321 +00:24:01,880 --> 00:24:06,130 +you can even download and play with it +would be pretty cool and these cafe + +322 +00:24:06,130 --> 00:24:09,020 +models new models are sort of like a +little bit of a standard in the + +323 +00:24:09,019 --> 00:24:13,759 +community so you can even load cafe +models into other other frameworks like + +324 +00:24:13,759 --> 00:24:17,658 +torch so that's that's something to keep +in mind that these cafe models are quite + +325 +00:24:17,659 --> 00:24:21,030 +useful right + +326 +00:24:21,029 --> 00:24:26,889 +any any further questions on fine-tuning +or transfer learning + +327 +00:24:26,890 --> 00:24:46,650 +yeah yeah that's quite large and lower +dimensions so you might try a highly + +328 +00:24:46,650 --> 00:24:50,250 +regularize linear model on top of that +or you might try putting a small come + +329 +00:24:50,250 --> 00:24:53,109 +out on top of that maybe reduce the +dimensionality you can get creative here + +330 +00:24:53,109 --> 00:24:56,399 +but I think that there are there are +there things you can try that might work + +331 +00:24:56,400 --> 00:25:03,640 +for your data depending on it right so I +think we should talk more about + +332 +00:25:03,640 --> 00:25:07,740 +convolutions so for all these networks +we've talked about it really the + +333 +00:25:07,740 --> 00:25:11,920 +convolutions are the computational +workhorse that's doing a lot of the work + +334 +00:25:11,920 --> 00:25:18,090 +and the network so we need to talk about +two things about convolutions the first + +335 +00:25:18,089 --> 00:25:22,809 +is how to stop them so how can we design +efficient network architectures that + +336 +00:25:22,809 --> 00:25:28,789 +combine many layers of convolution to +achieve some some nice results so here's + +337 +00:25:28,789 --> 00:25:33,230 +a question suppose that we have a +network that has two layers of people i + +338 +00:25:33,230 --> 00:25:37,190 +three contributions as this would be the +input this would be the activation map + +339 +00:25:37,190 --> 00:25:40,120 +in the first layer this would be the +activation nap after two layers of + +340 +00:25:40,119 --> 00:25:45,959 +convolution the question is for an Iran +on this second layer how big of a region + +341 +00:25:45,960 --> 00:25:49,640 +on the input doesn't see this was on +your midterm so I i hope i hope u guys + +342 +00:25:49,640 --> 00:25:53,920 +all know the answer to this + +343 +00:25:53,920 --> 00:26:01,298 +anyone ok just maybe that was a hard +exam question + +344 +00:26:01,298 --> 00:26:05,230 +but this is this is a five by five and +it's it's pretty easy to see from this + +345 +00:26:05,230 --> 00:26:08,989 +diagram why so that this neuron up to +the second layer is looking at this + +346 +00:26:08,989 --> 00:26:13,619 +entire volume in the intermediate where +some particular in this pixel in the + +347 +00:26:13,618 --> 00:26:18,138 +intermediate we're looking at this three +by three region in the input so when you + +348 +00:26:18,138 --> 00:26:22,738 +average across all when you look at all +of all three of these than this + +349 +00:26:22,739 --> 00:26:26,200 +lair this neuron in the in the second or +third layer is actually looking at this + +350 +00:26:26,200 --> 00:26:32,669 +entire five by five volume in the input +ok so now the question is if we had + +351 +00:26:32,669 --> 00:26:36,820 +three feet by three convolutions stacked +in a row how big of a region in the + +352 +00:26:36,819 --> 00:26:43,700 +input what they see ya so the same kind +of reason is that he's receptive field + +353 +00:26:43,700 --> 00:26:49,739 +just kind of build up with successive +contributions so the point here to make + +354 +00:26:49,739 --> 00:26:53,940 +is that you know 33 by three +convolutions actually give you a very + +355 +00:26:53,940 --> 00:26:57,919 +similar representational power is my +claim to a single seven by seven + +356 +00:26:57,919 --> 00:27:02,619 +convolution so you might debate on the +exact semantics of this and you could + +357 +00:27:02,618 --> 00:27:05,528 +try to prove theorems about it and +things like that but just from an + +358 +00:27:05,528 --> 00:27:09,940 +intuitive sense they can 333 +convolutions can represent similar types + +359 +00:27:09,940 --> 00:27:14,100 +of functions as a similar seven by seven +contribution since it's looking at the + +360 +00:27:14,099 --> 00:27:22,189 +same input region in the input so now +the idea now actually we can dig further + +361 +00:27:22,190 --> 00:27:27,399 +into this idea and we can compare more +concretely between a single 797 + +362 +00:27:27,398 --> 00:27:32,618 +convolution versus a stack of 33 by +three contributions so let's suppose + +363 +00:27:32,618 --> 00:27:38,638 +that we have input image that's hiw by +sea and we want to have convolutions + +364 +00:27:38,638 --> 00:27:43,329 +that preserve the depth so we have see +filters and we want to have them + +365 +00:27:43,329 --> 00:27:48,019 +preserve heightened with so we just said +patting appropriately and then we want + +366 +00:27:48,019 --> 00:27:51,528 +to compare concretely what is the +difference between a single seven by + +367 +00:27:51,528 --> 00:27:56,648 +seven versus a stack of three by three +so first how many weeks to each of these + +368 +00:27:56,648 --> 00:28:01,748 +two things have anyone have a gas on how +many weight the single seven by seven + +369 +00:28:01,749 --> 00:28:09,519 +convolution house and you can forget +about biases are confusing + +370 +00:28:09,519 --> 00:28:19,869 +I heard I heard some summers but so my +my answer I hope I got it right + +371 +00:28:19,869 --> 00:28:24,319 +was 49 C squared as you've got the seven +by seven convolution each one is looking + +372 +00:28:24,319 --> 00:28:29,809 +at a depth of see you got to see such +filters so 49 C squared but now for the + +373 +00:28:29,809 --> 00:28:34,649 +three by three convolutions we have +three layers of convolutions each one + +374 +00:28:34,650 --> 00:28:38,990 +each filter is three by three by Steve +and each player has see filters when you + +375 +00:28:38,990 --> 00:28:43,980 +multiply that all out we see that 33 by +free convolutions only has 27 C squared + +376 +00:28:43,980 --> 00:28:49,079 +parameters and assuming that we have Ray +Lewis after between each of these + +377 +00:28:49,079 --> 00:28:54,049 +contributions we see that the stack up +33 by three convolutions actually has + +378 +00:28:54,049 --> 00:28:58,649 +fewer parameters which is good and more +nonlinearity which is good for this kind + +379 +00:28:58,650 --> 00:29:02,960 +of gives you some intuition for why a +stack of three by of multiple three by + +380 +00:29:02,960 --> 00:29:06,440 +three convolutions might actually be +preferable to a single seven by seven + +381 +00:29:06,440 --> 00:29:11,559 +competition and we can actually take +this one step further and think about + +382 +00:29:11,559 --> 00:29:14,750 +not just the number below normal +parameters but actually honey floating + +383 +00:29:14,750 --> 00:29:19,099 +point operations to these things take so +anyone have a gas for how many + +384 +00:29:19,099 --> 00:29:29,669 +operations these things to take just now +sounds hard writes actually this is + +385 +00:29:29,670 --> 00:29:33,740 +pretty easy because for each of these +filters were gonna be using it at every + +386 +00:29:33,740 --> 00:29:37,819 +position in the end in the image so +actually the number of multiply ads is + +387 +00:29:37,819 --> 00:29:42,099 +just gonna be Heights times with times +the number of burnable filters so you + +388 +00:29:42,099 --> 00:29:47,789 +can see that actually over here again +not only do we have some between + +389 +00:29:47,789 --> 00:29:52,440 +comparing between these two the seven by +seven action not only has more learnable + +390 +00:29:52,440 --> 00:29:57,460 +parameters but it actually costs a lot +more to computers well so the stack of + +391 +00:29:57,460 --> 00:30:03,140 +33 by frequent allusions again gives us +more nonlinearity for less compute so + +392 +00:30:03,140 --> 00:30:06,170 +that kinda gives you some intuition for +why actually having multiple layers of + +393 +00:30:06,170 --> 00:30:12,300 +three bay three convolutions is actually +preferable to large filters but then you + +394 +00:30:12,299 --> 00:30:15,750 +can think of another question you know +we've been pushing towards smaller and + +395 +00:30:15,750 --> 00:30:20,109 +smaller filters but why stop at three by +three right we can actually go smaller + +396 +00:30:20,109 --> 00:30:21,859 +than that may be the same logic would +expand + +397 +00:30:21,859 --> 00:30:27,798 +shaking your head you don't believe it +that's true it's true you don't get the + +398 +00:30:27,798 --> 00:30:33,539 +receptive field so actually what we're +going to do here is compared to a single + +399 +00:30:33,539 --> 00:30:39,019 +33 convolution versus a slightly fancier +architecture the bottleneck architecture + +400 +00:30:39,019 --> 00:30:45,150 +so here we're gonna assume I can input +of HW see and hear we can actually do + +401 +00:30:45,150 --> 00:30:50,070 +this is a cool trick we do a single +one-by-one convolution with see over to + +402 +00:30:50,069 --> 00:30:54,609 +filters to actually reduce the +dimensionality of the volume so now this + +403 +00:30:54,609 --> 00:30:57,990 +thing is going to have the same spatial +extent but half the number of features + +404 +00:30:57,990 --> 00:31:03,480 +in-depth now after we do this bottleneck +we're gonna do a three by three + +405 +00:31:03,480 --> 00:31:08,929 +convolution at this reduced +dimensionality so now this this three by + +406 +00:31:08,929 --> 00:31:13,610 +three convolution takes over to input +features and produces over to output + +407 +00:31:13,609 --> 00:31:18,000 +features and now we restore the +dimensionality with another one by one + +408 +00:31:18,000 --> 00:31:23,558 +convolution to go from see over to back +to see this is kind of a kind of a funky + +409 +00:31:23,558 --> 00:31:27,910 +architecture this idea of using +one-by-one convolutions everywhere is + +410 +00:31:27,910 --> 00:31:31,669 +sometimes called network and network +because it has this intuition that + +411 +00:31:31,669 --> 00:31:35,730 +you're a one-by-one convolution is kinda +similar to sliding a fully connected + +412 +00:31:35,730 --> 00:31:42,480 +network over each part of your input +volume and this idea also appears in + +413 +00:31:42,480 --> 00:31:46,259 +Google Matt and in ResNet this idea of +using these one-by-one bottleneck + +414 +00:31:46,259 --> 00:31:52,679 +contributions so we can compare this +this bottleneck sandwich to a single + +415 +00:31:52,679 --> 00:31:56,390 +three by three convolution with C +filters and run through the same logic + +416 +00:31:56,390 --> 00:32:01,270 +so I won't I won't force you to +computers in your heads but you'll have + +417 +00:32:01,269 --> 00:32:02,720 +to trust me on this + +418 +00:32:02,720 --> 00:32:08,700 +that this bottleneck stack has three and +a quarter C squared parameters where is + +419 +00:32:08,700 --> 00:32:12,360 +this one over here has nine C squared +parameters and again if we're sticking + +420 +00:32:12,359 --> 00:32:15,879 +rallies in between each of these +contributions than this bottleneck + +421 +00:32:15,880 --> 00:32:20,620 +sandwich is giving us more more +nonlinearity for fewer number of + +422 +00:32:20,619 --> 00:32:28,899 +parameters and actually as we similar to +we saw on the three by three versus + +423 +00:32:28,900 --> 00:32:33,200 +seven by seven the number of parameters +is tied directly to the computation so + +424 +00:32:33,200 --> 00:32:35,389 +this bottleneck sandwich is also + +425 +00:32:35,388 --> 00:32:39,788 +much faster to compute so this to this +idea of one-by-one bottlenecks has + +426 +00:32:39,788 --> 00:32:52,669 +received quite a lot of usage recently +in Google Matt and especially yeah so + +427 +00:32:52,669 --> 00:32:56,579 +you might think of it as you sometimes +you think of it as as a projection from + +428 +00:32:56,578 --> 00:33:00,308 +like a lower dimensional feature back to +a higher dimensional space and then if + +429 +00:33:00,308 --> 00:33:03,868 +you think about stacking many of these +things on top of each other as happens + +430 +00:33:03,868 --> 00:33:09,499 +and residents than than you have been +coming immediately after this one is + +431 +00:33:09,499 --> 00:33:11,088 +going to be another one by one + +432 +00:33:11,088 --> 00:33:14,858 +you're kind of stuck in many many one +people one by one convolutions on top of + +433 +00:33:14,858 --> 00:33:18,918 +each other and one-by-one convolution is +a little bit like sliding a fully a + +434 +00:33:18,919 --> 00:33:23,409 +multi-layer fully connected network over +each double channel to think maybe think + +435 +00:33:23,409 --> 00:33:27,229 +about that when a little bit but it +turns out that actually you don't really + +436 +00:33:27,229 --> 00:33:31,200 +need the spatial extent and even just +comparing the sandwich to a single three + +437 +00:33:31,200 --> 00:33:35,769 +by three Khans you're sort of having the +same input output volume sizes but + +438 +00:33:35,769 --> 00:33:41,429 +what's more nonlinearity and cheaper to +compute and animal parameters so they're + +439 +00:33:41,429 --> 00:33:46,089 +all kind of nice features but there's +there's one problem with this is that + +440 +00:33:46,088 --> 00:33:49,668 +that's we're still using a three by +three convolution in there somewhere and + +441 +00:33:49,669 --> 00:33:54,709 +you might wonder if we if we really need +this and the answer is No it turns out + +442 +00:33:54,709 --> 00:33:59,808 +so one crazy thing that I've seen +recently is that you can you can factor + +443 +00:33:59,808 --> 00:34:05,608 +the street by three convolution in 2003 +by one and won by three and compared to + +444 +00:34:05,608 --> 00:34:09,469 +the single three by three convolution +this ends up saving you some parameters + +445 +00:34:09,469 --> 00:34:14,428 +as well so that you might if you really +go crazy you can come by in this one by + +446 +00:34:14,429 --> 00:34:18,019 +three and three by one together with +this bottleneck an idea and things just + +447 +00:34:18,018 --> 00:34:22,358 +get really cheap and that's basically +what Google has done in their most + +448 +00:34:22,358 --> 00:34:27,038 +recent version of Inception so there's +this kind of crazy paper rethinking the + +449 +00:34:27,039 --> 00:34:30,389 +inception architecture for computer +vision where they play a lot of these + +450 +00:34:30,389 --> 00:34:34,169 +crazy tricks about factoring +convolutions in weird ways and having a + +451 +00:34:34,168 --> 00:34:37,138 +lot of one-by-one bottlenecks and then +projections backup to different + +452 +00:34:37,139 --> 00:34:40,608 +dimensions and then if you thought the +original Google met with with their + +453 +00:34:40,608 --> 00:34:42,699 +inception module was was crazy + +454 +00:34:42,699 --> 00:34:46,118 +this one's these are the inception +modules that Google is now using in + +455 +00:34:46,119 --> 00:34:47,329 +their newest inception at + +456 +00:34:47,329 --> 00:34:50,739 +and the interesting features here are +that they have these one-by-one + +457 +00:34:50,739 --> 00:34:55,819 +bottlenecks everywhere and make sure you +have these asymmetric filters to against + +458 +00:34:55,820 --> 00:35:01,519 +Avon computation so this stuff is not +super widely used yet but it's it's it's + +459 +00:35:01,519 --> 00:35:05,079 +out there and it's a Google Matt +psychotics it something cool to mention + +460 +00:35:05,079 --> 00:35:14,610 +so quickly recap from convolutions and +how to stack them is that it's usually + +461 +00:35:14,610 --> 00:35:18,530 +better instead of having a single large +convolution with a large filter size + +462 +00:35:18,530 --> 00:35:22,740 +it's usually better to break up into +multiple smaller filters and that even + +463 +00:35:22,739 --> 00:35:26,339 +maybe helps explain the difference +between something like BGG which has + +464 +00:35:26,340 --> 00:35:30,059 +many many three by three filters with +something like Alex net that have fewer + +465 +00:35:30,059 --> 00:35:35,119 +smaller filters and other thing that's +actually become pretty common i think is + +466 +00:35:35,119 --> 00:35:38,829 +this idea of one by one bottle necking +you see that in both versions of Google + +467 +00:35:38,829 --> 00:35:42,579 +not and also in ResNet and that actually +helps you save a lot on parameters I + +468 +00:35:42,579 --> 00:35:46,340 +think that's a useful trick to keep in +mind and this idea of factoring + +469 +00:35:46,340 --> 00:35:50,890 +convolutions into these asymmetric +filters i think is maybe not so widely + +470 +00:35:50,889 --> 00:35:54,629 +used right now but it may become more +commonly used in the future I'm not sure + +471 +00:35:54,630 --> 00:36:00,160 +and the basic over overarching theme for +all of these tracks is that it lets you + +472 +00:36:00,159 --> 00:36:04,289 +have fewer learnable parameters and +fewer and less compute and more + +473 +00:36:04,289 --> 00:36:07,739 +nonlinearity which are all sort of nice +features to having your architectures + +474 +00:36:07,739 --> 00:36:18,779 +such as any questions about these these +convolution architecture designs to + +475 +00:36:18,780 --> 00:36:21,300 +bring her too obvious + +476 +00:36:21,300 --> 00:36:26,340 +ok so then the next thing is that once +you've actually decided on how you want + +477 +00:36:26,340 --> 00:36:30,760 +to wire up your stack of convolutions +you actually to compute them and this + +478 +00:36:30,760 --> 00:36:33,630 +there's actually been a lot of work on +different ways to implement + +479 +00:36:33,630 --> 00:36:37,950 +contributions we asked you to implement +of the assignments using for loops and + +480 +00:36:37,949 --> 00:36:43,960 +that as you may have guessed doesn't +scale too well so this a pretty a pretty + +481 +00:36:43,960 --> 00:36:47,720 +easy approach that's pretty easy to +implement is this idea of a name to call + +482 +00:36:47,719 --> 00:36:52,269 +method so the intuition here is that we +know matrix multiplication is really + +483 +00:36:52,269 --> 00:36:56,809 +fast and pretty much any computing +architecture out there someone has + +484 +00:36:56,809 --> 00:37:00,949 +written a really really well optimized +matrix multiplication retainer library + +485 +00:37:00,949 --> 00:37:06,230 +so the idea of him to call is stinking +well given that matrix multiplication is + +486 +00:37:06,230 --> 00:37:07,400 +really fast + +487 +00:37:07,400 --> 00:37:11,420 +is there some way that we can take this +convolution operation and recast as a + +488 +00:37:11,420 --> 00:37:17,800 +matrix multiply and it turns out that +this is actually pretty somewhat easy to + +489 +00:37:17,800 --> 00:37:22,930 +do once you think about it so the idea +is that we have an input volume that's + +490 +00:37:22,929 --> 00:37:28,549 +hiw by sea and we have a filter bank of +convolutions of convolutional filters + +491 +00:37:28,550 --> 00:37:32,730 +each one of these is going to be a +case-by-case by see volume so it has a + +492 +00:37:32,730 --> 00:37:36,659 +case-by-case receptive field and +adaptive see two matched to match the + +493 +00:37:36,659 --> 00:37:39,989 +input over here and we're gonna have to +deal with these filters and then we want + +494 +00:37:39,989 --> 00:37:44,809 +to turn this into a into a matrix +multiply problem so the idea is that + +495 +00:37:44,809 --> 00:37:48,829 +we're going to take one of their we're +going to take the first receptive field + +496 +00:37:48,829 --> 00:37:54,019 +of the image which is gonna be this kay +by Kay by CEE region in the region in + +497 +00:37:54,019 --> 00:37:58,130 +the end up in football you I'm going to +reshape it into this column of case + +498 +00:37:58,130 --> 00:38:01,910 +whereby see elements and then we're +going to repeat this for every possible + +499 +00:38:01,909 --> 00:38:05,909 +receptive field in the image so we're +going to take this little guy I'm going + +500 +00:38:05,909 --> 00:38:09,359 +to shift him over all possible regions +in the image and here I'm just saying + +501 +00:38:09,360 --> 00:38:12,680 +that there's going to be maybe end +region and different receptive field + +502 +00:38:12,679 --> 00:38:18,389 +locations so now we've taken our image +and we've taken reshaped into this giant + +503 +00:38:18,389 --> 00:38:25,139 +matrix oh and by I mean and in my case +whereby see anyone see what a potential + +504 +00:38:25,139 --> 00:38:28,139 +problem with this maybe + +505 +00:38:28,139 --> 00:38:36,829 +yeah that's true so best this tends to +use a lot of memory right so many + +506 +00:38:36,829 --> 00:38:41,380 +elements in this volume if it appears +and multiple receptive fields then it's + +507 +00:38:41,380 --> 00:38:45,010 +going to be duplicated in multiple of +these columns so and this is going to + +508 +00:38:45,010 --> 00:38:49,220 +get worse the more overlap there is +between your receptive fields but it + +509 +00:38:49,219 --> 00:38:52,839 +turns out that in practice that's +actually not too big of a deal and it + +510 +00:38:52,840 --> 00:38:57,910 +works fine then we're gonna run a +similar check on these convolutional + +511 +00:38:57,909 --> 00:39:01,699 +filters so if you remember what a +convolution is doing we want to take + +512 +00:39:01,699 --> 00:39:06,039 +each of these convolutional weights and +take our products with each + +513 +00:39:06,039 --> 00:39:10,889 +convolutional weight against each +receptive field location in the image so + +514 +00:39:10,889 --> 00:39:16,420 +each of these convolutional weights is +is this kay by Kay buy seats answer so + +515 +00:39:16,420 --> 00:39:21,059 +we're going to reshape each of those to +be a case where by Ciro now we have D + +516 +00:39:21,059 --> 00:39:26,420 +filters so we got a deal by case whereby +seat matrix now this is great + +517 +00:39:26,420 --> 00:39:31,750 +now this guide contains all the recep +each each column as a receptive field we + +518 +00:39:31,750 --> 00:39:37,039 +have one column receptive field in the +image and now this matrix has one has + +519 +00:39:37,039 --> 00:39:42,679 +one each row is a different weight so +now we can easily compute all of these + +520 +00:39:42,679 --> 00:39:49,069 +inner products all at once with the +single matrix multiply and I apologize + +521 +00:39:49,070 --> 00:39:52,809 +for these dimensions not working out the +probably should swap is to make it more + +522 +00:39:52,809 --> 00:39:59,219 +obvious but I think you get the idea so +this this gives und by end result that + +523 +00:39:59,219 --> 00:40:03,659 +that D is our number of output filters +and that n is for all the receptive + +524 +00:40:03,659 --> 00:40:07,469 +field locations in the image then you +play a similar trek to take this and + +525 +00:40:07,469 --> 00:40:13,000 +reshape it into your interior 3d +appetizer you can actually stand this + +526 +00:40:13,000 --> 00:40:16,219 +too many batches quite easily if you +have a mini batch of any of these + +527 +00:40:16,219 --> 00:40:24,099 +elements you just add more rows and how +one set of rows per me back element this + +528 +00:40:24,099 --> 00:40:28,589 +actually is pretty easy to implement so +yeah + +529 +00:40:28,590 --> 00:40:35,090 +depends that-that's then it depends on +your implementation right but then you + +530 +00:40:35,090 --> 00:40:39,910 +have to worry about things like memory +layout and stuff like that but sometimes + +531 +00:40:39,909 --> 00:40:45,099 +you even do that reshape operation on +the GPUs you can do it in parallel but + +532 +00:40:45,099 --> 00:40:50,089 +as a as a case study so this is really +easy to implement so a lot of if if if + +533 +00:40:50,090 --> 00:40:53,470 +you don't have a convolution technique +available and you need to implement one + +534 +00:40:53,469 --> 00:40:57,869 +passed this is probably the one to +choose and if you look at actual cafe in + +535 +00:40:57,869 --> 00:41:01,119 +earlier versions of cafe this is the +method that they used for doing + +536 +00:41:01,119 --> 00:41:07,730 +contributions so this is the convolution +forward code for the GPU conflict the + +537 +00:41:07,730 --> 00:41:12,630 +native GPU convolution in you can see in +this red chunk they're calling into the + +538 +00:41:12,630 --> 00:41:18,070 +same to call method is taking their +input image rights this is taking their + +539 +00:41:18,070 --> 00:41:22,900 +input image somewhere this is so this is +their intention and then they're going + +540 +00:41:22,900 --> 00:41:27,050 +to reshape this calling the same to call +method and then store it in this in this + +541 +00:41:27,050 --> 00:41:33,519 +column GPU tenser than they're gonna +take to a matrix matrix multiply calling + +542 +00:41:33,519 --> 00:41:37,980 +it could last through the matrix +multiply and then a bias so that's + +543 +00:41:37,980 --> 00:41:42,840 +that's how that's i mean these things +tend to work quite well in practice and + +544 +00:41:42,840 --> 00:41:45,850 +also has another case study if you +remember the fast layers we gave you any + +545 +00:41:45,849 --> 00:41:51,500 +assignments actually uses this exact +same strategy so here we actually do nm + +546 +00:41:51,500 --> 00:41:55,940 +to call operation was some crazy numpy +tricks and then now we can actually do + +547 +00:41:55,940 --> 00:42:00,230 +the convolution inside The FAST layers +with a single call to the numpy matrix + +548 +00:42:00,230 --> 00:42:03,900 +multiplication and you sign your +homework this usually gives me a couple + +549 +00:42:03,900 --> 00:42:07,740 +hundred times faster than using for +loops this actually works pretty well + +550 +00:42:07,739 --> 00:42:18,209 +and it's it's pretty easy to implement +any questions about him to call + +551 +00:42:18,210 --> 00:42:24,949 +think about it a little bit but if you +think if you think really hard you'll + +552 +00:42:24,949 --> 00:42:28,219 +realize that the backward pass on a +convolution is actually also a + +553 +00:42:28,219 --> 00:42:33,358 +convolution which you may have figured +out a few if you're thinking about it on + +554 +00:42:33,358 --> 00:42:37,269 +your homework but the backward pass a +convolution is actually also a type of + +555 +00:42:37,269 --> 00:42:41,070 +convolution over the over the upstream +gradients you can actually use a similar + +556 +00:42:41,070 --> 00:42:45,789 +type of image to call method for the +tobacco passes well the only trick there + +557 +00:42:45,789 --> 00:42:51,259 +is that one once you do in a backward +pass you need to some gradients from + +558 +00:42:51,260 --> 00:42:54,940 +across overlapping receptive fields in +the upstream so you need to be careful + +559 +00:42:54,940 --> 00:43:02,889 +about the call Tim you need to summon +the call Tim in the backward pass and + +560 +00:43:02,889 --> 00:43:06,150 +you can actually check out in the fast +lane is on the homework implements that + +561 +00:43:06,150 --> 00:43:11,050 +too although actually further in the +fast layers on the homework the call tim + +562 +00:43:11,050 --> 00:43:18,910 +is in sight on I couldn't find a way to +get it fast enough in there's actually + +563 +00:43:18,909 --> 00:43:22,710 +another way that sometimes people use +for convolutions and that's this idea of + +564 +00:43:22,710 --> 00:43:27,400 +a Fast Fourier Transform so if you have +some memories from like a signal + +565 +00:43:27,400 --> 00:43:30,700 +processing class or something like that +you might remember this thing called the + +566 +00:43:30,699 --> 00:43:34,639 +convolution theorem met says that you if +you have two signals and you want to + +567 +00:43:34,639 --> 00:43:38,779 +call them either discreetly are +continuously with another girl then + +568 +00:43:38,780 --> 00:43:44,130 +taking a convolution of these two +signals is the same as rather the + +569 +00:43:44,130 --> 00:43:47,820 +Fourier transform of the convolutions is +the same as the elements product of the + +570 +00:43:47,820 --> 00:43:51,859 +Fourier transforms so if you have you +unpacked out and stare the symbols I + +571 +00:43:51,858 --> 00:43:56,779 +think it'll make sense and if also you +might remember from again a signal + +572 +00:43:56,780 --> 00:44:00,240 +processing class or an algorithm class +there's this amazing thing called the + +573 +00:44:00,239 --> 00:44:04,299 +fast Fourier transform that actually +likes lets us to compute Fourier + +574 +00:44:04,300 --> 00:44:08,080 +transforms an inverse Fourier transforms +really really fast + +575 +00:44:08,079 --> 00:44:11,679 +you may have seen it bears versions of +this in one day in 2d and they're all + +576 +00:44:11,679 --> 00:44:17,129 +really fast so we can actually applied a +stricter convolutions so the way this + +577 +00:44:17,130 --> 00:44:20,660 +works is that first we're going to +compute use the Fast Fourier Transform + +578 +00:44:20,659 --> 00:44:24,899 +to compute the Fourier transform the +weights also compute the Fourier + +579 +00:44:24,900 --> 00:44:30,320 +transform of our activation map and now +in Fourier space we just do an element + +580 +00:44:30,320 --> 00:44:35,050 +multiplication which is really really +fast and efficient and then we just come + +581 +00:44:35,050 --> 00:44:40,269 +again and use the pass for a transformed +to do the inverse transform the output + +582 +00:44:40,269 --> 00:44:44,420 +of that elements product and that +implements convolutions for us in this + +583 +00:44:44,420 --> 00:44:52,550 +kinda cool fancy clever way and this is +actually been used and face some folks + +584 +00:44:52,550 --> 00:44:55,940 +that Facebook had a paper about this +last year and they actually released a + +585 +00:44:55,940 --> 00:44:57,650 +GPU library to do this + +586 +00:44:57,650 --> 00:45:03,329 +compute these things but the sad thing +about these Fourier transforms this that + +587 +00:45:03,329 --> 00:45:07,819 +they actually give you really really big +speedups over other methods but really + +588 +00:45:07,820 --> 00:45:11,970 +only four large boulders and when you're +working on these small three by three + +589 +00:45:11,969 --> 00:45:15,829 +filters the overhead of computing the +Fourier transform just towards the + +590 +00:45:15,829 --> 00:45:20,449 +computation of doing the computation +directly in the in the input pixel space + +591 +00:45:20,449 --> 00:45:25,579 +and as we just talked about earlier in +the lecture small contributions are + +592 +00:45:25,579 --> 00:45:30,389 +really really nice and appealing and +great for lots of reasons so it's a + +593 +00:45:30,389 --> 00:45:33,489 +little bit of a shame that this for a +trick doesn't work out too well impact + +594 +00:45:33,489 --> 00:45:38,439 +us but if for some reason you do want to +compute really large contributions then + +595 +00:45:38,440 --> 00:45:46,019 +this is something you can try yeah + +596 +00:45:46,019 --> 00:46:02,489 +too involved in stuff but I imagine if +you think it's a problem is probably a + +597 +00:46:02,489 --> 00:46:04,639 +problem + +598 +00:46:04,639 --> 00:46:12,900 +ya another thing to point out is that +one kind of balance out about Fourier + +599 +00:46:12,900 --> 00:46:17,430 +transforms conclusions is that they +don't handle striding too well so far + +600 +00:46:17,429 --> 00:46:21,219 +normal computer with your computing +strident convolutions in sort of normal + +601 +00:46:21,219 --> 00:46:25,409 +input space you only compute a small +subset of those in our products so you + +602 +00:46:25,409 --> 00:46:28,489 +actually save a lot of computation when +you strike the convolutions + +603 +00:46:28,489 --> 00:46:32,199 +directly on the input space but the way +you tend to implement strident + +604 +00:46:32,199 --> 00:46:36,649 +convolutions in Fourier transform space +is you just compute the whole thing and + +605 +00:46:36,650 --> 00:46:43,180 +then you throw out part of the data so +that ends up not being very efficient so + +606 +00:46:43,179 --> 00:46:47,969 +there's another trick that has not +really become too I think too widely + +607 +00:46:47,969 --> 00:46:51,989 +known yet but I really liked it so I +thought I wanted to talk about that so + +608 +00:46:51,989 --> 00:46:55,909 +you may remember from algorithms class +something called stratton's algorithm + +609 +00:46:55,909 --> 00:47:00,789 +right there's this idea that when you do +a naive matrix multiplication of to end + +610 +00:47:00,789 --> 00:47:04,869 +by and matrices kind of if you count up +although although all the modifications + +611 +00:47:04,869 --> 00:47:08,630 +and additions that you need to do it's +going to take about its gonna take any + +612 +00:47:08,630 --> 00:47:12,950 +cute operations and stratton's algorithm +is this like really crazy thing we + +613 +00:47:12,949 --> 00:47:16,839 +compute all these crazy intermediates +and it somehow magically works out to + +614 +00:47:16,840 --> 00:47:22,289 +compute the output asymptotically faster +than the naive method and you know from + +615 +00:47:22,289 --> 00:47:26,869 +him to call me know that matrix +multiplication is this we can implement + +616 +00:47:26,869 --> 00:47:31,339 +convolution as matrix multiplication to +intuitively you might expect that these + +617 +00:47:31,340 --> 00:47:35,110 +similar types of tricks might +theoretically maybe be applicable to + +618 +00:47:35,110 --> 00:47:41,320 +convolution and it turns out they are so +there's this really cool paper that just + +619 +00:47:41,320 --> 00:47:46,370 +came out over the summer where these two +guys worked out very explicitly that + +620 +00:47:46,369 --> 00:47:50,670 +something very special cases 43 by +frequent allusions and it involves this + +621 +00:47:50,670 --> 00:47:54,659 +obviously I'm not going to go into +details here but it's a similar flavor + +622 +00:47:54,659 --> 00:47:58,539 +to stress and computing very clever +intermediate + +623 +00:47:58,539 --> 00:48:03,630 +and Henry combining them to actually +save a lot on the computation and these + +624 +00:48:03,630 --> 00:48:08,220 +guys are actually really really intense +and they're not just mathematicians they + +625 +00:48:08,219 --> 00:48:11,959 +actually wrote also highly highly +optimized CUDA kernels to compute these + +626 +00:48:11,960 --> 00:48:17,570 +things and were able to speed up BGG by +a factor of two so that's really really + +627 +00:48:17,570 --> 00:48:21,890 +impressive so I think that these these +type this type of truck might become + +628 +00:48:21,889 --> 00:48:26,019 +pretty popular in the future but for the +time being I think it's not very widely + +629 +00:48:26,019 --> 00:48:30,650 +used but these numbers are crazy +especially for small batch sizes they're + +630 +00:48:30,650 --> 00:48:35,010 +getting a six speed up on BGG that's +that's really really impressive and I + +631 +00:48:35,010 --> 00:48:38,770 +think it's a really cool method the +downside is that you kinda have to work + +632 +00:48:38,769 --> 00:48:43,009 +out these explicit special cases each +different size of convolution but maybe + +633 +00:48:43,010 --> 00:48:45,850 +if we only care about three by three +convolutions that's not such a big deal + +634 +00:48:45,849 --> 00:48:54,719 +so the recap computing convolutions in +practice is that the sort of the really + +635 +00:48:54,719 --> 00:48:58,579 +fast easy quick and dirty way to +implement these things is in to call + +636 +00:48:58,579 --> 00:49:02,869 +matrix multiplication is passed it does +it's usually not too hard to implement + +637 +00:49:02,869 --> 00:49:06,609 +these things so if for some reason you +really need to implement competitions + +638 +00:49:06,610 --> 00:49:11,400 +yourself I'd really recommend into call +activity is something that coming from + +639 +00:49:11,400 --> 00:49:15,230 +signal processing you might think would +be really cool and really useful but it + +640 +00:49:15,230 --> 00:49:19,719 +turns out that it's it does give speed +ups but only for big filters so it's not + +641 +00:49:19,719 --> 00:49:24,000 +as useful as you might have hoped but +there is hope because these fast + +642 +00:49:24,000 --> 00:49:25,440 +algorithms are really good + +643 +00:49:25,440 --> 00:49:29,650 +filters and there already exists code +somewhere in the world to do it so + +644 +00:49:29,650 --> 00:49:35,889 +hopefully these these things will catch +on and become more widely used so if + +645 +00:49:35,889 --> 00:49:41,529 +there's any questions about computing +convolutions + +646 +00:49:41,530 --> 00:49:50,940 +ok so next we're gonna talk about some +implementation details so first question + +647 +00:49:50,940 --> 00:49:55,710 +how do you guys ever built your own +computer + +648 +00:49:55,710 --> 00:50:01,710 +ok so you guys are prevented from this +answer on this next slide so who can + +649 +00:50:01,710 --> 00:50:07,869 +spot the CPU anyone on a point out + +650 +00:50:07,869 --> 00:50:17,210 +the CPU is this little guy right so +actually this this thing is actually a + +651 +00:50:17,210 --> 00:50:22,179 +lot of it is the cooler so the CPU +itself is a little tiny part inside of + +652 +00:50:22,179 --> 00:50:28,730 +here a lot of this is actually the +heatsink cooling the next spot the GPU + +653 +00:50:28,730 --> 00:50:38,320 +yes it's the thing that says GeForce on +and so this GPU is is for one thing it's + +654 +00:50:38,320 --> 00:50:43,180 +it's much larger and the CPU so you +might so it may be is is more powerful I + +655 +00:50:43,179 --> 00:50:48,679 +know but at least it's taking up more +space in the case so that's that's kind + +656 +00:50:48,679 --> 00:50:54,309 +of an indication that something exciting +is happening so I'm another question and + +657 +00:50:54,309 --> 00:50:57,029 +you gotta play video games + +658 +00:50:57,030 --> 00:51:05,390 +ok then you probably have opinions about +this so turns out a lot of people in + +659 +00:51:05,389 --> 00:51:09,809 +machine learning and deep learning have +really strong opinions too and most + +660 +00:51:09,809 --> 00:51:15,639 +people are on the side so Nvidia is +actually much much more widely used then + +661 +00:51:15,639 --> 00:51:21,179 +AMD for you using GPUs and US and the +reason is that + +662 +00:51:21,179 --> 00:51:25,599 +NVIDIA has really done a lot in the last +couple of years to really dive into deep + +663 +00:51:25,599 --> 00:51:30,710 +learning and make it a really core part +of their focus so as a cool example of + +664 +00:51:30,710 --> 00:51:34,769 +that last year at GTC which is an + +665 +00:51:34,769 --> 00:51:39,869 +videos sort of yearly big gigantic +conference for the announce new products + +666 +00:51:39,869 --> 00:51:44,230 +Jensen Hong who is the CEO of in video +and actually also stanford alarm + +667 +00:51:44,230 --> 00:51:49,059 +introduced this latest and greatest +amazing new GPU capitation acts like + +668 +00:51:49,059 --> 00:51:53,400 +their flagship thing and the benchmark +he used to sell it was how fast the + +669 +00:51:53,400 --> 00:51:56,800 +country and Alex met so this was crazy + +670 +00:51:56,800 --> 00:52:00,140 +this was a gigantic room with like +hundreds and hundreds of people and + +671 +00:52:00,139 --> 00:52:04,279 +journalists and like this gigantic +highly polished presentation and the CEO + +672 +00:52:04,280 --> 00:52:07,890 +of in video was talking about Alex net +and convolutions and I thought that was + +673 +00:52:07,889 --> 00:52:11,690 +really exciting and it kind of shows you +that Nvidia really cares a lot about + +674 +00:52:11,690 --> 00:52:15,300 +getting these things to work and they +pushed a lot of their efforts into + +675 +00:52:15,300 --> 00:52:22,150 +getting into making it work so just to +give you an idea a CPU as you probably + +676 +00:52:22,150 --> 00:52:26,900 +know is really good at fast sequential +processing and they tend to have a small + +677 +00:52:26,900 --> 00:52:31,019 +number of cores your laptop probably +have like maybe between one and four + +678 +00:52:31,019 --> 00:52:36,920 +corners and big things on a server might +have up to 16 quarters and these things + +679 +00:52:36,920 --> 00:52:39,610 +are really good at computing things +really really fast + +680 +00:52:39,610 --> 00:52:45,349 +and in sequence GPU is on the other hand +tend to have many many many course for a + +681 +00:52:45,349 --> 00:52:49,759 +big guy like a tax it can have up to +thousands of quarters but they tend each + +682 +00:52:49,760 --> 00:52:53,500 +core can do last May 10 2010 lower clock +speed and be able to do less per + +683 +00:52:53,500 --> 00:52:59,429 +instruction cycle so these GPUs again we +actually were originally developed for + +684 +00:52:59,429 --> 00:53:05,230 +processing graphics graphics processing +units so they're really good at doing + +685 +00:53:05,230 --> 00:53:09,699 +sort of highly paralyzed operations are +you wanna do many many things in + +686 +00:53:09,699 --> 00:53:15,460 +parallel independently and since they +were originally designed for computer + +687 +00:53:15,460 --> 00:53:19,590 +graphics but since then they've sort of +evolved as a more general computing + +688 +00:53:19,590 --> 00:53:23,100 +platform so there are different +frameworks that allow you to write + +689 +00:53:23,099 --> 00:53:28,929 +generic code to run directly on the GPU +so from Nvidia we have this framework + +690 +00:53:28,929 --> 00:53:33,509 +that lets you write a variant of seats +actually write code that runs directly + +691 +00:53:33,510 --> 00:53:37,990 +on the GPU and there's a similar +framework called OpenCL that works on + +692 +00:53:37,989 --> 00:53:43,569 +pretty much any any computational +platform but I mean open standards are + +693 +00:53:43,570 --> 00:53:48,890 +nice and it's quite nice that OpenCL +works everywhere but in practice open so + +694 +00:53:48,889 --> 00:53:52,559 +that tends to be a lot more performance +and how a little bit nicer library + +695 +00:53:52,559 --> 00:53:57,420 +support so at least four deep learning +most people use could instead and if + +696 +00:53:57,420 --> 00:54:01,309 +you're interested in actually learning +how to write G Piko G Piko yourself + +697 +00:54:01,309 --> 00:54:05,230 +there's a really cool nasty course I +would it's it's pretty cool have fun + +698 +00:54:05,230 --> 00:54:09,409 +assignments all that lets you write code +to run things on GPU although in + +699 +00:54:09,409 --> 00:54:12,730 +practice if all you want to do is train +come nuts and do research and that sort + +700 +00:54:12,730 --> 00:54:16,409 +of thing you end up usually not having +to write any of this code yourself you + +701 +00:54:16,409 --> 00:54:20,139 +just rely on external libraries + +702 +00:54:20,139 --> 00:54:33,440 +right so could I is like this this raw +so cute and higher higher level library + +703 +00:54:33,440 --> 00:54:38,599 +kind of like glass right so one thing +that GPUs are really really good at is + +704 +00:54:38,599 --> 00:54:43,420 +matrix multiplication so here's here's a +benchmark I mean this is from Nvidia's + +705 +00:54:43,420 --> 00:54:49,550 +website so it's a little bit biased but +this is showing matrix multiplication + +706 +00:54:49,550 --> 00:54:54,789 +time as a function of matrix eyes on a +pretty beefy CPU this is a 12 corps guy + +707 +00:54:54,789 --> 00:55:00,079 +that would live in a server that's like +quite a quite a healthy CPU and this is + +708 +00:55:00,079 --> 00:55:04,000 +running the same date science matrix +multiply on a test like a 40 which is a + +709 +00:55:04,000 --> 00:55:11,000 +pretty beefy GPU and it's much faster I +mean that's no big surprise right and + +710 +00:55:11,000 --> 00:55:15,119 +GPUs are also really gotta convolutions +so as you mentioned and video has a + +711 +00:55:15,119 --> 00:55:19,909 +library called today announced that is +specifically optimized optimist CUDA + +712 +00:55:19,909 --> 00:55:26,139 +kernels for convolution so compared to +CPU I mean it's it's WAY faster and this + +713 +00:55:26,139 --> 00:55:30,139 +is actually comparing him to call +contributions from campaign with the + +714 +00:55:30,139 --> 00:55:34,920 +crew tienen convolutions I think these +graphs are actually from the first + +715 +00:55:34,920 --> 00:55:41,030 +version of CNN version for just came out +a few weeks ago and but this is the only + +716 +00:55:41,030 --> 00:55:44,600 +version where they actually had a CPU +benchmark since then the benchmark civil + +717 +00:55:44,599 --> 00:55:49,699 +me been against previous versions so +it's got a lot faster since then since + +718 +00:55:49,699 --> 00:55:54,769 +here but the way this witness fits and +is that something like two blasts or to + +719 +00:55:54,769 --> 00:56:00,090 +DNN is a C library so it provides +functions and see that just sort of + +720 +00:56:00,090 --> 00:56:05,309 +abstract away the GPU as a C library so +if you have a tensor sort of in in + +721 +00:56:05,309 --> 00:56:09,429 +memory and see you can just pass a +pointer to the Korean library and it'll + +722 +00:56:09,429 --> 00:56:13,299 +return the conf little running on GPU +maybe asynchronously and return the + +723 +00:56:13,300 --> 00:56:19,440 +result so frameworks like cafe and torch +all have now integrated the Q tienen + +724 +00:56:19,440 --> 00:56:23,750 +stuff into their own frameworks you can +utilize these efficient solutions in any + +725 +00:56:23,750 --> 00:56:30,340 +of these frameworks know but the problem +is that even when once we have these + +726 +00:56:30,340 --> 00:56:33,430 +really powerful GPUs training big models +is still kind + +727 +00:56:33,429 --> 00:56:39,409 +slow so VG nett was famously train for +something like two to three weeks on for + +728 +00:56:39,409 --> 00:56:43,759 +Titan what was a Titan black sandals +aren't cheap and it was actually a + +729 +00:56:43,760 --> 00:56:47,280 +recommendation of ResNet recently +there's a really cool right up this + +730 +00:56:47,280 --> 00:56:51,839 +really cool blog post describing it here +and they actually retrained the ResNet + +731 +00:56:51,838 --> 00:56:56,400 +hundred and one layer model and it also +took about two weeks to train on for + +732 +00:56:56,400 --> 00:57:03,880 +GPUs so that's not good and the one way +that people the way that the easy way to + +733 +00:57:03,880 --> 00:57:08,269 +split up training across multiple GPUs +is just to split your money back across + +734 +00:57:08,269 --> 00:57:14,230 +the GPUs so normally you might have you +especially for someone like BGG it takes + +735 +00:57:14,230 --> 00:57:17,679 +a lot of memory so you can't compete +with very large me batch sizes on a + +736 +00:57:17,679 --> 00:57:23,649 +single GPU so what you'll do you have +any batch of images may be a 6:00 128 or + +737 +00:57:23,650 --> 00:57:24,700 +something like that + +738 +00:57:24,699 --> 00:57:30,338 +than any match into four equal chunks +each GPU compute a forward and backward + +739 +00:57:30,338 --> 00:57:35,190 +pass for that many batch in your compute +pramit gradients on the weights while + +740 +00:57:35,190 --> 00:57:39,470 +some of those weights inside your some +of those weights after all for GPU + +741 +00:57:39,469 --> 00:57:44,548 +Spanish and make an update your model so +this is a really simple way that people + +742 +00:57:44,548 --> 00:57:53,599 +tend to implement distribution on GPUs +yeah yeah + +743 +00:57:53,599 --> 00:57:59,089 +yeah yeah so that's why they claim that +they can automate this process and + +744 +00:57:59,090 --> 00:58:03,039 +really really efficiently distribute it +which is really exciting I think but I + +745 +00:58:03,039 --> 00:58:07,820 +haven't played much myself and also at +least in torch there's a data parallel + +746 +00:58:07,820 --> 00:58:11,059 +there that you can just drop in and use +that all sort of automatically do with + +747 +00:58:11,059 --> 00:58:14,070 +this type of parallelism very easily + +748 +00:58:14,070 --> 00:58:18,930 +a slightly more complex idea for multi +GPU training actually comes from Alex + +749 +00:58:18,929 --> 00:58:21,279 +Alex not fame + +750 +00:58:21,280 --> 00:58:26,670 +guess that's kind of cool kind of a +funny title but the idea but the idea is + +751 +00:58:26,670 --> 00:58:31,409 +that we want to actually do data +parallelism on the lower layers so on + +752 +00:58:31,409 --> 00:58:35,980 +the lower layers will take our image +many batch split up across two GPUs and + +753 +00:58:35,980 --> 00:58:42,059 +eat and GPU one will compute the +convolutions for the first part first + +754 +00:58:42,059 --> 00:58:46,279 +part of the many batch and just released +just this comp convolution part will be + +755 +00:58:46,280 --> 00:58:49,960 +distributed equally across the GPUs but +once you get to the fully connected + +756 +00:58:49,960 --> 00:58:50,760 +layers + +757 +00:58:50,760 --> 00:58:54,800 +he found it's actually more efficient if +you are just really big matrix + +758 +00:58:54,800 --> 00:58:58,810 +multiplies then it's more efficient +actually have the GPS work together to + +759 +00:58:58,809 --> 00:59:02,869 +compute this matrix multiply this is +kind of a cool track it's not very + +760 +00:59:02,869 --> 00:59:09,480 +commonly used but I thought it's it's +fun to mention another idea from Google + +761 +00:59:09,480 --> 00:59:13,800 +is before it before there was tenser +flow they had this thing called + +762 +00:59:13,800 --> 00:59:18,380 +disbelief which was their their previous +system which was entirely CPU based + +763 +00:59:18,380 --> 00:59:22,630 +which from the benchmarks a few slides +ago you can imagine was going to be + +764 +00:59:22,630 --> 00:59:26,250 +really slow but actually the first +version of Google Matt was all trained + +765 +00:59:26,250 --> 00:59:30,800 +in disbelief on CPU so they actually so +they had to do massive amounts of + +766 +00:59:30,800 --> 00:59:35,800 +distribution on CPU to get these things +to train so here there's this cool paper + +767 +00:59:35,800 --> 00:59:39,530 +from jap teen a couple years ago that +describes this and a lot more detail but + +768 +00:59:39,530 --> 00:59:43,640 +you use data parallelism or you have +each machine have an independent copy of + +769 +00:59:43,639 --> 00:59:48,710 +the model and each machine as computing +forward and backward on patches of data + +770 +00:59:48,710 --> 00:59:52,659 +but now i text you actually have this +parameters server that's storing the + +771 +00:59:52,659 --> 00:59:55,739 +parameters of the model and these +independent workers are making + +772 +00:59:55,739 --> 01:00:01,209 +communication with the parameters server +to make updates on the model and they + +773 +01:00:01,210 --> 01:00:05,740 +contrast this with model parallelism +which is where you type 1 + +774 +01:00:05,739 --> 01:00:09,879 +model and you have different different +workers computing different parts of the + +775 +01:00:09,880 --> 01:00:14,650 +model so and in disbelief they really +did a really good job + +776 +01:00:14,650 --> 01:00:18,110 +optimizing this to work really well +across many many CPUs and many many + +777 +01:00:18,110 --> 01:00:23,170 +machines but now they have cancer flow +which hopefully should do these things + +778 +01:00:23,170 --> 01:00:28,639 +more automatically and once you're doing +these these these updates there's this + +779 +01:00:28,639 --> 01:00:34,949 +idea between asynchronous STD and +synchronous STD so synchronous STD is + +780 +01:00:34,949 --> 01:00:39,299 +one of the things like the naive thing +you might expect you have any batch you + +781 +01:00:39,300 --> 01:00:42,880 +split up across multiple workers each +worker does forward and backward + +782 +01:00:42,880 --> 01:00:46,710 +computes gradients when you add up all +the gradients and make a single model + +783 +01:00:46,710 --> 01:00:51,220 +updates this will this will sort of +exactly simulate + +784 +01:00:51,219 --> 01:00:55,029 +just computing but many batch on a +larger machine but it could be kind of + +785 +01:00:55,030 --> 01:00:59,619 +slow since you to synchronize across +machines this tends to be too much of a + +786 +01:00:59,619 --> 01:01:03,610 +big deal when you're working with +multiple GPUs on a single note but once + +787 +01:01:03,610 --> 01:01:08,430 +you're distributed across many many CPUs +that district that I'm synchronization + +788 +01:01:08,429 --> 01:01:12,569 +can actually be quite expensive so +instead at least they also have this + +789 +01:01:12,570 --> 01:01:17,500 +concept of asynchronous STD where each +model is just sort of making updates to + +790 +01:01:17,500 --> 01:01:21,599 +the to its own copy of the parameters +and those have some notion of an + +791 +01:01:21,599 --> 01:01:25,480 +eventual consistency where they +sometimes periodically synchronize with + +792 +01:01:25,480 --> 01:01:29,530 +each other and it's seems really +complicated and hard to debug but they + +793 +01:01:29,530 --> 01:01:35,619 +got it to work so that's that's pretty +cool and one of the really cool pictures + +794 +01:01:35,619 --> 01:01:39,430 +so these two figures are both in the +tensor flow paper and one of the + +795 +01:01:39,429 --> 01:01:42,549 +pictures of tensor flow is that it +should really make this type of + +796 +01:01:42,550 --> 01:01:46,510 +distribution much more transparent to +the user that if you do happen to have + +797 +01:01:46,510 --> 01:01:51,580 +access to a big cluster of GPUs and CPUs +and whatnot tenser flow should + +798 +01:01:51,579 --> 01:01:54,840 +automatically be able to figure out the +best way to do these kinds of + +799 +01:01:54,840 --> 01:01:58,970 +distributions combining data and model +parallelism and just do it all for you + +800 +01:01:58,969 --> 01:02:03,399 +so that's that's really cool and I think +that's that's the really exciting part + +801 +01:02:03,400 --> 01:02:11,050 +about 1000 any any questions about the +stupid training yeah + +802 +01:02:11,050 --> 01:02:16,120 +and CN TK I haven't even taken a look at +it yet + +803 +01:02:16,119 --> 01:02:22,130 +ok so next time there's a couple +bottlenecks you should be aware of in + +804 +01:02:22,130 --> 01:02:27,500 +practice so expect like usually when +you're training these things like this + +805 +01:02:27,500 --> 01:02:30,769 +distributed stuff is nice and great but +you can actually go a long way with just + +806 +01:02:30,769 --> 01:02:34,840 +a single GPU on a single machine and +there there's a lot of bottlenecks that + +807 +01:02:34,840 --> 01:02:39,160 +can get in the way one is the +communication between the CPU and GPU + +808 +01:02:39,159 --> 01:02:44,759 +actually and a lot of cases especially +when the data is small the most + +809 +01:02:44,760 --> 01:02:48,000 +expensive part of the pipeline is +copying the data onto the GPU and then + +810 +01:02:48,000 --> 01:02:51,579 +copy it back once you get things under +the GPU you can do + +811 +01:02:51,579 --> 01:02:55,719 +computation really really fast and +efficiently but the copying is the + +812 +01:02:55,719 --> 01:03:01,089 +really slow part so 11 idea as you want +to make sure to avoid the memory copy + +813 +01:03:01,090 --> 01:03:06,570 +like one thing that sometimes you see +you all at each layer of the network is + +814 +01:03:06,570 --> 01:03:10,460 +copying back and forth from CPU GPU and +I'll be really inefficient and slow + +815 +01:03:10,460 --> 01:03:14,170 +everything down so ideally you want the +whole forward and backward pass to run + +816 +01:03:14,170 --> 01:03:17,159 +on a GPU at once + +817 +01:03:17,159 --> 01:03:21,139 +another thing you'll sometimes see is +multithreaded approach where you'll have + +818 +01:03:21,139 --> 01:03:27,849 +a CPU thread that is prefetching data +many memory in one thread in the + +819 +01:03:27,849 --> 01:03:28,690 +background + +820 +01:03:28,690 --> 01:03:34,070 +possibly also appointed augmentations +online and then this this background CPU + +821 +01:03:34,070 --> 01:03:37,470 +throughout will be sort of preparing me +batches and possibly also shipping them + +822 +01:03:37,469 --> 01:03:41,669 +over to GPU you can kind of coordinate +this loading of data and computing + +823 +01:03:41,670 --> 01:03:44,680 +preprocessing and shipping memory +shipping + +824 +01:03:44,679 --> 01:03:48,940 +many batch data to the GPU and actually +doing the computations and actually you + +825 +01:03:48,940 --> 01:03:51,980 +can get pretty involved with some +courting I'll be all these things in a + +826 +01:03:51,980 --> 01:03:57,719 +multithreaded way and I can give you +some good speedups so cafe in particular + +827 +01:03:57,719 --> 01:04:01,059 +I think already implements this +prefetching date on there for certain + +828 +01:04:01,059 --> 01:04:04,199 +types of data storages and other +frameworks you just have to roll your + +829 +01:04:04,199 --> 01:04:11,839 +own another problem is that the CPU disk +model Mac so these these things are kind + +830 +01:04:11,840 --> 01:04:17,820 +of slow they're cheap and they're big +but they actually are not the best so so + +831 +01:04:17,820 --> 01:04:22,220 +these are hard disks that now the solid +state drives are much more common + +832 +01:04:22,219 --> 01:04:25,730 +but the problem is a solid state drives +are you know smaller and more expensive + +833 +01:04:25,730 --> 01:04:30,590 +but they're a lot faster so they get +used a lot in practice so what's really + +834 +01:04:30,590 --> 01:04:35,710 +although one 1 common feature to both +hard disks and solid-state drives as + +835 +01:04:35,710 --> 01:04:39,889 +they work best when you're reading data +sequentially off the desk so a lot of + +836 +01:04:39,889 --> 01:04:44,108 +times what you're right so one thing +that would be really bad for example is + +837 +01:04:44,108 --> 01:04:48,569 +to have a big folder full of JPEG images +because now each of these images could + +838 +01:04:48,570 --> 01:04:52,309 +be located in different parts on the +desk so it could be really up to a + +839 +01:04:52,309 --> 01:04:56,619 +random seek to read any individual JPEG +image and now also once you read the + +840 +01:04:56,619 --> 01:05:01,150 +JPEG you have to decompress it into +pixels that's quite inefficient so what + +841 +01:05:01,150 --> 01:05:05,079 +you'll see a lot of times in practice is +that you'll actually preprocessor data + +842 +01:05:05,079 --> 01:05:10,059 +by decompressing it and just riding out +the raw pixels entire data sat in one + +843 +01:05:10,059 --> 01:05:15,940 +giant contiguous files to desk so that +that takes a lot of disk space but we do + +844 +01:05:15,940 --> 01:05:22,230 +it anyway because it's all for the good +of a calmness right so this is kinda so + +845 +01:05:22,230 --> 01:05:27,400 +in cafe we do this with a coupled with +like a level d be is one commonly used + +846 +01:05:27,400 --> 01:05:33,599 +format I've also used I also use html5 +files a lot for us but the idea is that + +847 +01:05:33,599 --> 01:05:39,280 +you want to just get your data all +sequentially on desk and already turned + +848 +01:05:39,280 --> 01:05:43,180 +into pixels Senate training when you're +training you can store all your data in + +849 +01:05:43,179 --> 01:05:46,230 +memory you have to read off desk when +you wanna make that read as fast as + +850 +01:05:46,230 --> 01:05:50,679 +possible and again with clever amounts +of prefetching and multi-threaded stuff + +851 +01:05:50,679 --> 01:05:54,829 +you might have you might have won prized +pitching a top desk while other + +852 +01:05:54,829 --> 01:05:57,460 +competition is happening in the +background + +853 +01:05:57,460 --> 01:06:05,019 +another thing to keep in mind is GPU +memory bottlenecks so GPUs big ones have + +854 +01:06:05,019 --> 01:06:10,559 +big ones have a lot of memory but not +that much so the biggest GPUs you can + +855 +01:06:10,559 --> 01:06:15,539 +buy right now that I tax and the key +forty have 12 gigs of memory and that's + +856 +01:06:15,539 --> 01:06:18,139 +pretty much as big as you're going to +get right now + +857 +01:06:18,139 --> 01:06:22,679 +NextGen should be bigger but you can +actually bump up against this limit + +858 +01:06:22,679 --> 01:06:26,989 +without too much trouble especially if +you're training something like a BG or + +859 +01:06:26,989 --> 01:06:31,608 +if you're having recurrent networks were +very very very very long time stops it's + +860 +01:06:31,608 --> 01:06:34,929 +actually not too hard to bump up against +this memory limit that's something you + +861 +01:06:34,929 --> 01:06:35,598 +need to keep + +862 +01:06:35,599 --> 01:06:39,130 +mind when you're training these things +and some of these planes about you know + +863 +01:06:39,130 --> 01:06:43,450 +these efficient convolutions and +cleverly creating architectures actually + +864 +01:06:43,449 --> 01:06:47,068 +helps with this memory as well if you +can have a bigger more powerful model + +865 +01:06:47,068 --> 01:06:52,268 +with smaller amounts of with don't use +less memory than you'll be able to train + +866 +01:06:52,268 --> 01:06:58,129 +things faster and use bigger matches and +everything is good and even just just a + +867 +01:06:58,130 --> 01:07:01,588 +sense of scale Alex Knight is pretty +small compared to a lot of the models + +868 +01:07:01,588 --> 01:07:05,608 +that are state of the art now but Alex +net with a 256 back sides already takes + +869 +01:07:05,608 --> 01:07:09,469 +about 3 gigabytes GB memory so once you +have to these bigger networks it's + +870 +01:07:09,469 --> 01:07:15,738 +actually not too hard to bump up against +the 12 Dec limits so another thing we + +871 +01:07:15,739 --> 01:07:20,978 +should talk about is floating point +precision so when I'm writing code a lot + +872 +01:07:20,978 --> 01:07:24,788 +of times I like to imagine that you know +these things are just real numbers and + +873 +01:07:24,789 --> 01:07:27,960 +they just work but in practice that's +not true and you need to think about + +874 +01:07:27,960 --> 01:07:32,889 +things like how many bits of +floating-point are using so most types + +875 +01:07:32,889 --> 01:07:37,159 +are a lot of types of numeric code that +you might write sort of is with a double + +876 +01:07:37,159 --> 01:07:43,278 +precision by default this is using 64 +bits and a lot of also wrote more + +877 +01:07:43,278 --> 01:07:47,449 +commonly used for deep learning is this +idea of single precision so this is only + +878 +01:07:47,449 --> 01:07:52,710 +32 bets so the idea is that if each +number takes fewer bets then you can + +879 +01:07:52,710 --> 01:07:56,469 +store more of those numbers within the +same amount of memory so that's good and + +880 +01:07:56,469 --> 01:08:00,559 +also with fewer bets you need less +computes operate on those numbers that's + +881 +01:08:00,559 --> 01:08:05,210 +also good so in general we would like to +have smaller data types because they're + +882 +01:08:05,210 --> 01:08:11,150 +faster to compute and the useless memory +and as a as a case study this was + +883 +01:08:11,150 --> 01:08:15,489 +actually even an issue on homework so +you may have noticed that and the + +884 +01:08:15,489 --> 01:08:16,960 +default data type is this + +885 +01:08:16,960 --> 01:08:21,289 +64 bit double precision but for all of +these models that we provided you on + +886 +01:08:21,289 --> 01:08:25,789 +homework we had this cast or 32 bit +floating point number and you can + +887 +01:08:25,789 --> 01:08:28,670 +actually go back on the homework and try +switching between these two and you'll + +888 +01:08:28,670 --> 01:08:32,908 +see that switching to the 32 bit +actually gives you some decent some + +889 +01:08:32,908 --> 01:08:39,670 +decent speed ups so bad and the obvious +question is that if 32 bets are better + +890 +01:08:39,670 --> 01:08:42,829 +than 64 bet spend maybe we can use less +than + +891 +01:08:42,829 --> 01:08:52,199 +so there's this right + +892 +01:08:52,199 --> 01:09:01,010 +16 bets but it was ordered to do these +great ok so in addition to 32 bit + +893 +01:09:01,010 --> 01:09:05,420 +floating point there's also a standard +for 16 bit floating point which is + +894 +01:09:05,420 --> 01:09:09,699 +sometimes called the half precision and +actually recent versions of cunanan do + +895 +01:09:09,699 --> 01:09:17,199 +support computing things in a position +that's cool and actually there there are + +896 +01:09:17,199 --> 01:09:20,050 +some other other existing +implementations from a company called + +897 +01:09:20,050 --> 01:09:23,850 +their bana who also has these +sixteen-bit implementations so these are + +898 +01:09:23,850 --> 01:09:28,350 +the fastest convolutions out there right +now so these there's this nice get + +899 +01:09:28,350 --> 01:09:31,850 +hungry poll that has different kinds of +comment benchmarks for different types + +900 +01:09:31,850 --> 01:09:35,160 +of convolutions and frameworks and +everything and pretty much everything + +901 +01:09:35,159 --> 01:09:38,319 +winning all these benchmarks right now +are these 16 bit floating point + +902 +01:09:38,319 --> 01:09:42,279 +operations from Nirvana which is not +surprising right because I can you have + +903 +01:09:42,279 --> 01:09:47,479 +your bets so it's faster to compete but +right now there's actually not yet + +904 +01:09:47,479 --> 01:09:51,479 +framework support in things like cafe or +torch for utilizing the sixteen-bit + +905 +01:09:51,479 --> 01:09:57,299 +computation but it should be coming very +soon but the problem is that even if we + +906 +01:09:57,300 --> 01:10:01,420 +can compute it's it's it's pretty +obvious that if you have 16 but numbers + +907 +01:10:01,420 --> 01:10:05,880 +you can compete with them very fast but +once you get to 16 better than you might + +908 +01:10:05,880 --> 01:10:10,380 +actually be worried about numeric +precision because two of the sixteen is + +909 +01:10:10,380 --> 01:10:13,550 +not that big of a number anymore it's +actually not too many real numbers you + +910 +01:10:13,550 --> 01:10:20,360 +can even represent so there is this +paper from a couple years ago that did + +911 +01:10:20,359 --> 01:10:25,339 +some experiments low precision floating +point and they found that actually just + +912 +01:10:25,340 --> 01:10:28,710 +using the experiment they actually use a +fixed with a floating-point + +913 +01:10:28,710 --> 01:10:34,819 +implementation and they found that +actually with these very with with this + +914 +01:10:34,819 --> 01:10:38,659 +sort of naive implementation of oslo of +these low precision methods the networks + +915 +01:10:38,659 --> 01:10:43,689 +had a hard time converging probably due +to these low precision Americare numeric + +916 +01:10:43,689 --> 01:10:46,710 +issues that kind of accumulate over +multiple rounds of multiplication and + +917 +01:10:46,710 --> 01:10:50,989 +whatnot but they found a simple trick +was actually this idea of stochastic + +918 +01:10:50,989 --> 01:10:54,559 +rounding so some of their +multiplications would so all their + +919 +01:10:54,560 --> 01:10:55,200 +parameters + +920 +01:10:55,199 --> 01:10:59,079 +activations are stored in 16 bet but +when they perform a multiplication they + +921 +01:10:59,079 --> 01:11:03,269 +up converts to a slightly higher +precision floating-point value and then + +922 +01:11:03,270 --> 01:11:07,570 +they still cast a round that back down +to a lower position and actually doing + +923 +01:11:07,569 --> 01:11:11,789 +that rounding in a stochastic way that +is not rounding to the nearest number + +924 +01:11:11,789 --> 01:11:16,479 +but probabilistically rounding two +different numbers that depending on how + +925 +01:11:16,479 --> 01:11:17,549 +close you are + +926 +01:11:17,550 --> 01:11:21,860 +tends to work better and practice so +they found that for example when you're + +927 +01:11:21,859 --> 01:11:26,710 +using these were sixteen-bit fixed +numbers with two beds for integers and + +928 +01:11:26,710 --> 01:11:31,170 +stand between 12 and 14 this for the +floating point for the for the + +929 +01:11:31,170 --> 01:11:35,239 +fractional part that when you use this +idea of always rounding to the nearest + +930 +01:11:35,239 --> 01:11:40,359 +number these networks and to diverge but +when you use these stochastic grounding + +931 +01:11:40,359 --> 01:11:43,599 +techniques that you can actually get +these networks to converge quite nicely + +932 +01:11:43,600 --> 01:11:47,170 +even with these very low precision +floating-point technique low precision + +933 +01:11:47,170 --> 01:11:52,859 +floating-point numbers but you might +want to ask will sixteen-bit is great + +934 +01:11:52,859 --> 01:11:59,089 +but can we go even lower than that there +was another paper in 2015 that got down + +935 +01:11:59,090 --> 01:12:04,560 +to 10 and 12 bets so here that I mean +from the previous paper we already had + +936 +01:12:04,560 --> 01:12:08,039 +this intuition that maybe when you're +using very low precision floating-point + +937 +01:12:08,039 --> 01:12:11,359 +numbers you actually need to use more +precision in some parts of the network + +938 +01:12:11,359 --> 01:12:15,909 +and lower precision in other parts of +the network so in this paper they were + +939 +01:12:15,909 --> 01:12:22,149 +able to get away with using story in the +activations in 10 bit 10 bit values and + +940 +01:12:22,149 --> 01:12:27,500 +stand doing computing gradients using 12 +bets and they've got this to work which + +941 +01:12:27,500 --> 01:12:34,800 +is pretty amazing but anyone think that +that's the limit can we go further + +942 +01:12:34,800 --> 01:12:36,310 +yes + +943 +01:12:36,310 --> 01:12:44,180 +there was actually a paper just last +week so this is actually from the same + +944 +01:12:44,180 --> 01:12:49,200 +author as the previous paper and this is +crazy I was I was amazed about this and + +945 +01:12:49,199 --> 01:12:53,539 +hear the idea is that all activations +and weights of a network use only one + +946 +01:12:53,539 --> 01:12:58,819 +bet either one or negative one that's +pretty fast to compute now you don't + +947 +01:12:58,819 --> 01:13:02,429 +even really have to do multiplication +you can just do like why is explored and + +948 +01:13:02,430 --> 01:13:07,240 +multiply those that's pretty cool but +the trick is that on the forward pass + +949 +01:13:07,239 --> 01:13:11,199 +all of the gradients and activations are +either one or minus one so it's super + +950 +01:13:11,199 --> 01:13:15,399 +stuff four passes super super fast and +efficient but now on a backward pass + +951 +01:13:15,399 --> 01:13:20,179 +they actually compute gradients using +higher precision and then these higher + +952 +01:13:20,180 --> 01:13:24,150 +precision gradients are used to actually +make updates to these single bit + +953 +01:13:24,149 --> 01:13:28,059 +parameters so it's it's it's actually +really cool paper and I'd encourage you + +954 +01:13:28,060 --> 01:13:33,310 +to check it out but the pitch is that +may be a training time you can afford to + +955 +01:13:33,310 --> 01:13:36,600 +use maybe more floating point precision +but then a test time do you want your + +956 +01:13:36,600 --> 01:13:41,250 +network to be super super fast and all +binary so I think this is a really + +957 +01:13:41,250 --> 01:13:45,010 +really cool idea that I mean it the +paper just came out two weeks ago so I + +958 +01:13:45,010 --> 01:13:50,460 +don't know but I think it's a pretty +cool thing so the recap from + +959 +01:13:50,460 --> 01:13:52,199 +implementation details + +960 +01:13:52,199 --> 01:13:56,960 +is that overall GPUs are much much +faster than CPUs sometimes people use + +961 +01:13:56,960 --> 01:14:00,739 +distributed training distributing over +multiple GPUs in one system is pretty + +962 +01:14:00,739 --> 01:14:04,840 +common if your Google and using tensor +flow then distributing over multiple + +963 +01:14:04,840 --> 01:14:10,239 +nodes is maybe more common be aware of +the potential bottlenecks between the + +964 +01:14:10,239 --> 01:14:15,739 +CPU and GPU between the GPU in the desk +and between the GPU memory and also pay + +965 +01:14:15,739 --> 01:14:19,510 +attention to floating point precision it +might not be the most glamorous thing + +966 +01:14:19,510 --> 01:14:23,409 +but it actually I think makes huge +differences in practice and maybe binary + +967 +01:14:23,409 --> 01:14:28,639 +nuts will be the next big thing that'd +be pretty exciting so yeah just to recap + +968 +01:14:28,640 --> 01:14:32,690 +everything we talked about today that we +talked to a date augmentation as a trick + +969 +01:14:32,689 --> 01:14:37,449 +for improving when you have small +datasets and help prevent overfitting we + +970 +01:14:37,449 --> 01:14:40,859 +talk about transfer learning as a way to +initialize from existing models to help + +971 +01:14:40,859 --> 01:14:44,399 +with your help with your training we +talked a lot of detail about + +972 +01:14:44,399 --> 01:14:48,159 +convolutions both how to combine them to +make efficient models and + +973 +01:14:48,159 --> 01:14:52,840 +and we talked about all these +implementation details so I think that's + +974 +01:14:52,840 --> 01:14:57,319 +that's all we have printed ASAP is any +last minute questions + +975 +01:14:57,319 --> 01:15:02,840 +alright so I guess we're done a couple +minutes early and our midst the midterms + diff --git a/captions/En/Lecture12_en.srt b/captions/En/Lecture12_en.srt new file mode 100644 index 00000000..d2bbc809 --- /dev/null +++ b/captions/En/Lecture12_en.srt @@ -0,0 +1,5373 @@ +1 +00:00:00,000 --> 00:00:02,990 +today we're going to go over these four +major software packages that people + +2 +00:00:02,990 --> 00:00:10,919 +commonly used as usual a couple +administrative things the milestones + +3 +00:00:10,919 --> 00:00:14,798 +were actually do last week so hopefully +return the men will try to take a look + +4 +00:00:14,798 --> 00:00:19,089 +at those this week also remember that +assignment 3 the final assignment is + +5 +00:00:19,089 --> 00:00:23,160 +gonna be due on Wednesday so and you +guys done already + +6 +00:00:23,160 --> 00:00:30,870 +ok that's that's good then you have late +days you should be fine + +7 +00:00:30,870 --> 00:00:34,230 +another another thing that I should +point out is that if you're actually + +8 +00:00:34,229 --> 00:00:37,619 +planning on using Terminal for your +projects which i think a lot of you are + +9 +00:00:37,619 --> 00:00:42,049 +then make sure you you're backing up +your code and data and things off of + +10 +00:00:42,049 --> 00:00:46,659 +paternal instances every once in a while +we've had some problems where the + +11 +00:00:46,659 --> 00:00:50,529 +instances will crash randomly and in +most cases the terminal folks have been + +12 +00:00:50,530 --> 00:00:53,989 +able to get the data back but it +sometimes takes a couple days and + +13 +00:00:53,988 --> 00:00:57,570 +there's been a couple of cases where +actually people lost data because it was + +14 +00:00:57,570 --> 00:01:01,558 +just on terminal and that crashed so I +think if you are planning to use + +15 +00:01:01,558 --> 00:01:04,569 +terminal then make sure that you have +some alternative backup strategy for + +16 +00:01:04,569 --> 00:01:10,250 +your code and your data like I said +today we're talking about these poor + +17 +00:01:10,250 --> 00:01:16,049 +software packages that are commonly used +for a deep learning cafe torch piano and + +18 +00:01:16,049 --> 00:01:20,269 +tensor flow and as a little bit of +disclaimer at the beginning I felt like + +19 +00:01:20,269 --> 00:01:24,179 +personally I've mostly worked with cafe +and torch so those the ones that I know + +20 +00:01:24,180 --> 00:01:27,710 +the most about I'll do my best to give +you a good flavor for the others as well + +21 +00:01:27,709 --> 00:01:35,939 +but just throwing that disclaimer out +there so the first one is cafe we saw in + +22 +00:01:35,939 --> 00:01:39,509 +the last lecture that really cafe sprung +out of this paper at berkeley that was + +23 +00:01:39,510 --> 00:01:44,040 +trying to a re-employment Alex NAT and +Alex features for other things and since + +24 +00:01:44,040 --> 00:01:47,550 +then kathy has really grown into a +really really popular widely used + +25 +00:01:47,549 --> 00:01:53,759 +package for especially convolutional +neural networks so Cafe is from Berkeley + +26 +00:01:53,760 --> 00:01:56,859 +that I think a lot of you people have no +no + +27 +00:01:56,859 --> 00:02:01,989 +and it's mostly written in C++ and there +is actually buying things for a cafe so + +28 +00:02:01,989 --> 00:02:04,939 +you can access the nets and whatnot in +Python in Matlab that are super useful + +29 +00:02:04,939 --> 00:02:09,969 +and in general cafes really widely used +and it's really really good if you just + +30 +00:02:09,969 --> 00:02:15,289 +want to train sort of standard +feedforward convolutional networks and + +31 +00:02:15,289 --> 00:02:17,489 +actually Cafe is somewhat different than +the others + +32 +00:02:17,490 --> 00:02:21,610 +other frameworks in this respect you can +actually trained big powerful models and + +33 +00:02:21,610 --> 00:02:26,150 +kept a without writing any code yourself +so for example the ResNet image + +34 +00:02:26,150 --> 00:02:29,760 +classification model that one image that +one everything last year you can + +35 +00:02:29,759 --> 00:02:33,189 +actually trained to resonate using cafe +without writing any code which is pretty + +36 +00:02:33,189 --> 00:02:37,579 +amazing so the most but the most +important tip when you're working with + +37 +00:02:37,580 --> 00:02:41,860 +cafe is that the documentation is not as +sometimes out of date and not always + +38 +00:02:41,860 --> 00:02:45,980 +perfect so you need to not be afraid to +just dive in there and read the source + +39 +00:02:45,979 --> 00:02:52,359 +code yourself it's C++ so hopefully you +can read that and understand it but in + +40 +00:02:52,360 --> 00:02:56,080 +general the C++ code that they have +interface is pretty well structured + +41 +00:02:56,080 --> 00:03:00,270 +pretty well organized and pretty easy to +understand so if you have doubts about + +42 +00:03:00,270 --> 00:03:04,459 +how things work in cafe you do your best +bet is just to go on get up and read the + +43 +00:03:04,459 --> 00:03:11,229 +source code so Cafe is this huge big +project with Mike probably thousands + +44 +00:03:11,229 --> 00:03:14,369 +tens of thousands of lines of code and +it's a little bit scary to understand + +45 +00:03:14,370 --> 00:03:18,730 +how everything fits together but there's +really four major classes in cafe that + +46 +00:03:18,729 --> 00:03:24,310 +you need to know about the first one is +a blob so blobs army store all of your + +47 +00:03:24,310 --> 00:03:27,939 +data and your weight and your +activations in the network so these + +48 +00:03:27,939 --> 00:03:34,870 +blobs are things in the network so your +weights are have blocked are your rates + +49 +00:03:34,870 --> 00:03:38,680 +are stored in a blob your data which +would be like your pixel values are + +50 +00:03:38,680 --> 00:03:43,189 +stored in a blob and your labels your +wife or stored in a blob and also all of + +51 +00:03:43,189 --> 00:03:47,319 +your intermediate activations will also +be stored in blobs so blobs are these + +52 +00:03:47,319 --> 00:03:51,069 +and dimensional tensors sort of like +you've seen an umpire accepted they + +53 +00:03:51,069 --> 00:03:56,150 +actually have four copies of a +non-dimensional tenser inside they have + +54 +00:03:56,150 --> 00:03:57,370 +data + +55 +00:03:57,370 --> 00:04:02,450 +data version of the tensor which is +storing the actual raw data and they + +56 +00:04:02,449 --> 00:04:07,449 +also have a parallel thing but parallel +10 circled deaths that cafe uses to + +57 +00:04:07,449 --> 00:04:12,459 +store gradients with respect to that +data and that gives you two and then you + +58 +00:04:12,459 --> 00:04:16,280 +actually have four because there's a CPU +and GPU version of each of those things + +59 +00:04:16,279 --> 00:04:21,228 +so you have data types of CPU and GPU +there's actually four and dimensional + +60 +00:04:21,228 --> 00:04:26,159 +tents are superb lob the next important +class that you need to know about and + +61 +00:04:26,160 --> 00:04:30,930 +cafes the lair and a larry is sort of a +function from similar to the ones who + +62 +00:04:30,930 --> 00:04:35,329 +wrote on the hallmarks that receives +some input blobs catcalls inputs bottoms + +63 +00:04:35,329 --> 00:04:41,269 +and then produces output blobs that kept +a hole stop lobs the idea is that your + +64 +00:04:41,269 --> 00:04:45,349 +lair will receive pointers to the bottom +blobs with the data Rd filled in and + +65 +00:04:45,350 --> 00:04:49,229 +then it'll also receive a pointer to the +top blobs and it'll end in Fort + +66 +00:04:49,228 --> 00:04:53,759 +passionately expected to fill in the +values for the data elements of your top + +67 +00:04:53,759 --> 00:04:58,959 +blogs on the back road past the layers +will compute radiance sable expects to + +68 +00:04:58,959 --> 00:05:03,649 +receive a pointer to the top jobs with +the gradients and the activation spilled + +69 +00:05:03,649 --> 00:05:07,359 +an and then they'll also receive a +pointer to the bottom blobs until + +70 +00:05:07,360 --> 00:05:12,650 +ingredients for the bottoms and Blair is +this a pretty well structured abstract + +71 +00:05:12,649 --> 00:05:17,019 +class that you can go and I had to have +the the links for the source file here + +72 +00:05:17,019 --> 00:05:21,139 +and there's a lot of some classes that +implement different types of theirs and + +73 +00:05:21,139 --> 00:05:26,750 +like I said a common cap a problem +there's no really good list of all the + +74 +00:05:26,750 --> 00:05:30,490 +lair types you pretty much just need to +look at the code and see what types of + +75 +00:05:30,490 --> 00:05:36,280 +CPP files there are the next thing you +need to know about is a natural so and + +76 +00:05:36,279 --> 00:05:40,859 +that just combines multiple heirs and +that is basically directed acyclic graph + +77 +00:05:40,860 --> 00:05:44,598 +of layers and is responsible for running +the forward and backward methods of the + +78 +00:05:44,598 --> 00:05:49,519 +layers in the correct order so this is +you probably don't need to touch this + +79 +00:05:49,519 --> 00:05:52,560 +class ever yourself but it's kind of +nice to look at to get a flavour of how + +80 +00:05:52,560 --> 00:05:56,139 +everything fits together in the final +class that you need to know about a + +81 +00:05:56,139 --> 00:06:00,720 +solver so the solver is you know we have +this thing called solver on the homework + +82 +00:06:00,720 --> 00:06:04,710 +that was really inspired by capping a +somersault or is intended to dip into + +83 +00:06:04,709 --> 00:06:05,288 +the net + +84 +00:06:05,288 --> 00:06:08,889 +run the next forward and backward on +data actually update + +85 +00:06:08,889 --> 00:06:11,319 +owners of the network and handle +checkpointing and resuming from + +86 +00:06:11,319 --> 00:06:15,520 +checkpoints and all that sort of stuff +and in cafe solver is this abstract + +87 +00:06:15,519 --> 00:06:20,278 +class and different update rules are +implemented by different subclasses so + +88 +00:06:20,278 --> 00:06:24,598 +there is for example stochastic gradient +descent solver there's an atom bomb ass + +89 +00:06:24,598 --> 00:06:28,209 +problem-solver all of that sort of stuff +and again just to see what kinds of + +90 +00:06:28,209 --> 00:06:32,438 +options are available you should look at +the source code for this kind of gives + +91 +00:06:32,439 --> 00:06:35,639 +you a nice overview of how these things +all fit together that this whole thing + +92 +00:06:35,639 --> 00:06:40,069 +on the right would be done at the net +contains in the green boxes blobs each + +93 +00:06:40,069 --> 00:06:44,250 +blog contains data and texts the red +boxes are layers that are connecting + +94 +00:06:44,250 --> 00:06:51,038 +blocks together and the whole thing +would get optimized for the Psalter so + +95 +00:06:51,038 --> 00:06:55,538 +cafe makes heavy use of this funny thing +called protocol buffers any of you guys + +96 +00:06:55,538 --> 00:07:00,938 +ever in turn to Google after numbers you +guys know about this bomb but protocol + +97 +00:07:00,939 --> 00:07:05,099 +poppers are this almost like a binary +strongly typed JSON I sort of like to + +98 +00:07:05,098 --> 00:07:08,550 +think about it that are used very widely +inside google first utilizing data to + +99 +00:07:08,550 --> 00:07:14,750 +death over the network so protocol +buffers there's this . profile that + +100 +00:07:14,750 --> 00:07:18,639 +defines the different kinds of feels +that different types of objects how so + +101 +00:07:18,639 --> 00:07:22,819 +in this example there's a person has a +name and I D and an email and this lives + +102 +00:07:22,819 --> 00:07:26,300 +in a top profile . profiles + +103 +00:07:26,300 --> 00:07:31,490 +given to find a type of a class and you +can actually see realize instances to + +104 +00:07:31,490 --> 00:07:37,379 +human readable . total txt files so for +example this fills in the name it gives + +105 +00:07:37,379 --> 00:07:40,968 +you the idea gives you the email and +this is an instance of a person that can + +106 +00:07:40,968 --> 00:07:45,930 +be saved into this text file then +product includes this compiler that + +107 +00:07:45,930 --> 00:07:49,579 +actually lets you generate classes in +various programming languages to access + +108 +00:07:49,579 --> 00:07:55,418 +these data types you can after running +photobook compiler this profile it + +109 +00:07:55,418 --> 00:08:01,038 +produces classes that you can import in +Java and C C++ and Python and go and + +110 +00:08:01,038 --> 00:08:05,300 +just about everything so actually cafe +makes why do you say these probe of + +111 +00:08:05,300 --> 00:08:08,270 +these protocol buffers and they use them +to store pretty much everything and + +112 +00:08:08,269 --> 00:08:16,008 +Kathy so like I said to understand you +need to read the code understand cafe + +113 +00:08:16,009 --> 00:08:20,480 +and cafe has this one giant file called +cafe dark road + +114 +00:08:20,480 --> 00:08:24,470 +though they just defines all of the +protocol buffer types that are used in + +115 +00:08:24,470 --> 00:08:29,170 +cafe this is a gigantic file its I think +it's a couple thousand lines long but + +116 +00:08:29,170 --> 00:08:32,200 +it's actually pretty well documented and +is I think the most up-to-date + +117 +00:08:32,200 --> 00:08:35,890 +documentation of what are the lair types +are what the options for those layers + +118 +00:08:35,889 --> 00:08:39,629 +are how you specify every all the +options for solvers and layers and + +119 +00:08:39,629 --> 00:08:43,100 +that's not all bad so I really encourage +you to check out this file and read + +120 +00:08:43,100 --> 00:08:48,019 +through it if you have any questions +about how things work in cafe and just + +121 +00:08:48,019 --> 00:08:53,120 +give you a flavour on my left ear this +shows you this defines than a parameter + +122 +00:08:53,120 --> 00:08:58,519 +which is the type of protocol buffer +that cafe uses to represent an axe and + +123 +00:08:58,519 --> 00:09:03,970 +on the right is this solver parameter +which used to represent solvers so that + +124 +00:09:03,970 --> 00:09:09,009 +perimeter of the solver promoter for +example takes a reference to a net and + +125 +00:09:09,009 --> 00:09:12,409 +it also includes things like learning +rate and how often to check point and + +126 +00:09:12,409 --> 00:09:19,549 +other things like that right so when +you're working in cafe actually it's + +127 +00:09:19,549 --> 00:09:23,729 +pretty cool you don't need to write any +code in order to train models so when + +128 +00:09:23,730 --> 00:09:27,889 +working with cafe you generally have +this four-step process so first you will + +129 +00:09:27,889 --> 00:09:31,960 +convert your data and especially if you +just happen image classification problem + +130 +00:09:31,960 --> 00:09:34,540 +you don't have to write any code for +this you just use one of the existing + +131 +00:09:34,539 --> 00:09:40,240 +binary Kappa ships with Daniel define +your your file that you'll do by just + +132 +00:09:40,240 --> 00:09:45,230 +writing or editing one of these proteins +Daniel define your solver which again + +133 +00:09:45,230 --> 00:09:49,509 +will just live in Provo txt txt file +that you can just work within a text + +134 +00:09:49,509 --> 00:09:54,200 +editor and then you'll pass all of these +things to this existing binary to train + +135 +00:09:54,200 --> 00:09:57,990 +the model and battle spit out your train +kept a model to test that you can then + +136 +00:09:57,990 --> 00:10:02,820 +use for other things so even if you want +to train ResNet on image that you could + +137 +00:10:02,820 --> 00:10:06,000 +just follow the simple procedure and +train a giant network without writing + +138 +00:10:06,000 --> 00:10:12,110 +any code that's really cool and so step +one generally only to convert your data + +139 +00:10:12,110 --> 00:10:17,259 +so cafe uses I know we've talked a +little bit about html5 as format for + +140 +00:10:17,259 --> 00:10:21,460 +storing pixels on desk continuously and +then reading from them efficiently but + +141 +00:10:21,460 --> 00:10:26,940 +by default Kathy uses this other file +format called LM TV so there's asked if + +142 +00:10:26,940 --> 00:10:30,570 +you if all you have is a bunch of images +each image with a label then you can + +143 +00:10:30,570 --> 00:10:31,480 +call lol + +144 +00:10:31,480 --> 00:10:35,370 +cafe just has a script to convert that +whole dataset into a giant alamoudi be + +145 +00:10:35,370 --> 00:10:42,169 +you can use for training so Jen just to +give you an idea of the way it's this is + +146 +00:10:42,169 --> 00:10:46,240 +really easy you just create a text file +that has the path to your images and + +147 +00:10:46,240 --> 00:10:49,959 +separated by the label and you just +passenger kept a script wait a couple + +148 +00:10:49,958 --> 00:10:56,018 +hours if your data sets big giant IMDB +file on disk and if you're working with + +149 +00:10:56,019 --> 00:11:01,860 +something else like HBO five then you'll +have to create yourself probably so cafe + +150 +00:11:01,860 --> 00:11:06,060 +does actually have a couple of options +to reading data and there's this date on + +151 +00:11:06,059 --> 00:11:11,888 +their window dato mayor for protection +and actually can read from HDL 5 and + +152 +00:11:11,889 --> 00:11:14,350 +there's an option for reading stuff +directly from memory that's especially + +153 +00:11:14,350 --> 00:11:18,480 +useful with Python interface but at +least in my point of view all of these + +154 +00:11:18,480 --> 00:11:22,339 +types of other methods of reading and +data to campaign are a little bit + +155 +00:11:22,339 --> 00:11:26,120 +second-class citizens in the cafe +ecosystem and Ellen DBA is really the + +156 +00:11:26,120 --> 00:11:30,669 +easiest thing to work with so if you can +you should probably try to convert your + +157 +00:11:30,669 --> 00:11:40,179 +data into mp3 format with so step 24 +campaign is to define your object so + +158 +00:11:40,179 --> 00:11:44,609 +like I said he'll just to write a big +promo txt to find your not so here this + +159 +00:11:44,610 --> 00:11:48,818 +is this just a simple model for logistic +regression you can see that I did not + +160 +00:11:48,818 --> 00:11:53,948 +follow my own advice and I'm reading +data out of an HDL 5 file here then I + +161 +00:11:53,948 --> 00:11:59,278 +have a fully connected layer which is +called inner product and Cathay than + +162 +00:11:59,278 --> 00:12:03,588 +their rights that's fully connected lair +tells you the number of classes and how + +163 +00:12:03,589 --> 00:12:10,399 +to initialize the values and then I have +a soft max loss function that read the + +164 +00:12:10,399 --> 00:12:15,458 +labels and produces loss ingredients +from the opposite elected leader so a + +165 +00:12:15,458 --> 00:12:20,009 +couple things to point out about this +file are that one every layer typically + +166 +00:12:20,009 --> 00:12:23,588 +include some blogs which to store the +data and the gradients in the weights + +167 +00:12:23,589 --> 00:12:28,680 +and the layers blobs and Bellaire itself +typically have the same name that can be + +168 +00:12:28,679 --> 00:12:34,269 +a little bit confusing another thing is +that a lot of these layers will have two + +169 +00:12:34,269 --> 00:12:39,250 +blobs 14 weight and 14 bias and actually +in this network right in here you'll + +170 +00:12:39,250 --> 00:12:43,149 +find the learning rates for those two +blobs so that's learning rate and + +171 +00:12:43,149 --> 00:12:44,769 +regularization for both the way + +172 +00:12:44,769 --> 00:12:50,198 +bias of that later another thing to note +is that to specify the number of output + +173 +00:12:50,198 --> 00:12:51,568 +classes is just the number + +174 +00:12:51,568 --> 00:12:57,378 +output on this fully connected lair +perimeter and finally the quick and + +175 +00:12:57,379 --> 00:13:01,139 +dirty way to freeze layers and cafe is +just to set the learning rate 204 that + +176 +00:13:01,139 --> 00:13:08,048 +for the blobs associated to that way +it's our biases another thing to point + +177 +00:13:08,048 --> 00:13:12,600 +out is that for ResNet and other large +models like Google that this can get + +178 +00:13:12,600 --> 00:13:17,110 +really out of hand really quickly so +cafe doesn't really let you define like + +179 +00:13:17,110 --> 00:13:20,989 +composition ality so for ResNet they +just repeat the same pattern over and + +180 +00:13:20,989 --> 00:13:26,459 +over and over in the Pro txt file so the +ResNet proto txt is almost 7,000 lines + +181 +00:13:26,458 --> 00:13:31,219 +long so you could write that by hand but +interim practice people tend to write + +182 +00:13:31,220 --> 00:13:35,470 +little python script to generate these +things automatically so that's that's a + +183 +00:13:35,470 --> 00:13:41,879 +little bit gross you out if you want to +find to a network rather than starting + +184 +00:13:41,879 --> 00:13:46,509 +from scratch then you'll typically +download some existing product ext and + +185 +00:13:46,509 --> 00:13:50,230 +some existing weights file and work from +there so the way you should think about + +186 +00:13:50,230 --> 00:13:54,139 +it is that the product txt file that +we've seen here before it defines the + +187 +00:13:54,139 --> 00:13:58,159 +architecture of the network and Mendel +the preacher and weights live in this + +188 +00:13:58,159 --> 00:14:03,230 +cafe model file that's a binary thing +and you can't really inspected but the + +189 +00:14:03,230 --> 00:14:07,869 +way it works that's basically key-value +pairs where it matches name where the + +190 +00:14:07,869 --> 00:14:13,790 +inside the cafe model it matches these +names that are scoped to Lares so this + +191 +00:14:13,789 --> 00:14:19,389 +xc70 weight with the would-be though the +way its corresponding to this final + +192 +00:14:19,389 --> 00:14:24,048 +fully connected layer and Alex not so +then when you want to find you on your + +193 +00:14:24,048 --> 00:14:29,600 +own data when you start up cafe and you +load a model and a product ext + +194 +00:14:29,600 --> 00:14:33,459 +just tries to match the key-value pairs +of names and waits between the cafe + +195 +00:14:33,458 --> 00:14:35,008 +model and the product ext + +196 +00:14:35,009 --> 00:14:39,209 +so if the names of the same then your +new network gets initialized from the + +197 +00:14:39,208 --> 00:14:43,008 +values and the proto txt which is really +really useful and convenient for fine + +198 +00:14:43,009 --> 00:14:49,230 +tuning but if the layers if the names +don't match than those layers actually + +199 +00:14:49,230 --> 00:14:52,980 +initialize from scratch so this is how +for example you can read nationalize the + +200 +00:14:52,980 --> 00:14:57,810 +output in cafe so to be a little bit +more concrete if you've + +201 +00:14:57,809 --> 00:15:02,250 +maybe download an image that model then +this larry is going on this final fully + +202 +00:15:02,250 --> 00:15:06,289 +connected layer that's output in class +course will have a thousand outputs but + +203 +00:15:06,289 --> 00:15:09,480 +now maybe for some problem you care +about you only want have 10 outputs + +204 +00:15:09,480 --> 00:15:13,149 +you're gonna need to reindustrialize +that final layer and realized it + +205 +00:15:13,149 --> 00:15:17,309 +randomly and fine-tune the network so +the way that you do that is you need to + +206 +00:15:17,309 --> 00:15:22,088 +change the name of the lair in the Pro +txt file to make sure that it's actually + +207 +00:15:22,089 --> 00:15:26,890 +initialize randomly and not reading from +the from from the cafe model and if you + +208 +00:15:26,889 --> 00:15:30,919 +forget to do this then it'll actually +crash and it'll give you a weird error + +209 +00:15:30,919 --> 00:15:35,419 +message about the shapes not aligning +cause it'll be trying to store this + +210 +00:15:35,419 --> 00:15:39,299 +thousand dimensional weight matrix into +this ten dimensional thing from your new + +211 +00:15:39,299 --> 00:15:46,129 +file and it won't work so the next step +when working with cafe is to define the + +212 +00:15:46,129 --> 00:15:51,100 +solver the solver is also just a pro txt +file you can see all the options for it + +213 +00:15:51,100 --> 00:15:56,620 +in that giant profile that I gave a link +to a little look something like this for + +214 +00:15:56,620 --> 00:16:00,169 +Alex night maybe so that will define +your learning rate and you're learning + +215 +00:16:00,169 --> 00:16:04,809 +way to K and your regularization how +often to check everything like that but + +216 +00:16:04,809 --> 00:16:10,169 +these end up being less much less +complex than that he's pro txt for the + +217 +00:16:10,169 --> 00:16:15,069 +networks this Alex neckties just maybe +fourteen lines although what you will + +218 +00:16:15,070 --> 00:16:18,530 +see some times in practice is that if +people want to have sort of complex + +219 +00:16:18,529 --> 00:16:22,299 +trading pipelines where they first one I +trained with 11 learning rate in certain + +220 +00:16:22,299 --> 00:16:25,039 +parts of the network they want to train +with another learning rate certain parts + +221 +00:16:25,039 --> 00:16:28,389 +of the network that you might end up +with a cascade of different solver files + +222 +00:16:28,389 --> 00:16:31,490 +and actually run the most independent me +we are sort of fine-tuning your own + +223 +00:16:31,490 --> 00:16:38,070 +model in separate stages using different +solvers so once you've done all that + +224 +00:16:38,070 --> 00:16:43,550 +then you just trainer model so if you if +you followed my advice and just use a + +225 +00:16:43,549 --> 00:16:49,208 +MTB and all these things about you just +call this binary that is it that exists + +226 +00:16:49,208 --> 00:16:55,569 +in campaign already so here you just +passed your solver and your txt and + +227 +00:16:55,570 --> 00:16:59,540 +Europe retrain weights file if you're +fine tuning and it'll run maybe Friday + +228 +00:16:59,539 --> 00:17:03,659 +maybe for a long time and just checking +and savings to desk and you'll be happy + +229 +00:17:03,659 --> 00:17:08,549 +one thing to point out here is that you +specify which GPU it runs on this is + +230 +00:17:08,549 --> 00:17:11,209 +your last text but you can actually run +in CPR + +231 +00:17:11,209 --> 00:17:17,288 +by setting this flag to my negative one +and actually recent sometime in the last + +232 +00:17:17,288 --> 00:17:21,048 +year cafe added data parallelism to let +you split up many batches across + +233 +00:17:21,048 --> 00:17:26,318 +multiple GPUs in your system you can +actually add multiple GPUs on this flag + +234 +00:17:26,318 --> 00:17:29,710 +and if you just say all been Cafe will +automatically split up many batches + +235 +00:17:29,710 --> 00:17:33,600 +across all the GPUs on your machine so +that's really cool you've done multi GPU + +236 +00:17:33,599 --> 00:17:51,689 +training without writing a single line +of code pretty cool cafe oh yeah + +237 +00:17:51,690 --> 00:17:57,230 +yeah I think so the question is how +would you go about doing some more + +238 +00:17:57,230 --> 00:18:00,778 +complex initialization strategy for you +maybe want to initialize the weights + +239 +00:18:00,778 --> 00:18:04,019 +from a preacher and model and use those +same way as in multiple parts of your + +240 +00:18:04,019 --> 00:18:07,710 +network and then the answer is that you +probably can't do that with a simple + +241 +00:18:07,710 --> 00:18:11,278 +mechanism you can kind of money on the +weights and Python and that's probably + +242 +00:18:11,278 --> 00:18:17,669 +how you go about doing it right so I +think we've mentioned this before that + +243 +00:18:17,669 --> 00:18:21,710 +cafe has this really great models you +you can download lots of different types + +244 +00:18:21,710 --> 00:18:25,919 +of preteen models on a mission at and +other datasets so this this model is it + +245 +00:18:25,919 --> 00:18:29,659 +was really top-notch you've got Alex +natin BGG you've got residents up there + +246 +00:18:29,659 --> 00:18:33,840 +already pretty much lots and lots of +really good models are up there so + +247 +00:18:33,839 --> 00:18:37,359 +that's that's a really really strong +point about cafe that it's really easy + +248 +00:18:37,359 --> 00:18:40,428 +to download someone else's model and run +it on your data are pointing to your + +249 +00:18:40,429 --> 00:18:42,350 +data + +250 +00:18:42,349 --> 00:18:46,298 +Cafe has a pipeline interface like I +mentioned I + +251 +00:18:46,298 --> 00:18:49,069 +since there are so many things to cover +I don't think I can dive into detail + +252 +00:18:49,069 --> 00:18:53,378 +here but as a kind of par for the course +and cafe there's not really really great + +253 +00:18:53,378 --> 00:18:57,980 +documentation about the Python interface +so you need to read the code and the + +254 +00:18:57,980 --> 00:18:58,690 +whole + +255 +00:18:58,690 --> 00:19:02,730 +the Python interface Street Cafe is +mostly defined in these two in these two + +256 +00:19:02,730 --> 00:19:08,399 +files this CPP file uses boost Python if +you've ever used that before talks to + +257 +00:19:08,398 --> 00:19:13,369 +wrap up some of the C++ classes and +expose them to take on and then in this + +258 +00:19:13,369 --> 00:19:17,648 +. py file it actually attach additional +methods and gives you more Python + +259 +00:19:17,648 --> 00:19:22,469 +interface so if you wanna know what +kinds of methods and data types are + +260 +00:19:22,470 --> 00:19:27,000 +available in the cafe pipe interface +your best bet is to just read 3 through + +261 +00:19:27,000 --> 00:19:31,339 +these two files and they're not too long +so it's it's pretty easy to do + +262 +00:19:31,339 --> 00:19:37,038 +yes the Python interface in general is +is pretty useful it lets you do maybe + +263 +00:19:37,038 --> 00:19:40,558 +crazy weight initialization strategies +if you need to do something more complex + +264 +00:19:40,558 --> 00:19:44,960 +than just copy from a chain model it +also makes it really easy to just get a + +265 +00:19:44,960 --> 00:19:48,710 +network and then run it forward and +backward on with numpy from numpy array + +266 +00:19:48,710 --> 00:19:53,129 +is so for example you can implement +things like deep dream and class + +267 +00:19:53,128 --> 00:19:56,798 +visualisations similar to that you did +on the homework you can also do that + +268 +00:19:56,798 --> 00:20:01,349 +quite easily using the Python interface +on cafe where you just need to take data + +269 +00:20:01,349 --> 00:20:03,899 +and then run it forward and backward +through different parts of the network + +270 +00:20:03,900 --> 00:20:08,720 +work the Python interface is also quite +nice if if you just want to extract + +271 +00:20:08,720 --> 00:20:12,220 +features like you have some data that +you have some free trade model and you + +272 +00:20:12,220 --> 00:20:15,610 +want to track features from some part of +the network and then maybe save them to + +273 +00:20:15,609 --> 00:20:20,259 +disk maybe 2005 file were some +downstream processing that's quite easy + +274 +00:20:20,259 --> 00:20:25,660 +to do with the Python interface you can +also actually Cafe has a kind of a new + +275 +00:20:25,660 --> 00:20:29,970 +feature where you can actually define +layers entirely in Python but this is + +276 +00:20:29,970 --> 00:20:33,600 +I've never done it myself but it's it +seems cool it seems nice but the + +277 +00:20:33,599 --> 00:20:37,259 +downside is that those layers will be +CPU on me so we talked about + +278 +00:20:37,259 --> 00:20:41,809 +communication bottlenecks between the +CPU and GPU that if you write letters in + +279 +00:20:41,809 --> 00:20:46,460 +Python then every forward and backward +pass you'll be anchoring overhead I'm + +280 +00:20:46,460 --> 00:20:51,289 +not transfer although one nice place +where pipes and wires could be useful as + +281 +00:20:51,289 --> 00:20:58,450 +custom loss functions so that's maybe +something that you could keep in mind so + +282 +00:20:58,450 --> 00:21:02,450 +the quick overview of Catholic pros and +cons that really from my point of view + +283 +00:21:02,450 --> 00:21:06,049 +if all you wanna do is kind of train a +simple basic feedforward network + +284 +00:21:06,049 --> 00:21:09,730 +especially for classification and kathy +is really really easy to get things up + +285 +00:21:09,730 --> 00:21:12,880 +and running you don't have to write any +code yourself you just use all these are + +286 +00:21:12,880 --> 00:21:17,660 +pre-built tools and it's quite easy to +run it has a Python interface which is + +287 +00:21:17,660 --> 00:21:21,259 +quite nice for using a little to work +for a little bit more complex use cases + +288 +00:21:21,259 --> 00:21:25,329 +but it can be cumbersome when things get +really crazy when you have these really + +289 +00:21:25,329 --> 00:21:29,299 +big networks like president especially +with repeated module patterns they can + +290 +00:21:29,299 --> 00:21:33,450 +be tedious and for things like like +recurrent networks where you want to + +291 +00:21:33,450 --> 00:21:37,519 +share waits between different parts of +the network can be kind of company kind + +292 +00:21:37,519 --> 00:21:41,559 +of cumbersome in cafe it is possible but +it's probably not the best thing to use + +293 +00:21:41,559 --> 00:21:46,250 +for that and the other downside the +other big downside from my point of view + +294 +00:21:46,250 --> 00:21:50,220 +is that when you want to find your own +type of lair in cafe you end up having + +295 +00:21:50,220 --> 00:21:55,440 +to write C++ code so that's not doesn't +give you a very quick development cycle + +296 +00:21:55,440 --> 00:22:00,769 +so it's kind of a lot of kind of painful +to write you letters so that's that's + +297 +00:22:00,769 --> 00:22:04,750 +why our world whirlwind tour of cafe so +if there's any quick questions + +298 +00:22:04,750 --> 00:22:06,669 +yeah + +299 +00:22:06,669 --> 00:22:14,028 +cross validation and cafe so in the +train Valparaiso txt you can try to find + +300 +00:22:14,028 --> 00:22:19,159 +a training phase and a testing phase so +generally alright like a train about + +301 +00:22:19,159 --> 00:22:20,269 +product ext + +302 +00:22:20,269 --> 00:22:24,960 +and apply product ext and deploy will be +used at on the task at hand but test + +303 +00:22:24,960 --> 00:22:33,409 +phase of the trail product ext will be +used for validation ok that's that's all + +304 +00:22:33,409 --> 00:22:39,820 +there is to know about cabinet so the +next one is torch so torch is really my + +305 +00:22:39,819 --> 00:22:42,980 +personal favorite so I have a little bit +of bias here just to get that out in the + +306 +00:22:42,980 --> 00:22:46,259 +open that I've pretty much been using +torch almost exclusively on my own + +307 +00:22:46,259 --> 00:22:51,749 +projects in the last year or so so a +torch is originally from NYU it's + +308 +00:22:51,749 --> 00:22:56,450 +written in C and in lieu up and it's +used a lot at Facebook indeed mind + +309 +00:22:56,450 --> 00:23:02,409 +especially I think also a lot of folks +at Twitter use torch so one of the big + +310 +00:23:02,409 --> 00:23:05,309 +things that freaks people out of course +is that you have to write in lieu of + +311 +00:23:05,308 --> 00:23:11,038 +which I had never had never heard of or +used before starting to work with torch + +312 +00:23:11,038 --> 00:23:16,700 +but it actually isn't too bad that lure +is best highly this high-level scripting + +313 +00:23:16,700 --> 00:23:20,999 +language that is really intended for +embedded devices so it can run more + +314 +00:23:20,999 --> 00:23:24,720 +efficiently and it's a lot of very +similar to JavaScript in a lot of ways + +315 +00:23:24,720 --> 00:23:29,749 +so another cool thing about lou is that +because it's meant to be run on embedded + +316 +00:23:29,749 --> 00:23:33,929 +devices that you can actually do for +loops are really fast and torch you know + +317 +00:23:33,929 --> 00:23:37,149 +how in Python if you're in a for loop +it's going to be really slow that's + +318 +00:23:37,148 --> 00:23:40,798 +actually totally fine to do in in torch +because it actually uses just-in-time + +319 +00:23:40,798 --> 00:23:46,249 +compilation to make these things really +fast and torch is our newest most + +320 +00:23:46,249 --> 00:23:50,200 +important JavaScript in that it is +functional language functions are + +321 +00:23:50,200 --> 00:23:54,058 +first-class citizens it's very common to +pass pass callbacks around to different + +322 +00:23:54,058 --> 00:24:01,200 +parts of your code you also is has this +idea of protocol inheritance where + +323 +00:24:01,200 --> 00:24:05,200 +they're sort of one data structure which +in Lua is a table which you can think of + +324 +00:24:05,200 --> 00:24:09,558 +is being very similar to an object in +javascript and you can implement things + +325 +00:24:09,558 --> 00:24:13,378 +like object oriented programming using +prototypical inheritance in a similar + +326 +00:24:13,378 --> 00:24:18,428 +way as you would in javascript and one +of the town's one of the downsides + +327 +00:24:18,429 --> 00:24:19,929 +actually the standard library + +328 +00:24:19,929 --> 00:24:24,820 +is kind of annoying sometimes and things +like handling strings and whatnot can be + +329 +00:24:24,819 --> 00:24:28,999 +kind of cumbersome and maybe most +annoying is its one indexed so all of + +330 +00:24:28,999 --> 00:24:33,058 +your intuition about four loops will be +a little bit off for a while but other + +331 +00:24:33,058 --> 00:24:37,528 +than that it's pretty easy to pick up +and I gave a link here to this website + +332 +00:24:37,528 --> 00:24:41,618 +claiming that you can learn Lua in 15 +minutes it might be a little bit of an + +333 +00:24:41,618 --> 00:24:45,209 +over so they might be overselling it a +little bit but I think it is pretty easy + +334 +00:24:45,210 --> 00:24:50,298 +to pick up and start writing code and it +pretty fast so the main idea behind + +335 +00:24:50,298 --> 00:24:55,398 +torch is this tensor class so you guys +have been working in numpy a lot on your + +336 +00:24:55,398 --> 00:24:59,548 +assignments and the way the assignments +are kind of structured is that the numpy + +337 +00:24:59,548 --> 00:25:03,329 +array gives you this really easy way to +manipulate data in whatever way you want + +338 +00:25:03,329 --> 00:25:06,798 +and then you can use that number higher +rates of buildup other abstractions like + +339 +00:25:06,798 --> 00:25:10,720 +known that libraries and whatnot but +really the numpy array just lets you + +340 +00:25:10,720 --> 00:25:16,909 +manipulate data numerically in whatever +way you want in complete flexibility so + +341 +00:25:16,909 --> 00:25:20,580 +if you are a call then maybe here's a +look here's an example of some numpy + +342 +00:25:20,579 --> 00:25:24,918 +code that should be very familiar by now +we're just computing a simple for a pass + +343 +00:25:24,919 --> 00:25:31,990 +of cool air rail network so maybe black +wasn't the best choice here but we're + +344 +00:25:31,990 --> 00:25:36,569 +we're we're doing we're competing some +some constants were competing some + +345 +00:25:36,569 --> 00:25:40,408 +weights are getting some random data and +we're doing a matrix multiply a rally in + +346 +00:25:40,409 --> 00:25:44,789 +another major multiplied so that's +that's very easy to write an umpire and + +347 +00:25:44,788 --> 00:25:49,538 +actually this has almost a 120 +translation into torched answers so now + +348 +00:25:49,538 --> 00:25:53,970 +on the right this is the exact same code +but using torched answers and so here + +349 +00:25:53,970 --> 00:25:58,509 +we're defining our backsides input size +and all that we're defining our weights + +350 +00:25:58,509 --> 00:26:02,929 +which are just torched answers were +getting a random input vector we're + +351 +00:26:02,929 --> 00:26:07,929 +doing a forward pass this is doing a +matrix multiply up to our sponsors this + +352 +00:26:07,929 --> 00:26:09,179 +c-max + +353 +00:26:09,179 --> 00:26:13,149 +element wise maximum that's a real issue +and then we can compute cores using + +354 +00:26:13,148 --> 00:26:17,089 +another matrix multiply so in general +pretty much any kind of code used + +355 +00:26:17,089 --> 00:26:18,689 +trading an umpire is pretty easy + +356 +00:26:18,690 --> 00:26:22,460 +pretty much has almost a one-by-one +line-by-line translation into using + +357 +00:26:22,460 --> 00:26:25,400 +torched answers instead + +358 +00:26:25,400 --> 00:26:28,880 +so also remember in umpire that it's +really easy to swap and use different + +359 +00:26:28,880 --> 00:26:33,690 +data types we talked about this ad +nauseam the last lecture but at least in + +360 +00:26:33,690 --> 00:26:38,500 +numpy to switch to maybe a 32 bit +floating point all you need to do is + +361 +00:26:38,500 --> 00:26:43,049 +cast your data to this other data type +and it turns out that that's very very + +362 +00:26:43,049 --> 00:26:47,589 +easy to do in torture as well that our +data type is now this this strength and + +363 +00:26:47,589 --> 00:26:52,990 +then we can easily cast our data to +another data type but here's where two + +364 +00:26:52,990 --> 00:26:56,130 +years though so this next slide as the +real reason why torture is infinitely + +365 +00:26:56,130 --> 00:27:02,020 +better than numpy and that's that the +GPU is just another data type so when + +366 +00:27:02,019 --> 00:27:07,879 +you are right when you wanna run code on +the GPU in torch you use this you import + +367 +00:27:07,880 --> 00:27:11,630 +another package and you have another +time another data type which is torched + +368 +00:27:11,630 --> 00:27:16,810 +a tensor and now you cast your tensors +to this other data type and now they + +369 +00:27:16,809 --> 00:27:21,819 +live on the GPU and running any kind of +numerical operations on the tensors just + +370 +00:27:21,819 --> 00:27:26,500 +runs on the GPU so it's really really +easy and torch to just write generic + +371 +00:27:26,500 --> 00:27:34,220 +tenser scientific computing code to run +I GPU and be really fast so this like I + +372 +00:27:34,220 --> 00:27:37,819 +said these tensors are really you should +think of them as similar to numpy raised + +373 +00:27:37,819 --> 00:27:41,689 +and there's a lot of documentation on +but different kinds of methods that you + +374 +00:27:41,690 --> 00:27:46,250 +can work within 10 service up here and +get up this documentation isn't super + +375 +00:27:46,250 --> 00:27:53,950 +complete but it's it's not bad so you +should take a look at it so the next but + +376 +00:27:53,950 --> 00:27:58,200 +in practice you end up not really using +the tensors too much in torch instead + +377 +00:27:58,200 --> 00:28:02,880 +use this other package called an end for +neural networks so and and is this + +378 +00:28:02,880 --> 00:28:06,800 +pretty thin wrapper that actually +defines neural network package just in + +379 +00:28:06,799 --> 00:28:10,930 +terms of these tents in terms of these +tents are objects you should think of + +380 +00:28:10,930 --> 00:28:15,049 +this as being like a BPR more industrial +strength version of the homework code + +381 +00:28:15,049 --> 00:28:20,240 +base where you have this this tenth this +and the array this tensor abstraction + +382 +00:28:20,240 --> 00:28:24,480 +and then you implement an aromatic +library on top of that in a nice clean + +383 +00:28:24,480 --> 00:28:30,410 +interface so here's the same to larry +Adler network using the N package so we + +384 +00:28:30,410 --> 00:28:33,900 +define our network has a sequential so +it's gonna be a stack of of sequential + +385 +00:28:33,900 --> 00:28:38,360 +operations it's gonna we're gonna first +have a linear which is a fully connected + +386 +00:28:38,359 --> 00:28:41,759 +from our input mentioned marketing to +mention we're gonna have a railing and + +387 +00:28:41,759 --> 00:28:48,420 +another lender now we can actually get +the weights and gradients in second one + +388 +00:28:48,420 --> 00:28:52,070 +to answer for each using this get +parameters method to now waits will be a + +389 +00:28:52,069 --> 00:28:55,750 +single torched answer that will have all +the way to the network and graduates + +390 +00:28:55,750 --> 00:29:00,490 +will be a single torched answer for all +of the above ingredients we can generate + +391 +00:29:00,490 --> 00:29:05,730 +some random data now to a forward pass +we just call Matt the format on the + +392 +00:29:05,730 --> 00:29:11,599 +object using our data this gives us our +scores to computer loss we have a + +393 +00:29:11,599 --> 00:29:16,769 +separate criterion object that is our +loss function so we computer lost by + +394 +00:29:16,769 --> 00:29:21,289 +calling the fourth method of the +criteria now we've done our forecast + +395 +00:29:21,289 --> 00:29:27,279 +easy and backward pass we first set and +20 call a backward on the loss function + +396 +00:29:27,279 --> 00:29:31,609 +and then a backward I'm at work now this +has updated all of the gradients for the + +397 +00:29:31,609 --> 00:29:35,319 +network in the grad params so we can +just make a gradient stuff very easily + +398 +00:29:35,319 --> 00:29:40,419 +so this would be multiplying the +graduates by the opposite of learning + +399 +00:29:40,420 --> 00:29:44,130 +rate and then adding it to the ways +that's a simple gradient descent update + +400 +00:29:44,130 --> 00:29:50,400 +rights that's that's all of the rights +that would have been maybe a little bit + +401 +00:29:50,400 --> 00:29:53,560 +more clear but we have not we have +weights graduates who have lost function + +402 +00:29:53,559 --> 00:30:00,730 +we get random data from forward and +backward make an update and as as you + +403 +00:30:00,730 --> 00:30:03,930 +might expect from looking at that answer +it's quite easy to make this thing run + +404 +00:30:03,930 --> 00:30:09,570 +on GPU so to run on these networks on +the GPU we import a couple new packages + +405 +00:30:09,569 --> 00:30:14,519 +through torture and to an end which are +two versions of everything and then we + +406 +00:30:14,519 --> 00:30:17,930 +just need to cast our network and our +loss function to this other data type + +407 +00:30:17,930 --> 00:30:23,490 +and we also need to cast our data and +labels and now this whole network will + +408 +00:30:23,490 --> 00:30:28,660 +run and trained on the GPU so it's it's +pretty easy now in what was that like 40 + +409 +00:30:28,660 --> 00:30:31,320 +lines of code we've written a fully +connected network and we can train on + +410 +00:30:31,319 --> 00:30:37,089 +the GPU but one problem here is that +we're just using vanilla gradient + +411 +00:30:37,089 --> 00:30:41,000 +descent which is not so great and as you +saw on the assignments other things like + +412 +00:30:41,000 --> 00:30:45,329 +out on our mess popped into work much +better in practice so to solve that + +413 +00:30:45,329 --> 00:30:50,319 +torch gives us the opportunity package +so optimist quite easy to use again we + +414 +00:30:50,319 --> 00:30:51,799 +just import a new package up here + +415 +00:30:51,799 --> 00:30:57,799 +here and now what changes is that we +actually need to define this callback + +416 +00:30:57,799 --> 00:31:02,569 +function so before we were just calling +forward and backward exclude explicitly + +417 +00:31:02,569 --> 00:31:06,960 +ourself instead we're going to find this +callback function that will run the + +418 +00:31:06,960 --> 00:31:10,750 +network forward and backward on data and +then return the loss and the gradient + +419 +00:31:10,750 --> 00:31:15,400 +and now to make an update stop on our +network will actually pass this callback + +420 +00:31:15,400 --> 00:31:21,259 +function to this Adam method from the +Optim package so this this is maybe a + +421 +00:31:21,259 --> 00:31:26,940 +little bit awkward but you know we can +use any kind of update rule using just a + +422 +00:31:26,940 --> 00:31:31,430 +couple lines of change from what we had +before and again this is very easy to + +423 +00:31:31,430 --> 00:31:38,900 +add to run on the GPU by just casting +everything to go right so as we saw in + +424 +00:31:38,900 --> 00:31:44,220 +cafe cafe sort of implements everything +in terms of next and layers and cafe has + +425 +00:31:44,220 --> 00:31:48,750 +this really hard distinction between +that and the lair in torch they don't we + +426 +00:31:48,750 --> 00:31:52,400 +don't really draw this distinction +everything is just a model so the entire + +427 +00:31:52,400 --> 00:31:59,750 +network is a module and also each +individual larry is a module so modules + +428 +00:31:59,750 --> 00:32:03,650 +are just classes that are defined in +lieu of that rut that are implemented + +429 +00:32:03,650 --> 00:32:08,880 +using that answer API so these modules +are since the written law they're quite + +430 +00:32:08,880 --> 00:32:13,260 +easy to understand so many here is the +fully connected now is the the fully + +431 +00:32:13,259 --> 00:32:17,039 +connected larry and this is the +constructor you can see it's just + +432 +00:32:17,039 --> 00:32:23,210 +setting up tents as for the weight and +the bias and because this tensor API in + +433 +00:32:23,210 --> 00:32:28,100 +torch lets us easily run the same code +on GPU and CPU than all of these layers + +434 +00:32:28,099 --> 00:32:32,359 +will just be written in terms of the +tensor API and then Heasley run on both + +435 +00:32:32,359 --> 00:32:37,529 +devices so these modules need to +implement a forward and backward so far + +436 +00:32:37,529 --> 00:32:42,670 +forward babe decided to call it update +output so here's the example of the + +437 +00:32:42,670 --> 00:32:47,250 +update output for the full text of later +there's actually a couple cases they + +438 +00:32:47,250 --> 00:32:50,480 +need to deal with a couple different +cases here to be with me back vs non me + +439 +00:32:50,480 --> 00:32:55,170 +back in parts but other than that but +should be quite easy to read before + +440 +00:32:55,170 --> 00:33:00,830 +further backward pass there's a pair of +methods update grad input which receives + +441 +00:33:00,829 --> 00:33:03,970 +the upstream gradients and computes the +gradients respected + +442 +00:33:03,970 --> 00:33:09,160 +input and again this is just implemented +in the tensor API so it's very easy to + +443 +00:33:09,160 --> 00:33:14,279 +understand its just a bit just the same +type of thing you saw on homework and we + +444 +00:33:14,279 --> 00:33:17,990 +also implement and accumulate grab +parameters which computes the gradients + +445 +00:33:17,990 --> 00:33:21,480 +with respect to the weights of the +network as you saw in the constructor + +446 +00:33:21,480 --> 00:33:25,610 +the weights on the biases are held in +instance variables this module and + +447 +00:33:25,609 --> 00:33:30,309 +accumulate grad parameters will receive +gradients from upstream and accumulate + +448 +00:33:30,309 --> 00:33:34,940 +gradients of the parameters with respect +to the upstream radians and again this + +449 +00:33:34,940 --> 00:33:39,809 +is very simple just using the tensor API + +450 +00:33:39,809 --> 00:33:44,200 +torch actually has a ton of different +modules available the documentation here + +451 +00:33:44,200 --> 00:33:46,980 +can be a little bit out of date but if +you just go on get up you can see all + +452 +00:33:46,980 --> 00:33:51,460 +the files that give you all the goodies +to play with and he's actually get + +453 +00:33:51,460 --> 00:33:55,930 +updated a lot so just a point out a +couple these these pre-war just added me + +454 +00:33:55,930 --> 00:34:00,750 +last week so torches always adding new +modules that you can add your networks + +455 +00:34:00,750 --> 00:34:06,390 +which is pretty fun but when these +existing modules aren't good enough it's + +456 +00:34:06,390 --> 00:34:10,579 +actually very easy to write your own so +because you can just implement these + +457 +00:34:10,579 --> 00:34:13,989 +things using these tenser using the +tensor API and just implement the + +458 +00:34:13,989 --> 00:34:17,259 +forward and backward it's not much +harder than implementing layers on the + +459 +00:34:17,260 --> 00:34:21,890 +homeworks so here's just a small example +this is a stupid module that just takes + +460 +00:34:21,889 --> 00:34:28,210 +its input and multiply it by two and you +can see we implement the update graph + +461 +00:34:28,210 --> 00:34:31,849 +template and now we've implemented a new +layer and torque just twenty lines of + +462 +00:34:31,849 --> 00:34:35,929 +code and then that's really easy and +then it's very easy to use in other code + +463 +00:34:35,929 --> 00:34:40,710 +just import it and I you can add its +networks and so on and the really cool + +464 +00:34:40,710 --> 00:34:44,920 +thing about this is because this is just +the tensor API you can do whatever kind + +465 +00:34:44,920 --> 00:34:48,579 +of arbitrary thing you want inside of +these forward and backward if you need + +466 +00:34:48,579 --> 00:34:52,730 +to do for loops or complicated and +parent of code or anything or maybe + +467 +00:34:52,730 --> 00:34:56,980 +stochastic things for drop out or +rationalization than any kind of any + +468 +00:34:56,980 --> 00:34:59,949 +whatever kind of code you want to look +forward and backward pass you just + +469 +00:34:59,949 --> 00:35:03,500 +implemented yourself inside these +modules so it's usually very easy very + +470 +00:35:03,500 --> 00:35:11,500 +easy to implement your own new types of +players and torch so torch but of course + +471 +00:35:11,500 --> 00:35:14,250 +using individual layers on their own +isn't so useful + +472 +00:35:14,250 --> 00:35:16,960 +we need people to stitch them together +into larger networks + +473 +00:35:16,960 --> 00:35:21,220 +so far this torch uses containers we +already saw one in the previous example + +474 +00:35:21,219 --> 00:35:26,549 +which was this sequential container so +consequential container is just a stack + +475 +00:35:26,550 --> 00:35:29,950 +of modules that all we're one who +receives the output from the previous + +476 +00:35:29,949 --> 00:35:35,639 +one and just go back that's probably the +most commonly used another one you might + +477 +00:35:35,639 --> 00:35:40,799 +see is this parent is this cunt at table +so maybe if you have an input and you + +478 +00:35:40,800 --> 00:35:44,289 +want to apply different to different +modules to the same input than the + +479 +00:35:44,289 --> 00:35:49,099 +content table as you do that and you +receive the output Celeste another one + +480 +00:35:49,099 --> 00:35:53,280 +you might see as a parallel table if you +have a list of inputs and you want to + +481 +00:35:53,280 --> 00:35:57,500 +apply different modules to different +each element of the list then you can + +482 +00:35:57,500 --> 00:36:04,588 +use a parallel tabor table for that sort +of the construction but when things get + +483 +00:36:04,588 --> 00:36:08,980 +really complicated so actually those +those containers that I just told you + +484 +00:36:08,980 --> 00:36:13,480 +should in theory be easy to be possible +to implement just about aids apology you + +485 +00:36:13,480 --> 00:36:16,980 +want but it can be really hairy in +practice to wire up really complicated + +486 +00:36:16,980 --> 00:36:21,480 +things using those containers so torch +provides another package called pennant + +487 +00:36:21,480 --> 00:36:23,230 +graph that lets you hook up + +488 +00:36:23,230 --> 00:36:28,210 +container hook up things more +complicated topologies pretty easily so + +489 +00:36:28,210 --> 00:36:32,400 +here's an example if we have maybe if we +have three inputs we want to produce one + +490 +00:36:32,400 --> 00:36:36,930 +outputs and we want to produce them with +this pretty simple update rule that + +491 +00:36:36,929 --> 00:36:40,379 +corresponds to this type of +computational graph that we've seen many + +492 +00:36:40,380 --> 00:36:44,869 +times in lecture for different types of +problems so you could actually implement + +493 +00:36:44,869 --> 00:36:49,430 +this just fine using parallel and +sequential and cunt at table but it + +494 +00:36:49,429 --> 00:36:53,009 +could be kind of a mass so when you +wanna do things like this it's very + +495 +00:36:53,010 --> 00:36:58,470 +common to send a graph instead so this +graph code is is quite easy so here this + +496 +00:36:58,469 --> 00:37:03,179 +function is going to build a module +using a graph and then return it so here + +497 +00:37:03,179 --> 00:37:09,129 +we import the graph package and then +inside here this is a bit of money + +498 +00:37:09,130 --> 00:37:14,329 +syntax so this is actually not a tensor +this is the finding a symbolic variable + +499 +00:37:14,329 --> 00:37:19,480 +so this is saying that our our tents or +object is going to receive XY and Z as + +500 +00:37:19,480 --> 00:37:25,300 +inputs and now share were actually doing +symbolic operations on those inputs so + +501 +00:37:25,300 --> 00:37:26,840 +here we're saying that + +502 +00:37:26,840 --> 00:37:32,700 +we wanted to have a pointwise edition of +X&Y we want to have played twice + +503 +00:37:32,699 --> 00:37:38,159 +multiplication of ANZ store that and be +and now pointwise edition of A&B and + +504 +00:37:38,159 --> 00:37:42,159 +store that and see and again these are +not actual tenser objects these are now + +505 +00:37:42,159 --> 00:37:45,109 +sort of symbolic references that are you +being used to build up this + +506 +00:37:45,110 --> 00:37:50,420 +computational graph in the background +and now we can actually returned a + +507 +00:37:50,420 --> 00:37:55,159 +module here where we say that our module +will have input X Y and Z and outputs + +508 +00:37:55,159 --> 00:38:00,920 +see and this end I G module will +actually give us an object conforming to + +509 +00:38:00,920 --> 00:38:05,559 +the module API that implements its +computation so then after we build the + +510 +00:38:05,559 --> 00:38:10,619 +Montreal we can construct concrete court +torched answers and then feed them into + +511 +00:38:10,619 --> 00:38:19,170 +the module that will actually compute +the function so a torch actually quite + +512 +00:38:19,170 --> 00:38:22,670 +good at preteen models there is a +package called load campaign that lets + +513 +00:38:22,670 --> 00:38:27,050 +you load up many different types of +pre-trial models from cafe and it'll + +514 +00:38:27,050 --> 00:38:31,590 +convert them into their torture +equivalents you can load up the cafe + +515 +00:38:31,590 --> 00:38:35,539 +product ext and the cafe model file and +it'll turn into a giant stack of + +516 +00:38:35,539 --> 00:38:39,929 +sequential mayors load Cafe is not super +General Beau and only works for certain + +517 +00:38:39,929 --> 00:38:44,649 +types of networks but in particular load +Cafe will let you load up Alex not and + +518 +00:38:44,650 --> 00:38:49,660 +campaign and PGG so they're probably +some of the most commonly used there are + +519 +00:38:49,659 --> 00:38:54,259 +also a couple different implementations +you load up Google Matt into into torch + +520 +00:38:54,260 --> 00:38:58,520 +to let you load up retrain Google that +models into torch and actually very + +521 +00:38:58,519 --> 00:39:01,869 +recently Facebook went ahead and +reimplemented the residual networks + +522 +00:39:01,869 --> 00:39:07,900 +straight up in torch and they released +preteen models for that so between Alex + +523 +00:39:07,900 --> 00:39:11,849 +not campaign at BG Group and ResNet I +think that's probably everything you + +524 +00:39:11,849 --> 00:39:17,869 +need all the preteen models that most +people want to use another point is that + +525 +00:39:17,869 --> 00:39:21,549 +because torches using lure we can't use +pip to install packages and there's + +526 +00:39:21,550 --> 00:39:24,920 +another very similar idea called +barracks that's easily install new + +527 +00:39:24,920 --> 00:39:26,750 +packages an update packages + +528 +00:39:26,750 --> 00:39:29,650 +that's quite very easy to use + +529 +00:39:29,650 --> 00:39:34,079 +and this is kind of just a list of some +packages that I find very useful in + +530 +00:39:34,079 --> 00:39:38,349 +torch so there could be undone by names +you can read and write to HDR 5 files + +531 +00:39:38,349 --> 00:39:44,640 +you can read and write JSON there's this +funny one from Twitter autorad that is a + +532 +00:39:44,639 --> 00:39:47,980 +little bit like the animal which will +talk about it a bit but I haven't used + +533 +00:39:47,980 --> 00:39:52,369 +it but it's kind of cool to look at and +actually Facebook has a pretty useful + +534 +00:39:52,369 --> 00:39:57,849 +library for torches while that +implements a fifty convolutions and also + +535 +00:39:57,849 --> 00:40:01,548 +implements data-parallel and model +parallelism + +536 +00:40:01,548 --> 00:40:07,449 +so that's pretty a pretty nice thing to +have so very typical workflow in torch + +537 +00:40:07,449 --> 00:40:11,239 +is that you'll have some preprocessing +script often and pecan that'll + +538 +00:40:11,239 --> 00:40:15,818 +preprocess your data and dump it on to +some nice format in desk usually HDL 5 + +539 +00:40:15,818 --> 00:40:20,528 +for big things and Jason little things +then you will I'll typically write a + +540 +00:40:20,528 --> 00:40:25,318 +trained at low up at all read from the +HDL 5 and train the model and optimize + +541 +00:40:25,318 --> 00:40:30,088 +the model and save checkpoints the desk +and then usually I have some evaluate + +542 +00:40:30,088 --> 00:40:35,019 +script that loads up a train model and +does it for something useful so a case + +543 +00:40:35,019 --> 00:40:39,000 +study for this type of workflow is this +project I put up on github a week ago + +544 +00:40:39,000 --> 00:40:43,969 +that implements character level language +models and torch so here there's a + +545 +00:40:43,969 --> 00:40:48,239 +preprocessing script that converts text +files into HTML 5 files there's a + +546 +00:40:48,239 --> 00:40:52,889 +training script that loads for html5 and +trains these recurrent networks and then + +547 +00:40:52,889 --> 00:40:57,190 +there's a sampling script that loads up +the checkpoints generate tax so that's + +548 +00:40:57,190 --> 00:41:03,720 +that's kind of like my typical workflow +and torch so the quick pros and cons I + +549 +00:41:03,719 --> 00:41:07,169 +would say about torture that its lure is +a big turnoff for people but I don't + +550 +00:41:07,170 --> 00:41:11,690 +think it's actually that big a deal it's +definitely less plug and play in cafe so + +551 +00:41:11,690 --> 00:41:15,760 +you'll end up writing a lot of your own +code typically which maybe is a little + +552 +00:41:15,760 --> 00:41:20,028 +bit more overhead but also gives you +more flexibility it has a lot of modular + +553 +00:41:20,028 --> 00:41:24,278 +pieces that are easy to plug and play +and the like the standard library + +554 +00:41:24,278 --> 00:41:26,880 +because it's all written in blue it's +quite easy to read and quite easy to + +555 +00:41:26,880 --> 00:41:31,740 +understand there's a lot of preteen +models which is quite nice but + +556 +00:41:31,739 --> 00:41:34,598 +unfortunately it's it's a little bit +awkward to use for recurrent networks in + +557 +00:41:34,599 --> 00:41:38,640 +general so when you wanna have one month +when you want to have multiple modules + +558 +00:41:38,639 --> 00:41:42,028 +that share weights with each other you +can actually do this and torch but it's + +559 +00:41:42,028 --> 00:41:42,469 +it's kind + +560 +00:41:42,469 --> 00:41:47,199 +brittle and you can run into subtle bugs +there so that's that's probably the + +561 +00:41:47,199 --> 00:41:49,649 +biggest caveat is that recurrent +networks can be tricky + +562 +00:41:49,650 --> 00:42:15,800 +any any questions about torch yeah yeah +yeah but it's not out of the question + +563 +00:42:15,800 --> 00:42:21,570 +was about how how bad are four loops and +pecan is interpreted right so that's + +564 +00:42:21,570 --> 00:42:24,359 +that's really why for these are really +bad in Python because it's interpreted + +565 +00:42:24,358 --> 00:42:27,139 +and every for lupus actually doing quite +a lot of memory allocation and other + +566 +00:42:27,139 --> 00:42:31,960 +things behind the scenes but if you've +ever use JavaScript then loops and + +567 +00:42:31,960 --> 00:42:35,059 +JavaScript tend to be pretty fast +because the runtime actually just + +568 +00:42:35,059 --> 00:42:39,759 +compile the code on the fly down to +native code so loops in JavaScript are + +569 +00:42:39,760 --> 00:42:44,520 +really fast and fluid and lou actually +has a similar mechanism where it'll sort + +570 +00:42:44,519 --> 00:42:49,588 +of automatically and magically compiled +code for human genetic code so your lips + +571 +00:42:49,588 --> 00:42:53,608 +can be really fast but that only I'm +writing custom vectorized code still can + +572 +00:42:53,608 --> 00:43:01,619 +give you a lot of speed up all rights +we've got now maybe half an hour left to + +573 +00:43:01,619 --> 00:43:06,420 +cover two more frameworks so we're +running out of time so next up is no + +574 +00:43:06,420 --> 00:43:12,000 +such thing I know is from Joshua banjos +group at the University of Montreal and + +575 +00:43:12,000 --> 00:43:16,250 +it's really all about computational +graphs so we saw a little bit innn graph + +576 +00:43:16,250 --> 00:43:19,559 +from torch that computation crafts are +this pretty nice way to stitch together + +577 +00:43:19,559 --> 00:43:24,139 +big complicated architectures and Fionna +really takes this idea of computation on + +578 +00:43:24,139 --> 00:43:29,409 +graphics and runs with it to the extreme +and it also has some high-level library + +579 +00:43:29,409 --> 00:43:33,940 +is scarce and lasagna that will touch on +as well so here's the same computation + +580 +00:43:33,940 --> 00:43:38,570 +craft we saw in the context of a graph +before and we can actually walk through + +581 +00:43:38,570 --> 00:43:43,400 +implementation of this in 2010 so you +can see that in here we're importing + +582 +00:43:43,400 --> 00:43:49,440 +fiato and the fiato tenser object and +now here we're defining XY and Z as + +583 +00:43:49,440 --> 00:43:53,099 +symbolic as symbolic variables this is +actually very similar to the end and + +584 +00:43:53,099 --> 00:43:55,530 +graph example we saw just a few slides +ago + +585 +00:43:55,530 --> 00:43:59,500 +so that these are actually not numpy +raise these are sort of symbolic objects + +586 +00:43:59,500 --> 00:44:05,690 +in the in the computation grass then we +can actually computer these outputs + +587 +00:44:05,690 --> 00:44:11,679 +symbolically so XY and Z are these +symbolic things and we can compute ab&c + +588 +00:44:11,679 --> 00:44:15,769 +just using these overloaded operators +and that'll be building up this + +589 +00:44:15,769 --> 00:44:19,929 +computational graph in the background +then once we've built up our + +590 +00:44:19,929 --> 00:44:23,839 +computational craft we actually want to +be able to run certain parts of it on + +591 +00:44:23,840 --> 00:44:29,240 +real data so we call this the anode odd +function thing so this is saying about + +592 +00:44:29,239 --> 00:44:33,269 +we want to take our function will take +inputs XY and Z and it'll produce + +593 +00:44:33,269 --> 00:44:38,329 +outputs see this will return an actual +python function that we can evaluate on + +594 +00:44:38,329 --> 00:44:42,239 +real data and I'd like to point out that +this is really where all the magic and + +595 +00:44:42,239 --> 00:44:46,319 +Fionna was happening that when you call +the function it can be doing crazy crazy + +596 +00:44:46,320 --> 00:44:49,580 +things it can simplify your +computational graph to make it more + +597 +00:44:49,579 --> 00:44:54,199 +efficient it can actually symbolically +divider I pretense and other things and + +598 +00:44:54,199 --> 00:44:58,319 +it can actually generate native code so +when you call function to connect it + +599 +00:44:58,320 --> 00:45:02,450 +actually sometimes compiled code on the +flights are unofficially on the GPU so + +600 +00:45:02,449 --> 00:45:06,389 +all the magic and Fiano is really coming +from this from this little innocent + +601 +00:45:06,389 --> 00:45:11,750 +looking statement in Python but there's +a lot going on under the hood here and + +602 +00:45:11,750 --> 00:45:14,710 +now once we've gotten this magic +function through all this crazy stuff + +603 +00:45:14,710 --> 00:45:19,159 +then we can just run it on actual number +higher raise so here we instantiate + +604 +00:45:19,159 --> 00:45:25,440 +xxyyxx easy as actual as actual number +higher grades and then we can just about + +605 +00:45:25,440 --> 00:45:30,639 +our function and passing these actual +number is to get the values out and this + +606 +00:45:30,639 --> 00:45:35,359 +is doing the same thing as doing these +computations explosively in Python + +607 +00:45:35,360 --> 00:45:39,289 +except that the final version could be +much more efficient due to all the magic + +608 +00:45:39,289 --> 00:45:42,840 +under the hood and piano version +actually could be running on the GPU if + +609 +00:45:42,840 --> 00:45:47,289 +you have not configured but +unfortunately we don't really care about + +610 +00:45:47,289 --> 00:45:51,659 +computing things like this we wanted to +know thats so here's an example of a + +611 +00:45:51,659 --> 00:45:57,629 +simple tool air balloon at 10 so the +idea is the same that we're going to + +612 +00:45:57,630 --> 00:46:02,860 +declare our inputs but now instead of +just XY and Z we have our input syntax + +613 +00:46:02,860 --> 00:46:06,490 +our labels and Y which are better + +614 +00:46:06,489 --> 00:46:11,009 +are to weight matrices W&W too so we're +just sort of setting up these symbolic + +615 +00:46:11,010 --> 00:46:17,540 +variables that will be elements in our +computational grass now 44 pass we it + +616 +00:46:17,539 --> 00:46:21,179 +looks kinda like numpy but it's not +bizarre operations on the symbolic + +617 +00:46:21,179 --> 00:46:24,669 +objects that are building up the graph +in the background so here computing + +618 +00:46:24,670 --> 00:46:28,909 +activations with this . method that is +matrix multiply but we need symbolic + +619 +00:46:28,909 --> 00:46:33,210 +objects we're doing a real issue using +this this library function and we're + +620 +00:46:33,210 --> 00:46:37,769 +doing another matrix multiply and then +we can actually compute the loss the + +621 +00:46:37,769 --> 00:46:41,210 +probabilities and the Los using a couple +other library functions and again these + +622 +00:46:41,210 --> 00:46:44,349 +are all operations on the symbolic +objects that are building up the + +623 +00:46:44,349 --> 00:46:50,420 +computational grass so that we can just +compiled this function so our function + +624 +00:46:50,420 --> 00:46:54,570 +is going to take our data are labels and +are 28 factor in our to weight matrices + +625 +00:46:54,570 --> 00:46:58,890 +and puts and as outputs I will return +the loss and a scalar and our + +626 +00:46:58,889 --> 00:47:04,109 +classification scores in a vector and +now we can run this thing on real data + +627 +00:47:04,110 --> 00:47:07,559 +just like we saw in the previous slide +we can instantiate some actual number I + +628 +00:47:07,559 --> 00:47:13,759 +raised and then passed to the function +so this is great but this is only the + +629 +00:47:13,760 --> 00:47:17,820 +fourth pass actually to be able to train +this network and computer radiance so + +630 +00:47:17,820 --> 00:47:23,000 +here we just need to add a couple lines +of code to do that so this is the same + +631 +00:47:23,000 --> 00:47:27,170 +as before we're so we're defining are +symbolic variables for our inputs and + +632 +00:47:27,170 --> 00:47:29,510 +our weights and so forth and we're +combining + +633 +00:47:29,510 --> 00:47:33,980 +running the same four passes before to +compute the loss to the computer law + +634 +00:47:33,980 --> 00:47:37,920 +symbolically know the difference is that +we actually can do + +635 +00:47:37,920 --> 00:47:43,680 +symbolic differentiation here so this is +Dee W one and TW to we're telling the I + +636 +00:47:43,679 --> 00:47:47,129 +know that we want those to be the +gradient of the ingredients of the loss + +637 +00:47:47,130 --> 00:47:52,280 +with respect to those other symbolic +variables W one Min W two so this is + +638 +00:47:52,280 --> 00:47:52,930 +really cool + +639 +00:47:52,929 --> 00:47:56,549 +fiato just lets you take arbitrary +gradients of any part of the graph with + +640 +00:47:56,550 --> 00:48:00,289 +respect to any other part of the graph +not introduce introduced those as new + +641 +00:48:00,289 --> 00:48:05,190 +symbolic variables in the graph so that +you can really go crazy with that but + +642 +00:48:05,190 --> 00:48:09,470 +here in this case we're just gonna +return those Canadians as outputs so now + +643 +00:48:09,469 --> 00:48:14,049 +we're gonna compile a new function that +again is going to take our inputs are + +644 +00:48:14,050 --> 00:48:19,510 +input input pixel sacks and our labels +why along with the 28 matrices + +645 +00:48:19,510 --> 00:48:23,140 +and now it's going to return our loss +the classification scores and also these + +646 +00:48:23,139 --> 00:48:28,250 +two ingredients so now we can actually +use this setup to train a very simple + +647 +00:48:28,250 --> 00:48:32,809 +neural network so we can actually just +use gradient descent implement gradient + +648 +00:48:32,809 --> 00:48:36,630 +descent in just a couple lines using +using this this using this computation + +649 +00:48:36,630 --> 00:48:38,990 +grass so here we're + +650 +00:48:38,989 --> 00:48:43,599 +instantiating actual number higher raise +for the data set and the factors and + +651 +00:48:43,599 --> 00:48:45,489 +some random matrices as again + +652 +00:48:45,489 --> 00:48:49,839 +actual number higher raise and now every +time we make this call to ask when we + +653 +00:48:49,840 --> 00:48:50,519 +get back + +654 +00:48:50,519 --> 00:48:54,710 +numpy array is containing a loss and the +scores and the gradients so now that we + +655 +00:48:54,710 --> 00:48:57,800 +have the gradients we can just make a +simple gradient update on our weights + +656 +00:48:57,800 --> 00:49:01,970 +and measures promised an alley-oop to +train our network but there's actually a + +657 +00:49:01,969 --> 00:49:06,039 +big of a problem with this especially if +you're running on a GPU anyone can + +658 +00:49:06,039 --> 00:49:15,599 +anyone want totally lost the problem is +that this is actually incurring a lot of + +659 +00:49:15,599 --> 00:49:21,059 +over communication overhead between the +CPU and GPU because every time we we + +660 +00:49:21,059 --> 00:49:24,799 +call this a function and we get back +these gradients thats copying the + +661 +00:49:24,800 --> 00:49:29,720 +gradients from the GPU back to the CPU +and I can be an expensive operation and + +662 +00:49:29,719 --> 00:49:35,000 +now we're actually making our gradient +stop this is CPU computation in numpy so + +663 +00:49:35,000 --> 00:49:38,190 +it would be really nice if we can make +those gradient updates to our parameters + +664 +00:49:38,190 --> 00:49:45,389 +actually directly on the GPU and the way +that we do that in Fiano is this with + +665 +00:49:45,389 --> 00:49:50,619 +with this school thing called a shared +variable so I shared variable is another + +666 +00:49:50,619 --> 00:49:54,230 +part of the network that actually is a +value that lives inside the computation + +667 +00:49:54,230 --> 00:49:59,340 +craft and actually persists from call to +call so here this is this is actually + +668 +00:49:59,340 --> 00:50:04,150 +quite similar to before that now were +defining our same symbolic variables X&Y + +669 +00:50:04,150 --> 00:50:08,769 +for the data and labels and now we're +defining a couple of these new funky + +670 +00:50:08,769 --> 00:50:13,809 +things funky shared variables for our to +weight matrices and the initializing + +671 +00:50:13,809 --> 00:50:19,110 +these weight matrices with numpy raised +and now this is the same as before this + +672 +00:50:19,110 --> 00:50:22,910 +is the exact same code as before where +computing the forward pass using these + +673 +00:50:22,909 --> 00:50:24,980 +library functions are symbolically + +674 +00:50:24,980 --> 00:50:30,940 +gradients but now the difference is in +how we define our function so now this + +675 +00:50:30,940 --> 00:50:32,269 +compiled function + +676 +00:50:32,269 --> 00:50:36,780 +only receives does not receive the +weights and puts those actually live + +677 +00:50:36,780 --> 00:50:41,320 +inside the computational graph instead +we just received the data and the data + +678 +00:50:41,320 --> 00:50:45,210 +and the labels and now we are going to +put the loss rather than output + +679 +00:50:45,210 --> 00:50:49,639 +ingredients explicitly and instead we +actually provide these update rules they + +680 +00:50:49,639 --> 00:50:53,819 +should be run every time the function is +called so these update rules notice our + +681 +00:50:53,820 --> 00:50:57,920 +little functions that operate on the +symbolic variables so this is just + +682 +00:50:57,920 --> 00:51:02,010 +saying that we should make he's creating +the Santa stops to update W one Min W + +683 +00:51:02,010 --> 00:51:09,290 +two every time we run this computational +graph so writes weekly update and now to + +684 +00:51:09,289 --> 00:51:12,880 +train this network all we need to do is +call this function repeatedly and every + +685 +00:51:12,880 --> 00:51:16,869 +time we call the function those will +make a gradient stop on the way it's so + +686 +00:51:16,869 --> 00:51:21,210 +we can just trying this network by just +calling this thing repeatedly on in + +687 +00:51:21,210 --> 00:51:23,769 +practice when you make when you're doing +this kind of thing and I know you'll + +688 +00:51:23,769 --> 00:51:27,579 +often define our training function call +that update the weights and then also + +689 +00:51:27,579 --> 00:51:31,719 +evaluate function that I'll just put the +scores and not make any updates you can + +690 +00:51:31,719 --> 00:51:34,609 +actually have multiple of these compiled +functions that about eight different + +691 +00:51:34,610 --> 00:51:47,220 +parts of the same graph yeah yeah yeah +the question is how we compute gradients + +692 +00:51:47,219 --> 00:51:51,119 +and it actually does it symbolically +sort of person out the S well it's not + +693 +00:51:51,119 --> 00:51:55,219 +actually person at the St because every +time you make these calls it's a sort of + +694 +00:51:55,219 --> 00:51:58,769 +building up this computation on graphics +object and then you can compute + +695 +00:51:58,769 --> 00:52:06,090 +gradients by just adding nodes onto the +computation on graphics object so yeah + +696 +00:52:06,090 --> 00:52:09,360 +yeah so it needs to know every of these +basic operators it knows what the + +697 +00:52:09,360 --> 00:52:12,500 +derivative with the derivative is and +it's still the normal normal have a + +698 +00:52:12,500 --> 00:52:17,309 +back-propagation that you'll see it +works but some of it but the pitch with + +699 +00:52:17,309 --> 00:52:21,299 +the I know is that it works and he's +very very low level basic operations + +700 +00:52:21,300 --> 00:52:24,920 +like these elements things and matrix +multiply as and when it is hoping that + +701 +00:52:24,920 --> 00:52:27,800 +it can compile efficient code the +combine those and simplify it + +702 +00:52:27,800 --> 00:52:32,210 +symbolically and that I'm not sure how +well it works but that's at least what + +703 +00:52:32,210 --> 00:52:37,110 +they claim to do so there's a lot of a +lot of other advanced things that you + +704 +00:52:37,110 --> 00:52:40,309 +can do anything I know that we just +don't have time to talk about you can + +705 +00:52:40,309 --> 00:52:43,610 +actually include conditionals directly +inside your competition craft using + +706 +00:52:43,610 --> 00:52:44,809 +these files + +707 +00:52:44,809 --> 00:52:49,029 +and switch commands you can actually +include loops insider computational + +708 +00:52:49,030 --> 00:52:52,370 +graph using this this funny scan +function that I don't really understand + +709 +00:52:52,369 --> 00:52:57,409 +but it's tough but theoretically it lets +you implement recurrent networks quite + +710 +00:52:57,409 --> 00:53:01,909 +easily as you can imagine for a moment +are occurring at work in one of these + +711 +00:53:01,909 --> 00:53:05,539 +computational crafts all you're doing is +passing the same weight matrix into + +712 +00:53:05,539 --> 00:53:10,110 +multiple nodes and scan actually lets +you sort of do that in a loop and have + +713 +00:53:10,110 --> 00:53:14,680 +the loop be part of an explicit part of +the graph and we can actually go crazy + +714 +00:53:14,679 --> 00:53:17,909 +with derivatives we can compute +derivatives with respect with out any + +715 +00:53:17,909 --> 00:53:21,149 +part of the craft with respect to any +other part we can also compute jacoby + +716 +00:53:21,150 --> 00:53:24,300 +ends by computing derivatives of +derivatives we can use Allen our + +717 +00:53:24,300 --> 00:53:29,140 +operators to officially do made big +major matrix-vector multiply as actors + +718 +00:53:29,139 --> 00:53:32,500 +and Jacoby Jones you can do a lot of +pretty cool different derivative take + +719 +00:53:32,500 --> 00:53:36,610 +stock in piano that's maybe top and +other frameworks and it also has some + +720 +00:53:36,610 --> 00:53:40,180 +support for sparse matrices it tries to +optimize your code on the fly + +721 +00:53:40,179 --> 00:53:45,669 +do some other cool things I know does +have multi GPU support there's this + +722 +00:53:45,670 --> 00:53:50,599 +package that I have not used but that +claims that you can get data parallelism + +723 +00:53:50,599 --> 00:53:54,500 +so distribute I mean about to split up +over multiple GPUs and there's + +724 +00:53:54,500 --> 00:53:57,260 +experimental support for model +parallelism with this computational + +725 +00:53:57,260 --> 00:54:01,320 +graph will be divided among the +different devices but the documentation + +726 +00:54:01,320 --> 00:54:08,030 +says its experimental so it probably +really experimental so so you saw and + +727 +00:54:08,030 --> 00:54:11,730 +when working with the I know that the +API is little bit low level and we need + +728 +00:54:11,730 --> 00:54:15,769 +to sort of implement the update rules +and everything ourself somos anya is + +729 +00:54:15,769 --> 00:54:19,900 +this high-level wrapper around the I +know that sort of abstract away some of + +730 +00:54:19,900 --> 00:54:24,660 +those details for you so again we're +sort of defining symbolic matrices and + +731 +00:54:24,659 --> 00:54:28,659 +lasagna now has these layer functions +that will automatically set up the + +732 +00:54:28,659 --> 00:54:32,489 +shared variables and that sort of thing +we can compute the probability in the + +733 +00:54:32,489 --> 00:54:38,469 +loss using these convenient things from +the library and lasagna can actually + +734 +00:54:38,469 --> 00:54:41,969 +write these update rules for us to +implement and a strong momentum and + +735 +00:54:41,969 --> 00:54:47,109 +other fancy things and now when we +compile our function we actually just + +736 +00:54:47,110 --> 00:54:51,390 +pass on these update rules that were +written for us by my lasagna and all of + +737 +00:54:51,389 --> 00:54:51,839 +the way + +738 +00:54:51,840 --> 00:54:56,309 +objects were taken care of taken care of +for us by lasagna as well + +739 +00:54:56,309 --> 00:54:59,579 +and then at the end of the day we just +end up with one of these compiled piano + +740 +00:54:59,579 --> 00:55:04,599 +functions and we use at the same way as +before there's another there's another + +741 +00:55:04,599 --> 00:55:10,480 +rapper 4390 that's pretty popular +culture us which is a little bit is even + +742 +00:55:10,480 --> 00:55:15,730 +more high-level so here we're having +making a sequential container and adding + +743 +00:55:15,730 --> 00:55:20,559 +a stack of layers to it so this is kind +of like torch and now we're having this + +744 +00:55:20,559 --> 00:55:25,789 +making this Sgt object that is going to +actually updates for us and now we can + +745 +00:55:25,789 --> 00:55:29,759 +train our network by just using the +model that fit method so this is super + +746 +00:55:29,760 --> 00:55:36,570 +high level and you can't even tell that +using piano and in fact carry us well as + +747 +00:55:36,570 --> 00:55:40,289 +a background as well so you don't have +to use the honor with it but there's + +748 +00:55:40,289 --> 00:55:44,500 +actually one big problem with this piece +of code and I don't know if you if you + +749 +00:55:44,500 --> 00:55:49,219 +experience with ya know but this could +actually crashes and it crashes in a + +750 +00:55:49,219 --> 00:55:54,750 +really bad way this is the error message +so we get this giant stack trace none of + +751 +00:55:54,750 --> 00:55:58,380 +which is through any of the code that we +wrote and we get this giant value error + +752 +00:55:58,380 --> 00:56:03,440 +that doesn't make any sense to me so I'm +not really an expert in Fiano so this + +753 +00:56:03,440 --> 00:56:07,039 +was really confusing to me so we wrote +this kind of simple looking coating care + +754 +00:56:07,039 --> 00:56:11,259 +us but because it's using fiato as a +pack and it crapped out and gave us this + +755 +00:56:11,260 --> 00:56:15,030 +really confusing error message so that's +i think one of the common pain points + +756 +00:56:15,030 --> 00:56:18,730 +and failure cases with anything that +uses as a background that debugging can + +757 +00:56:18,730 --> 00:56:24,949 +be kinda hard so like any good developer +I googled the air and I found out that I + +758 +00:56:24,949 --> 00:56:28,659 +found out that I was including the width +of the white variable wrong and I was + +759 +00:56:28,659 --> 00:56:32,579 +supposed to use this other other +function to convert my wife variable and + +760 +00:56:32,579 --> 00:56:35,690 +make the problem go away but that was +not obvious from the error message + +761 +00:56:35,690 --> 00:56:41,139 +that's something to be good to be +worried about when using piano piano + +762 +00:56:41,139 --> 00:56:44,699 +actually has preteen models so we talk +about lasagna + +763 +00:56:44,699 --> 00:56:48,539 +actually has a pretty good models you a +lot of different popular model + +764 +00:56:48,539 --> 00:56:52,820 +architecture is that you might want so +in lasagna you can use Alex and Google + +765 +00:56:52,820 --> 00:56:56,190 +Matt and BG I don't think they have +resident yet but they have quite a lot + +766 +00:56:56,190 --> 00:57:00,320 +of useful things there and there are a +couple other packages I found that the + +767 +00:57:00,320 --> 00:57:04,550 +obvious that really seems good except I +mean this was clearly awesome because it + +768 +00:57:04,550 --> 00:57:07,030 +was a cs2 31 and project from last year + +769 +00:57:07,030 --> 00:57:10,330 +but if your gonna pick one i think +probably the lasagna models it was + +770 +00:57:10,329 --> 00:57:16,139 +really good so from my one day +experience of playing with the I know + +771 +00:57:16,139 --> 00:57:20,029 +about pros and cons that I could see +where that its its pipeline an umpire's + +772 +00:57:20,030 --> 00:57:20,890 +that's great + +773 +00:57:20,889 --> 00:57:23,920 +this computational crap seems like a +really powerful idea especially around + +774 +00:57:23,920 --> 00:57:28,760 +computing gradient symbolically and all +these optimizations it especially with R + +775 +00:57:28,760 --> 00:57:32,070 +and ends I think would be much easier to +implement using this computational graph + +776 +00:57:32,070 --> 00:57:37,570 +Rottino is kind of ugly and gross but +especially lasagna looks pretty good to + +777 +00:57:37,570 --> 00:57:41,470 +me and sort of takes away some of the +pain the error messages can be pretty + +778 +00:57:41,469 --> 00:57:46,279 +painful as we saw and big models from +what I've heard can have really long + +779 +00:57:46,280 --> 00:57:51,190 +compile times so that that when we're +compiling that function on the fly for + +780 +00:57:51,190 --> 00:57:54,579 +all these simple examples that pretty +much runs instantaneously but we're + +781 +00:57:54,579 --> 00:57:58,159 +doing big complicated things like neural +Turing machines I've heard stories that + +782 +00:57:58,159 --> 00:58:01,969 +that could actually take maybe half an +hour to compile so that's that's not + +783 +00:58:01,969 --> 00:58:06,239 +good and that's not good for iterating +quickly on your models and another sort + +784 +00:58:06,239 --> 00:58:10,509 +of pain point is that the API is much +better than torch that it's doing all + +785 +00:58:10,510 --> 00:58:13,470 +this complicated stuff in the background +so it's kind of hard to understand and + +786 +00:58:13,469 --> 00:58:17,969 +debug but actually happening to your +code and then preteen models are maybe + +787 +00:58:17,969 --> 00:58:22,569 +not quite as good as cafe or torch but +it looks like lasagna is pretty good + +788 +00:58:22,570 --> 00:58:30,320 +ok so we've got fifteen minutes now to +talk about 1000 although first if + +789 +00:58:30,320 --> 00:58:38,309 +there's any questions about the I know I +can try ok that's not so tenser flow + +790 +00:58:38,309 --> 00:58:42,809 +sensor flows from Google it's really +cool and shiny and new and everyone's + +791 +00:58:42,809 --> 00:58:47,829 +excited about it and it's actually very +similar to Fiona in a lot of ways that + +792 +00:58:47,829 --> 00:58:51,170 +they're really taking this idea of a +computational graph and a building on + +793 +00:58:51,170 --> 00:58:55,650 +that for everything so tenser flow and +Fiano actually very very closely linked + +794 +00:58:55,650 --> 00:58:59,090 +in my mind and that's sort of like +harris can get away with using either + +795 +00:58:59,090 --> 00:59:04,760 +one is a backhand and also kind of one +maybe point to make about 1000 is that + +796 +00:59:04,760 --> 00:59:07,200 +it's sort of the first one of these +frameworks that was designed from the + +797 +00:59:07,199 --> 00:59:10,750 +ground up by professional engineers + +798 +00:59:10,750 --> 00:59:14,000 +so a lot of other frameworks sort of +spun out of academic research labs and + +799 +00:59:14,000 --> 00:59:17,320 +they're really great and they let you do +things really well but they were sort of + +800 +00:59:17,320 --> 00:59:23,120 +maintained by grad students especially +so torch especially is maintained by + +801 +00:59:23,119 --> 00:59:26,500 +some engineers at Twitter and Facebook +now but it was originally an academic + +802 +00:59:26,500 --> 00:59:30,070 +project and for all of these I think +tenser flow was the first one that was + +803 +00:59:30,070 --> 00:59:35,000 +from the ground up from a neck from an +industrial place so maybe theoretically + +804 +00:59:35,000 --> 00:59:37,989 +that could lead to better code quality +or test coverage or something i dont no + +805 +00:59:37,989 --> 01:00:04,519 +I'm not sure seemed pretty scary so +here's so here's our favorite to lay + +806 +01:00:04,519 --> 01:00:07,389 +rabin that we're gonna we did it and all +other frameworks let's do it intends to + +807 +01:00:07,389 --> 01:00:12,769 +flow so this is actually really similar +to the I know so you can see that we're + +808 +01:00:12,769 --> 01:00:17,320 +importing tenser flow and in Fiano +remember we have these matrix and vector + +809 +01:00:17,320 --> 01:00:21,019 +symbolic variables intense workload +they're called placeholders but it's the + +810 +01:00:21,019 --> 01:00:26,380 +same idea these are just creating input +nodes in our computational graph we're + +811 +01:00:26,380 --> 01:00:30,650 +also going to define the weight matrices +in fiato we have these shared things + +812 +01:00:30,650 --> 01:00:34,490 +that lived inside the computation graph +same idea and tensor flexible called + +813 +01:00:34,489 --> 01:00:40,359 +variables we just like just like in +Ciano be computed are forward pass using + +814 +01:00:40,360 --> 01:00:44,610 +these library methods that operate +operate on symbolically on these things + +815 +01:00:44,610 --> 01:00:48,289 +and build up a computational graph so +that lets you easily compute the + +816 +01:00:48,289 --> 01:00:52,210 +probability is on the loss and +everything like that symbolically this + +817 +01:00:52,210 --> 01:00:56,190 +actually I think to me looks more like +care us rather looks a little bit more + +818 +01:00:56,190 --> 01:01:00,740 +like carousel lasagna than rocky I know +but we're using this gradient descent + +819 +01:01:00,739 --> 01:01:04,669 +optimizer and we're telling it to +minimize the loss so here we're not + +820 +01:01:04,670 --> 01:01:08,970 +explicitly but spitting out gradients +and we're not explicitly writing about + +821 +01:01:08,969 --> 01:01:13,489 +trading update rules were instead using +this people thing but just sort of adds + +822 +01:01:13,489 --> 01:01:19,250 +whatever it needs to into the graph in +order to minimize that loss and now just + +823 +01:01:19,250 --> 01:01:23,059 +like in Ciano market we can actually +instantiate using actual number higher + +824 +01:01:23,059 --> 01:01:23,779 +raise + +825 +01:01:23,780 --> 01:01:29,470 +some some small datasets and then we can +run in the loop so intense air flow and + +826 +01:01:29,469 --> 01:01:33,750 +you actually want to run your code you +need to use you need to wrap it in this + +827 +01:01:33,750 --> 01:01:39,199 +session code I don't understand what's +doing but it's you had to do it actually + +828 +01:01:39,199 --> 01:01:42,599 +went to do although actually what it's +doing is that all the stops short of + +829 +01:01:42,599 --> 01:01:45,869 +setting up your computational grass and +the missed session is actually doing + +830 +01:01:45,869 --> 01:01:48,440 +whatever optimization it needs to +actually like to run it + +831 +01:01:48,440 --> 01:01:58,110 +yeah yeah so if you're so the question +is what is one hot so if you remember in + +832 +01:01:58,110 --> 01:02:01,840 +your assignments when you did like a +soft max loss function but why was + +833 +01:02:01,840 --> 01:02:06,170 +always an integer telling you which +thing you wanted but in some of these + +834 +01:02:06,170 --> 01:02:11,420 +frameworks instead of an integer it +should be a factor where everything is + +835 +01:02:11,420 --> 01:02:15,090 +zero except for the one that was the +credit class so that was actually the + +836 +01:02:15,090 --> 01:02:20,420 +bug that tripped me up on care us back +there was the difference between one hot + +837 +01:02:20,420 --> 01:02:28,710 +and not one hot and it turns out 10 2011 +hot whatever right so than when we + +838 +01:02:28,710 --> 01:02:34,250 +actually want to train this network then +we call in fiato remember we actually + +839 +01:02:34,250 --> 01:02:37,610 +compiled this function object and then +call the function over and over again + +840 +01:02:37,610 --> 01:02:41,940 +the equivalent intense air flow is that +we used to call the run method on the + +841 +01:02:41,940 --> 01:02:46,409 +session object and we tell it what +switch output we wanted to compute so + +842 +01:02:46,409 --> 01:02:50,349 +here we're telling it that we want to +compute the train stopped out what I'm a + +843 +01:02:50,349 --> 01:02:54,769 +la Salle putt and we're gonna feed at +these numpy raised into these inputs so + +844 +01:02:54,769 --> 01:02:57,699 +this is kind of the same idea as Diano +except we're just calling the run method + +845 +01:02:57,699 --> 01:03:02,210 +rather than explicitly compiling +compiling a function and in the process + +846 +01:03:02,210 --> 01:03:06,179 +of evaluating this train stop object +election make a gradient descent on the + +847 +01:03:06,179 --> 01:03:10,690 +weights so then we just run this thing +in a loop and it'll the Los goes down + +848 +01:03:10,690 --> 01:03:16,450 +and everything is beautiful so one of +the really cool things about tenser flow + +849 +01:03:16,449 --> 01:03:20,519 +is this thing called tenser board that +lets you easily easily visualize what's + +850 +01:03:20,519 --> 01:03:24,880 +going on in your network so here is +pretty much the same code that we had + +851 +01:03:24,880 --> 01:03:29,150 +before except we've added these three +little lines hopefully you can see it if + +852 +01:03:29,150 --> 01:03:34,280 +not you'll have to trust me so here +where computing a scalar summary of the + +853 +01:03:34,280 --> 01:03:37,200 +loss and that's giving us a new symbolic +variables + +854 +01:03:37,199 --> 01:03:40,929 +law summary and more computing a +histogram summary of the weight matrices + +855 +01:03:40,929 --> 01:03:46,049 +W on w-2 and also getting us new +symbolic variables W one pissed and w2 + +856 +01:03:46,050 --> 01:03:51,390 +hissed now we're getting another +symbolic variable called emerged that + +857 +01:03:51,389 --> 01:03:54,349 +can emerge as all those summaries +together using some magic I don't + +858 +01:03:54,349 --> 01:03:58,929 +understand and we're getting this +summary writer object that we can use to + +859 +01:03:58,929 --> 01:04:03,000 +actually dumped out those summaries to +desk and now in our loop when we're + +860 +01:04:03,000 --> 01:04:06,570 +actually running the network then we +tell it to evaluate to evaluate the + +861 +01:04:06,570 --> 01:04:10,460 +training staff and a loss like before +her at all so this merge summary object + +862 +01:04:10,460 --> 01:04:14,190 +so in the process of evaluating the +splurge summary object it'll compute + +863 +01:04:14,190 --> 01:04:17,690 +gradient it'll compute histograms of the +weights and dump those summaries to desk + +864 +01:04:17,690 --> 01:04:22,019 +and then we tell our writer to actually +at the summaries I guess that's where + +865 +01:04:22,019 --> 01:04:26,610 +the right into this happens so once you +run this thing then you get the mall + +866 +01:04:26,610 --> 01:04:28,890 +this thing is running it sort of +constantly streaming all this + +867 +01:04:28,889 --> 01:04:33,069 +information about what's going on in +your network to desk and then you just + +868 +01:04:33,070 --> 01:04:37,480 +start up this this web server that ships +with tensor flow sensor board and we get + +869 +01:04:37,480 --> 01:04:41,420 +these beautiful beautiful visualisations +about what's going on in your network so + +870 +01:04:41,420 --> 01:04:42,539 +here on the left + +871 +01:04:42,539 --> 01:04:46,230 +member we were telling we were getting a +scalar summary of the loss so this + +872 +01:04:46,230 --> 01:04:49,360 +actually shows that loss was going down +I mean it was a small it was a big + +873 +01:04:49,360 --> 01:04:52,760 +network and a small dataset but that +means everything is working and this + +874 +01:04:52,760 --> 01:04:56,860 +over here on the right hand side showing +you histograms over time showing you the + +875 +01:04:56,860 --> 01:05:00,900 +distributions of the values in your +weight matrices so this is the stuff is + +876 +01:05:00,900 --> 01:05:04,579 +really really cool and I think this is a +really really beautiful debugging tool + +877 +01:05:04,579 --> 01:05:09,289 +so when i when I've been working on +projects and torch I've written this + +878 +01:05:09,289 --> 01:05:11,250 +kind of stuff myself by hand + +879 +01:05:11,250 --> 01:05:14,900 +just kinda dumping JSON blobs out of +torture and then writing my own custom + +880 +01:05:14,900 --> 01:05:18,369 +visualization visualizer is to view +these kind of statistics because they're + +881 +01:05:18,369 --> 01:05:21,609 +really useful and with tents are you +don't have to write any about yourself + +882 +01:05:21,610 --> 01:05:25,019 +you just a couple lines of code to your +training script run they're saying and + +883 +01:05:25,019 --> 01:05:27,489 +you can get all these beautiful +visualisations to help your debugging + +884 +01:05:27,489 --> 01:05:35,059 +tenser flow sensor board can also help +you even visualize what your network + +885 +01:05:35,059 --> 01:05:39,820 +structure looks like so here we've +annotated are variables with these names + +886 +01:05:39,820 --> 01:05:43,510 +and now when we're doing the forward +pass we can actually scope some of the + +887 +01:05:43,510 --> 01:05:47,450 +complications under a namespace and that +sort of the slices group together + +888 +01:05:47,449 --> 01:05:48,949 +computations that + +889 +01:05:48,949 --> 01:05:52,519 +should belong together semantically now +other than that it's the same with the + +890 +01:05:52,519 --> 01:05:56,949 +same thing that we saw before and now if +we run this network and load up tents or + +891 +01:05:56,949 --> 01:06:00,909 +more and we can actually get this +beautiful visualization for how like + +892 +01:06:00,909 --> 01:06:04,789 +what our network actually looks like and +we can actually click and look and see + +893 +01:06:04,789 --> 01:06:07,820 +what the screens on the scores and +really help debug what's going on inside + +894 +01:06:07,820 --> 01:06:12,170 +this network and Egypt you see these +loss and scores + +895 +01:06:12,170 --> 01:06:15,030 +these are the semantic namespaces that +we defined it during the forward pass + +896 +01:06:15,030 --> 01:06:18,940 +and if we click on the scores for +example it opens up and lets us see all + +897 +01:06:18,940 --> 01:06:22,679 +the operations that have that up here +inside the computation on graphics that + +898 +01:06:22,679 --> 01:06:28,108 +node so I thought this was really cool +if it lets you like really easily debug + +899 +01:06:28,108 --> 01:06:31,039 +what's going on inside your networks +while it's running enough to write any + +900 +01:06:31,039 --> 01:06:39,300 +of Apple's Asian code yourself so tender +flow does have support from multi GPU so + +901 +01:06:39,300 --> 01:06:42,750 +has data parallelism like you might +expect so I'd like to point out that + +902 +01:06:42,750 --> 01:06:45,809 +actually this distribute this +distribution part is probably one of the + +903 +01:06:45,809 --> 01:06:50,460 +other major selling point sometimes a +flow that it can try to actually + +904 +01:06:50,460 --> 01:06:53,338 +distributed computation crap in +different ways across different devices + +905 +01:06:53,338 --> 01:06:57,828 +and actually place the distribute that +crap smartly to minimize communication + +906 +01:06:57,829 --> 01:07:02,839 +overhead and so on so one thing that you +can do is data parallelism where you + +907 +01:07:02,838 --> 01:07:05,559 +just put your money back across +different devices and run each one + +908 +01:07:05,559 --> 01:07:08,409 +forward and backward and then either +some of the gradients to do + +909 +01:07:08,409 --> 01:07:12,068 +synchronous distributed training or just +make a synchronous updates to your + +910 +01:07:12,068 --> 01:07:16,730 +parameters and do a synchronous training +buchanan the white paper claims she can + +911 +01:07:16,730 --> 01:07:21,300 +do both of these things and tensor flow +but I didn't I didn't try it out you can + +912 +01:07:21,300 --> 01:07:25,000 +also actually do model parallelism in +intensive flow as well but lets you + +913 +01:07:25,000 --> 01:07:27,829 +split up the same model and compute +different parts of the same model on + +914 +01:07:27,829 --> 01:07:32,190 +different devices so here's an example +so one place for that might be useful is + +915 +01:07:32,190 --> 01:07:36,510 +a multi-layer recurrent network there it +might actually be a good idea to run + +916 +01:07:36,510 --> 01:07:39,900 +different layers of a network on +different CPUs because those things can + +917 +01:07:39,900 --> 01:07:42,838 +actually take a lot of memory so that's +the type of thing that you can actually + +918 +01:07:42,838 --> 01:07:47,599 +do you can do that intense air flow +without too much pain + +919 +01:07:47,599 --> 01:07:51,599 +tenser flow is also the only of the +frameworks that can run into distributed + +920 +01:07:51,599 --> 01:07:56,000 +mode not just a strip across one machine +and multiple GPUs but actually + +921 +01:07:56,000 --> 01:07:58,309 +distribute them training model across +many machine + +922 +01:07:58,309 --> 01:08:04,709 +ads so the caveat here is that that part +is not open source yet rated as of today + +923 +01:08:04,708 --> 01:08:08,328 +the open source release of tensor flow +can only do single machine multi GPU + +924 +01:08:08,329 --> 01:08:13,890 +training but I think but hopefully soon +that part will be released to be really + +925 +01:08:13,889 --> 01:08:16,500 +cool right so here + +926 +01:08:16,500 --> 01:08:22,069 +the idea is you can just end reply was +aware of communication costs both + +927 +01:08:22,069 --> 01:08:26,489 +between GPU and CPU but also between +different machines on the network so + +928 +01:08:26,488 --> 01:08:30,118 +that it can try to smartly distribute +the computation craft across different + +929 +01:08:30,118 --> 01:08:33,750 +machines and across different CPUs +within those machines to compute + +930 +01:08:33,750 --> 01:08:37,649 +everything as efficiently as possible so +that's i think thats really cool and + +931 +01:08:37,649 --> 01:08:41,629 +that's something that the other +frameworks just can't do right now one + +932 +01:08:41,630 --> 01:08:46,409 +can point with tens or flow is preteen +models so I looked I did a thorough + +933 +01:08:46,408 --> 01:08:51,448 +Google search and the only thing I could +come up with was an inception module a + +934 +01:08:51,448 --> 01:08:56,028 +pre-trial inception model but it's only +accessible through this Android explore + +935 +01:08:56,029 --> 01:08:59,569 +this and rate them all so that's +something I would have expected to be + +936 +01:08:59,569 --> 01:09:04,219 +more clear documentation but that's at +least you have that one pitch a model + +937 +01:09:04,219 --> 01:09:09,109 +either other than that I'm not I'm not +really aware of other preteen models + +938 +01:09:09,109 --> 01:09:12,109 +intense air flow but maybe maybe there +are out maybe they're out there and I + +939 +01:09:12,109 --> 01:09:13,230 +just don't know about them + +940 +01:09:13,229 --> 01:09:19,729 +prob says no so I googled correctly so +that answer flow pros and cons + +941 +01:09:19,729 --> 01:09:23,689 +again my quick one day experiment it's +really good because its pipeline an + +942 +01:09:23,689 --> 01:09:27,928 +umpire that's really cool I know it has +this idea of computation on graphics + +943 +01:09:27,929 --> 01:09:32,289 +which i think is super powerful and +actually takes this idea of + +944 +01:09:32,289 --> 01:09:35,948 +computational graphs even farther than +than Fiano really and things like + +945 +01:09:35,948 --> 01:09:40,000 +checkpointing and distributing across +devices these all end up as just know + +946 +01:09:40,000 --> 01:09:46,380 +it's inside the computation on graphics +4000 that's really cool it's also claims + +947 +01:09:46,380 --> 01:09:49,520 +to have much faster compile time +something I know I've heard horror + +948 +01:09:49,520 --> 01:09:53,670 +stories about neural tree machines +taking half an hour to compile maybe + +949 +01:09:53,670 --> 01:09:59,219 +maybe that should be faster intends or +flow so I've heard tenser board looks + +950 +01:09:59,219 --> 01:10:03,369 +awesome that looks amazing I want to use +that everywhere + +951 +01:10:03,369 --> 01:10:07,340 +it has really cool data and model model +parallelism I think much more advanced + +952 +01:10:07,340 --> 01:10:11,079 +than the other frameworks although the +distributed stop is still secret sauce + +953 +01:10:11,079 --> 01:10:15,689 +that Google but hopefully I'll come out +to the rest of us eventually but I guess + +954 +01:10:15,689 --> 01:10:19,989 +as bob was saying it's even maybe the +scariest code based actually dig into + +955 +01:10:19,989 --> 01:10:24,409 +and understand what's working under the +hood so at least my my fear about cancer + +956 +01:10:24,409 --> 01:10:29,010 +flow is that if you want to do some kind +of crazy weird imperative code and you + +957 +01:10:29,010 --> 01:10:32,690 +cannot easily work it into their +computational graph abstraction that + +958 +01:10:32,689 --> 01:10:38,159 +seems like you could be in a lot of +trouble we're in may be in in in torture + +959 +01:10:38,159 --> 01:10:40,659 +you can just write whatever imperative +code you want inside the forward and + +960 +01:10:40,659 --> 01:10:44,659 +backward passes of your own custom +theirs but that seems like the biggest a + +961 +01:10:44,659 --> 01:10:49,979 +worrying point for me about working with +tons of law and practice another another + +962 +01:10:49,979 --> 01:10:52,959 +kind of awkward thing is the slack +appreciate models so that's that's kind + +963 +01:10:52,960 --> 01:11:12,239 +of gross + +964 +01:11:12,239 --> 01:11:22,019 +even installing on a 2002 was a little +bit painful they claimed to have a + +965 +01:11:22,020 --> 01:11:25,680 +python we all that you can just download +and install with PEP but it broke and I + +966 +01:11:25,680 --> 01:11:29,150 +had to change the filename annually to +get to install and then they had a + +967 +01:11:29,149 --> 01:11:32,479 +broken dependency that I had to update +manually and like download some random + +968 +01:11:32,479 --> 01:11:36,759 +zip file and unpack it and copy some +random files around but it eventually + +969 +01:11:36,760 --> 01:11:41,520 +worked but installation was tough even +on my own machine that I have sudo 2012 + +970 +01:11:41,520 --> 01:11:47,400 +so they should get their act together on +that so I put together this quick + +971 +01:11:47,399 --> 01:11:51,529 +overview table that kind of covers when +I think people would care about on major + +972 +01:11:51,529 --> 01:11:55,529 +points between the frameworks whitewater +languages what kinda preteen models are + +973 +01:11:55,529 --> 01:11:56,210 +available + +974 +01:11:56,210 --> 01:12:05,029 +question + +975 +01:12:05,029 --> 01:12:09,988 +the question is is which of these +support Windows I'm sorry but I don't + +976 +01:12:09,988 --> 01:12:11,769 +know + +977 +01:12:11,770 --> 01:12:16,830 +I think you're on your own + +978 +01:12:16,829 --> 01:12:24,439 +aww you can use AWS from Windows ok + +979 +01:12:24,439 --> 01:12:29,359 +ok so I put together this quick come +quick comparison chart between the + +980 +01:12:29,359 --> 01:12:32,198 +frameworks that I think covers some of +the major bullet points that people care + +981 +01:12:32,198 --> 01:12:37,460 +about talking about what language is +whether they have free trade models what + +982 +01:12:37,460 --> 01:12:41,300 +kind of parallelism you have and how +readable as the source code and whether + +983 +01:12:41,300 --> 01:12:47,029 +they get our hands so I had a couple of +use cases in let's see we've got holy + +984 +01:12:47,029 --> 01:12:52,939 +crap we got 250 slides and we still have +two minutes left so let's let's do let's + +985 +01:12:52,939 --> 01:12:56,710 +play a little game suppose that all you +wanted to do was extracted aleksandr BGG + +986 +01:12:56,710 --> 01:12:58,619 +features which framework would you pick + +987 +01:12:58,619 --> 01:13:06,969 +yeah me too let's say all we wanted to +do was find to an Alex net on on some + +988 +01:13:06,969 --> 01:13:19,189 +new data yeah let's say we want to do +image captioning with fine-tuning ok I + +989 +01:13:19,189 --> 01:13:22,889 +heard a good distribution so this is my +thought process I'm not saying this is + +990 +01:13:22,890 --> 01:13:26,289 +the right answer but the way I think +about this is that for this problem we + +991 +01:13:26,289 --> 01:13:30,969 +need preteen models preteen models were +looking at Cafe or torture lasagna we + +992 +01:13:30,969 --> 01:13:36,239 +need our hands so kathy is pretty much +out even though people have done have + +993 +01:13:36,238 --> 01:13:39,869 +implemented the stuff there is just kind +of painful so I'd probably use torture + +994 +01:13:39,869 --> 01:13:44,869 +maybe lasagna about semantic +segmentation we want to classify every + +995 +01:13:44,869 --> 01:13:49,880 +pixel right so here we want to read an +input image and instead of giving a + +996 +01:13:49,880 --> 01:13:57,900 +label to the whole output image we want +to label every pixel independently ok + +997 +01:13:57,899 --> 01:14:01,969 +that's good so again my thought process +was that we need a preteen model here + +998 +01:14:01,969 --> 01:14:06,800 +most likely and hear that we're talking +about kind of a weird use case for you + +999 +01:14:06,800 --> 01:14:10,739 +might need to define some of our own +project so if this layer happens to + +1000 +01:14:10,738 --> 01:14:14,738 +exist in cafe they would be a good fit +otherwise what's the radar self and + +1001 +01:14:14,738 --> 01:14:23,109 +writing this thing ourself seems least +10 points for each object detection no + +1002 +01:14:23,109 --> 01:14:24,329 +idea + +1003 +01:14:24,329 --> 01:14:30,750 +yes ok kathy is an idea so my thought +process again we're looking at preteen + +1004 +01:14:30,750 --> 01:14:33,149 +models so we need cafe + +1005 +01:14:33,149 --> 01:14:38,069 +torch or lasagna we actually could with +the texting you could need a lot of + +1006 +01:14:38,069 --> 01:14:41,609 +funky imperative code that it might be +possible to put in a computation + +1007 +01:14:41,609 --> 01:14:47,799 +aircraft but seems scary to me so cafe + +python is is 11 choice that some of the + +1008 +01:14:47,800 --> 01:14:52,529 +spring we talked about actually went +this route and I've actually done a + +1009 +01:14:52,529 --> 01:14:56,939 +similar project like this and I chose +torch and it worked out good for me but + +1010 +01:14:56,939 --> 01:14:59,809 +if you want to language modeling like +you wanna do funky are intense and you + +1011 +01:14:59,810 --> 01:15:06,270 +want to play with the recurrence role +torch what do you guys thing yeah I + +1012 +01:15:06,270 --> 01:15:09,550 +would actually not used torture this at +all so here if we just wanted to + +1013 +01:15:09,550 --> 01:15:13,650 +language modeling and do funky kind of +recurrence relationships then we're not + +1014 +01:15:13,649 --> 01:15:17,109 +talking about images at all this just +untaxed so we don't need any pre-trial + +1015 +01:15:17,109 --> 01:15:22,309 +models and we really want to play with +this recurrence relationship and easily + +1016 +01:15:22,310 --> 01:15:25,430 +work with our current networks so there +I think that maybe fee on all returns + +1017 +01:15:25,430 --> 01:15:32,570 +are flow might be a good choice if you +want to implement batch norm + +1018 +01:15:32,569 --> 01:15:39,769 +ok ok slides sorry about that right so +here if you wanna if you want to rely on + +1019 +01:15:39,770 --> 01:15:42,230 +if you don't drive the gradient yourself +and you could rely on these + +1020 +01:15:42,229 --> 01:15:46,899 +computational craft things like flow but +because of the way that those things + +1021 +01:15:46,899 --> 01:15:50,089 +work as you saw on homework for passion +or you can actually simplify the + +1022 +01:15:50,090 --> 01:15:54,900 +gradient quite a lot and I'm not sure if +these computational craft frameworks + +1023 +01:15:54,899 --> 01:15:57,589 +would correctly simplify the gradient +down to this makes efficient form + +1024 +01:15:57,590 --> 01:16:09,489 +question + +1025 +01:16:09,488 --> 01:16:13,009 +I think I thought the question is how +easily is at how easy is it to come by + +1026 +01:16:13,010 --> 01:16:18,860 +in like a torch model with a piano model +and I think it seems painful but at + +1027 +01:16:18,859 --> 01:16:22,819 +least in fiato you can use lasagna tick +tick access and preteen models so + +1028 +01:16:22,819 --> 01:16:26,498 +fucking together a lasagna model is +something else I think theoretically + +1029 +01:16:26,498 --> 01:16:31,748 +maybe should be easier so here if you +want if you have some like really really + +1030 +01:16:31,748 --> 01:16:35,429 +good knowledge about how how exactly you +want the backward pass to be computed + +1031 +01:16:35,429 --> 01:16:38,179 +and you want to implement it yourself to +be efficient than you probably won't use + +1032 +01:16:38,179 --> 01:16:43,300 +torch you can just implement that backup +ask yourself so make recommendations on + +1033 +01:16:43,300 --> 01:16:46,949 +frameworks are that if you just wanna do +feature feature extraction or maybe + +1034 +01:16:46,948 --> 01:16:51,248 +fine-tuning of existing models or just +transferred of a vanilla straightforward + +1035 +01:16:51,248 --> 01:16:54,929 +task then Cafe is probably the right way +to go it's really easy to use you don't + +1036 +01:16:54,929 --> 01:16:58,649 +have to write any code if you want to +work around with preteen models but + +1037 +01:16:58,649 --> 01:17:02,738 +maybe do weird stuff with preteen models +and not just fine to Phnom Penh you + +1038 +01:17:02,738 --> 01:17:07,209 +might have a better job in lasagna or +torch is there it's easier to kind of + +1039 +01:17:07,210 --> 01:17:11,328 +mess with the structure of preteen +models if you want to if you really + +1040 +01:17:11,328 --> 01:17:14,788 +really want to write your own layers for +whatever reason and you don't think you + +1041 +01:17:14,788 --> 01:17:18,788 +can easily fit into these computational +crafts then probably should use torch if + +1042 +01:17:18,788 --> 01:17:22,948 +you really want to use our intense and +maybe other types of fancy things that + +1043 +01:17:22,948 --> 01:17:26,138 +depend on the computational graph then +probably talk then maybe fee on all + +1044 +01:17:26,139 --> 01:17:30,090 +returns are low also if you have a +gigantic model and you need to + +1045 +01:17:30,090 --> 01:17:33,449 +distribute across an entire cluster and +you have access to Google's internal + +1046 +01:17:33,448 --> 01:17:36,169 +code base then you should use her flow + +1047 +01:17:36,170 --> 01:17:39,989 +although hopefully that like I said that +part will be released for the rest of us + +1048 +01:17:39,988 --> 01:17:44,889 +soon so that's also if you wanna use +tents are bored and you got a slow so + +1049 +01:17:44,890 --> 01:17:48,810 +that's that's pretty much my my overview +my quick whirlwind tour of all the + +1050 +01:17:48,810 --> 01:17:58,210 +frameworks so any any last minute +questions questions questions about + +1051 +01:17:58,210 --> 01:18:02,630 +speed so there's actually a really nice +page that compares some speed + +1052 +01:18:02,630 --> 01:18:06,039 +benchmark speed of all the different +frameworks and right now the one that + +1053 +01:18:06,039 --> 01:18:10,488 +wins is none of these The one that wins +is this thing called me on from Nirvana + +1054 +01:18:10,488 --> 01:18:15,049 +systems so these guys have actually +written these guys are crazy they + +1055 +01:18:15,050 --> 01:18:20,119 +actually wrote their own custom +assembler for g4 and video hardware they + +1056 +01:18:20,119 --> 01:18:22,448 +were not happy with like and videos + +1057 +01:18:22,448 --> 01:18:26,500 +toolchain they reverse engineer the +hardware and rotor on a similar and then + +1058 +01:18:26,500 --> 01:18:30,948 +implemented all these kernels in +assembly themselves so these guys are + +1059 +01:18:30,948 --> 01:18:35,859 +crazy and their stuff is really really +fast so they're actually there is there + +1060 +01:18:35,859 --> 01:18:39,309 +stuff is actually the fastest right now +but I've never really used their I've + +1061 +01:18:39,310 --> 01:18:42,510 +never really used their framework myself +and I think it's a little less common + +1062 +01:18:42,510 --> 01:18:47,010 +these although for the ones that are +using CUDA and the speed is roughly the + +1063 +01:18:47,010 --> 01:18:52,030 +same right now I think 10 surplus quite +a bit slower than the others for some + +1064 +01:18:52,029 --> 01:18:55,609 +silly reasons that I think will be +cleaned up in subsequent releases but at + +1065 +01:18:55,609 --> 01:18:58,729 +least fundamentally there's no reason +you should be should should should be + +1066 +01:18:58,729 --> 01:19:04,209 +slower than others + +1067 +01:19:04,210 --> 01:19:07,319 +you people are picking up rifles + +1068 +01:19:07,319 --> 01:19:24,279 +alright + +1069 +01:19:24,279 --> 01:19:27,198 +that that's actually not crazy there are +quite a few for quite a few teams last + +1070 +01:19:27,198 --> 01:19:29,738 +year that actually use a sign like Oprah +projects and it was fine + +1071 +01:19:29,738 --> 01:19:34,658 +yeah I should also mention that there +are other frameworks + +1072 +01:19:34,658 --> 01:19:45,359 +I just think these are the peace for the +most common question + +1073 +01:19:45,359 --> 01:19:52,299 +so the question is about grabbing and +torch so torch actually has and I Python + +1074 +01:19:52,300 --> 01:19:56,770 +colonel you can actually use on my torch +notebooks that's kind of cool and + +1075 +01:19:56,770 --> 01:20:00,150 +actually you can actually do some simple +grabbing a night or two notebooks but in + +1076 +01:20:00,149 --> 01:20:04,899 +practice what I usually do is dump my +data run my torch model dump the data + +1077 +01:20:04,899 --> 01:20:09,899 +even to JSON HDL 5 and in visualizing in +Python which is a little bit little bit + +1078 +01:20:09,899 --> 01:20:19,359 +painful but you can just get the job +done + +1079 +01:20:19,359 --> 01:20:23,309 +the question is whether tenser bored +lets you dump the raw data you can put + +1080 +01:20:23,310 --> 01:20:28,300 +yourself there actually they're actually +dumping all this stuff into some log + +1081 +01:20:28,300 --> 01:20:33,050 +files in a temp directory I'm not sure +how easy those are sparse but you could + +1082 +01:20:33,050 --> 01:20:45,900 +try it could be easy or not I'm not sure +question the question is whether there + +1083 +01:20:45,899 --> 01:20:49,899 +are other third party tool some of the +tensor board for modern networks there + +1084 +01:20:49,899 --> 01:20:53,269 +might be some out there but I've never +really used them I just read my own in + +1085 +01:20:53,270 --> 01:20:58,159 +the past any other questions + +1086 +01:20:58,158 --> 01:21:00,319 +alright I think I think that's it + diff --git a/captions/En/Lecture13_en.srt b/captions/En/Lecture13_en.srt new file mode 100644 index 00000000..bd3e4fa5 --- /dev/null +++ b/captions/En/Lecture13_en.srt @@ -0,0 +1,4558 @@ +1 +00:00:00,000 --> 00:00:06,878 +so are our administrative points for +today an assignment 3 is due tonight so + +2 +00:00:06,878 --> 00:00:14,399 +was done that's going to be easier than +assignment to ok that's good hopefully + +3 +00:00:14,400 --> 00:00:18,320 +gives you more time to work on your +projects so also remember your + +4 +00:00:18,320 --> 00:00:22,500 +milestones where were turned in we +returned in last week so we're in the + +5 +00:00:22,500 --> 00:00:25,028 +process of looking through the +milestones to make sure those are ok and + +6 +00:00:25,028 --> 00:00:28,609 +also we're working on assignment to +creating so we should have that done + +7 +00:00:28,609 --> 00:00:32,289 +sometime this week or early next week + +8 +00:00:32,289 --> 00:00:36,329 +last time we had a whirlwind tour of all +the soccer all the common software + +9 +00:00:36,329 --> 00:00:40,058 +packages that people use for deep +learning and we saw a lot of code on + +10 +00:00:40,058 --> 00:00:43,468 +slides and a lot of stepping through +code and hopefully that you found it + +11 +00:00:43,469 --> 00:00:48,730 +useful for your projects today we're +going to talk about to other topics + +12 +00:00:48,729 --> 00:00:53,308 +we're gonna talk about segmentation +within segmentation there are two + +13 +00:00:53,308 --> 00:00:57,488 +subproblems semantic an instant +segmentation we're also going to talk + +14 +00:00:57,488 --> 00:01:01,509 +about soft attention and within soft +attention again they're sort of two + +15 +00:01:01,509 --> 00:01:07,069 +buckets that we've divided things into +but first before we go into these + +16 +00:01:07,069 --> 00:01:12,849 +details I was there something else I +want to bring up briefly so this is the + +17 +00:01:12,849 --> 00:01:16,769 +image classification errors I think at +this point in the class you've seen this + +18 +00:01:16,769 --> 00:01:23,079 +type of figure it many times right so in +2012 Alex not 2013 ZF crushed it + +19 +00:01:23,079 --> 00:01:29,118 +recently Google Matt and later ResNet is +sort of help but wonder classification + +20 +00:01:29,118 --> 00:01:37,400 +challenge in 2015 but turns out as of +today there is a new image net result so + +21 +00:01:37,400 --> 00:01:41,140 +this paper came out last night + +22 +00:01:41,140 --> 00:01:48,609 +so Google actually has now state of the +art on image that with 3.08% top 5 error + +23 +00:01:48,609 --> 00:01:55,560 +which is crazy and the way they do this +is with this thing that they call in + +24 +00:01:55,560 --> 00:01:59,900 +section before this is a little bit of a +monster so I don't want to go into too + +25 +00:01:59,900 --> 00:02:05,280 +much detail but you can see that it's +this really deep network that has these + +26 +00:02:05,280 --> 00:02:11,150 +repeated modules so here there's the +stem the stem is this guy over here a + +27 +00:02:11,150 --> 00:02:14,789 +couple interesting things to point out +about this architecture are actually may + +28 +00:02:14,789 --> 00:02:18,979 +use some balance convolutions which +means they have no padding so that makes + +29 +00:02:18,979 --> 00:02:22,229 +every all the math more complicated but +they're smart and I figured things out + +30 +00:02:22,229 --> 00:02:27,299 +they also have an interesting feature +here is they actually have in parallel + +31 +00:02:27,300 --> 00:02:31,459 +strident convolution and also Max Pooley +so they kind of do these two operations + +32 +00:02:31,459 --> 00:02:34,900 +and parallel to down sample images and +then concatenate the kind of the and + +33 +00:02:34,900 --> 00:02:39,909 +another thing is they they're really +going all out on these efficient + +34 +00:02:39,909 --> 00:02:43,389 +convolution checks that we talked about +a couple lectures ago so as you can see + +35 +00:02:43,389 --> 00:02:47,518 +they've actually have these asymmetric +filters like one by seven and seven by + +36 +00:02:47,519 --> 00:02:51,750 +one convolutions they also make heavy +use of these one-by-one convolutional + +37 +00:02:51,750 --> 00:02:56,449 +bottlenecks to reduce computational +costs so this is just the stem of a + +38 +00:02:56,449 --> 00:03:01,939 +network and actually each of these parts +is sort of different so they've got for + +39 +00:03:01,939 --> 00:03:07,769 +abuse inception modules been this but +down sampling module then what seven of + +40 +00:03:07,769 --> 00:03:11,599 +these guys and then another down +sampling module and then three more of + +41 +00:03:11,599 --> 00:03:16,889 +these guys and then finally they have +dropped out and and fully connected lair + +42 +00:03:16,889 --> 00:03:20,919 +for the class labels so another thing to +point out is again there's no sort of + +43 +00:03:20,919 --> 00:03:24,859 +fully connected hitler's here they just +have this global averaging to compute + +44 +00:03:24,860 --> 00:03:29,320 +the final feature vector and another +cool thing they did in this paper was + +45 +00:03:29,319 --> 00:03:34,900 +inception residents so they propose this +residual version of Inception + +46 +00:03:34,900 --> 00:03:39,579 +architecture which is also pretty big +and scary the stem is the same as before + +47 +00:03:39,579 --> 00:03:43,950 +and now these residues repeated +inception blocks that they repeat + +48 +00:03:43,949 --> 00:03:48,289 +throughout the network they actually +have these residual connections so + +49 +00:03:48,289 --> 00:03:51,409 +that's that's kind of cool their kind of +jumping on this residual idea + +50 +00:03:51,409 --> 00:03:55,609 +and now improved state of the art image +not so again they have many repeated + +51 +00:03:55,610 --> 00:04:00,880 +modules and when you add this thing all +up it's about 7875 layers assuming I did + +52 +00:04:00,879 --> 00:04:07,939 +the math right it last night so they +also show that between their sort of new + +53 +00:04:07,939 --> 00:04:12,680 +inception but their new version 4 of +Inception Google Maps and their residual + +54 +00:04:12,680 --> 00:04:17,079 +version of Google Maps that actually +both of them perform about the same so + +55 +00:04:17,079 --> 00:04:22,909 +this is true top five air as a function +of epochs on image now you can see that + +56 +00:04:22,910 --> 00:04:28,070 +the inception network and read it +actually is converging a bit faster but + +57 +00:04:28,069 --> 00:04:33,180 +over time both of them sort of +conversation about the same value so + +58 +00:04:33,180 --> 00:04:38,340 +that that's kind of interesting that's +kind of cool another thing that's kind + +59 +00:04:38,339 --> 00:04:42,369 +of interesting to point out is this this +this the raw numbers on the x-axis here + +60 +00:04:42,370 --> 00:04:46,030 +these are a pox on image now these +things are being trained for a hundred + +61 +00:04:46,029 --> 00:04:52,089 +and sixty Pakistan image net so that's +that's a lot of training time but that's + +62 +00:04:52,089 --> 00:04:55,469 +that's enough of current events and +let's go back to our regularly scheduled + +63 +00:04:55,470 --> 00:05:02,710 +programming so today oh yeah question I +don't know I think it might be in the + +64 +00:05:02,709 --> 00:05:11,789 +paper but I didn't read it carefully and +other questions on russia's might drop + +65 +00:05:11,790 --> 00:05:16,600 +out whether it's only in the last layer +I'm not sure again I didn't read the + +66 +00:05:16,600 --> 00:05:21,620 +paper to to carefully yet but it's the +link is here you should check it out + +67 +00:05:21,620 --> 00:05:29,600 +ok so today we're going to talk about to +sort of two other topics that are + +68 +00:05:29,600 --> 00:05:33,970 +considered common things and research +these days so those are segmentation + +69 +00:05:33,970 --> 00:05:37,490 +which is this sort of classic computer +vision topic and also this idea of + +70 +00:05:37,490 --> 00:05:41,550 +attention which i think is a really has +been a really popular thing to work on + +71 +00:05:41,550 --> 00:05:46,060 +in deep mourning over the past year +especially so first we're gonna talk + +72 +00:05:46,060 --> 00:05:50,889 +about segmentation so you may have +remembered this slide from a couple + +73 +00:05:50,889 --> 00:05:53,649 +lectures ago we talked about object +detection that was talking about + +74 +00:05:53,649 --> 00:05:58,000 +different tasks that people work on +computer vision and we spent a lot of + +75 +00:05:58,000 --> 00:06:02,259 +time in the class talking about +classification and back in lecture we + +76 +00:06:02,259 --> 00:06:03,750 +talk about different models for + +77 +00:06:03,750 --> 00:06:08,339 +localisation and for object detection +but today we're actually gonna focus on + +78 +00:06:08,339 --> 00:06:12,239 +this idea of segmentation that we +skipped over last time in this previous + +79 +00:06:12,240 --> 00:06:18,189 +lecture so within segmentation there's +sort of two different some tasks that we + +80 +00:06:18,189 --> 00:06:21,870 +need to make that we need to define and +people actually work on these things a + +81 +00:06:21,870 --> 00:06:26,389 +little bit separately the first task is +this idea called semantic segmentation + +82 +00:06:26,389 --> 00:06:32,370 +so here we take an end we have an input +image and we have some pics number of + +83 +00:06:32,370 --> 00:06:38,000 +classes things like buildings and trees +and ground and cow and whatever kind of + +84 +00:06:38,000 --> 00:06:42,629 +semantic labels you want usually have +some small fixed number of classes also + +85 +00:06:42,629 --> 00:06:46,199 +typically you'll have some background +class for first things that don't fit + +86 +00:06:46,199 --> 00:06:51,360 +into these classes and then the task is +that we want to take as input an inch + +87 +00:06:51,360 --> 00:06:55,240 +and then we want to label every pixel in +that image with one of these semantic + +88 +00:06:55,240 --> 00:06:59,850 +classes so here we have taken this input +image of these three cows in the field + +89 +00:06:59,850 --> 00:07:05,490 +and the ideal output is this image where +instead of being RGB values we actually + +90 +00:07:05,490 --> 00:07:11,228 +have one class label per pixel we can do +this and other images and maybe segment + +91 +00:07:11,228 --> 00:07:16,789 +out the trees and the sky and the road +the grass so this type of task is pretty + +92 +00:07:16,790 --> 00:07:19,950 +cool I think it gives you sort of a +higher level of understanding of what's + +93 +00:07:19,949 --> 00:07:23,029 +going on in images compared to just +putting a single label on the whole + +94 +00:07:23,029 --> 00:07:28,668 +image and this is actually a very old +problem in computer vision so this at + +95 +00:07:28,668 --> 00:07:32,649 +that predates sort of the deep learning +revolution so this figure actually comes + +96 +00:07:32,649 --> 00:07:37,259 +from a computer vision paperbacks in +2007 that didn't use any deep learning + +97 +00:07:37,259 --> 00:07:43,728 +at all people had other methods for this +a couple years ago the other a task that + +98 +00:07:43,728 --> 00:07:48,949 +people work on is this thing right so +the thing to point out here is that this + +99 +00:07:48,949 --> 00:07:54,310 +thing is not aware of instances so here +this this image actually house or four + +100 +00:07:54,310 --> 00:07:58,329 +cows there's actually three cows +standing up and one cow kinda laying on + +101 +00:07:58,329 --> 00:08:02,300 +the grass taking a nap but here in this +output it's not really clear how many + +102 +00:08:02,300 --> 00:08:07,560 +cows there are these different cow is +actually there pixels overlap so here in + +103 +00:08:07,560 --> 00:08:11,540 +the outputs there is no notion that +there are different cows + +104 +00:08:11,540 --> 00:08:15,480 +miss output we're just labeling every +pixel so it's maybe not as informative + +105 +00:08:15,480 --> 00:08:20,009 +as you might like and that could +actually lead to some problems for some + +106 +00:08:20,009 --> 00:08:23,409 +downstream applications so it's overcome +this + +107 +00:08:23,410 --> 00:08:28,080 +people have also separately work on this +nor problem called instant segmentation + +108 +00:08:28,079 --> 00:08:32,039 +this also sometimes gets called +simultaneous detection and segmentation + +109 +00:08:32,039 --> 00:08:37,879 +so here's the problem is again somewhere +to before we have some set of classes + +110 +00:08:37,879 --> 00:08:43,370 +that were trying to recognize and given +an input image we want to output all + +111 +00:08:43,370 --> 00:08:48,370 +instances of those classes and for each +instance we want segment out the pixels + +112 +00:08:48,370 --> 00:08:52,970 +that belong to that instance so here in +this in this input image there are + +113 +00:08:52,970 --> 00:08:57,509 +actually three different people there's +the two parents and the kid and now in + +114 +00:08:57,509 --> 00:09:00,860 +the output we actually distinguish +between those different people in the + +115 +00:09:00,860 --> 00:09:05,279 +input image which which those three +people are now shown in different colors + +116 +00:09:05,279 --> 00:09:09,360 +indicate different instances and again +for each of those instances we're going + +117 +00:09:09,360 --> 00:09:14,009 +to label all the pixels in the input +image that belong to that instance so + +118 +00:09:14,009 --> 00:09:18,639 +these two tasks of semantic segmentation +and instant segmentation people actually + +119 +00:09:18,639 --> 00:09:22,409 +have worked on them a little bit +separately so first we're gonna talk + +120 +00:09:22,409 --> 00:09:27,269 +about some models for semantic +segmentation so remember this is the + +121 +00:09:27,269 --> 00:09:30,399 +task for you want to just label all the +pixels in the image and you don't care + +122 +00:09:30,399 --> 00:09:38,230 +about instances so here the idea is +actually pretty simple given some input + +123 +00:09:38,230 --> 00:09:43,269 +image this is the final in with the cows +we're gonna take some little patch of + +124 +00:09:43,269 --> 00:09:48,720 +the input image and extract this patch +that sort of gives local information in + +125 +00:09:48,720 --> 00:09:53,340 +image then we're gonna take this patch +and we're gonna feed it through some + +126 +00:09:53,340 --> 00:09:57,230 +convolutional neural network this could +be any of the architecture is that we've + +127 +00:09:57,230 --> 00:10:01,070 +talked about so far in the class and now +this + +128 +00:10:01,070 --> 00:10:04,890 +convolutional neural network will +actually classified the center pixel a + +129 +00:10:04,889 --> 00:10:10,080 +patch so this neural network is Justin +classification that's something we know + +130 +00:10:10,080 --> 00:10:14,379 +how to do so this thing is just going to +say that the center pixel of dispatch + +131 +00:10:14,379 --> 00:10:19,769 +actually is a cow than we can imagine +taking this network that works on + +132 +00:10:19,769 --> 00:10:20,710 +patches + +133 +00:10:20,710 --> 00:10:26,019 +and labels the center pixel and we just +run it over the entire image and that + +134 +00:10:26,019 --> 00:10:33,269 +will give us a label for each pixel in +the image so this actually is a very + +135 +00:10:33,269 --> 00:10:36,699 +expensive operation right because now +there's many many patches in the image + +136 +00:10:36,700 --> 00:10:40,120 +and it would be super super expensive to +run this network independently for all + +137 +00:10:40,120 --> 00:10:44,139 +of them so in practice people use the +same trick that we saw an object + +138 +00:10:44,139 --> 00:10:48,639 +detection where you'll run this thing +fully convolutional II and get all the + +139 +00:10:48,639 --> 00:10:54,220 +outputs for the whole image all at once +but the problem here is that if you're + +140 +00:10:54,220 --> 00:10:58,879 +convolutional network contains a kind of +down sampling either to pooling or + +141 +00:10:58,879 --> 00:11:02,899 +through striker convolutions then now +your output your output image will + +142 +00:11:02,899 --> 00:11:07,139 +actually have a smaller spatial size and +your input image so that's that's + +143 +00:11:07,139 --> 00:11:09,929 +something that people need to work +around when they're using this type of + +144 +00:11:09,929 --> 00:11:14,629 +approach so any any questions on this +kind of basic setup for semantic + +145 +00:11:14,629 --> 00:11:28,208 +segmentation yeah + +146 +00:11:28,208 --> 00:11:32,979 +the question is whether pat pat pat +right thing just doesn't give you enough + +147 +00:11:32,980 --> 00:11:37,800 +information in some cases and that's +true so sometimes for these for these + +148 +00:11:37,799 --> 00:11:41,688 +networks people actually have a separate +offline refinement stage where they take + +149 +00:11:41,688 --> 00:11:44,980 +this output and then feed it to some +kind of graphical model to clean up to + +150 +00:11:44,980 --> 00:11:48,028 +clean up the output a little bed so +sometimes that can help boost your + +151 +00:11:48,028 --> 00:11:52,838 +performance a little better but justice +for input-output model set up tents to + +152 +00:11:52,839 --> 00:12:09,600 +work pretty well just as something easy +to implement yeah I'm you need I'm not + +153 +00:12:09,600 --> 00:12:13,019 +sure I'm not sure exactly probably +pretty big maybe a couple hundred 200 + +154 +00:12:13,019 --> 00:12:19,919 +pixels that order of magnitude so one +extension that people have used to this + +155 +00:12:19,919 --> 00:12:23,289 +basic approach is this idea of +multiscale testing actually sometimes a + +156 +00:12:23,289 --> 00:12:28,230 +single scale isn't enough so here we're +going to take our input image and will + +157 +00:12:28,230 --> 00:12:33,009 +actually resize it to multiple different +sizes so this is sort of a common trick + +158 +00:12:33,009 --> 00:12:36,688 +that people use in computer vision a lot +called an image pyramid you just take + +159 +00:12:36,688 --> 00:12:41,458 +the same dimension and you resize it +made many different scales and now for + +160 +00:12:41,458 --> 00:12:44,528 +each of these scales were gonna run it +through a convolutional neural network + +161 +00:12:44,528 --> 00:12:49,568 +that is going to protect these pics are +wise labels for these different images + +162 +00:12:49,568 --> 00:12:52,969 +of these different resolutions so +another thing to point out here along + +163 +00:12:52,970 --> 00:12:56,249 +the lines of your question that if each +of these networks actually has the same + +164 +00:12:56,249 --> 00:12:59,639 +architecture then each of these outputs +will have a different effect of + +165 +00:12:59,639 --> 00:13:04,490 +receptive field in the input 2222 the +image pyramid so now that we've gotten + +166 +00:13:04,490 --> 00:13:08,720 +these differently sized pixel labels for +the intention than we can take all of + +167 +00:13:08,720 --> 00:13:13,660 +them and Risa and just do some offline +up sampling to up sample those responses + +168 +00:13:13,659 --> 00:13:18,129 +to the same size as the input image so +now we've gotten our three outputs of + +169 +00:13:18,129 --> 00:13:24,319 +different sizes up samples and stack +them and this paper this actually paper + +170 +00:13:24,318 --> 00:13:29,139 +from the coon back in 2013 so they +actually also have this separate + +171 +00:13:29,139 --> 00:13:33,709 +off-line processing stap where they do +this idea of a bottom-up segmentation + +172 +00:13:33,708 --> 00:13:39,119 +using right using these super pixel +methods so these are these sort of more + +173 +00:13:39,120 --> 00:13:41,370 +classic computer vision image processing +type + +174 +00:13:41,370 --> 00:13:45,470 +methods that actually look at the +differences between adjacent pixels and + +175 +00:13:45,470 --> 00:13:48,589 +images and then try to merge them +together to give you these coherent + +176 +00:13:48,589 --> 00:13:52,900 +regions where there are not much change +in image so then this method actually + +177 +00:13:52,899 --> 00:13:56,519 +takes sort of runs the image offline +through these other more traditional + +178 +00:13:56,519 --> 00:14:02,230 +image processing techniques to get +either a set of super pixels or trees + +179 +00:14:02,230 --> 00:14:06,629 +saying which pixels ought to be merged +together in the image and they have this + +180 +00:14:06,629 --> 00:14:09,519 +somewhat complicated process for merging +all these different things together + +181 +00:14:09,519 --> 00:14:13,028 +cause now we've gotten this sort of +low-level information saying which + +182 +00:14:13,028 --> 00:14:14,110 +pixels in the image + +183 +00:14:14,110 --> 00:14:18,909 +actually are similar to each other based +on sort of color and great information + +184 +00:14:18,909 --> 00:14:22,439 +and we've got these outputs of different +resolutions from the convolutional + +185 +00:14:22,440 --> 00:14:25,810 +neural networks telling us semantically +what the labels are at different points + +186 +00:14:25,809 --> 00:14:29,929 +in the image and they use may actually +explore a couple different ideas for + +187 +00:14:29,929 --> 00:14:33,870 +merging these things together to give +you your final out what this actually + +188 +00:14:33,870 --> 00:14:38,419 +also answers when I addresses one of the +earlier questions about the conflict not + +189 +00:14:38,419 --> 00:14:43,809 +being enough on its own so using these +external read super pixel methods or the + +190 +00:14:43,809 --> 00:14:47,729 +segmentation trees is another thing that +sort of gives you additional information + +191 +00:14:47,730 --> 00:14:55,649 +about maybe larger context in the input +images so any any questions about this + +192 +00:14:55,649 --> 00:15:03,879 +model ok so another another sort of cool +idea that people have used for semantic + +193 +00:15:03,879 --> 00:15:08,299 +segmentation in this is this idea of +iterative refinement so we actually saw + +194 +00:15:08,299 --> 00:15:12,809 +this a few lectures ago when we talked +to many mentioned pose estimation but + +195 +00:15:12,809 --> 00:15:17,149 +the idea is that we're gonna have an +input image here they separated out the + +196 +00:15:17,149 --> 00:15:20,929 +three channels and we're gonna run that +thing for our favorite sort of + +197 +00:15:20,929 --> 00:15:24,929 +convolutional neural network to predict +these low resolution patches + +198 +00:15:24,929 --> 00:15:30,309 +rather to predict this no resolution +segmentation of the image and now we're + +199 +00:15:30,309 --> 00:15:34,899 +gonna take that output from the CNN +together with a Down sampled version of + +200 +00:15:34,899 --> 00:15:38,829 +the original image and we'll just repeat +the process again so this allows the + +201 +00:15:38,830 --> 00:15:43,990 +network to sort of wine increase its +effective receptive field of the output + +202 +00:15:43,990 --> 00:15:48,399 +and also to perform or processing on the +on the input image and then we can + +203 +00:15:48,399 --> 00:15:54,009 +repeat this process again so this is +kinda cool so if these three + +204 +00:15:54,009 --> 00:15:54,769 +convolutional + +205 +00:15:54,769 --> 00:15:58,249 +networks actually share weights then +this becomes a recurrent convolutional + +206 +00:15:58,249 --> 00:16:03,489 +network where it sort of operating on +the same input over overthrew time but + +207 +00:16:03,489 --> 00:16:07,528 +actually each of these updates steps is +a whole convolutional network that's + +208 +00:16:07,528 --> 00:16:10,139 +actually a very similar idea to +recurrent networks that we saw + +209 +00:16:10,139 --> 00:16:18,789 +previously and the idea behind this +paper which was in 2014 is that if you + +210 +00:16:18,789 --> 00:16:22,558 +actually do more iterations of the same +type of thing then hopefully it allows + +211 +00:16:22,558 --> 00:16:28,219 +the network to sort of iteratively +refine its outputs so here if we have + +212 +00:16:28,220 --> 00:16:32,220 +this raw input image then after one +generation you can see that actually + +213 +00:16:32,220 --> 00:16:35,959 +there's quite a bit of noise especially +on the boundaries of the objects but as + +214 +00:16:35,958 --> 00:16:39,359 +we run for two and three iterations +through this recurrent convolutional + +215 +00:16:39,360 --> 00:16:42,769 +network and actually allows the network +to clean up a lot of that sort of + +216 +00:16:42,769 --> 00:16:46,989 +low-level malaise and produced much +cleaner much cleaner and nicer results + +217 +00:16:46,989 --> 00:16:51,119 +so I thought that was quite quite a cool +idea that sort of merging together these + +218 +00:16:51,119 --> 00:16:55,199 +idea of recurrent networks and sharing +weights over time with this idea of + +219 +00:16:55,198 --> 00:17:03,479 +convolutional networks to process images +so another another very widely very well + +220 +00:17:03,480 --> 00:17:07,470 +very very well-known paper for semantic +segmentation is this one from Berkeley + +221 +00:17:07,470 --> 00:17:12,419 +that was published at CBP our last year +so here it's very similar model as + +222 +00:17:12,419 --> 00:17:16,850 +before we're going to take an input +image and run it through some number of + +223 +00:17:16,849 --> 00:17:22,259 +convolutions and eventually extract some +some feature map for the pixels but in + +224 +00:17:22,259 --> 00:17:26,638 +contrast the previous methods all rely +on this sort of hard-coded up sampling + +225 +00:17:26,638 --> 00:17:31,138 +to actually produce the final +segmentation for the energy but in this + +226 +00:17:31,138 --> 00:17:34,668 +paper they proposed that well we're +we're deep learning people we want to + +227 +00:17:34,669 --> 00:17:39,149 +learn everything so we're gonna learn +the upsampling as part of the network so + +228 +00:17:39,148 --> 00:17:43,298 +they're not work includes this at the +last layer this learnable up sampling + +229 +00:17:43,298 --> 00:17:50,798 +there that actually up samples the +feature map in a learnable way so yes + +230 +00:17:50,798 --> 00:17:55,179 +they have been up sampling map at the +end and the way their model kind of + +231 +00:17:55,179 --> 00:17:59,940 +looks is that they have this at the time +it was an Alex not so they have their + +232 +00:17:59,940 --> 00:18:04,090 +input image running through many phases +of convolution and pulling and + +233 +00:18:04,089 --> 00:18:08,028 +eventually they produce at this pool 5 +output they have a quite + +234 +00:18:08,028 --> 00:18:12,048 +down sample image quite a Down sampled +special size compared to the input image + +235 +00:18:12,048 --> 00:18:16,999 +and then there are learnable up sampling +reup them up samples that back to the + +236 +00:18:16,999 --> 00:18:19,460 +original size of the input image + +237 +00:18:19,460 --> 00:18:25,909 +another cool feature of this paper was +this idea of skip connections so they + +238 +00:18:25,909 --> 00:18:30,489 +actually don't use only just these poor +five features they actually use the + +239 +00:18:30,489 --> 00:18:34,598 +convolutional features from different +layers and the network which sort of + +240 +00:18:34,598 --> 00:18:39,200 +exist at different scales so you can +imagine that once you're in the pool for + +241 +00:18:39,200 --> 00:18:42,649 +a layup Alex now that's actually a +bigger feature map then the pool five + +242 +00:18:42,648 --> 00:18:48,069 +and pool 3 is even bigger than pool for +so the intuition is that these lower + +243 +00:18:48,069 --> 00:18:52,148 +convolutional airs might actually help +you capture finer grain structure in the + +244 +00:18:52,148 --> 00:18:56,408 +input image since they have a smaller +receptive field so actually impact us + +245 +00:18:56,409 --> 00:18:59,889 +take these different convolutional +feature maps and apply a separate + +246 +00:18:59,888 --> 00:19:03,428 +learned up sampling each of these +feature maps and then combine them all + +247 +00:19:03,429 --> 00:19:09,070 +to produce the final output and in the +result they show that actually adding he + +248 +00:19:09,069 --> 00:19:15,408 +skipped connections tends to help a lot +with these low-level details so over + +249 +00:19:15,409 --> 00:19:19,979 +here on the left these are the results +that only use these poor five outputs + +250 +00:19:19,979 --> 00:19:24,919 +and you can see that it's sort of gotten +the rough idea of a person on a bicycle + +251 +00:19:24,919 --> 00:19:29,330 +but it's kinda blobby and missing a lot +of the fine details are on the edges but + +252 +00:19:29,329 --> 00:19:31,819 +then when you add in these steps +connections from these lower + +253 +00:19:31,819 --> 00:19:35,468 +convolutional errors that gives you a +lot more fine-grained information about + +254 +00:19:35,469 --> 00:19:39,940 +the spatial locations of things in the +image so that action so adding those + +255 +00:19:39,940 --> 00:19:43,919 +skip connections in the lower layers +really helps you clean up the boundaries + +256 +00:19:43,919 --> 00:19:51,159 +in some cases for these these outputs +question so the question is how to + +257 +00:19:51,159 --> 00:19:55,070 +classify accuracy I think the two +metrics people typically used for this + +258 +00:19:55,069 --> 00:19:58,829 +are even just classification as you're +classifying every pixel classification + +259 +00:19:58,829 --> 00:20:03,968 +my tracks also sometimes people use +intersection of a union so for each + +260 +00:20:03,969 --> 00:20:09,058 +class you compute what is the region of +the image that I predicted us that class + +261 +00:20:09,058 --> 00:20:12,368 +and what was the reach the ground troops +region of the image that had that class + +262 +00:20:12,368 --> 00:20:17,158 +and then compute an intersection of a +union between those two I'm not sure + +263 +00:20:17,159 --> 00:20:20,510 +which which metric this paper used in +particular + +264 +00:20:20,509 --> 00:20:26,609 +so this idea of learnable up sampling is +actually really cool and since since + +265 +00:20:26,609 --> 00:20:30,839 +this paper has been applied and a lot of +other contacts cuz we know we've seen + +266 +00:20:30,839 --> 00:20:35,839 +that we can down sample our feature maps +in a variety of different ways but being + +267 +00:20:35,839 --> 00:20:39,689 +able to up sample them inside the +network could actually be very useful + +268 +00:20:39,690 --> 00:20:44,750 +and a very valuable thing to do so this +sometimes gets called deconvolution + +269 +00:20:44,750 --> 00:20:48,980 +that's not a very good terms so we all +talk about that in a couple minutes but + +270 +00:20:48,980 --> 00:20:54,130 +it's a very common term so just just to +recap sort of when you're doing a normal + +271 +00:20:54,130 --> 00:20:59,870 +sort of stride stride 1353 convolution +we we have this we have this picture + +272 +00:20:59,869 --> 00:21:04,489 +that should be pretty familiar by now +that given our four-by-four input we + +273 +00:21:04,490 --> 00:21:08,710 +have some three by three filter and we +plot that three by three filter over + +274 +00:21:08,710 --> 00:21:10,059 +part of the input + +275 +00:21:10,059 --> 00:21:14,539 +product and that gives us one element of +the output and now because the sting + +276 +00:21:14,539 --> 00:21:19,240 +asteroid one to compute the next element +of the output we we moved the filter + +277 +00:21:19,240 --> 00:21:22,599 +over one slot in the input again +computer dot product and that gives us + +278 +00:21:22,599 --> 00:21:29,409 +are one element in the output and now +for stride true convolution it's it's a + +279 +00:21:29,410 --> 00:21:32,360 +very similar type of idea where now + +280 +00:21:32,359 --> 00:21:36,099 +output is going to be a Down sampled +version 2 by two output for a + +281 +00:21:36,099 --> 00:21:40,459 +four-by-four in place and again it's the +same idea we take our filter we plopped + +282 +00:21:40,460 --> 00:21:44,279 +down on the image computer dot product +gives us one element of the output the + +283 +00:21:44,279 --> 00:21:48,450 +only difference is that now we slide the +convolutional filter over two slots and + +284 +00:21:48,450 --> 00:21:53,610 +the input to compute one on into the +outputs the deconvolution elaire + +285 +00:21:53,609 --> 00:21:57,439 +actually does something a little bit +different so here we want to take a low + +286 +00:21:57,440 --> 00:22:02,490 +resolution input and produce a higher +resolution output so this would be maybe + +287 +00:22:02,490 --> 00:22:08,309 +a few by free deconvolution with a +straight up to an appt at one so here + +288 +00:22:08,309 --> 00:22:12,659 +this is a little bit weird you know in a +normal convolution you imagine take you + +289 +00:22:12,660 --> 00:22:16,750 +have your three by three filter and you +take dot products and the input but here + +290 +00:22:16,750 --> 00:22:21,000 +you want to imagine taking your three by +three filter and just copying it over to + +291 +00:22:21,000 --> 00:22:26,230 +the output the only difference is that +the weights like this one scalar value + +292 +00:22:26,230 --> 00:22:27,579 +of the weight and your input + +293 +00:22:27,579 --> 00:22:31,788 +gives you a wait for that you're going +to relate that filter when you stand + +294 +00:22:31,788 --> 00:22:38,298 +into the output and now when we started +this thing along we're gonna step 1 step + +295 +00:22:38,298 --> 00:22:43,298 +over in the input and two steps over any +output now we're going to take the same + +296 +00:22:43,298 --> 00:22:47,798 +the same learned convolutional filter +and we're gonna plopped down in the + +297 +00:22:47,798 --> 00:22:53,378 +output but now in the boob now for +taking the same convolutional filter and + +298 +00:22:53,378 --> 00:22:56,928 +we're popping it down twice in the +output the difference being that the + +299 +00:22:56,929 --> 00:23:02,139 +Redbox that convolutional filter is +weighted by this scalar value in the + +300 +00:23:02,138 --> 00:23:06,148 +input and for the blue box that +convolutional filter is weighted by the + +301 +00:23:06,148 --> 00:23:10,978 +blue scalar value in the input and where +these where these regions overlap you + +302 +00:23:10,979 --> 00:23:16,590 +just add so this kind of allows you to +learn and up sampling inside the network + +303 +00:23:16,589 --> 00:23:23,118 +so if you remember from your from +implementing convolutions on the + +304 +00:23:23,118 --> 00:23:27,999 +assignment this idea of sort of +especially striking and adding an + +305 +00:23:27,999 --> 00:23:31,348 +overlapping regions that should remind +you of the backward pass for a normal + +306 +00:23:31,348 --> 00:23:36,729 +convolution and it turns out that these +are completely equivalent that this + +307 +00:23:36,729 --> 00:23:40,440 +deconvolution forward pass is exactly +the same as the normal convolution + +308 +00:23:40,440 --> 00:23:44,840 +backward pass and the normal and the +deconvolution backward pass is the same + +309 +00:23:44,839 --> 00:23:50,238 +as the normal convolution forward pass +so because of that actually the term + +310 +00:23:50,239 --> 00:23:54,989 +deconvolution is maybe not so great and +if you have a signal processing + +311 +00:23:54,989 --> 00:23:58,700 +background you may have seen that +deconvolution already has a very + +312 +00:23:58,700 --> 00:24:03,308 +well-defined meaning and that is the +inverse of convolution so a + +313 +00:24:03,308 --> 00:24:07,470 +deconvolution should undo a convolution +operation which is quite different from + +314 +00:24:07,470 --> 00:24:11,909 +what this is actually doing so probably +better names for this instead of + +315 +00:24:11,909 --> 00:24:17,609 +deconvolution that you'll sometimes see +will be convolution transpose or our + +316 +00:24:17,608 --> 00:24:22,148 +backwards strident convolution or +fractionally strident convolution or or + +317 +00:24:22,148 --> 00:24:27,148 +up convolution so I think those are kind +of weird names I think deconvolution as + +318 +00:24:27,148 --> 00:24:30,988 +popular just cuz it's easiest to say +even though it may be less technical + +319 +00:24:30,989 --> 00:24:35,369 +technically correct although actually if +you read papers you'll see that some + +320 +00:24:35,368 --> 00:24:38,699 +people get angry about this so + +321 +00:24:38,700 --> 00:24:43,539 +it's more proper to say convolution +transpose instead of deconvolution and + +322 +00:24:43,539 --> 00:24:47,529 +this other paper really wants it to be +colored fractionally stride convolution + +323 +00:24:47,529 --> 00:24:51,750 +so I think I think the community is +still deciding on the right terminology + +324 +00:24:51,750 --> 00:24:55,240 +here but I kind of agree with them the +deconvolution is probably not very + +325 +00:24:55,240 --> 00:25:00,309 +technically correct and this this paper +in particular has alcohol they felt very + +326 +00:25:00,309 --> 00:25:04,139 +strongly about this issue and they had a +one-page index appendix to the paper + +327 +00:25:04,140 --> 00:25:09,230 +actually explaining why so I convolution +transposase the proper term so if you're + +328 +00:25:09,230 --> 00:25:11,849 +interested then I would really recommend +checking out that it's a pretty good + +329 +00:25:11,849 --> 00:25:16,289 +explanation actually so any any +questions about this + +330 +00:25:16,289 --> 00:25:26,299 +yeah I think so the question is how much +faster is this relative to a patch based + +331 +00:25:26,299 --> 00:25:29,930 +thing the answer is that in practice +nobody even thanks to run this thing and + +332 +00:25:29,930 --> 00:25:34,820 +hopefully patch beast mode that would +just be way way too slow so actually all + +333 +00:25:34,819 --> 00:25:36,000 +of the papers that i've seen + +334 +00:25:36,000 --> 00:25:39,109 +do some kind of polly convolutional +thing in one way or another + +335 +00:25:39,109 --> 00:25:44,729 +actually there there is sort of another +trick instead of up sampling that people + +336 +00:25:44,730 --> 00:25:49,309 +sometimes use and that is that so +suppose that your network is actually + +337 +00:25:49,309 --> 00:25:52,599 +gonna down sample by a factor of four +then one thing you can do is take your + +338 +00:25:52,599 --> 00:25:57,199 +input image shipped by one pixel and now +running through the network again and I + +339 +00:25:57,200 --> 00:26:00,710 +you get another output and you repeat +this for sort of four different one + +340 +00:26:00,710 --> 00:26:04,870 +pixel ships of the input and now you've +gotten for output maps and you can sort + +341 +00:26:04,869 --> 00:26:08,339 +of interleaved those to reconstruct an +original input map so that's that's + +342 +00:26:08,339 --> 00:26:12,279 +another trick that people sometimes used +to get around that problem but I think + +343 +00:26:12,279 --> 00:26:19,740 +that this morning up sampling is quite a +bit cleaner + +344 +00:26:19,740 --> 00:26:28,440 +so i think is really nice just rolls off +the tongue I think back once tried I + +345 +00:26:28,440 --> 00:26:33,799 +think fractionally strident convolution +is actually pretty cool right I think + +346 +00:26:33,799 --> 00:26:36,928 +it's the longest name but it's really +descriptive right normal with Astro + +347 +00:26:36,929 --> 00:26:40,910 +normally with us try to convolution you +move acts elements in the end he moved + +348 +00:26:40,910 --> 00:26:45,808 +like you don't want any input and output +and here you're moving what happened the + +349 +00:26:45,808 --> 00:26:48,940 +input you might wanna input which +corresponds moving what happened the + +350 +00:26:48,940 --> 00:26:55,140 +output so that captures idea quite +nicely so I'm not sure what I'll call it + +351 +00:26:55,140 --> 00:27:02,790 +when when I use it in the paper we'll +have to see about that but so despite + +352 +00:27:02,789 --> 00:27:06,440 +the despite the concerns about people +calling a deconvolution like people just + +353 +00:27:06,440 --> 00:27:10,980 +call it that anyway so there was this +paper from ICC be that takes this idea + +354 +00:27:10,980 --> 00:27:16,319 +of this week of this convolutional / +fractionally try to come up with an idea + +355 +00:27:16,319 --> 00:27:21,428 +and sort of pushed to the extreme so +here and they they took what amounts to + +356 +00:27:21,429 --> 00:27:28,170 +two entire BGG networks so this is an +exact same models before I wanna input + +357 +00:27:28,170 --> 00:27:33,720 +and output pixel wise predictions for +the semantic segmentation task but here + +358 +00:27:33,720 --> 00:27:40,220 +we initialize the BGG and over here is +an upside down BGG and it trains for six + +359 +00:27:40,220 --> 00:27:44,509 +days on a tax so this thing is pretty +slow but actually got really really good + +360 +00:27:44,509 --> 00:27:51,160 +results and I think it's also a very +beautiful figure so that's that's pretty + +361 +00:27:51,160 --> 00:27:54,308 +much all that I have to say about +semantic segmentation if there's any + +362 +00:27:54,308 --> 00:27:59,799 +questions about that yeah + +363 +00:27:59,799 --> 00:28:04,909 +the question is how is this your main +the answer is I took a screenshot from + +364 +00:28:04,910 --> 00:28:09,090 +their paper so I don't know but you can +try to answer flow we saw in the last + +365 +00:28:09,089 --> 00:28:15,069 +lecture that lets you make make figures +but they're not as nice as this yeah + +366 +00:28:15,069 --> 00:28:22,579 +question as training data yes there +exists datasets with this kind of thing + +367 +00:28:22,579 --> 00:28:28,449 +where I think there's a common one is +the Pascal segmentation data set so it + +368 +00:28:28,450 --> 00:28:31,380 +just has a ground truth you have an +image you have an image and they have + +369 +00:28:31,380 --> 00:28:37,780 +every pixel labeled yeah it's it's kind +of expensive to get that data to the + +370 +00:28:37,779 --> 00:28:43,049 +datasets tend to be a little smaller but +in practice there's a famous interface + +371 +00:28:43,049 --> 00:28:46,299 +called label me where you could upload +an image and then sort of drunk on tour + +372 +00:28:46,299 --> 00:28:49,240 +around the invention around different +regions of the invention and that you + +373 +00:28:49,240 --> 00:28:54,140 +can convert those contours into sort of +these segmentation asks that's how you + +374 +00:28:54,140 --> 00:29:02,130 +tend to label these things in a way if +you are questions that I think we'll + +375 +00:29:02,130 --> 00:29:07,290 +move on to instant segmentation so just +to recap instance segmentation is this + +376 +00:29:07,289 --> 00:29:11,089 +generalization or where we not only want +to label the pixels of the image but we + +377 +00:29:11,089 --> 00:29:15,089 +also want to distinguish instant +distinguish instances so we're going to + +378 +00:29:15,089 --> 00:29:18,419 +detect the different instances of our +classes and for each one we want to + +379 +00:29:18,420 --> 00:29:25,320 +label the pixels of that instance so +these this actually these models and up + +380 +00:29:25,319 --> 00:29:28,419 +looking a lot like the detection models +that we talked about a few lectures ago + +381 +00:29:28,420 --> 00:29:34,150 +so one of the earliest papers that I +know that this is actually I should also + +382 +00:29:34,150 --> 00:29:38,040 +point out that this is i think im much +more recent ask that this idea of + +383 +00:29:38,039 --> 00:29:42,319 +semantic segmentation has been used in +computer vision for a long long time but + +384 +00:29:42,319 --> 00:29:45,409 +I think this idea of instant +segmentation has gotten a lot more + +385 +00:29:45,410 --> 00:29:50,970 +popular especially in the last couple of +years so this paper from 2014 sort of + +386 +00:29:50,970 --> 00:29:53,890 +took this I think they call it +simultaneous detection and segmentation + +387 +00:29:53,890 --> 00:29:59,600 +or SDS that's kind of a nice name and +this is actually very similar to our CNN + +388 +00:29:59,599 --> 00:30:03,839 +model that we saw protection so here +we're gonna take it and input dementia + +389 +00:30:03,839 --> 00:30:09,399 +and if you remember in our CNN we rely +on these external region proposals that + +390 +00:30:09,400 --> 00:30:12,269 +can are these sort of offline computer +vision + +391 +00:30:12,269 --> 00:30:16,538 +global thing that compute predictions on +where it thinks objects in image might + +392 +00:30:16,538 --> 00:30:17,658 +be located + +393 +00:30:17,659 --> 00:30:21,419 +well it turns out that there's other +methods for proposing segments instead + +394 +00:30:21,419 --> 00:30:25,419 +of boxes so we just download one of +those existing segment proposal methods + +395 +00:30:25,419 --> 00:30:30,879 +and use that instead now for each of +these segments we can for each of these + +396 +00:30:30,878 --> 00:30:35,398 +proposed segments we can extract a +bounding box by just sitting at a box of + +397 +00:30:35,398 --> 00:30:40,298 +a segment and then run crop out that +chunk of the input image and run it + +398 +00:30:40,298 --> 00:30:47,108 +through a box CNN to extract features +for that box than in parallel will run + +399 +00:30:47,108 --> 00:30:52,358 +through a region CNN so she can we take +that relevant that chunk from input + +400 +00:30:52,358 --> 00:30:57,168 +invention and crop it out but here +because we actually have this proposal + +401 +00:30:57,169 --> 00:31:01,320 +for the segment then we're going to mask +out the background region using the mean + +402 +00:31:01,319 --> 00:31:05,700 +color of the data that so this is kind +of a hack that lets you take these kind + +403 +00:31:05,700 --> 00:31:09,838 +of weird shape inputs and feed it into a +CNN you just mask out the background + +404 +00:31:09,838 --> 00:31:14,479 +part with us with a black color so that +may take these masks inputs and run them + +405 +00:31:14,479 --> 00:31:18,769 +through a separate regions CNN now we've +gotten two different feature vectors one + +406 +00:31:18,769 --> 00:31:22,739 +sort of incorporating the whole box and +one in corporate only the of the + +407 +00:31:22,739 --> 00:31:26,328 +proposed foreground pixels we +concatenate these things and then just + +408 +00:31:26,328 --> 00:31:30,638 +like in our CNN we make a classification +to decide what class actually should + +409 +00:31:30,638 --> 00:31:37,128 +this segment B and then they also have +this region refinement step where we + +410 +00:31:37,128 --> 00:31:42,108 +want to refine the proposed regions the +little bit so if you don't know how well + +411 +00:31:42,108 --> 00:31:45,218 +you remember that our CNN framework but +this is actually very similar to our CNN + +412 +00:31:45,219 --> 00:31:52,909 +just apply to this instance simultaneous +detection and segmentation task so this + +413 +00:31:52,909 --> 00:31:56,950 +idea for this region refinement step +there's actually a follow-up paper that + +414 +00:31:56,950 --> 00:32:03,288 +proposes a pretty nice way to do it so +here is the paper from the same folks at + +415 +00:32:03,288 --> 00:32:07,578 +berkeley though the following conference +and here we want to take this this input + +416 +00:32:07,578 --> 00:32:12,940 +which is this proposed to segment this +proposed segment and want to clean it up + +417 +00:32:12,940 --> 00:32:17,778 +somehow so we're actually gonna take a +very similar approach very similar type + +418 +00:32:17,778 --> 00:32:20,230 +multiscale approach that we saw in the + +419 +00:32:20,230 --> 00:32:24,839 +in the semantic segmentation model a +while ago so here we're going to take + +420 +00:32:24,839 --> 00:32:30,139 +our our image crop out the prop up the +box corresponding to that segment and + +421 +00:32:30,140 --> 00:32:34,350 +then pass it through and Alex net and +we're going to extract convolutional + +422 +00:32:34,349 --> 00:32:37,849 +features from several different layers +of that Alex NAT for each of those + +423 +00:32:37,849 --> 00:32:42,139 +feature maps will up sampled I'm and +combine them together and now will + +424 +00:32:42,140 --> 00:32:48,370 +produce this this figure this proposed +figure ground segmentation so this this + +425 +00:32:48,369 --> 00:32:52,308 +is actually kind of a funny output but +it's it's really easy to predict the + +426 +00:32:52,308 --> 00:32:55,910 +idea is that invests this output image +we're just gonna do a logistic + +427 +00:32:55,910 --> 00:33:00,990 +classifier inside each independent pixel +so given these features we just have a + +428 +00:33:00,990 --> 00:33:04,410 +whole bunch of independent logistic +classify hairs that are predicting how + +429 +00:33:04,410 --> 00:33:08,250 +much each pixel of this output is likely +to be in the foreground are in the + +430 +00:33:08,250 --> 00:33:13,390 +background and they show that this this +type of multiscale refinement step + +431 +00:33:13,390 --> 00:33:16,610 +actually cleans up the other parts of +the previous system and gives quite + +432 +00:33:16,609 --> 00:33:27,899 +quite nice results question + +433 +00:33:27,900 --> 00:33:34,390 +fractionally stride and convolution I +think it was instead of some kind of a + +434 +00:33:34,390 --> 00:33:37,870 +fix up sampling like a bilinear +interpolation or something like that or + +435 +00:33:37,869 --> 00:33:41,449 +maybe even a nearest-neighbor just +something fixed and variable but I could + +436 +00:33:41,450 --> 00:33:44,170 +be wrong but you could definitely +imagine swapping and some learnable + +437 +00:33:44,170 --> 00:33:46,250 +think they're too + +438 +00:33:46,250 --> 00:33:52,980 +ok so this this this actually is very +similar to our CNN but in the detection + +439 +00:33:52,980 --> 00:33:57,049 +lecture we saw that our CNN was just the +start of the story there's all these + +440 +00:33:57,049 --> 00:34:03,329 +faster versions right so it turns out +that a similar intuition from faster our + +441 +00:34:03,329 --> 00:34:08,090 +CNN has actually been applied to this +instance segmentation problem as well so + +442 +00:34:08,090 --> 00:34:12,050 +this is work from Microsoft that action +and this model actually won the cocoa + +443 +00:34:12,050 --> 00:34:16,860 +instance segmentation challenge this +year so they they took their giant + +444 +00:34:16,860 --> 00:34:20,000 +resonance and they stuck this model on +top of it and they and they crush + +445 +00:34:20,000 --> 00:34:25,489 +everyone else in the coco instance +segmentation challenge so this this + +446 +00:34:25,489 --> 00:34:28,668 +actually is very similar to past our +stand on so we're going to take our + +447 +00:34:28,668 --> 00:34:34,148 +input image and just like in fast and +faster our CNN our input image will not + +448 +00:34:34,148 --> 00:34:37,730 +be pretty high resolution and we'll get +this giant comedy show will feature map + +449 +00:34:37,730 --> 00:34:44,260 +over our high resolution and then from +this high resolution and we're actually + +450 +00:34:44,260 --> 00:34:48,700 +going to propose our own region +proposals in the previous method we + +451 +00:34:48,699 --> 00:34:52,319 +relied on these external segment +proposals but here we're just gonna + +452 +00:34:52,320 --> 00:34:56,870 +learn our own region proposals just like +faster our CNN so here we just stick a + +453 +00:34:56,869 --> 00:35:00,859 +couple a couple extra convolutional airs +on top up are controversial feature map + +454 +00:35:00,860 --> 00:35:04,740 +and each one of those is going to +predict several regions of interest in + +455 +00:35:04,739 --> 00:35:11,109 +the image that using this idea of boxes +that we saw in the detection work + +456 +00:35:11,110 --> 00:35:15,200 +the difference is that now once we have +this region these region proposals were + +457 +00:35:15,199 --> 00:35:18,559 +gonna segment about using a very similar +approach that we just saw on the last + +458 +00:35:18,559 --> 00:35:24,380 +slide so for each of these proposed +regions are going to use this ROI what + +459 +00:35:24,380 --> 00:35:28,579 +they call it ROI warping or pooling and +squish them all down to a fixed square + +460 +00:35:28,579 --> 00:35:33,000 +size and then run each of them through a +convolutional neural network to produce + +461 +00:35:33,000 --> 00:35:36,710 +these course figure ground segmentation +masks like we just saw in the previous + +462 +00:35:36,710 --> 00:35:41,909 +in the previous slide so now at this +point we've gotten our image we've got a + +463 +00:35:41,909 --> 00:35:45,859 +bunch of region proposals and now for +each region proposal we have a rough + +464 +00:35:45,860 --> 00:35:49,240 +idea of which part of that box as +foreground and which part is background + +465 +00:35:49,239 --> 00:35:54,489 +now we're going to take this idea of +masking so now that we predicted the + +466 +00:35:54,489 --> 00:35:57,709 +foreground background for each of these +segments we're going to mask out the + +467 +00:35:57,710 --> 00:36:02,889 +predicted background and only keep the +pixels from the predicted foreground and + +468 +00:36:02,889 --> 00:36:07,179 +past goes through another couple layers +to actually classified about classify + +469 +00:36:07,179 --> 00:36:13,629 +that segment as our different categories +so this is a man this entire thing can + +470 +00:36:13,630 --> 00:36:18,380 +just be learned jointly and two and with +the idea that we've got these three + +471 +00:36:18,380 --> 00:36:22,490 +semantically interpretable outputs in +intermediate layers of our network and + +472 +00:36:22,489 --> 00:36:26,589 +each of them we can just supervise with +ground truth data so these regions of + +473 +00:36:26,590 --> 00:36:29,900 +interest we know where the ground truth +sex objects are in the object and image + +474 +00:36:29,900 --> 00:36:34,349 +so we can provide supervision on those +outputs for these segmentation asks we + +475 +00:36:34,349 --> 00:36:37,929 +know what the true foreground and +background ours we can give supervision + +476 +00:36:37,929 --> 00:36:42,759 +there and we are we obviously know the +classes of those different segments so + +477 +00:36:42,760 --> 00:36:46,760 +we just provide supervision at different +layers of these network and try to trade + +478 +00:36:46,760 --> 00:36:50,420 +off all those different lost terms and +hopefully get the thing to converge but + +479 +00:36:50,420 --> 00:36:53,670 +this actually was trained and two and +and they find to an interesting and it + +480 +00:36:53,670 --> 00:36:59,809 +works really really well so here is that +the results figure that we have to show + +481 +00:36:59,809 --> 00:37:04,519 +so these results are at least to me +really really impressive so for example + +482 +00:37:04,519 --> 00:37:09,159 +this input image has all these different +people sitting in this room and the + +483 +00:37:09,159 --> 00:37:12,539 +predicted outputs do a really good job +of separating out all those different + +484 +00:37:12,539 --> 00:37:15,360 +people even though they overlap and +there's a lot of them and they're very + +485 +00:37:15,360 --> 00:37:16,500 +close + +486 +00:37:16,500 --> 00:37:20,699 +same with these cars made it a little +easier but especially this this people + +487 +00:37:20,699 --> 00:37:24,629 +when I was pretty impressed by but you +can see it's not perfect so this potted + +488 +00:37:24,630 --> 00:37:28,840 +plants at that was blocked here than it +really was and it confused this chair on + +489 +00:37:28,840 --> 00:37:32,230 +the right for a person and I missed a +person there but overall these results + +490 +00:37:32,230 --> 00:37:36,300 +are very very impressive and like I said +this model one that Coco segmentation + +491 +00:37:36,300 --> 00:37:43,250 +challenge this year so the overview of +segmentation is that we've got these + +492 +00:37:43,250 --> 00:37:47,519 +these two different tasks semantic +segmentation and instant segmentation + +493 +00:37:47,519 --> 00:37:52,210 +for semantic segmentation it's very +common to use this this + +494 +00:37:52,210 --> 00:37:56,800 +conde convoy approached and then for +instance segmentation you end up with + +495 +00:37:56,800 --> 00:38:02,180 +these pipelines that look more similar +to object detection so if there's any + +496 +00:38:02,179 --> 00:38:08,338 +last minute questions about segmentation +I can try to answer those now super + +497 +00:38:08,338 --> 00:38:14,329 +clear I guess so we're gonna move on to +another and are not a pretty cool + +498 +00:38:14,329 --> 00:38:18,150 +exciting topic and thats attention +models so this is something that I think + +499 +00:38:18,150 --> 00:38:24,550 +has got a lot of attention and last year +and community so as a kind of a case + +500 +00:38:24,550 --> 00:38:29,780 +study we're gonna talk about the model +from another citation here ok but as a + +501 +00:38:29,780 --> 00:38:32,349 +as a as a sort of a case study + +502 +00:38:32,349 --> 00:38:35,190 +we're going to talk about the idea of +attention as applied to image capture me + +503 +00:38:35,190 --> 00:38:39,530 +so I think this model was previewed in +the recurrent networks lecture but we + +504 +00:38:39,530 --> 00:38:43,740 +want I want to step into a lot more +details here but first as a recap just + +505 +00:38:43,739 --> 00:38:47,029 +so we're on the same page hopefully you +know how I missed captioning works by + +506 +00:38:47,030 --> 00:38:51,540 +now since the homework is due in a few +hours but we're going to take our input + +507 +00:38:51,539 --> 00:38:54,869 +invention and run it through a +convolutional not and get some features + +508 +00:38:54,869 --> 00:38:58,869 +those features will be used maybe to +initialize the first hidden state of our + +509 +00:38:58,869 --> 00:39:03,780 +current network then are far start token +or a first word got out that hidden + +510 +00:39:03,780 --> 00:39:06,609 +state we're going to produce this +distribution over words in our + +511 +00:39:06,608 --> 00:39:11,940 +vocabulary than to generate a word will +just simple format distribution and will + +512 +00:39:11,940 --> 00:39:16,429 +just sort of repeat this process +overtime to generate captions the + +513 +00:39:16,429 --> 00:39:20,199 +problem here is that this network only +sort of gets one chance to look at the + +514 +00:39:20,199 --> 00:39:23,899 +input image and when it does it's +looking at the entire input image all at + +515 +00:39:23,900 --> 00:39:29,970 +once and it might be cooler if it +actually have the ability to one look at + +516 +00:39:29,969 --> 00:39:33,809 +the input image multiple times and also +if it could focus on different parts of + +517 +00:39:33,809 --> 00:39:41,969 +the input image as it ran so one pretty +cool paper that came out last year was + +518 +00:39:41,969 --> 00:39:46,409 +this one called show attendant tell the +original one show and tell us they added + +519 +00:39:46,409 --> 00:39:51,289 +the a-ten part and the idea is is pretty +straightforward so we're going to take + +520 +00:39:51,289 --> 00:39:54,750 +our input image and we're still gonna +run it through a convolutional network + +521 +00:39:54,750 --> 00:39:58,440 +but instead of extracting the features +from the last fully connected later + +522 +00:39:58,440 --> 00:40:01,659 +instead we're gonna pull features from +one of the convoluted earlier + +523 +00:40:01,659 --> 00:40:05,549 +convolutional heirs and that's going to +give us this grid of features + +524 +00:40:05,550 --> 00:40:09,160 +rather than a single feature vector so +because these are coming from + +525 +00:40:09,159 --> 00:40:13,460 +convolutional air as you can imagine +that maybe the upper left-hand this you + +526 +00:40:13,460 --> 00:40:17,320 +can think of this as a treaty spatial +grid of features and inside each grid + +527 +00:40:17,320 --> 00:40:21,130 +each point in the grid gives you +features corresponding to some part of + +528 +00:40:21,130 --> 00:40:26,890 +the input image so now again will use +these these features to initialize the + +529 +00:40:26,889 --> 00:40:30,099 +hidden state of our network in some way +and now here's where things get + +530 +00:40:30,099 --> 00:40:34,400 +different now we're going to use our +hidden state to compute not a + +531 +00:40:34,400 --> 00:40:38,220 +distribution over words but instead a +distribution over these different + +532 +00:40:38,219 --> 00:40:43,459 +positions in the in our convolutional +feature map so again this is this would + +533 +00:40:43,460 --> 00:40:47,050 +probably be implemented with may be +awfully connect with maybe and a fine + +534 +00:40:47,050 --> 00:40:51,260 +layer or two and then some soft max to +give you a distribution but we just end + +535 +00:40:51,260 --> 00:40:54,410 +up with this al dimensional vector +giving us a probability distribution + +536 +00:40:54,409 --> 00:41:01,019 +over these different locations and our +input and now we take this probability + +537 +00:41:01,019 --> 00:41:05,780 +distribution and actually use it to read +to wait to get a weighted sum of those + +538 +00:41:05,780 --> 00:41:10,810 +feature vectors at the different points +in our in our grade so once we take this + +539 +00:41:10,809 --> 00:41:15,849 +weighted combination of features that +takes our grid and summarizes it down to + +540 +00:41:15,849 --> 00:41:22,420 +a single factor there and this this sort +of disease vector summarizes the input + +541 +00:41:22,420 --> 00:41:26,909 +image in some way and due to the +different types do to do to this + +542 +00:41:26,909 --> 00:41:30,619 +probability distribution it gives the +network the capacity to focus on + +543 +00:41:30,619 --> 00:41:35,299 +different parts of the image as it goes +so now this this weighting factor that's + +544 +00:41:35,300 --> 00:41:39,730 +produced from the input features gets +fed together with the first word and now + +545 +00:41:39,730 --> 00:41:43,960 +when we make a recurrence in a recurrent +network we actually have three and parts + +546 +00:41:43,960 --> 00:41:49,139 +we have our previous hidden state we +have this attended feature vector and we + +547 +00:41:49,139 --> 00:41:52,929 +have this first word and now all of +these together are used to produce our + +548 +00:41:52,929 --> 00:41:56,929 +new hidden state and now from this +hidden state we're actually going to + +549 +00:41:56,929 --> 00:42:01,419 +produce two outputs we're going to +produce another a new distribution over + +550 +00:42:01,420 --> 00:42:04,940 +the locations and our input image and +we're also going to reduce our standard + +551 +00:42:04,940 --> 00:42:08,599 +distribution over words so these are +probably be implemented as just a couple + +552 +00:42:08,599 --> 00:42:13,679 +of active layers on top of the hidden +states and now this process repeats so + +553 +00:42:13,679 --> 00:42:17,739 +given this new probably distribution we +go back to the input feature grand + +554 +00:42:17,739 --> 00:42:22,949 +and come to a new summarization vector +for the invention take take that dr. + +555 +00:42:22,949 --> 00:42:25,618 +together with the next word in the +sentence to compute the New Haven + +556 +00:42:25,619 --> 00:42:34,930 +State's produce ok so that spoiled a +little bad but I'm by Ben will actually + +557 +00:42:34,929 --> 00:42:50,109 +repeat this process overtime to generate +captions yeah so the question is how + +558 +00:42:50,110 --> 00:42:54,190 +where does this feature good come from +and the answer is when you're when + +559 +00:42:54,190 --> 00:42:57,510 +you're doing and Alex that for example +you have con- wanna come to country + +560 +00:42:57,510 --> 00:43:01,670 +Kanpur come by and by the time you get +to come five the shape of that tensor is + +561 +00:43:01,670 --> 00:43:05,960 +now something like seven by seven by +five hundred and twelve so that + +562 +00:43:05,960 --> 00:43:11,050 +corresponds to a seven by seven spatial +grid over the input and each grid + +563 +00:43:11,050 --> 00:43:15,450 +position that's a 512 dimensional +feature vector so those are just pulled + +564 +00:43:15,449 --> 00:43:27,858 +out of one of the convolutional there is +a network question + +565 +00:43:27,858 --> 00:43:33,219 +so the question is about this probably +distributions so we're actually + +566 +00:43:33,219 --> 00:43:37,899 +producing two different probability +distributions at every time step the + +567 +00:43:37,900 --> 00:43:42,400 +first one of these d vectors and blue so +those are probably distribution over + +568 +00:43:42,400 --> 00:43:46,920 +words in your vocabulary like we did in +normal image captioning and also at + +569 +00:43:46,920 --> 00:43:50,759 +every time step will produce a second +probability distribution over these + +570 +00:43:50,759 --> 00:43:55,170 +locations in the end in the input image +that are telling us where we want to + +571 +00:43:55,170 --> 00:43:59,690 +look next time step sis is actually +quite right so if you're just tuning to + +572 +00:43:59,690 --> 00:44:05,200 +ups and then as a quiz wanted to see +like what framework you want to use them + +573 +00:44:05,199 --> 00:44:09,679 +for months and we talked about maybe how +r intense would be a good choice for our + +574 +00:44:09,679 --> 00:44:16,288 +tents are flow and I think this +qualifies as a crazy are done so I + +575 +00:44:16,289 --> 00:44:19,749 +wanted to maybe talk in a little bit +more detail how these attention vector + +576 +00:44:19,748 --> 00:44:24,308 +how these summarization doctors get +produced so this paper actually talks + +577 +00:44:24,309 --> 00:44:29,278 +about two different methods for +generating these factors so the idea as + +578 +00:44:29,278 --> 00:44:33,559 +we saw in the last slide is that we'll +take our input image and get this great + +579 +00:44:33,559 --> 00:44:38,019 +of teachers coming from one of the +convolutional areas in our network and + +580 +00:44:38,018 --> 00:44:41,899 +then each time stop our network will +produce this probability distribution + +581 +00:44:41,900 --> 00:44:45,789 +over locations so this would be a full +impact of land in a soft maxed out to + +582 +00:44:45,789 --> 00:44:50,329 +normalize it and now the idea is that we +want to take these this great feature + +583 +00:44:50,329 --> 00:44:54,249 +vectors together with these probability +distributions and produce a single + +584 +00:44:54,248 --> 00:44:59,798 +d-dimensional factor that summarizes +that input image and there's the paper + +585 +00:44:59,798 --> 00:45:04,159 +actually explores two different ways of +solving this problem so the easy way is + +586 +00:45:04,159 --> 00:45:08,969 +to use what's what they call soft +detention so she r Rd dimensional + +587 +00:45:08,969 --> 00:45:13,518 +vectors eg will just be a weighted sum +of all the elements in the grid where + +588 +00:45:13,518 --> 00:45:18,028 +each factor is just waited by its +probably by its predicted probability + +589 +00:45:18,028 --> 00:45:23,318 +this is actually very easy to implement +it sort of a nice as just another layer + +590 +00:45:23,318 --> 00:45:28,599 +in a neural network and these gradients +like the derivative of this context + +591 +00:45:28,599 --> 00:45:32,588 +factor with respect are predicted +probabilities P is quite nice and easy + +592 +00:45:32,588 --> 00:45:36,818 +to compute so we can actually trained +this thing just using normal gradient + +593 +00:45:36,818 --> 00:45:40,019 +descent and back-propagation + +594 +00:45:40,019 --> 00:45:44,559 +but they actually explore another +another option for competing this + +595 +00:45:44,559 --> 00:45:48,210 +feature vector and that's something +called the heart attention so instead of + +596 +00:45:48,210 --> 00:45:52,630 +having this weighted sum we might want +to select just a single element of that + +597 +00:45:52,630 --> 00:45:57,940 +upgrade to attend to so you might +imagine so one simple thing to do is + +598 +00:45:57,940 --> 00:46:02,440 +just to pick the elements of the grid +with the highest probability and just + +599 +00:46:02,440 --> 00:46:07,269 +pull out that feature vector comp +corresponding to that part tax position + +600 +00:46:07,269 --> 00:46:13,150 +the problem is now if you think about in +this card max in this park next case if + +601 +00:46:13,150 --> 00:46:16,829 +you think about this derivative the +derivative with respect to our + +602 +00:46:16,829 --> 00:46:18,360 +distribution P + +603 +00:46:18,360 --> 00:46:22,980 +it turns out that this is not very +friendly for backpropagation anymore so + +604 +00:46:22,980 --> 00:46:29,059 +imagine in our next case if I suppose +that a that P A or actually the largest + +605 +00:46:29,059 --> 00:46:33,119 +element and our input and now what +happens if we change pH just a little + +606 +00:46:33,119 --> 00:46:40,130 +bit rate so if he is the architects and +then we just jiggle the probability + +607 +00:46:40,130 --> 00:46:44,869 +distribution just a little bit NPA will +still be the architects so we'll still + +608 +00:46:44,869 --> 00:46:49,400 +select the same factor from the input +which means that actually the derivative + +609 +00:46:49,400 --> 00:46:53,990 +of this factor is easy with respect are +predicted probabilities is going to be 0 + +610 +00:46:53,989 --> 00:46:58,689 +almost everywhere so that's that's very +bad week now we can't really use + +611 +00:46:58,690 --> 00:47:02,970 +backpropagation anymore to train this +thing so it turns out that they propose + +612 +00:47:02,969 --> 00:47:06,549 +another method based on reinforcement +learning to actually train the model in + +613 +00:47:06,550 --> 00:47:12,710 +this context where you want to select a +single element but that's a little bit + +614 +00:47:12,710 --> 00:47:16,260 +more complex so we're not gonna talk +about that in this lecture but just be + +615 +00:47:16,260 --> 00:47:18,900 +aware that that is something that you'll +see the difference between soft + +616 +00:47:18,900 --> 00:47:26,010 +attention and heart attention where you +actually pick one so now we can look at + +617 +00:47:26,010 --> 00:47:30,450 +some some pretty results from this model +so since we're actually generating a + +618 +00:47:30,449 --> 00:47:34,480 +probability distribution over grid +locations every time stop we can + +619 +00:47:34,480 --> 00:47:38,519 +visualize that probability distribution +as we generate each word of art are + +620 +00:47:38,519 --> 00:47:44,039 +generated caption so then this input +image that shows a bird both back they + +621 +00:47:44,039 --> 00:47:48,279 +both their heart attention model and her +soft attention model in this case both + +622 +00:47:48,280 --> 00:47:51,650 +produced the caption a bird flying over +a body of water period + +623 +00:47:51,650 --> 00:47:57,090 +and for these two models they visualize +what that probability distribution looks + +624 +00:47:57,090 --> 00:48:01,690 +like these two different models so the +top shows the soft attention so you can + +625 +00:48:01,690 --> 00:48:04,849 +see that it's sort of diffuse since it's +averaging probabilities from every + +626 +00:48:04,849 --> 00:48:09,309 +location and image and in the bottom +it's just showing the one single element + +627 +00:48:09,309 --> 00:48:16,289 +that it pulled out and it's actually +quite nice romantic drama meanings you + +628 +00:48:16,289 --> 00:48:19,779 +can see that when the model is +especially the soft attention on the top + +629 +00:48:19,780 --> 00:48:23,340 +i think is very nice results that when +it's talking about the bird and talking + +630 +00:48:23,340 --> 00:48:26,610 +about flying at sort of focus is right +on the bird and then when it's talking + +631 +00:48:26,610 --> 00:48:30,820 +about the water it kinda focuses on +everything else so another thing to + +632 +00:48:30,820 --> 00:48:34,269 +point out is that it didn't receive any +supervision and training time for which + +633 +00:48:34,269 --> 00:48:38,869 +parts of the image should be attending +to it just made up its own mind to + +634 +00:48:38,869 --> 00:48:43,289 +attend to those parts based on whatever +would help it captured things better and + +635 +00:48:43,289 --> 00:48:46,480 +it's pretty cool that we actually get +these interpretable results just out of + +636 +00:48:46,480 --> 00:48:51,920 +this captioning task we can look at a +couple a couple other results cause + +637 +00:48:51,920 --> 00:48:56,340 +they're fun we can see that when we have +the dog throwing one woman throwing the + +638 +00:48:56,340 --> 00:49:01,079 +frisbee in the park at Presby talking +about the dog at various recognize the + +639 +00:49:01,079 --> 00:49:05,259 +dog and especially interesting is this +guy in the bottom right when it + +640 +00:49:05,260 --> 00:49:08,790 +generates the word trees it's actually +focusing on all the stuff in the + +641 +00:49:08,789 --> 00:49:13,440 +background and not just the giraffe and +again these are just coming out with no + +642 +00:49:13,440 --> 00:49:22,179 +supervision all just based on the +caption and ask question yes or the + +643 +00:49:22,179 --> 00:49:27,440 +question is whatever when would you +prefer hard versus attention so there's + +644 +00:49:27,440 --> 00:49:31,380 +i think sort of two motivations that +people usually give her wanting to even + +645 +00:49:31,380 --> 00:49:33,530 +do attention at all in the first place + +646 +00:49:33,530 --> 00:49:37,580 +one of those is just to give nice +interminable outputs and I think you get + +647 +00:49:37,579 --> 00:49:42,710 +nice interpretable outputs in either +case at least theoretically maybe her + +648 +00:49:42,710 --> 00:49:46,130 +detention think you're wasn't quite as +pretty but the other motivation for + +649 +00:49:46,130 --> 00:49:49,970 +using attention is to relieve +computational burden especially when you + +650 +00:49:49,969 --> 00:49:54,989 +have a very very large and put it might +be computationally expensive actually + +651 +00:49:54,989 --> 00:49:58,619 +process that whole input on every time +step and it might be more efficient + +652 +00:49:58,619 --> 00:50:02,869 +computationally if we can just focus on +one part of the input at each time step + +653 +00:50:02,869 --> 00:50:07,380 +only process a small subset pretends +that so with soft attention because + +654 +00:50:07,380 --> 00:50:10,730 +we're doing this sort of averaging over +all positions we don't get any + +655 +00:50:10,730 --> 00:50:14,369 +computational savings are still +processing the whole input on every time + +656 +00:50:14,369 --> 00:50:17,799 +step but with heart attention we +actually do get a computational savings + +657 +00:50:17,800 --> 00:50:22,680 +since were explicitly pic picking out +some small subset of the input so I + +658 +00:50:22,679 --> 00:50:26,289 +think that's that's the big benefits +also her detention takes reinforcement + +659 +00:50:26,289 --> 00:50:41,420 +learning and expand CRN makes you look +smarter that's kind of question yeah so + +660 +00:50:41,420 --> 00:50:46,150 +the question is how does this work at +all and I think the answer is it's + +661 +00:50:46,150 --> 00:50:49,789 +really learning sort of correlation +structures in the input right that it's + +662 +00:50:49,789 --> 00:50:54,779 +seen many examples of images with dogs +and it's a many sentences with dogs but + +663 +00:50:54,780 --> 00:50:57,480 +for those different images with dogs the +dogs tend to appear in different + +664 +00:50:57,480 --> 00:51:01,349 +positions in the input and I guess it +turns out through the optimization + +665 +00:51:01,349 --> 00:51:05,659 +procedure that actually putting more +weight on the places where the dog + +666 +00:51:05,659 --> 00:51:10,399 +actually exists actually helps the +captioning task in some way so I don't + +667 +00:51:10,400 --> 00:51:14,460 +think there's a very very good answer it +just it just happens to work also I'm + +668 +00:51:14,460 --> 00:51:18,500 +not sure so obviously these are pictures +from a figure these are figures from a + +669 +00:51:18,500 --> 00:51:23,300 +paper not like random results so I'm not +sure how good it works on random images + +670 +00:51:23,300 --> 00:51:31,870 +but another thing to really point out +about this especially this model a soft + +671 +00:51:31,869 --> 00:51:35,739 +detention is that it's sort of +constraints to this fixed grid from the + +672 +00:51:35,739 --> 00:51:41,199 +convolution feature map that these like +we get more getting these nice diffused + +673 +00:51:41,199 --> 00:51:44,449 +looking things but those are just sort +of like blurring out this this + +674 +00:51:44,449 --> 00:51:48,210 +distribution and the model does not +really have the capacity to look at + +675 +00:51:48,210 --> 00:51:52,220 +arbitrary regions of the input it's only +allowed to look at these are fixed grid + +676 +00:51:52,219 --> 00:51:55,959 +regions + +677 +00:51:55,960 --> 00:51:59,690 +I should also point out that this idea +of soft attention was not really + +678 +00:51:59,690 --> 00:52:04,789 +introduced in this paper I think the +first paper that really had this notion + +679 +00:52:04,789 --> 00:52:09,159 +of soft attention came from machine +translation so here it's a similar + +680 +00:52:09,159 --> 00:52:13,299 +motivation that we want to take some +input sentence here in Spanish and then + +681 +00:52:13,300 --> 00:52:17,960 +produce an output sentence in English +and this would be done with recurrent + +682 +00:52:17,960 --> 00:52:22,179 +neural network sequence to sequence +model where we would first read in our + +683 +00:52:22,179 --> 00:52:26,588 +input sentence with a recurrent network +and then generate an output sequence is + +684 +00:52:26,588 --> 00:52:29,269 +very similar to that as we would in +captioning + +685 +00:52:29,269 --> 00:52:33,119 +but in this paper they wanted to +actually have attention over the inputs + +686 +00:52:33,119 --> 00:52:38,599 +intense as they generated their sentence +so the exact mechanism as a little bit + +687 +00:52:38,599 --> 00:52:43,080 +different but the intuition has the same +that now when we generate this first + +688 +00:52:43,079 --> 00:52:47,469 +word my we want to compute power +distribution not over + +689 +00:52:47,469 --> 00:52:52,000 +regions in an image but instead over +words in the input sentence so we're + +690 +00:52:52,000 --> 00:52:55,289 +gonna get a distribution that hopefully +will focus on this first word in Spanish + +691 +00:52:55,289 --> 00:52:59,170 +sentence and then we'll take some +pictures from each word and then relate + +692 +00:52:59,170 --> 00:53:03,780 +them and feed them back into the next +time step in this process would repeat + +693 +00:53:03,780 --> 00:53:08,820 +at every time step up the network so +this idea of soft detention is very + +694 +00:53:08,820 --> 00:53:12,230 +easily applicable not only to image +capturing but also to machine + +695 +00:53:12,230 --> 00:53:18,990 +translation question the question is how +do you do this for a variable-length + +696 +00:53:18,989 --> 00:53:23,409 +sentences and that's something I glossed +over a little bit but the idea is you + +697 +00:53:23,409 --> 00:53:26,980 +use what's called content based +addressing so for the image captioning + +698 +00:53:26,980 --> 00:53:31,559 +we all know ahead of time that there is +this fixed maybe seven by seven grid so + +699 +00:53:31,559 --> 00:53:35,579 +we just produce a probability +distribution directly instead in this + +700 +00:53:35,579 --> 00:53:40,440 +model as the encoder reads the input +sentence it's producing some vector that + +701 +00:53:40,440 --> 00:53:45,320 +encodes that each word in the input +sentence so now in the decoder instead + +702 +00:53:45,320 --> 00:53:49,300 +of directly producing a probability +vector a probability distribution its + +703 +00:53:49,300 --> 00:53:52,900 +way to spread out sort of a vector that +will get dot product it with each of + +704 +00:53:52,900 --> 00:53:57,000 +those encoded vectors and the input and +then those top products get used to get + +705 +00:53:57,000 --> 00:54:02,159 +renormalized and converted to a +distribution + +706 +00:54:02,159 --> 00:54:06,940 +so this idea of soft detention is +actually pretty easy to implement and + +707 +00:54:06,940 --> 00:54:10,970 +pretty easy to train so it's been very +popular in the last year or so and + +708 +00:54:10,969 --> 00:54:14,489 +there's a whole bunch of papers that +apply this idea of soft attention to a + +709 +00:54:14,489 --> 00:54:18,349 +whole bunch of different problems so +there have been a couple papers looking + +710 +00:54:18,349 --> 00:54:22,360 +at soft detention for machine +translation as we saw there have been a + +711 +00:54:22,360 --> 00:54:24,230 +couple papers that actually want to do + +712 +00:54:24,230 --> 00:54:28,179 +speech transcription where they read in +an audio signal and then I'll put the + +713 +00:54:28,179 --> 00:54:32,589 +words in English so there's been a +couple papers that use soft attention + +714 +00:54:32,590 --> 00:54:37,130 +over the input audio sequence to help +with that task weeks there's been at + +715 +00:54:37,130 --> 00:54:41,300 +least one paper on using soft attention +for video captioning so here you read in + +716 +00:54:41,300 --> 00:54:45,260 +some sequence of frames and then you +output some sequence of words and you + +717 +00:54:45,260 --> 00:54:49,110 +want to have a tension over whether the +frames of the input sequence as you're + +718 +00:54:49,110 --> 00:54:53,050 +generating your caption you could see +that maybe for this little video + +719 +00:54:53,050 --> 00:54:57,240 +sequence they output someone is trying +to fish in a pot and when they generate + +720 +00:54:57,239 --> 00:55:01,169 +the words someone actually attend much +more to this second frame in the video + +721 +00:55:01,170 --> 00:55:05,590 +sequence and when they generate the word +frying attends much more to this last + +722 +00:55:05,590 --> 00:55:11,480 +element in the video sequence there have +also been a couple papers for this task + +723 +00:55:11,480 --> 00:55:16,059 +of question answering so here you the +setup is that you read in a natural + +724 +00:55:16,059 --> 00:55:20,590 +language question and you also read an +image and image and the model needs to + +725 +00:55:20,590 --> 00:55:22,870 +produce an answer about that question + +726 +00:55:22,869 --> 00:55:28,139 +produced the answer to that question in +natural language so and there been a + +727 +00:55:28,139 --> 00:55:31,869 +couple papers that explore the idea of +spatial attention over the image in + +728 +00:55:31,869 --> 00:55:35,420 +order to help with this problem of +question answering another thing to + +729 +00:55:35,420 --> 00:55:38,860 +point out is that some of these papers +have great games so there was a + +730 +00:55:38,860 --> 00:55:43,000 +show-and-tell there was show attendant +I'll there was less than a tenth and + +731 +00:55:43,000 --> 00:55:45,039 +spell + +732 +00:55:45,039 --> 00:55:49,999 +and this one is asked to attend an +answer so I I really enjoy about + +733 +00:55:49,998 --> 00:55:56,808 +creativity with naming I'm just on this +line of work and this idea of soft + +734 +00:55:56,809 --> 00:55:59,910 +detention is pretty easy to implement so +a lot of people have just uploaded two + +735 +00:55:59,909 --> 00:56:05,899 +tons of tasks but remember we saw this +problem with this sort of implementation + +736 +00:56:05,900 --> 00:56:09,709 +of soft attention and that's that we +cannot attends to arbitrate regions in + +737 +00:56:09,708 --> 00:56:14,038 +the input instead were constrained and +can only attend to this fixed grid given + +738 +00:56:14,039 --> 00:56:18,699 +by the convolutional feature map so the +question is whether we can overcome this + +739 +00:56:18,699 --> 00:56:23,559 +restriction and still attend and attends +to arbitrary input regions somehow in a + +740 +00:56:23,559 --> 00:56:28,089 +different way and I think + +741 +00:56:28,088 --> 00:56:32,900 +precursor to this type of work is this +paper from Alex graves back in 2013 so + +742 +00:56:32,900 --> 00:56:38,249 +here he wanted to read as inputs natural +language sentence and then generate as + +743 +00:56:38,248 --> 00:56:43,598 +output actually an image that would be +handwriting general like writing out + +744 +00:56:43,599 --> 00:56:48,528 +that that sentence in handwriting and +the way that he actually has attention + +745 +00:56:48,528 --> 00:56:53,418 +over this output image in kind of a cool +way we're now he's actually predicting + +746 +00:56:53,418 --> 00:56:57,608 +the parameters of some cash and mixture +model over the output image and then + +747 +00:56:57,608 --> 00:57:02,739 +uses that to actually attend to +arbitrate parts of the output image and + +748 +00:57:02,739 --> 00:57:07,028 +this actually works really really well +so on the right some of these are + +749 +00:57:07,028 --> 00:57:12,259 +actually written by people and the rest +of them were written by him by his + +750 +00:57:12,259 --> 00:57:16,269 +network so can you tell the difference +between the generated on the real + +751 +00:57:16,268 --> 00:57:24,418 +generating I couldn't so it turns out +that the top one is real and he's + +752 +00:57:24,418 --> 00:57:31,049 +bottomed for all generated by the +network + +753 +00:57:31,050 --> 00:57:35,580 +yeah maybe maybe the real ones have more +variance between the letters or + +754 +00:57:35,579 --> 00:57:39,380 +something like that but these results +work really well and actually he has an + +755 +00:57:39,380 --> 00:57:42,820 +online demo that you can go on and try +that runs in your browser you can just + +756 +00:57:42,820 --> 00:57:46,800 +type in words and will generate the +handwriting for you that's kind of fun + +757 +00:57:46,800 --> 00:57:52,840 +another another paper that we saw +already is draw that that sort of takes + +758 +00:57:52,840 --> 00:57:56,500 +this idea of arbitrary detention over +and then extends it to a couple more + +759 +00:57:56,500 --> 00:58:01,050 +real-world problems not just handwriting +generation so one task they consider is + +760 +00:58:01,050 --> 00:58:05,960 +image classification here we want to +classify these digits but in the process + +761 +00:58:05,960 --> 00:58:09,920 +of classifying we're actually going to +attend to arbitrate regions of the input + +762 +00:58:09,920 --> 00:58:14,639 +image in order to help with this +classification task so this is this is + +763 +00:58:14,639 --> 00:58:17,909 +kind of cool it sort of learns on its +own but it needs to attend to these + +764 +00:58:17,909 --> 00:58:22,710 +digits in order to help with image +classification and withdrawn they also + +765 +00:58:22,710 --> 00:58:27,849 +consider the idea of generating +arbitrary output images with a similar + +766 +00:58:27,849 --> 00:58:31,589 +sort of motivation as the handwriting +generation where we're gonna have + +767 +00:58:31,590 --> 00:58:35,740 +arbitrary attention over the output +image and just generate this output on + +768 +00:58:35,739 --> 00:58:42,589 +my bed and I think we saw this video +before but it's really cool so this is + +769 +00:58:42,590 --> 00:58:47,190 +the draw network from my mind so you can +see that here we're gonna do it we're + +770 +00:58:47,190 --> 00:58:51,200 +doing a classification task sort of +learns to attend to arbitrate regions in + +771 +00:58:51,199 --> 00:58:55,439 +the input and when we generate we're +going to attend to arbitrate regions in + +772 +00:58:55,440 --> 00:58:59,579 +the output to actually generate these +digits so it can generate multiple + +773 +00:58:59,579 --> 00:59:04,000 +digits at a time and it can actually +generate these these house number these + +774 +00:59:04,000 --> 00:59:10,639 +house numbers so this is really cool and +as you can see you like the region it + +775 +00:59:10,639 --> 00:59:13,920 +was attending to was actually growing +and shrinking overtime and sort of + +776 +00:59:13,920 --> 00:59:17,430 +moving continuously over the image and +it was definitely not constrained to a + +777 +00:59:17,429 --> 00:59:21,690 +fixed grid like we saw with show +attendant tell so the way that this + +778 +00:59:21,690 --> 00:59:26,840 +paper works is a little bit a little bit +weird and some follow-up work from deep + +779 +00:59:26,840 --> 00:59:34,260 +mind I think actually was more clear and +why is the sky all my focus is all ok + +780 +00:59:34,260 --> 00:59:38,630 +right so there's this follow-up paper +that take that uses a very similar + +781 +00:59:38,630 --> 00:59:43,500 +mechanism attention called special +transport networks but i think is much + +782 +00:59:43,500 --> 00:59:44,500 +easier to understand + +783 +00:59:44,500 --> 00:59:49,039 +and and presented in a very clean way so +the idea is that we want to have this + +784 +00:59:49,039 --> 00:59:53,369 +input image this our favorite bird and +then we want to have this sort of + +785 +00:59:53,369 --> 00:59:57,589 +continuous set of variables telling us +where you want to attend you might + +786 +00:59:57,590 --> 01:00:01,579 +imagine that we have the corner of the +center and width and height of some box + +787 +01:00:01,579 --> 01:00:06,170 +of the region we want to attach to and +then we want to have some function that + +788 +01:00:06,170 --> 01:00:10,240 +takes our input image and these +continuous attention coordinates and + +789 +01:00:10,239 --> 01:00:14,919 +then produces some fixed size output and +we won't be able to do this in a + +790 +01:00:14,920 --> 01:00:21,840 +differentiable way so this this seems +kinda hard right can you imagine that at + +791 +01:00:21,840 --> 01:00:26,250 +least with the idea of cropping then +these inputs cannot really be continuous + +792 +01:00:26,250 --> 01:00:30,590 +they need to be sort of pixel values so +our country in two integers and it's not + +793 +01:00:30,590 --> 01:00:34,550 +really clear exactly how we can make +this function continuous or differential + +794 +01:00:34,550 --> 01:00:39,210 +and they actually came up with a very +nice solution and the idea is that we're + +795 +01:00:39,210 --> 01:00:44,679 +gonna write down a parametrized function +that will map from coordinates of pixels + +796 +01:00:44,679 --> 01:00:50,469 +in the outputs to coordinates of pixels +in the input so here we're gonna say + +797 +01:00:50,469 --> 01:00:54,839 +that this this upper left upper +right-hand pixel any other potential has + +798 +01:00:54,840 --> 01:00:59,700 +the coordinates x TYT in the output and +we're going to check compute these + +799 +01:00:59,699 --> 01:01:04,480 +coordinates access and white house in +the input image using this privatized a + +800 +01:01:04,480 --> 01:01:08,900 +fine function so that's that's a nice +differentiable function that we can + +801 +01:01:08,900 --> 01:01:13,349 +differentiate with respect to these a +fine transport accordance then we can + +802 +01:01:13,349 --> 01:01:17,059 +repeat this process and again for maybe +the upper upper left-hand pixel in the + +803 +01:01:17,059 --> 01:01:21,219 +output image we can use this planet rise +function to map to the coordinates of + +804 +01:01:21,219 --> 01:01:27,199 +the pixel in the input now we can repeat +this for all pixels in our output which + +805 +01:01:27,199 --> 01:01:31,689 +gives us something called a sampling +grid so the idea is that this will be + +806 +01:01:31,690 --> 01:01:36,480 +our output image and then for each pixel +in the output the sampling grid tells us + +807 +01:01:36,480 --> 01:01:41,610 +where in the input that pixel should +come from and how many guys taking a + +808 +01:01:41,610 --> 01:01:47,590 +computer graphics course not many so +this looks left looks kinda like texture + +809 +01:01:47,590 --> 01:01:52,510 +mapping doesn't it so they take this +idea from texture mapping in computer + +810 +01:01:52,510 --> 01:01:56,300 +graphics and just use by linear +interpolation to compute the output once + +811 +01:01:56,300 --> 01:01:57,720 +we have the sampling grid + +812 +01:01:57,719 --> 01:02:02,669 +so now what now that we have now this +now this allows our networked actually + +813 +01:02:02,670 --> 01:02:07,450 +attends to arbitrate parts of the input +and a nice differentiable way where our + +814 +01:02:07,449 --> 01:02:11,789 +network will now just predict these +transforming coordinates pana and that + +815 +01:02:11,789 --> 01:02:16,639 +will allow the whole thing to attend to +arbitrate regions of the input image so + +816 +01:02:16,639 --> 01:02:20,199 +they put this thing altogether into a +nice little self-contained module that + +817 +01:02:20,199 --> 01:02:24,608 +they call a special transformer so the +spatial transformer receives some input + +818 +01:02:24,608 --> 01:02:29,679 +which you can think of as a as our raw +input image and then actually runs this + +819 +01:02:29,679 --> 01:02:33,949 +small localisation network which could +be a small fully connected network or a + +820 +01:02:33,949 --> 01:02:38,409 +very shallow convolutional network and +this this localization network will + +821 +01:02:38,409 --> 01:02:44,500 +actually produces as output these a plan +transform coordinates data now these + +822 +01:02:44,500 --> 01:02:48,829 +affine transform coordinates will be +used to compute a sampling grid so now + +823 +01:02:48,829 --> 01:02:51,750 +that we've predicted these at misshapen +transformed from the localisation + +824 +01:02:51,750 --> 01:02:56,280 +network we map each pixel in the output +the coordinates of each pixel in the + +825 +01:02:56,280 --> 01:03:02,280 +output back to the input and this is a +nice smooth differentiable function now + +826 +01:03:02,280 --> 01:03:06,230 +once we have the sampling grid we can +just applied by linear interpolation to + +827 +01:03:06,230 --> 01:03:11,309 +compute the values in the pixels of the +output and if you if you think about + +828 +01:03:11,309 --> 01:03:15,588 +what this thing is doing it's clear that +every single part of this network is one + +829 +01:03:15,588 --> 01:03:21,159 +continuous and two differential so this +thing can be managed jointly without any + +830 +01:03:21,159 --> 01:03:26,579 +crazy reinforcement learning stuff which +is quite nice although 11 sort of + +831 +01:03:26,579 --> 01:03:31,789 +caveats know about bilinear sampling if +you know how by linear sampling works it + +832 +01:03:31,789 --> 01:03:36,449 +means that every pixel in the output is +going to be a weighted average of four + +833 +01:03:36,449 --> 01:03:41,639 +pixels and the input so those gradients +are actually very local so this is + +834 +01:03:41,639 --> 01:03:45,549 +continuous and differentiable a nice but +I don't think you get a whole lot of + +835 +01:03:45,550 --> 01:03:50,300 +gradient signal through the third the +bilinear sampling but once you have this + +836 +01:03:50,300 --> 01:03:54,410 +special this nice special transport +module we can just insert it into + +837 +01:03:54,409 --> 01:03:58,739 +existing networks to sort of let them +learn to attend two things so they + +838 +01:03:58,739 --> 01:04:03,739 +consider this classification task very +similar to the drop paper where they + +839 +01:04:03,739 --> 01:04:08,118 +want to classify these warped versions +of amnesty Jets so they actually + +840 +01:04:08,119 --> 01:04:09,519 +consider several other + +841 +01:04:09,519 --> 01:04:13,610 +more complicated transforms not just +he's a fine transformants you can also + +842 +01:04:13,610 --> 01:04:18,260 +imagine that that's the mapping from +your output pixel spektr in pixels on + +843 +01:04:18,260 --> 01:04:21,470 +the previous flight we showed an affine +transform but they also consider + +844 +01:04:21,469 --> 01:04:25,339 +projective transforms and also a thin +plate splines but the idea is you just + +845 +01:04:25,340 --> 01:04:28,970 +want to some private rise and +differentiable function and you could go + +846 +01:04:28,969 --> 01:04:34,829 +crazy with that part so here on the left +the network is just trying to classify + +847 +01:04:34,829 --> 01:04:38,380 +these digits that are worked so on the +left we have different versions of + +848 +01:04:38,380 --> 01:04:43,340 +warped digits on this middle colin is +showing these different thin plate + +849 +01:04:43,340 --> 01:04:47,460 +splines that it's using to attend to a +part of the image and then on the right + +850 +01:04:47,460 --> 01:04:51,590 +shows the output of the spatial +transformer model which has not only + +851 +01:04:51,590 --> 01:04:56,250 +attended to that region but also on +worked at corresponding to those planes + +852 +01:04:56,250 --> 01:05:01,730 +and on the right they're using an app +find transpire on the right is using an + +853 +01:05:01,730 --> 01:05:05,559 +affine transform not be in place plans +you can see that this is actually doing + +854 +01:05:05,559 --> 01:05:09,369 +more than just attending to the input or +actually transforming the input as well + +855 +01:05:09,369 --> 01:05:14,849 +so for example in this middle column +this is a four but it's actually rotated + +856 +01:05:14,849 --> 01:05:19,069 +by something by something like ninety +degrees so by using this app and + +857 +01:05:19,070 --> 01:05:23,140 +transform the network can not only +attention before but also rotated into + +858 +01:05:23,139 --> 01:05:27,839 +the proper position for the downstream +classification at work and this is all + +859 +01:05:27,840 --> 01:05:31,930 +very cool and i can sort of similar to +the soft attention we don't need + +860 +01:05:31,929 --> 01:05:35,949 +explicit supervision it can just decide +for itself where it wants to attend in + +861 +01:05:35,949 --> 01:05:41,710 +order to solve problems so these guys +have a fancy video as well which is very + +862 +01:05:41,710 --> 01:05:53,860 +impressive so this is the transformer +module that we just unpacked and here + +863 +01:05:53,860 --> 01:05:58,930 +we're actually showing right now this is +actually running a classification task + +864 +01:05:58,929 --> 01:06:03,389 +but we're varying the input continuously +you can see that these different inputs + +865 +01:06:03,389 --> 01:06:08,429 +the network learns to attends 22 and +then actually economic alliances that + +866 +01:06:08,429 --> 01:06:13,169 +digit to sort of a fixed known pose and +as we very that input and move it around + +867 +01:06:13,170 --> 01:06:18,500 +the image the network still does a good +job of locking onto the digit and on the + +868 +01:06:18,500 --> 01:06:23,059 +right you can see that sometimes it can +fix rotations as well right so + +869 +01:06:23,059 --> 01:06:26,809 +here on the left were actually rotating +that digit and the network actually + +870 +01:06:26,809 --> 01:06:31,619 +learns to on rotate the debt and +economic life the polls and again both + +871 +01:06:31,619 --> 01:06:36,420 +with a friend transforms or thin plate +splines this is using even crazier + +872 +01:06:36,420 --> 01:06:40,389 +warping with projected transports she +can see that it does a really good job + +873 +01:06:40,389 --> 01:06:48,099 +of learning to attend and also its own +work and they do quite a lot of other + +874 +01:06:48,099 --> 01:06:52,829 +experiments instead of classification +they learn to add together to work + +875 +01:06:52,829 --> 01:06:58,369 +digits which is kind of weird task but +it works so their network is receding + +876 +01:06:58,369 --> 01:07:05,389 +two inputs as to input images and I'll +put the sum and it burns and even know + +877 +01:07:05,389 --> 01:07:08,679 +this is kind of a weird task it learns +that it needs to attend and on work + +878 +01:07:08,679 --> 01:07:15,659 +those images so this is during +optimization writes this is a test + +879 +01:07:15,659 --> 01:07:20,009 +called co-localization the idea is that +the network is going to receive two + +880 +01:07:20,010 --> 01:07:25,560 +images of as input maybe two different +images all fours and the task is to say + +881 +01:07:25,559 --> 01:07:31,179 +whether or not those images are the same +the same or different and then also + +882 +01:07:31,179 --> 01:07:34,750 +local using spatial transformers and end +up learning to localize those things as + +883 +01:07:34,750 --> 01:07:38,139 +well as you can see that over the course +of training it learns to actually + +884 +01:07:38,139 --> 01:07:42,239 +localize these things very very +precisely even when we are closer to the + +885 +01:07:42,239 --> 01:07:50,479 +image than these networks still learn to +localize very very precisely that's a + +886 +01:07:50,480 --> 01:07:58,280 +recent paper from deep mind that is +pretty cool + +887 +01:07:58,280 --> 01:08:11,519 +so any other last minute questions about +special transformers yeah yeah so the + +888 +01:08:11,519 --> 01:08:13,989 +simple so the question is what is the +what is the task of these things are + +889 +01:08:13,989 --> 01:08:17,420 +doing and at least in the vanilla +version is just classification so it + +890 +01:08:17,420 --> 01:08:21,810 +receives this sort of input which could +be warped her cluttered or whatnot and + +891 +01:08:21,810 --> 01:08:26,060 +all that it needs to do is classified ad +budget and sort of in the process of + +892 +01:08:26,060 --> 01:08:29,839 +learning to classify it also plans to +attend to the cracked part so that's + +893 +01:08:29,838 --> 01:08:40,189 +that's a really cool feature of this +this this work right sort of my overview + +894 +01:08:40,189 --> 01:08:44,588 +of attention is that we have soft +attention which is really easy to + +895 +01:08:44,588 --> 01:08:49,119 +implement especially in this context of +fixed input positions that we just + +896 +01:08:49,119 --> 01:08:53,039 +produced distributions over and put a +man we wait and we feed those those + +897 +01:08:53,039 --> 01:08:56,850 +factors back to a network somehow and +this is really easy to implement in many + +898 +01:08:56,850 --> 01:08:59,930 +different contexts and has been +implemented for a lot of different tasks + +899 +01:08:59,930 --> 01:09:04,770 +when you want to attend to arbitrate +regions than you need to get a little + +900 +01:09:04,770 --> 01:09:09,130 +bit fancier and I think spatial +transformers is a very very nice elegant + +901 +01:09:09,130 --> 01:09:13,949 +way of attending to arbitrate regions in +input images there are a lot of papers + +902 +01:09:13,949 --> 01:09:17,889 +that actually work on her detention and +this is quite a bit more challenging due + +903 +01:09:17,890 --> 01:09:21,579 +to this problem with the gradients so +hard attention papers typically use + +904 +01:09:21,579 --> 01:09:26,199 +reinforcement learning and we didn't +really talk about that today so any any + +905 +01:09:26,199 --> 01:09:39,429 +other questions about tension or ok sure + +906 +01:09:39,429 --> 01:09:51,958 +the captioning before we got +transformers and yeah those closed + +907 +01:09:51,958 --> 01:09:56,649 +captions are produced using this script +based thing but in in that network in + +908 +01:09:56,649 --> 01:10:01,299 +particular I think it was a 14 by 14 +grid so it's actually quite a lot of + +909 +01:10:01,300 --> 01:10:04,550 +locations but it it's it's still is +constrained to wear it a lot to + +910 +01:10:04,550 --> 01:10:22,800 +questions about interpolating between +soft attention and her detention so yeah + +911 +01:10:22,800 --> 01:10:26,279 +11 thing you might imagine is you train +the network in a soft way and then + +912 +01:10:26,279 --> 01:10:29,929 +during training you kind of penalize its +distribution sharper and sharper and + +913 +01:10:29,929 --> 01:10:32,949 +sharper and then a test time you just +switch over and use her detention + +914 +01:10:32,948 --> 01:10:37,938 +instead and I think I can't remember +which paper did that but I'm I'm pretty + +915 +01:10:37,939 --> 01:10:43,130 +sure I've seen that idea somewhere but +in practice I think training with her + +916 +01:10:43,130 --> 01:10:46,099 +detention tends to work better than the +sharpening approach but it's definitely + +917 +01:10:46,099 --> 01:10:51,800 +something you could try ok if not a +question that I think we're done a + +918 +01:10:51,800 --> 01:10:54,179 +couple minutes early today so get your +homework done + diff --git a/captions/En/Lecture14_en.srt b/captions/En/Lecture14_en.srt new file mode 100644 index 00000000..f35caf2e --- /dev/null +++ b/captions/En/Lecture14_en.srt @@ -0,0 +1,5240 @@ +1 +00:00:00,000 --> 00:00:04,990 +administrative I tell everyone should be +done with 73 now if you're not done I + +2 +00:00:04,990 --> 00:00:07,790 +think you're late and you're in trouble + +3 +00:00:07,790 --> 00:00:11,280 +Muslim graves will be out very soon +we're still going through them and there + +4 +00:00:11,279 --> 00:00:13,779 +are basically I think done but we have +to double check a few things I will send + +5 +00:00:13,779 --> 00:00:14,199 +them out + +6 +00:00:14,199 --> 00:00:18,820 +ok so in terms of reminding you where we +are in the class I last looked very + +7 +00:00:18,820 --> 00:00:22,629 +briefly at segmentation we looked at +some soft attention models substation + +8 +00:00:22,629 --> 00:00:25,829 +models are away for selectively paying +attention to different parts of the + +9 +00:00:25,829 --> 00:00:28,028 +image as your processing it was +something like a recurrent neural + +10 +00:00:28,028 --> 00:00:32,020 +network so glad you selectively pay +attention to some parts of the scene and + +11 +00:00:32,020 --> 00:00:35,450 +enhance those features and will start +about special transformer which is this + +12 +00:00:35,450 --> 00:00:38,929 +very nice way of basically in a +different way cropping parts of an image + +13 +00:00:38,929 --> 00:00:43,769 +or some features either in a hydrogen or +in any kind of warped shape aren't in + +14 +00:00:43,770 --> 00:00:48,579 +like that so very interesting kind of PC +that you can slot internal network + +15 +00:00:48,579 --> 00:00:52,049 +architectures so today we'll talk about +videos + +16 +00:00:52,049 --> 00:00:56,229 +specifically now in image classification +you should be familiar by now whether + +17 +00:00:56,229 --> 00:00:59,390 +the basic combat set up you have an +image that comes in a reprocessing it to + +18 +00:00:59,390 --> 00:01:03,239 +for example classified in a case of +videos we won't have just a single image + +19 +00:01:03,238 --> 00:01:07,728 +but will have multiple frames so this is +an image of 32 by 32 will actually have + +20 +00:01:07,728 --> 00:01:13,829 +an entire video frames so 32 by 32 +mighty purty a sometime extent ok so + +21 +00:01:13,829 --> 00:01:17,340 +before I dive into how we approach these +problems with I'd like to talk about + +22 +00:01:17,340 --> 00:01:21,170 +very briefly about how we used to +address them for calm asking about using + +23 +00:01:21,170 --> 00:01:25,629 +pcr-based methods so some of the most +popular features right before coming to + +24 +00:01:25,629 --> 00:01:30,019 +work today became very popular where +these dense trajectory features + +25 +00:01:30,019 --> 00:01:34,140 +developed by hanging at all and I just +like to give you a brief taste of + +26 +00:01:34,140 --> 00:01:36,989 +exactly how these features worked +because it's kind of interesting and + +27 +00:01:36,989 --> 00:01:39,609 +they inspire some of the later +developments in terms of how come to + +28 +00:01:39,609 --> 00:01:43,429 +show that works actually operating +videos so in this trajectory is what we + +29 +00:01:43,430 --> 00:01:47,140 +do is we have this video the playing and +we're going to be detecting these key + +30 +00:01:47,140 --> 00:01:50,709 +points that are good to track in a video +and then we're going to be tracking them + +31 +00:01:50,709 --> 00:01:54,679 +and you end up with all these little +track let's that we actually track + +32 +00:01:54,680 --> 00:01:57,759 +across the video and then lots of +features about those track let's and + +33 +00:01:57,759 --> 00:02:01,868 +about the surrounding features that +accumulated just crimes so just to give + +34 +00:02:01,868 --> 00:02:06,549 +you an idea about how this worked there +are basically three steps roughly we + +35 +00:02:06,549 --> 00:02:10,868 +detect feature points at different +scales in the image I'll tell me briefly + +36 +00:02:10,868 --> 00:02:11,960 +about how that's done + +37 +00:02:11,960 --> 00:02:16,810 +then go to track those features over +time using optical flow methods optical + +38 +00:02:16,810 --> 00:02:20,270 +flow method solved explain very briefly +they basically give you a motion field + +39 +00:02:20,270 --> 00:02:23,800 +from one thing to another and they tell +you how the scene moved from one frame + +40 +00:02:23,800 --> 00:02:28,070 +to an extent Xtreme and then we're going +to extract a whole bunch of features but + +41 +00:02:28,069 --> 00:02:30,609 +importantly we're not just going to +extract those feature set fixed + +42 +00:02:30,610 --> 00:02:33,930 +positions in the image but we're +actually going to be struck me + +43 +00:02:33,930 --> 00:02:37,700 +speechless and the local coordinate +system every single track let and so + +44 +00:02:37,699 --> 00:02:41,869 +these histogram of greedy insist gotta +flows and and be resource we're going to + +45 +00:02:41,870 --> 00:02:45,610 +be extracted them in the coordinate +system off a track wit and so hard here + +46 +00:02:45,610 --> 00:02:49,200 +we saw histograms gradients and +two-dimensional images are basically + +47 +00:02:49,199 --> 00:02:51,750 +generalizations of that too + +48 +00:02:51,750 --> 00:02:54,780 +videos and so that's the kind of things +that people used to encode the + +49 +00:02:54,780 --> 00:03:01,009 +spatio-temporal bombings in terms of the +key point detection part there's been + +50 +00:03:01,009 --> 00:03:04,239 +quite a lot of work on exactly how to +detect good features and videos to track + +51 +00:03:04,240 --> 00:03:07,930 +and intuitively you don't want to track +a video that are too smooth because he + +52 +00:03:07,930 --> 00:03:11,580 +can't log onto any visual feature as +there are ways for basically getting a + +53 +00:03:11,580 --> 00:03:16,620 +set of points that are easy to track and +a video so there are some papers on this + +54 +00:03:16,620 --> 00:03:19,509 +so you detect a bunch of features like +this + +55 +00:03:19,509 --> 00:03:23,039 +optical flow algorithms on these videos + +56 +00:03:23,659 --> 00:03:28,060 +take a frame and a second frame and it +will solve for a motion field + +57 +00:03:28,060 --> 00:03:32,409 +displacement vector at every single +position in to where it traveled for how + +58 +00:03:32,409 --> 00:03:35,919 +the free moved as I hear some examples +of optical flow results + +59 +00:03:36,439 --> 00:03:42,270 +basically here every single pixel is +colored by a direction in which that + +60 +00:03:42,270 --> 00:03:46,260 +part of the image is currently moving +into video so for example this girl has + +61 +00:03:46,259 --> 00:03:49,939 +all yellow meaning that you probably +translating horizontally or something + +62 +00:03:49,939 --> 00:03:53,680 +like that the two most common methods +for using optical flow for computing it + +63 +00:03:53,680 --> 00:03:58,069 +at least me one of the most common ones +here as blocks from boxing Malik that's + +64 +00:03:58,069 --> 00:04:00,949 +the one that is kind of like a +defaulting to use so if you are + +65 +00:04:00,949 --> 00:04:03,399 +computing optical flow in your own +project I would encourage you to use + +66 +00:04:03,400 --> 00:04:08,950 +this large displacement optical flow +method so using this optical flow we + +67 +00:04:08,949 --> 00:04:12,199 +have all these key points using optical +flow we know also have the move as we + +68 +00:04:12,199 --> 00:04:15,859 +end up tracking these lil truckloads of +may be roughly fifteen frames at a time + +69 +00:04:15,860 --> 00:04:20,509 +so we end up with these half a second +roughly track lets through the video and + +70 +00:04:20,509 --> 00:04:21,519 +then we encode + +71 +00:04:21,519 --> 00:04:26,129 +regions around this track what's with +all these descriptors and then went to + +72 +00:04:26,129 --> 00:04:29,710 +accumulate all these visual Peterson two +histograms and people used to play with + +73 +00:04:29,709 --> 00:04:34,668 +different kinds of like how do you +exactly truncate video specially because + +74 +00:04:34,668 --> 00:04:37,359 +we're going to have a histogram an +independent histogram and every one of + +75 +00:04:37,360 --> 00:04:40,389 +these business and then we're going to +basically create all these histograms + +76 +00:04:40,389 --> 00:04:45,220 +urban with all these visual features and +all of this thing goes into an SVM and + +77 +00:04:45,220 --> 00:04:48,050 +what kind of the rock a layout in terms +of how people address these problems in + +78 +00:04:48,050 --> 00:04:55,720 +the past your truck just think of it as +is going to be fifteen frames and it's + +79 +00:04:55,720 --> 00:05:01,639 +just XY positions so a 15 X Y +coordinates the strangled and then we + +80 +00:05:01,639 --> 00:05:07,168 +extract in the local coordinate system +now in terms of how we actually approach + +81 +00:05:07,168 --> 00:05:13,859 +these problems with that works she never +called Alex net on the very first layer + +82 +00:05:13,860 --> 00:05:17,560 +will receive an image thatís for +example 227 227 by three and + +83 +00:05:17,560 --> 00:05:22,310 +reprocessing it with 96 filters that are +11 by 11 applied it's right for and so + +84 +00:05:22,310 --> 00:05:27,978 +we saw that with Alex net this results +in the 5555 by ninety six volume in + +85 +00:05:27,978 --> 00:05:30,468 +which we actually have all these +responses of all the filters at every + +86 +00:05:30,468 --> 00:05:34,788 +single spatial position so now what +would be a reasonable approach if you + +87 +00:05:34,788 --> 00:05:38,158 +wanted to generalize accomplish all that +work into a case we don't just have a + +88 +00:05:38,158 --> 00:05:42,579 +220 somebody turns 23 but you may be +happening frames that you like to encode + +89 +00:05:42,579 --> 00:05:47,278 +so you have an entire block of 227 227 +battery by 15 that's coming in to + +90 +00:05:47,278 --> 00:05:50,180 +accomplish all that work you're trying +to echo the both the spatial and + +91 +00:05:50,180 --> 00:05:54,209 +temporal patterns and inside this little +block of volume so would be like one + +92 +00:05:54,209 --> 00:05:57,379 +idea for how to change accomplish all +that work + +93 +00:05:57,379 --> 00:06:00,379 +generalize it to this case + +94 +00:06:03,899 --> 00:06:27,609 +and arrange them as like two blocks ok +that's interesting I would expect that + +95 +00:06:27,610 --> 00:06:33,870 +do not work very very well so the +problem with that is kind of interesting + +96 +00:06:33,870 --> 00:06:36,850 +basically all these neurons are looking +at only a single frame and then by the + +97 +00:06:36,850 --> 00:06:39,720 +end of the comment that you end up with +you on that are looking at a larger and + +98 +00:06:39,720 --> 00:06:43,310 +larger regions and your challenge so +eventually these neurons with see all of + +99 +00:06:43,310 --> 00:06:46,470 +your input but they would not be able to +very easily relate + +100 +00:06:47,589 --> 00:06:52,589 +like little special control patch in +this image so I'm not sure actually + +101 +00:06:52,589 --> 00:07:04,149 +really good idea did you turn them into +it I think so we'll get to some of those + +102 +00:07:04,149 --> 00:07:07,149 +that do something like that + +103 +00:07:09,930 --> 00:07:25,199 +take 45 channels effectively and you +could put a comment on that so that's + +104 +00:07:25,199 --> 00:07:28,919 +something that all get to I think you +could do that I don't think it's the + +105 +00:07:28,918 --> 00:07:44,049 +best idea as a yes so you're saying that +things in one slice of this time are you + +106 +00:07:44,050 --> 00:07:48,379 +want to extract similar kinds of +features in one time then a different + +107 +00:07:48,379 --> 00:07:48,990 +time + +108 +00:07:48,990 --> 00:07:52,829 +similar to the motivation of doing it +sharing with specially because peter is + +109 +00:07:52,829 --> 00:07:55,909 +here are useful down there as well so +you have the same kind of property where + +110 +00:07:55,910 --> 00:07:58,910 +you'd like to share weights and time not +only in space + +111 +00:07:59,689 --> 00:08:03,550 +ok so building on top of that idea of +the basic thing that people usually do + +112 +00:08:03,550 --> 00:08:06,400 +when they want to apply commercial +networks and videos as they extend these + +113 +00:08:06,399 --> 00:08:10,138 +filters not only to don't only have +filters in space but you also have these + +114 +00:08:10,139 --> 00:08:14,840 +filters and extend them small amounts in +time so before we have 11 Bielema + +115 +00:08:14,839 --> 00:08:15,750 +filters + +116 +00:08:15,750 --> 00:08:21,709 +1111 by tea filters where Tia some small +temporal extent so say for example we + +117 +00:08:21,709 --> 00:08:28,759 +can use a to up to 15 in this particular +case he was 30 2011 by three filters and + +118 +00:08:28,759 --> 00:08:33,979 +then by three because we have RGB and so +basically these filters are now you're + +119 +00:08:33,979 --> 00:08:36,969 +thinking of sliding filters not only in +space and carving out an entire + +120 +00:08:36,969 --> 00:08:40,469 +activation map but you're actually +sliding filters not only in space but + +121 +00:08:40,469 --> 00:08:44,450 +also in time and they have a small +finite temporal extent in time and you + +122 +00:08:44,450 --> 00:08:48,379 +end up carving out an entire activation +volume ok so you're introducing this + +123 +00:08:48,379 --> 00:08:51,909 +time to mention into all your kernels +and to all the are dying stages have an + +124 +00:08:51,909 --> 00:08:55,899 +additional time to mention along which +were performing the convolutions so + +125 +00:08:55,899 --> 00:08:59,659 +that's usually how people extract the +features and then you get this property + +126 +00:08:59,659 --> 00:09:04,009 +where safety is three here and so so +then when we do the spatial temporal + +127 +00:09:04,009 --> 00:09:07,230 +competition we end up with this +parameter sharing scheme going in time + +128 +00:09:07,230 --> 00:09:11,639 +as well as you mentioned so basically +what extent all the filters and time and + +129 +00:09:11,639 --> 00:09:14,360 +then we do convolutions not only in +space but also in time + +130 +00:09:14,360 --> 00:09:18,800 +wind up with activation volume +activation maps so some of these + +131 +00:09:18,799 --> 00:09:22,818 +approaches were proposed quite early on +for example one of the earlier ones + +132 +00:09:22,818 --> 00:09:28,238 +for activity recognition is maybe from +2010 so the idea here was that this is + +133 +00:09:28,239 --> 00:09:31,798 +just a couple of work but instead of +getting a single input of sixty by 40 + +134 +00:09:31,798 --> 00:09:36,108 +pics also we are getting in fact seven +frames of sixty by forty and then their + +135 +00:09:36,109 --> 00:09:40,119 +conclusions are three deconvolution as +we refer to them so these filters for + +136 +00:09:40,119 --> 00:09:44,220 +example might be sold by seven but now +by three as well as we end up with a 3d + +137 +00:09:44,220 --> 00:09:49,499 +calm and the three conditions are +applied at every single stage here + +138 +00:09:50,649 --> 00:09:55,208 +similar paper also from 2011 but the +same idea we have a block of friends + +139 +00:09:55,208 --> 00:09:59,518 +coming in and you promised them with 3d +completions three-dimensional filters at + +140 +00:09:59,519 --> 00:10:03,229 +every single point in this commercial +network so this isn't 2011 + +141 +00:10:04,948 --> 00:10:08,748 +very similar idea also so these are from +before actually Alex next these + +142 +00:10:08,749 --> 00:10:12,889 +approaches are kind of like smaller know +that work accomplished all that work so + +143 +00:10:12,889 --> 00:10:16,829 +the first kind of large-scale +application of this was from this + +144 +00:10:16,828 --> 00:10:19,828 +awesome paper in 2014 by capacity at all + +145 +00:10:20,830 --> 00:10:27,540 +this is for processing videos so the +model here on the very right that week + +146 +00:10:27,539 --> 00:10:31,159 +we called slow fusion that is the same +idea that I presented so far these are + +147 +00:10:31,159 --> 00:10:35,750 +three-dimensional competitions happening +in both space and time and so that's + +148 +00:10:35,750 --> 00:10:38,879 +slow fusion as we refer to it because +you're slowly using this temporal + +149 +00:10:38,879 --> 00:10:43,649 +information just as before we were +slowly using the spatial information now + +150 +00:10:43,649 --> 00:10:47,100 +there are other ways that you could also +why are up comedy show networks and just + +151 +00:10:47,100 --> 00:10:51,769 +to give you some context historically +this is Google research and Alex let's + +152 +00:10:51,769 --> 00:10:55,039 +just came out and everyone was super +excited because they work extremely well + +153 +00:10:55,039 --> 00:11:00,579 +images and I was in the video analysis +team at Google and we wanted to run on + +154 +00:11:00,580 --> 00:11:04,060 +the YouTube videos and but it was not +quite clear exactly how to generalize + +155 +00:11:04,059 --> 00:11:07,809 +you know commercial networks and then +just to videos so we explored several + +156 +00:11:07,809 --> 00:11:11,389 +kinds of architecture stuff how you can +actually wear this up so floats no + +157 +00:11:11,389 --> 00:11:17,889 +fusion as a 3d called kind of approach +early fusion is this idea that someone + +158 +00:11:17,889 --> 00:11:21,230 +described earlier where you take a chunk +of friends and just woke up need them + +159 +00:11:21,230 --> 00:11:25,430 +long channels you might end up with a +227 227 by like 45 + +160 +00:11:25,429 --> 00:11:29,500 +everything is just stocked up and you do +a single column over it so it's kind of + +161 +00:11:29,500 --> 00:11:35,200 +like your filters on the very first call +later have a large temporal extent but + +162 +00:11:35,200 --> 00:11:38,780 +from then on everything else is +two-dimensional competition in fact we + +163 +00:11:38,779 --> 00:11:42,139 +call it early because he refused the +temporal information very early on in + +164 +00:11:42,139 --> 00:11:45,879 +the very first letter from then on +everything just to call you can imagine + +165 +00:11:45,879 --> 00:11:49,490 +architecture is likely convolution so +here the ideas would take to Alex nets + +166 +00:11:49,490 --> 00:11:53,169 +we place them say ten things apart so +they both computed independently on + +167 +00:11:53,169 --> 00:11:57,169 +these 10 points apart and then we must +be much later in the fully connected + +168 +00:11:57,169 --> 00:12:00,620 +layers and then we had a single claim +baseline that is only looking at a + +169 +00:12:00,620 --> 00:12:03,830 +single frame of the video so you can +play with exactly how the white wire up + +170 +00:12:03,830 --> 00:12:08,440 +these models look Asian model you can +imagine that they've had three + +171 +00:12:08,440 --> 00:12:13,130 +dimensional colonels now the first layer +you can actually visualize them and + +172 +00:12:13,129 --> 00:12:16,210 +these are the kinds of features you end +up learning on videos these are + +173 +00:12:16,210 --> 00:12:18,990 +basically features that were familiar +with except they're moving because now + +174 +00:12:18,990 --> 00:12:22,680 +these filters are also extended a small +amount and time to have these little + +175 +00:12:22,679 --> 00:12:26,049 +moving blobs and some of them are static +and some of them are moving and they're + +176 +00:12:26,049 --> 00:12:30,729 +basically detecting motion on the very +first layer and so you end up a nice + +177 +00:12:30,730 --> 00:12:31,960 +moving bombings + +178 +00:12:31,960 --> 00:12:48,090 +question is how much we'll get to that +and I think the answer is probably yes + +179 +00:12:48,090 --> 00:12:53,269 +just as in spatial it works better if +smaller filters and you have more depth + +180 +00:12:53,269 --> 00:12:56,370 +at the same applies I think in time and +we'll see an architecture that does that + +181 +00:12:56,370 --> 00:13:07,220 +mean but expecting + +182 +00:13:08,190 --> 00:13:13,580 +classifying so we have a video and were +still classifying number of categories + +183 +00:13:13,580 --> 00:13:17,970 +at every single frame but now you're not +only function that single frame but also + +184 +00:13:17,970 --> 00:13:23,740 +a small number of frames alot on both +sides so maybe your prediction is + +185 +00:13:23,740 --> 00:13:28,539 +actually a function of safety drinks a +half a second video to end up with fun + +186 +00:13:28,539 --> 00:13:32,909 +moving pictures in this paper also +released video they said over one + +187 +00:13:32,909 --> 00:13:36,639 +million videos and 500 classes just a +given context for why this is actually + +188 +00:13:36,639 --> 00:13:41,759 +it's kind of difficult to work with +videos and right now I think because + +189 +00:13:41,759 --> 00:13:45,480 +problem right now i think is that +there's not too many very large-scale + +190 +00:13:45,480 --> 00:13:49,820 +datasets like millions of very varied +images that you see an image that there + +191 +00:13:49,820 --> 00:13:53,230 +are no really good equivalent of that in +the video domain and so we tried with + +192 +00:13:53,230 --> 00:13:56,730 +this for status and back in 2013 but I +don't think it actually we fully achieve + +193 +00:13:56,730 --> 00:14:00,519 +that and I think we're still not seeing +very good really lost the assassin + +194 +00:14:00,519 --> 00:14:03,579 +videos and that's partly why we're also +slightly discouraging some of you from + +195 +00:14:03,580 --> 00:14:08,050 +working on this on projects because you +can't retrain these very powerful + +196 +00:14:08,049 --> 00:14:12,969 +features because the data sets are just +not quite there another kind of + +197 +00:14:12,970 --> 00:14:16,100 +interesting things that you see and this +is why we also sometimes caution people + +198 +00:14:16,100 --> 00:14:21,490 +from working on videos and getting very +elaborate very quickly with them because + +199 +00:14:21,490 --> 00:14:24,490 +sometimes people think they have videos +and get very excited if they want to do + +200 +00:14:24,490 --> 00:14:27,810 +3d color shows Alice teams and they just +think about all the possibilities that + +201 +00:14:27,809 --> 00:14:31,469 +opened up for them but actually turns +out that single frame methods are a very + +202 +00:14:31,470 --> 00:14:34,820 +strong baseline and I would always +encourage you to run that first so don't + +203 +00:14:34,820 --> 00:14:37,710 +worry about the motion in your video and +just try single frame that works first + +204 +00:14:37,710 --> 00:14:40,990 +so for example in this paper we found +that a single from baseline was about + +205 +00:14:40,990 --> 00:14:44,610 +59.3% classification accuracy in our +dataset + +206 +00:14:44,610 --> 00:14:48,600 +and then we tried our best to actually +take into account small local motion but + +207 +00:14:48,600 --> 00:14:54,440 +we ended up bumping down by 11.6% so all +this extra work all the extra computer + +208 +00:14:54,440 --> 00:14:57,529 +and then you ended up with relatively +small gains I'm going to try to tell you + +209 +00:14:57,528 --> 00:15:02,088 +why that might be a basically video is +not always as useful as you might + +210 +00:15:02,089 --> 00:15:07,230 +intuitively think and so here are some +examples of kind of predictions that we + +211 +00:15:07,230 --> 00:15:11,800 +are different data sets of sports and +our predictions and I think this kind of + +212 +00:15:11,799 --> 00:15:15,528 +highlight slightly why adding video +might not be as helpful in some settings + +213 +00:15:15,528 --> 00:15:19,740 +in particular here if you're trying to +distinguish sports and think about it + +214 +00:15:19,740 --> 00:15:23,930 +trying to distinguish say tennis from +swimming or something like that it turns + +215 +00:15:23,929 --> 00:15:26,729 +out that you actually don't need very +fine local motion information if you're + +216 +00:15:26,730 --> 00:15:29,610 +trying to distinguish tennis from +swimming right lots of blue stuff lots + +217 +00:15:29,610 --> 00:15:33,350 +of red stuff like the images actually +have a huge amount of information and so + +218 +00:15:33,350 --> 00:15:36,240 +you're putting in a lot of additional +parameters and trying to go after these + +219 +00:15:36,240 --> 00:15:40,959 +local motions but and most in most +classes actually be local motions are + +220 +00:15:40,958 --> 00:15:44,289 +not very important they're only +important if you have very fine-grained + +221 +00:15:44,289 --> 00:15:47,919 +categories where the small motion +actually really matters a lot as a lot + +222 +00:15:47,919 --> 00:15:52,419 +of you if you have videos you'll be +inclined to use spatial temporal crazy + +223 +00:15:52,419 --> 00:15:56,860 +video networks but I think very hard +about is that locomotion extremely + +224 +00:15:56,860 --> 00:15:59,980 +important and you're setting because if +it isn't you might end up with results + +225 +00:15:59,980 --> 00:16:04,070 +like this where he put in a lot of work +and it may not work well let's look at + +226 +00:16:04,070 --> 00:16:10,180 +some other video classification that +works so this is April 2015 its + +227 +00:16:10,179 --> 00:16:14,698 +relatively popular it's called sea 3d +and the idea here was basically your + +228 +00:16:14,698 --> 00:16:18,528 +network has this very nice architecture +its three-month recalled and two by two + +229 +00:16:18,528 --> 00:16:22,110 +pool throughout the idea here is that +cool let's do the exact same thing but + +230 +00:16:22,110 --> 00:16:25,169 +extend everything in time so going back +to your point you want very small + +231 +00:16:25,169 --> 00:16:29,069 +filters so this is everything is three +my tree might recall to buy to buy to + +232 +00:16:29,070 --> 00:16:33,100 +pool throughout the architecture so it's +a very simple kind of big united in 3d + +233 +00:16:33,100 --> 00:16:36,528 +kind of approach and that works +reasonably well and you can look at this + +234 +00:16:36,528 --> 00:16:38,429 +paper for reference + +235 +00:16:38,429 --> 00:16:42,389 +another form of approaches actually that +works quite well as from Karen Simonian + +236 +00:16:42,389 --> 00:16:43,778 +in 2014 + +237 +00:16:43,778 --> 00:16:48,299 +Simonian by the way as the same he's a +person who came up with the BG not he + +238 +00:16:48,299 --> 00:16:51,828 +also has a very nice paper on video +classification and the idea here is that + +239 +00:16:51,828 --> 00:16:54,299 +he didn't want to do three dimensional +competitions because it's kind of + +240 +00:16:54,299 --> 00:16:55,219 +painful to have it + +241 +00:16:55,220 --> 00:17:00,360 +98 or find it and so on so he only used +to measure compilations but the idea + +242 +00:17:00,360 --> 00:17:05,179 +here is that we have to come that's +looking at an image and the other one is + +243 +00:17:05,179 --> 00:17:10,298 +looking at optical flow of the video so +both of these are just images but the + +244 +00:17:10,298 --> 00:17:14,699 +optical flow basically tells you how +things are moving in the in the image + +245 +00:17:14,699 --> 00:17:19,120 +and so both of these are just kind of +like an avg net like or Alex not like + +246 +00:17:19,119 --> 00:17:23,139 +that one of them another close one of +them on the image and you extract + +247 +00:17:23,140 --> 00:17:28,059 +optical flow with say the Bronx method +before and then you let the UF use that + +248 +00:17:28,058 --> 00:17:31,720 +information very late in the end so both +of these come up with some idea about + +249 +00:17:31,720 --> 00:17:34,850 +what they are seeing in terms of the +classes in the video and then refused + +250 +00:17:34,849 --> 00:17:37,859 +them and there are different ways of +using them so they found for example + +251 +00:17:37,859 --> 00:17:42,979 +that if you just use a special comments +are only looking at images you get some + +252 +00:17:42,980 --> 00:17:47,120 +performance if you use come on just the +optical flow it actually performs even + +253 +00:17:47,119 --> 00:17:49,558 +slightly better than just looking at the +raw images + +254 +00:17:49,558 --> 00:17:54,178 +optical flow actually here in this case +contains a lot of information and then + +255 +00:17:54,179 --> 00:17:58,538 +if you actually end up even better now +an interesting point to make here by the + +256 +00:17:58,538 --> 00:18:01,879 +way is that if you have this kind of +architecture especially here + +257 +00:18:01,880 --> 00:18:05,700 +complex history much by three filters +you might imagine that actually would + +258 +00:18:05,700 --> 00:18:10,038 +think that I mean why does it help to +actually put an optical flow you'd + +259 +00:18:10,038 --> 00:18:13,158 +imagine that in the center and framework +we're hoping that these comments learn + +260 +00:18:13,159 --> 00:18:16,049 +everything from scratch in particular +they should be able to learn something + +261 +00:18:16,048 --> 00:18:20,599 +that simulates the computation of +computing optical flow and it turns out + +262 +00:18:20,599 --> 00:18:24,230 +that that might not be the case because +sometimes when you compare video + +263 +00:18:24,230 --> 00:18:29,440 +networks on only the hospital and then +it works better and so I think the + +264 +00:18:29,440 --> 00:18:34,169 +reason for that is probably comes back +to actually data since we don't have + +265 +00:18:34,169 --> 00:18:37,900 +enough data we have small amount of data +I think you actually probably don't have + +266 +00:18:37,900 --> 00:18:42,730 +enough data to actually learn very good +optical flow like features and so that + +267 +00:18:42,730 --> 00:18:45,599 +would be my particular answer why +actually hard getting up to go to the + +268 +00:18:45,599 --> 00:18:48,819 +network is probably helping out in many +cases if you guys are working on your + +269 +00:18:48,819 --> 00:18:51,839 +project with videos I would encourage +you to actually try to be this kind of + +270 +00:18:51,839 --> 00:18:52,779 +architecture + +271 +00:18:52,779 --> 00:18:57,480 +optical flow and then pretend that it's +an image and you could come to an end to + +272 +00:18:57,480 --> 00:19:01,808 +that seems like a relatively reasonable +approach ok so so far we've only talked + +273 +00:19:01,808 --> 00:19:06,339 +about the little local information in +time right so we have these little + +274 +00:19:06,339 --> 00:19:07,398 +pieces + +275 +00:19:07,398 --> 00:19:10,069 +black half a second ever tried to take +advantage of it should be better + +276 +00:19:10,069 --> 00:19:13,739 +classification but what happens if you +have videos that actually have much + +277 +00:19:13,739 --> 00:19:14,489 +longer + +278 +00:19:14,489 --> 00:19:19,700 +temporal kind of dependencies that you +like to model so it's not only that the + +279 +00:19:19,700 --> 00:19:22,319 +local motion is important but actually +there are some events throughout the + +280 +00:19:22,319 --> 00:19:25,548 +video that are much larger in time +scales in your network and they actually + +281 +00:19:25,548 --> 00:19:29,618 +matter so event to happening after event +one can be very indicative of some class + +282 +00:19:29,618 --> 00:19:33,999 +and you'd like to actually model that +would work so are the kinds of + +283 +00:19:33,999 --> 00:19:39,659 +approaches that you might think for +trying to actually you know how would + +284 +00:19:39,659 --> 00:19:42,659 +you mind actually model these kinds of +much longer-term events + +285 +00:19:44,618 --> 00:19:54,009 +ok so attention model perhaps so you may +be like to have any tension over you're + +286 +00:19:54,009 --> 00:19:56,729 +trying to classify this entire video +maybe like to have a tension over + +287 +00:19:56,729 --> 00:19:58,129 +different parts of the video + +288 +00:19:58,128 --> 00:20:12,689 +yeah that's a good idea I see so you're +saying that we have these multiscale + +289 +00:20:12,690 --> 00:20:16,479 +approach is where we process images on +very low detail level but also sometimes + +290 +00:20:16,479 --> 00:20:20,298 +we resize the images and process them on +the global level as a maybe the frames + +291 +00:20:20,298 --> 00:20:23,710 +we can actually like speed up the video +and put a comment on that I don't think + +292 +00:20:23,710 --> 00:20:28,048 +that's a very common but it's senator +sensible idea I think yeah so the + +293 +00:20:28,048 --> 00:20:33,618 +problem roughly is that basically this +extent is maybe ten times too short it + +294 +00:20:33,618 --> 00:20:37,019 +doesn't spent our many seconds so how do +we make architectures that are + +295 +00:20:37,019 --> 00:20:40,179 +functional much longer time scales and +their prediction + +296 +00:20:42,150 --> 00:20:48,300 +yes the one idea here is we have this +video and we have different classes that + +297 +00:20:48,299 --> 00:20:50,599 +would like to predict at every single +point in time but we want that + +298 +00:20:50,599 --> 00:20:54,849 +prediction to be a function not only a +little choked up 15 seconds but actually + +299 +00:20:54,849 --> 00:20:59,149 +a much longer time expense so the idea +that is sensible as you actually use + +300 +00:20:59,150 --> 00:21:01,769 +record while at work somewhere in the +architecture because your current + +301 +00:21:01,769 --> 00:21:04,990 +networks allow you to have infinite +context and principal over everything + +302 +00:21:04,990 --> 00:21:08,579 +that has happened before you up till +that time especially if you go back to + +303 +00:21:08,579 --> 00:21:12,119 +this paper that I've already showing you +in 2011 it turns out that they have an + +304 +00:21:12,119 --> 00:21:16,289 +entire section where the cheek take this +and they actually have analyst team that + +305 +00:21:16,289 --> 00:21:21,109 +does exactly that this is a peep from +2011 using 3d called NLST I'm so way + +306 +00:21:21,109 --> 00:21:25,899 +before they will call in 2011 and so +this paper basically has it all + +307 +00:21:25,900 --> 00:21:29,920 +the model little local motion with 3d +calm and they most model global motion + +308 +00:21:29,920 --> 00:21:34,860 +with Ella stance and so they put a stamp +on the play the full connected layers so + +309 +00:21:34,859 --> 00:21:37,849 +they strung together fully connected +layers with this recurrence and then + +310 +00:21:37,849 --> 00:21:40,939 +when you're predicting classes every +single frame you have infinite context + +311 +00:21:40,940 --> 00:21:45,930 +this paper is as I think quite ahead of +its time and it basically has it all + +312 +00:21:45,930 --> 00:21:49,900 +except it's only set at 65 times I'm not +sure was not more popular I think people + +313 +00:21:49,900 --> 00:21:54,680 +are basically this is way ahead of its +time paper that recognizes both of these + +314 +00:21:54,680 --> 00:21:59,380 +national team sweat before I even knew +about them so since then and there are + +315 +00:21:59,380 --> 00:22:02,990 +several more recently percent actually +kind of take a similar approach so in + +316 +00:22:02,990 --> 00:22:07,190 +2015 by Jeff Donahue at all from +Berkeley the idea here is that you have + +317 +00:22:07,190 --> 00:22:08,610 +video you like to again + +318 +00:22:08,609 --> 00:22:11,819 +classify every single frame but they +have these comments that look at + +319 +00:22:11,819 --> 00:22:14,809 +individual frames but then they have +also Alice team that string this + +320 +00:22:14,809 --> 00:22:19,389 +together temporarily a similar idea also +from a paper from I think this is Google + +321 +00:22:19,390 --> 00:22:24,160 +and so the idea here is that they have +optical flow and images are processed by + +322 +00:22:24,160 --> 00:22:28,930 +complex and then again you have analyst +am that merges that over time so again + +323 +00:22:28,930 --> 00:22:34,680 +this this combination of local and +global so so far we've looked at kind of + +324 +00:22:34,680 --> 00:22:37,789 +two architectural patterns in +accomplishing your classification that + +325 +00:22:37,789 --> 00:22:43,170 +actually takes into account important +information modeling locomotion which + +326 +00:22:43,170 --> 00:22:47,289 +for example beast entry to call for use +optical flow or look more global motion + +327 +00:22:47,289 --> 00:22:51,059 +where we have chemistry together +sequences morning time steps or fusion + +328 +00:22:51,059 --> 00:22:54,418 +of the two now actually I like to make +the point that there's + +329 +00:22:54,419 --> 00:22:59,879 +another cleaner very nice interesting +idea that I saw in a recent paper and + +330 +00:22:59,878 --> 00:23:03,689 +then I like much more and so here's +basically the rock picture of what + +331 +00:23:03,690 --> 00:23:08,330 +things look like right now we have some +video and we have a 3d come that say + +332 +00:23:08,329 --> 00:23:13,038 +that is using optical flow may be +ordered using 3d column or both on the + +333 +00:23:13,038 --> 00:23:17,898 +trunk of frame crank up your data and +then have are nestled atop unfortunately + +334 +00:23:17,898 --> 00:23:20,979 +or something like that that are doing +the long-term modeling and so kind of + +335 +00:23:20,980 --> 00:23:24,950 +kind of not very nice are unsettling +about this is that their son of this + +336 +00:23:24,950 --> 00:23:29,499 +ugly asymmetry about these components to +have these parties neurons inside the 3d + +337 +00:23:29,499 --> 00:23:33,079 +come that are only a fraction of some +small local chunk of video and then you + +338 +00:23:33,079 --> 00:23:35,849 +have these neurons in the very top that +our function of everything in the video + +339 +00:23:35,849 --> 00:23:40,808 +because their record units that are a +function of everything that's come + +340 +00:23:40,808 --> 00:23:45,288 +before it and so it's kind of like an +unsettling asymmetry or something like + +341 +00:23:45,288 --> 00:23:48,720 +that so there's a paper that has a very +clever any idea from a few weeks ago + +342 +00:23:48,720 --> 00:23:54,249 +that is much more nice and homogeneous +lifestyle where everything is very nice + +343 +00:23:54,249 --> 00:23:58,118 +and margins and simple and so I don't +know if anyone can think of how we could + +344 +00:23:58,118 --> 00:24:06,819 +but we can do to make everything much +more cleaner and I couldn't because I + +345 +00:24:06,819 --> 00:24:09,019 +don't come up with this idea but I +thought it was cool what I read it + +346 +00:24:09,019 --> 00:24:22,399 +before the comment actually starts +processing the images not sure what that + +347 +00:24:22,398 --> 00:24:25,288 +would give you see would have torn +asunder optical information and comments + +348 +00:24:25,288 --> 00:24:30,169 +on top that somehow you would certainly +have neurons that are function of + +349 +00:24:30,169 --> 00:24:34,090 +everything but it's not clear what the +US team will be doing in that case + +350 +00:24:34,089 --> 00:24:37,388 +likely to be blurring the pixels it's +too low level probably processing at + +351 +00:24:37,388 --> 00:24:51,678 +that point then there's a lot of media +like an intolerable that works + +352 +00:24:51,679 --> 00:24:56,389 +differently temporal resolutions that +this problem is looking every bit you + +353 +00:24:56,388 --> 00:25:04,038 +another time that is looking like every +trip every a friend and I say so your + +354 +00:25:04,038 --> 00:25:07,009 +ideas that I think it's similar to what +someone pointed out where you take this + +355 +00:25:07,009 --> 00:25:10,179 +video and you work on multiple scales on +that video to speed up the video when + +356 +00:25:10,179 --> 00:25:14,778 +you slow down the video and then you +have 3d come that's on the front row + +357 +00:25:14,778 --> 00:25:23,989 +like speeds or something like that it's +a sensible idea can you do background + +358 +00:25:23,989 --> 00:25:26,669 +subtraction only look at things are +interesting to look at I think that's a + +359 +00:25:26,669 --> 00:25:30,639 +reasonable idea I think it kind of goes +against this idea of having end-to-end + +360 +00:25:30,638 --> 00:25:33,868 +learning because you're introducing like +this explicit computation that you think + +361 +00:25:33,868 --> 00:25:37,759 +is useful as he got + +362 +00:25:42,288 --> 00:25:48,658 +sharing between the 3d comes out and +they are and that's interesting I'm not + +363 +00:25:48,659 --> 00:25:52,139 +a hundred percent sure because the Arnon +is just hittin state vector and matrix + +364 +00:25:52,138 --> 00:25:55,678 +multiplies and things like that but in a +calm players we have disliked spatial + +365 +00:25:55,679 --> 00:26:05,369 +structure I'm not actually sure how the +sharing would work but yeah ok so the + +366 +00:26:05,368 --> 00:26:11,319 +idea is that we're going to see we're +going to get rid of the are now we're + +367 +00:26:11,319 --> 00:26:14,408 +going to basically take on that and +we're going to make every single neuron + +368 +00:26:14,409 --> 00:26:17,379 +and that comes out to be a small +recurrent neural network like every + +369 +00:26:17,378 --> 00:26:21,648 +single neuron becomes recurrent in the +calm that ok so the way this will work + +370 +00:26:21,648 --> 00:26:27,178 +and I think it's a beautiful but their +picture is kind of a kind of ugly so + +371 +00:26:27,179 --> 00:26:29,730 +much for this makes no sense so let me +try to explain this in a slightly + +372 +00:26:29,730 --> 00:26:36,278 +different way what we'll do instead is +that we have a caller somewhere in the + +373 +00:26:36,278 --> 00:26:40,278 +neural network and it takes input from +below the operative a previous calmly or + +374 +00:26:40,278 --> 00:26:43,398 +something that we're doing competitions +over this to compute the output at this + +375 +00:26:43,398 --> 00:26:47,528 +layer right so the idea here is we're +going to make every single come a little + +376 +00:26:47,528 --> 00:26:53,058 +later a kind of a recurrent player and +so the way we do that is we just as + +377 +00:26:53,058 --> 00:26:57,528 +for we take the input from below us and +we do comes over it but we also take our + +378 +00:26:57,528 --> 00:27:00,778 +previous output from the previous time +instead of this + +379 +00:27:00,778 --> 00:27:05,638 +players out there so that's this caller +from previous time step in addition to + +380 +00:27:05,638 --> 00:27:09,408 +the current input that this time stuff +and we do competitions over both this + +381 +00:27:09,409 --> 00:27:13,830 +one and that one and then we kind of +have you know we don't call when we have + +382 +00:27:13,829 --> 00:27:19,490 +these activations from current input and +activations from our previous outfit and + +383 +00:27:19,490 --> 00:27:24,649 +we add them up or something like that we +do recurrent like that work like merge + +384 +00:27:24,648 --> 00:27:28,719 +of those two to produce are up and so +we're a function of the current input + +385 +00:27:28,720 --> 00:27:34,730 +but also a function of our previous +activations if that makes sense and so + +386 +00:27:34,730 --> 00:27:37,200 +it's very nice about this is that were +in fact only using two-dimensional + +387 +00:27:37,200 --> 00:27:41,149 +competitions here and there is no 3d +count anywhere because both of these are + +388 +00:27:41,148 --> 00:27:44,678 +width by height by depth rights of the +previous convo liam is just with highly + +389 +00:27:44,679 --> 00:27:49,309 +depth from the previous layer and we are +with high depth from previous time and + +390 +00:27:49,308 --> 00:27:52,408 +some of these are two-dimensional +competitions but we end up with kind of + +391 +00:27:52,409 --> 00:27:57,710 +like recurrent process in here and so +one way to see this also with recurrent + +392 +00:27:57,710 --> 00:28:00,659 +neural networks which we looked at is +that you have this recurrence where + +393 +00:28:00,659 --> 00:28:03,980 +you're trying to compete in state and +it's a function of your previous state + +394 +00:28:03,980 --> 00:28:07,878 +and the current attacks and so we looked +at many different ways of actually + +395 +00:28:07,878 --> 00:28:14,058 +wiring up that recurrence so there's a +velar por el esteem or the GRU which GRU + +396 +00:28:14,058 --> 00:28:17,950 +is a simpler version of LSD and if you +recall but it almost always has similar + +397 +00:28:17,950 --> 00:28:21,548 +performance to analyst team so GRU a +slightly different update formulas for + +398 +00:28:21,548 --> 00:28:24,499 +actually performing that recurrence and +see what they do in this paper is + +399 +00:28:24,499 --> 00:28:27,950 +basically they take the GRU because it's +a simpler version of an Austrian that + +400 +00:28:27,950 --> 00:28:31,899 +works just as well but instead of every +single matrix multiply it's kind of like + +401 +00:28:31,898 --> 00:28:36,758 +replaced with a calm if you can you can +imagine that every single matrix + +402 +00:28:36,759 --> 00:28:41,819 +multiply here just becomes a call so we +can evolve over our input and become + +403 +00:28:41,819 --> 00:28:45,798 +involved a large output and that's the +before and the below and then we combine + +404 +00:28:45,798 --> 00:28:50,329 +them with the recurrence just us in the +GRU to actually get our activations and + +405 +00:28:50,329 --> 00:28:57,158 +so before it looked like this and now it +just looks like that so we don't have + +406 +00:28:57,159 --> 00:29:01,179 +some parts of the internet and extent of +some parts finite we just have this our + +407 +00:29:01,179 --> 00:29:05,679 +income that where every single layer is +returned its computing but it before but + +408 +00:29:05,679 --> 00:29:06,410 +also fun + +409 +00:29:06,410 --> 00:29:11,610 +its previous efforts and so this link on +that as a function of everything and + +410 +00:29:11,609 --> 00:29:14,990 +it's very kind of uniform and kinda like +a gene that you just 233 called too much + +411 +00:29:14,990 --> 00:29:19,799 +in mexico india recurrent and that's a +maybe that's just the answer my simplest + +412 +00:29:19,799 --> 00:29:27,579 +thing so somebody so if you'd like to +use spatial temporal commercial networks + +413 +00:29:27,579 --> 00:29:30,819 +and your projects and your very excited +because your videos the first thing to + +414 +00:29:30,819 --> 00:29:34,359 +do is stop and you should think about +whether or not you really need to + +415 +00:29:34,359 --> 00:29:37,740 +process locomotion or global motion or +emotion is really important your + +416 +00:29:37,740 --> 00:29:41,839 +classification task if you really think +motion is important to you then think + +417 +00:29:41,839 --> 00:29:44,829 +about whether or not you need to model +local motions are those are important + +418 +00:29:44,829 --> 00:29:46,929 +for all the global motion is very +important + +419 +00:29:46,930 --> 00:29:50,370 +based on that you get a hint of what you +should try about you always have to + +420 +00:29:50,369 --> 00:29:54,069 +compare that to a single from baseline I +would say and then you should try using + +421 +00:29:54,069 --> 00:29:57,539 +optical flow because it seems that if +you especially smaller amount of data it + +422 +00:29:57,539 --> 00:30:02,039 +actually is very important it's like a +very nice signal tax lien code that and + +423 +00:30:02,039 --> 00:30:06,099 +explicitly specified that optical flow +is a useful feature to look at and you + +424 +00:30:06,099 --> 00:30:09,609 +can try this Dr you are seeing that work +that afternoon just now but I think this + +425 +00:30:09,609 --> 00:30:12,599 +is too recent experimental so I'm +actually not sure if I can fully + +426 +00:30:12,599 --> 00:30:16,589 +endorses or if it works it seems like +it's a very nice idea but it hasn't been + +427 +00:30:16,589 --> 00:30:21,849 +proven yet and so that's that's kind of +like the rock layout of happy process + +428 +00:30:21,849 --> 00:30:25,339 +videos in the field so I know if there +is any questions because Justin is going + +429 +00:30:25,339 --> 00:30:28,339 +to come next + +430 +00:30:33,980 --> 00:30:43,289 +you are seeing this one hasn't been used +for it all P thats good question I don't + +431 +00:30:43,289 --> 00:30:46,879 +think so I'm not super duper expert on +LLP but I haven't seen this idea before + +432 +00:30:46,880 --> 00:30:52,980 +so i would i would guess that I haven't +seen her I don't think so good + +433 +00:31:18,880 --> 00:31:26,660 +in on a side with a million I would say +that definitely something people would + +434 +00:31:26,660 --> 00:31:31,810 +want to do you don't see too many papers +that do both of them just because people + +435 +00:31:31,809 --> 00:31:35,639 +like the kind of guy sleeping problems +and tackle them maybe not jointly but + +436 +00:31:35,640 --> 00:31:38,620 +certainly the company are trying to get +something working in a real system you + +437 +00:31:38,619 --> 00:31:42,869 +would do something like that but I don't +think there's anything that you would do + +438 +00:31:42,869 --> 00:31:45,449 +you probably do this with the late +fusion approach where you have a + +439 +00:31:45,450 --> 00:31:49,039 +whatever works best on videos whatever +works best on audio and then emerged out + +440 +00:31:49,039 --> 00:31:55,029 +somewhere later somehow but that's only +something I can do and with contend with + +441 +00:31:55,029 --> 00:31:57,639 +the neural networks right very simple +because you just have a player that's + +442 +00:31:57,640 --> 00:32:00,410 +looking at the output of both at some +point and then you're classifying as a + +443 +00:32:00,410 --> 00:32:09,860 +function of both so we're going to +surprise them and I guess we have to get + +444 +00:32:09,859 --> 00:32:11,179 +here + +445 +00:32:11,180 --> 00:32:14,180 +hopefully it works + +446 +00:32:29,148 --> 00:32:34,108 +ok so I guess we're gonna switch gears +completely and entirely and talk about + +447 +00:32:34,108 --> 00:32:38,199 +unsupervised learning so I'd like to +make a little bit of a contrast here + +448 +00:32:38,200 --> 00:32:42,460 +that first we're gonna talk about some +sort of basic definitions on + +449 +00:32:42,460 --> 00:32:46,009 +unsupervised learning and we're going to +talk about two different sort of ways + +450 +00:32:46,009 --> 00:32:50,858 +that unsupervised learning has recently +been attacked by deporting people so in + +451 +00:32:50,858 --> 00:32:53,408 +particular we gonna talk about auto +encoders and then this idea of + +452 +00:32:53,409 --> 00:32:58,679 +adversarial networks and I guess I need +my clicker right so pretty much + +453 +00:32:58,679 --> 00:33:03,259 +everything we've seen in this class so +far is supervised learning so the basic + +454 +00:33:03,259 --> 00:33:07,128 +setup behind pretty much all supervised +learning problems is that we assume that + +455 +00:33:07,128 --> 00:33:11,769 +our dataset has sort of each data point +has sort of two distinct parts we have + +456 +00:33:11,769 --> 00:33:15,858 +our data access and then we have some +label or output why that we want to have + +457 +00:33:15,858 --> 00:33:20,028 +produced from that from that input and +our whole goal in supervised learning is + +458 +00:33:20,028 --> 00:33:24,888 +to learn some function that takes in our +input tax and then produces this output + +459 +00:33:24,888 --> 00:33:29,538 +or label why and if you really think +about it pretty much almost everything + +460 +00:33:29,538 --> 00:33:33,088 +we've seen in this class is some +instances of this supervised learning + +461 +00:33:33,088 --> 00:33:37,358 +set up so something like image +classification acts as an image and then + +462 +00:33:37,358 --> 00:33:41,960 +why is a label for something like object +detection access an image and then why + +463 +00:33:41,960 --> 00:33:46,119 +is maybe a set of objects in the image +that you won't find why could be a + +464 +00:33:46,118 --> 00:33:50,238 +caption and then we look at capture name +could be a video and now why it could be + +465 +00:33:50,239 --> 00:33:55,838 +either a label or a caption or pretty +much anything anything so I just want to + +466 +00:33:55,838 --> 00:33:59,450 +make the point that supervised learning +is this very very very powerful powerful + +467 +00:33:59,450 --> 00:34:03,819 +and generic framework that encompass +encompasses everything we've done in the + +468 +00:34:03,819 --> 00:34:08,960 +class so far and the other point is that +supervised learning actually make system + +469 +00:34:08,960 --> 00:34:12,639 +that works systems that work really well +in practice and is very useful for + +470 +00:34:12,639 --> 00:34:14,628 +practical applications + +471 +00:34:14,628 --> 00:34:17,898 +unsupervised learning i think is a +little bit more of an open research + +472 +00:34:17,898 --> 00:34:22,338 +question at this point in time so it's +really cool I think it's really + +473 +00:34:22,338 --> 00:34:26,199 +important for solving a guy in general +but at this point it's maybe a little + +474 +00:34:26,199 --> 00:34:30,028 +bit more of a research focus to type of +area it's also a little bit less + +475 +00:34:30,028 --> 00:34:34,568 +well-defined so it's an unsupervised +learning we generally assume that we + +476 +00:34:34,568 --> 00:34:37,579 +have just data we only have pacs we +don't have any why + +477 +00:34:38,349 --> 00:34:44,009 +and the goal of unsupervised learning is +to do something with that data acts and + +478 +00:34:44,009 --> 00:34:48,199 +the something that we're trying to do +really depends on the problem so some so + +479 +00:34:48,199 --> 00:34:51,939 +in general we hope that we can discover +some type of latent structure in the + +480 +00:34:51,940 --> 00:34:56,710 +data acts without explicitly knowing +anything about the labels so some + +481 +00:34:56,710 --> 00:34:59,650 +classical examples that you might have +seen in previous machine learning + +482 +00:34:59,650 --> 00:35:04,009 +classes would be things like clustering +so something like a means we're just a + +483 +00:35:04,009 --> 00:35:07,728 +bunch of points and we discover +structure by classifying them into + +484 +00:35:07,728 --> 00:35:13,268 +clusters some other classical examples +of unsupervised learning would be + +485 +00:35:13,268 --> 00:35:18,248 +something like principal component +analysis where X is just at this point + +486 +00:35:18,248 --> 00:35:22,098 +of data and we want to discover some +low-dimensional representation of that + +487 +00:35:22,099 --> 00:35:27,170 +input data so unsupervised learning is +this really is sort of cool area but a + +488 +00:35:27,170 --> 00:35:30,519 +little bit more problems specific and a +little bit less well defined in + +489 +00:35:30,518 --> 00:35:37,228 +supervised learning so two things that +to architecture is in particular that + +490 +00:35:37,228 --> 00:35:42,358 +people in deep learning have done for +unsupervised learning these ideas as + +491 +00:35:42,358 --> 00:35:46,048 +this idea of an audio encoder will talk +about sort of traditional Ottoman + +492 +00:35:46,048 --> 00:35:49,318 +quarters that have a very very long +history will also talk about variational + +493 +00:35:49,318 --> 00:35:54,308 +auto encoders which are this sort of +news cool Asian twist on them will also + +494 +00:35:54,309 --> 00:35:57,729 +talk about some generative adversarial +networks that actually this really nice + +495 +00:35:57,728 --> 00:36:06,718 +idea but let you generate images and +model sample from natural images so the + +496 +00:36:06,719 --> 00:36:09,548 +idea with an audio encoder is is pretty +simple + +497 +00:36:09,548 --> 00:36:14,088 +we have our input sacks which is some +data and we're gonna pass this input + +498 +00:36:14,088 --> 00:36:19,710 +data through some kind of encoding +network to produce some features some + +499 +00:36:19,710 --> 00:36:24,440 +latent features so this you could think +this stage you could think up a little + +500 +00:36:24,440 --> 00:36:28,219 +bit like a learnable principal component +analysis we're going to take our input + +501 +00:36:28,219 --> 00:36:33,298 +data and then converted into some other +feature representation so those many + +502 +00:36:33,298 --> 00:36:38,940 +times these access will be images like +these are 10 images shown here so this + +503 +00:36:38,940 --> 00:36:42,989 +this encoder network could be something +very complicated so for something like + +504 +00:36:42,989 --> 00:36:47,228 +PCA it's just a simple linear transform +but in general this might be a fully + +505 +00:36:47,228 --> 00:36:51,799 +connected network originally sort of +maybe five or 10 years ago this + +506 +00:36:51,800 --> 00:36:56,130 +often a single they're fully connected +to network with sigmoid units now it's + +507 +00:36:56,130 --> 00:37:00,410 +often a deep deep network with trailer +units and this could also be something + +508 +00:37:00,409 --> 00:37:09,230 +like a convolutional not work right so +we also have this idea that Z + +509 +00:37:09,230 --> 00:37:13,820 +the features that we are that we learn +are usually smaller in size than acts so + +510 +00:37:13,820 --> 00:37:18,789 +we won't need to be some kind of useful +features about the data acts so we we + +511 +00:37:18,789 --> 00:37:22,610 +don't want the network to just transform +the net transport the data into some + +512 +00:37:22,610 --> 00:37:26,370 +useless representation we want to force +that actually crush the data down and + +513 +00:37:26,369 --> 00:37:29,900 +summarize it statistics and some useful +way that could hopefully be useful + +514 +00:37:29,900 --> 00:37:34,720 +person downstream processing but the +problem is that we don't really have any + +515 +00:37:34,719 --> 00:37:39,219 +explicit labels to use for this +downstream processing so instead we need + +516 +00:37:39,219 --> 00:37:43,159 +to invent some kind of a surrogate ask +that we can use using just just the data + +517 +00:37:43,159 --> 00:37:50,159 +itself so the circuit asked that we +often use for auto encoders is this idea + +518 +00:37:50,159 --> 00:37:55,719 +of reconstruction so since we don't have +any wise to learn a mapping instead + +519 +00:37:55,719 --> 00:38:00,119 +we're just gonna try to reproduce the +data acts from those features Z and + +520 +00:38:00,119 --> 00:38:05,119 +especially if those features are smaller +in size than hopefully that'll force the + +521 +00:38:05,119 --> 00:38:07,139 +network to act to summarize + +522 +00:38:07,139 --> 00:38:11,420 +to summarize the useful statistics of +the input data and hopefully discover + +523 +00:38:11,420 --> 00:38:16,289 +some useful features that could be one +useful for reconstruction but more + +524 +00:38:16,289 --> 00:38:19,920 +generally might be those features might +be useful for some other tasks if we + +525 +00:38:19,920 --> 00:38:26,340 +later get some supervised data so again +this decoder network could be pretty + +526 +00:38:26,340 --> 00:38:30,050 +complicated when auto in quarters so +first came about + +527 +00:38:30,050 --> 00:38:33,720 +oftentimes these were just simply either +a simple linear network or a small + +528 +00:38:33,719 --> 00:38:37,459 +signal network but now they can be +deeply networks and often times these + +529 +00:38:37,460 --> 00:38:43,220 +will be up convolutional is a good time +so it's Mason small inflatable slides so + +530 +00:38:43,219 --> 00:38:46,869 +oftentimes this decoder nowadays will be +one of these up convolutional networks + +531 +00:38:46,869 --> 00:38:50,529 +that takes your features that are again +are smaller in size than your input data + +532 +00:38:50,530 --> 00:38:56,880 +and kind of blows it back up in size to +reproduce your original data and I'd + +533 +00:38:56,880 --> 00:39:00,579 +like to make the point that these things +are actually pretty easy to train so the + +534 +00:39:00,579 --> 00:39:04,610 +right here is a quick example that I +just cooked up in torch so this is for + +535 +00:39:04,610 --> 00:39:05,050 +larry + +536 +00:39:05,050 --> 00:39:09,210 +code which is accomplished all that work +for their decoder which is up + +537 +00:39:09,210 --> 00:39:12,420 +convolutional network and you can see +that it's actually learns to reconstruct + +538 +00:39:12,420 --> 00:39:19,159 +the data pretty well another thing that +you sometimes see is that these encoder + +539 +00:39:19,159 --> 00:39:23,799 +and decoder networks will sometimes +share weights with just sort of as a + +540 +00:39:23,800 --> 00:39:27,740 +regularization strategy and with this +intuition that these are opposite + +541 +00:39:27,739 --> 00:39:32,329 +operations so maybe I might make sense +to try to use the same waits for both so + +542 +00:39:32,329 --> 00:39:36,659 +just as a concrete example if you're in +if you think about a fully connected + +543 +00:39:36,659 --> 00:39:39,980 +network then maybe your input data has +some dimension d + +544 +00:39:39,980 --> 00:39:44,070 +and then you're late and data the will +have some smaller dimension H and if + +545 +00:39:44,070 --> 00:39:47,769 +this encoder was just a fully connected +network then the weights will just be + +546 +00:39:47,769 --> 00:39:51,630 +this matrix of Dubai age and now when we +want to do the decoding and try to + +547 +00:39:51,630 --> 00:39:54,470 +reconstruct the original data than that + +548 +00:39:54,469 --> 00:39:59,129 +mapping back from each back to D so we +can just reuse the same weights in these + +549 +00:39:59,130 --> 00:40:06,420 +two areas we just take a transpose of a +matrix so when we're training this thing + +550 +00:40:06,420 --> 00:40:10,300 +we need some kind of a loss function +that we can use to compare a + +551 +00:40:10,300 --> 00:40:15,400 +reconstructed data with our original +data and then once and oftentimes will c + +552 +00:40:15,400 --> 00:40:20,220 +L to a simple like hell to Euclidean +loss to train this thing so once we've + +553 +00:40:20,219 --> 00:40:24,659 +chosen our internet work and once we've +chosen rd quarter network and function + +554 +00:40:24,659 --> 00:40:28,329 +then we can train this thing just like +any other normal neural network where we + +555 +00:40:28,329 --> 00:40:32,420 +get some data we pass it through to +encode it we passed through decoded the + +556 +00:40:32,420 --> 00:40:37,900 +computer law sweetback propagate and +everything's good so once we train this + +557 +00:40:37,900 --> 00:40:41,880 +thing then oftentimes will take this +decoder network that we spent so much + +558 +00:40:41,880 --> 00:40:46,700 +time learning and I'll just throw it +away which seems kinda weird but the + +559 +00:40:46,699 --> 00:40:52,129 +reason is that reconstruction on its own +is not such a useful task so instead we + +560 +00:40:52,130 --> 00:40:56,349 +want to apply these networks to some +kind of actually useful task which is + +561 +00:40:56,349 --> 00:41:01,099 +probably a supervised learning task so +here to set up is that we've learned + +562 +00:41:01,099 --> 00:41:05,179 +this encoder network which hopefully +from all this unsupervised data has + +563 +00:41:05,179 --> 00:41:08,799 +emerged to has learned to compress the +data and extract some useful features + +564 +00:41:08,800 --> 00:41:13,190 +and then we're going to use this encoder +network to initialize part of a larger + +565 +00:41:13,190 --> 00:41:17,650 +supervised work and now if we actually +do have access to maybe some smaller + +566 +00:41:17,650 --> 00:41:18,280 +data set + +567 +00:41:18,280 --> 00:41:22,590 +that have some labels then hopefully +this most of the work here could have + +568 +00:41:22,590 --> 00:41:26,309 +been done by this unsupervised training +at the beginning and then we can just + +569 +00:41:26,309 --> 00:41:29,699 +use that to initialize this this bigger +network and then fine tune the whole + +570 +00:41:29,699 --> 00:41:35,509 +thing with hopefully a very small amount +of supervised data so this is kind of a + +571 +00:41:35,510 --> 00:41:39,380 +dream of one of the dreams of +unsupervised feature learning that you + +572 +00:41:39,380 --> 00:41:43,410 +have this really really large datasets +with no labels you can just go on Google + +573 +00:41:43,409 --> 00:41:46,409 +and download images forever and it's +really easy to get a lot of images + +574 +00:41:46,969 --> 00:41:51,399 +the problem is the labels are expensive +to collect so you'd want some system + +575 +00:41:51,400 --> 00:41:54,960 +that could take advantage of both a +large huge amount of unsupervised data + +576 +00:41:54,960 --> 00:41:59,570 +and also just a small amount of +supervised data so automakers are at + +577 +00:41:59,570 --> 00:42:03,940 +least one thing that has been proposed +that has this night property but in + +578 +00:42:03,940 --> 00:42:07,670 +practice I think it tends not to work +too well which is a little bit + +579 +00:42:07,670 --> 00:42:12,010 +unfortunate because it's such a +beautiful idea another thing that I + +580 +00:42:12,010 --> 00:42:15,890 +should point out almost as a side note +that if you go back and read the + +581 +00:42:15,889 --> 00:42:21,179 +literature on these things from the mid +to thousands in the last 10 years than + +582 +00:42:21,179 --> 00:42:25,129 +people have this funny thing called +increase their wives pre-training that + +583 +00:42:25,130 --> 00:42:30,010 +they used for training auto encoders and +share the idea was that at the time in + +584 +00:42:30,010 --> 00:42:35,410 +2006 training very deep networks was was +challenging and if you you can find + +585 +00:42:35,409 --> 00:42:39,429 +quotes and papers like this that say +that even when you have maybe 45 hidden + +586 +00:42:39,429 --> 00:42:44,359 +layers it was extremely challenging per +pupil in those days to train networks so + +587 +00:42:44,360 --> 00:42:48,760 +it to get around that problem with a +instead had this paradigm where they + +588 +00:42:48,760 --> 00:42:53,560 +would try to train just one letter at a +time and they use this this thing but i + +589 +00:42:53,559 --> 00:42:57,139 +dont wanna get too much into called the +Restricted Boltzmann machine which is a + +590 +00:42:57,139 --> 00:43:01,279 +typographical model and they would use +these restricted Boltzmann Machines the + +591 +00:43:01,280 --> 00:43:05,880 +kind of trainees to these little there's +one at a time so first we will have our + +592 +00:43:05,880 --> 00:43:12,070 +input image may be sized up of size W +one and this would be maybe something + +593 +00:43:12,070 --> 00:43:16,630 +like PCA or some other kind of pics +transform and then we would hopefully + +594 +00:43:16,630 --> 00:43:19,990 +learn using a restricted Boltzmann +machine some kind of relationship + +595 +00:43:19,989 --> 00:43:25,359 +between those first their features and +some higher level features when once we + +596 +00:43:25,360 --> 00:43:27,940 +once we learned this layer within reason + +597 +00:43:27,940 --> 00:43:30,840 +and learn another restricted Boltzmann +machine on top of those features + +598 +00:43:30,840 --> 00:43:36,000 +connecting it to the next level features +so by using this type of approach it let + +599 +00:43:36,000 --> 00:43:40,050 +them train just one layer at a time in +this sort of greedy way and that let + +600 +00:43:40,050 --> 00:43:43,980 +them hopefully find a really good +initialization for this larger network + +601 +00:43:43,980 --> 00:43:48,369 +so after this greedy pre-training stage +they would stick the whole thing + +602 +00:43:48,369 --> 00:43:52,099 +together into this giant audio encoder +and then fine tune the audio encoder + +603 +00:43:52,099 --> 00:44:00,469 +jointly so nowadays we don't really need +to do this with things like ray Liu and + +604 +00:44:00,469 --> 00:44:04,139 +proper initialization and bash +normalization and slightly fancier + +605 +00:44:04,139 --> 00:44:08,730 +fancier optimizers this type of thing is +not really necessary anymore so as an + +606 +00:44:08,730 --> 00:44:12,659 +example on the previous slide we saw +this for Larry convolutional + +607 +00:44:12,659 --> 00:44:16,409 +deconvolution audio encoder that I +trained on ceasefire and this is just + +608 +00:44:16,409 --> 00:44:17,429 +trying to do + +609 +00:44:17,429 --> 00:44:20,149 +using all these modern neural network +techniques you don't have to mess around + +610 +00:44:20,150 --> 00:44:25,039 +with US Airways training so this is not +something that really gets done anymore + +611 +00:44:25,039 --> 00:44:27,800 +but I thought we should at least +mentioned it since you're probably + +612 +00:44:27,800 --> 00:44:35,990 +encounter this idea if you read back in +the literature about these things so the + +613 +00:44:35,989 --> 00:44:39,949 +basic idea or an auto in quarters is I +think pretty simple it's this beautiful + +614 +00:44:39,949 --> 00:44:44,009 +idea where we can just use a lot of +unsupervised data to hopefully learn + +615 +00:44:44,010 --> 00:44:49,710 +some nice features unfortunately that +doesn't work but that's ok but there's + +616 +00:44:49,710 --> 00:44:53,639 +maybe some other nice type of task we +would want to do with unsupervised data + +617 +00:44:53,639 --> 00:44:56,639 +question first + +618 +00:44:59,068 --> 00:45:10,308 +yesterday the question is what what's +going on here right so this is this is + +619 +00:45:10,309 --> 00:45:14,880 +this is maybe you could think about a +three-layer neural network so our input + +620 +00:45:14,880 --> 00:45:18,410 +is gonna be the same as the output so +we're just hoping that this is a neural + +621 +00:45:18,409 --> 00:45:22,788 +network that will learn the identity +function but that's a really and in + +622 +00:45:22,789 --> 00:45:26,099 +order to learn the identity function we +have some loss function at the end + +623 +00:45:26,099 --> 00:45:29,989 +something like an adult who lost that is +encouraging our to our input and output + +624 +00:45:29,989 --> 00:45:35,429 +to be the same and learning identity +function is probably really easy thing + +625 +00:45:35,429 --> 00:45:39,379 +to do but instead we're going to force +the network to not take the easy route + +626 +00:45:39,380 --> 00:45:43,410 +and instead hopefully rather than just +regurgitating the data and learning the + +627 +00:45:43,409 --> 00:45:46,909 +identity function in the easy way +instead we're gonna bottlenecks + +628 +00:45:46,909 --> 00:45:51,268 +representation through this hidden layer +in the middle so then it's gonna learn + +629 +00:45:51,268 --> 00:45:54,798 +the identity function but in the middle +of the network is gonna have to squeeze + +630 +00:45:54,798 --> 00:45:59,829 +down and summarize and compress the data +and hopefully that that compression will + +631 +00:45:59,829 --> 00:46:04,339 +give rise to features that are useful +for other tasks as that may be a little + +632 +00:46:04,338 --> 00:46:14,719 +bit more care ok questioned the claim +was that PCA is just the answer for this + +633 +00:46:14,719 --> 00:46:19,259 +problem so it's true that PCA is optimal +in certain senses if you're only allowed + +634 +00:46:19,259 --> 00:46:25,278 +to do one where a wonder if your income +and your decoder are just a single + +635 +00:46:25,278 --> 00:46:30,259 +linear transform then indeed PCA of +optimal in some sense but if you're in + +636 +00:46:30,259 --> 00:46:34,170 +quarter and decoder are potentially +larger more complicated functions that + +637 +00:46:34,170 --> 00:46:39,059 +are more maybe multi-layer neural +networks then then maybe PCA is no + +638 +00:46:39,059 --> 00:46:43,209 +longer the right solution another point +to make is that PCA is only optimal in + +639 +00:46:43,208 --> 00:46:44,308 +certain senses + +640 +00:46:44,309 --> 00:46:48,670 +particularly talking about LG +reconstruction but in practice we don't + +641 +00:46:48,670 --> 00:46:51,798 +actually care about reconstruction we're +just hoping that this thing will learn + +642 +00:46:51,798 --> 00:46:56,538 +useful features for other tasks so in +practice and will see this a bit later + +643 +00:46:56,539 --> 00:47:00,259 +that people don't always use out to +anymore because I'll to is maybe not + +644 +00:47:00,259 --> 00:47:04,719 +quite the right loss for actually +features yeah + +645 +00:47:04,719 --> 00:47:14,348 +the army on larry is this is is this +kind of generative model of the data of + +646 +00:47:14,349 --> 00:47:18,250 +data where you imagine that you have +sort of two sequences of bets and you + +647 +00:47:18,250 --> 00:47:19,108 +want to do this + +648 +00:47:19,108 --> 00:47:23,579 +generative modeling of the of the two +things so then you need to get into + +649 +00:47:23,579 --> 00:47:26,440 +quite a lot of reasons that this text to +figure out exactly what a loss function + +650 +00:47:26,440 --> 00:47:31,260 +is but it ends up being something like +what likelihood of the data with these + +651 +00:47:31,260 --> 00:47:35,470 +latent state that you don't observe and +that's actually a cool idea that we will + +652 +00:47:35,469 --> 00:47:40,868 +sort of revisit in the variational audio +encoder so one of the one of the + +653 +00:47:40,869 --> 00:47:45,280 +problems with this traditional audio +encoder is that it's hoping to learn + +654 +00:47:45,280 --> 00:47:49,590 +features that's that's a cool thing but +there's this other thing that we would + +655 +00:47:49,590 --> 00:47:54,670 +like to not just learned features but +also be able to generate new data a cool + +656 +00:47:54,670 --> 00:47:59,320 +task that we could potentially learned +from unsupervised data is that hopefully + +657 +00:47:59,320 --> 00:48:03,030 +our model could slurp and a bunch of +images and after it does that it sort of + +658 +00:48:03,030 --> 00:48:06,990 +learns what natural images look like and +then after its learn this distribution + +659 +00:48:06,989 --> 00:48:11,449 +then it could hopefully spit out sort of +fake images that look like the original + +660 +00:48:11,449 --> 00:48:17,949 +images but are fake and this is maybe +not address a task which is directly + +661 +00:48:17,949 --> 00:48:22,319 +applicable to things like classification +but it seems like an important thing for + +662 +00:48:22,320 --> 00:48:26,588 +a guy that humans are pretty good at +looking at data and summarizing it and + +663 +00:48:26,588 --> 00:48:31,199 +getting the idea of what it looks like +so hopefully if our models could also do + +664 +00:48:31,199 --> 00:48:34,969 +this sort of task then hopefully they'll +have learned some some useful + +665 +00:48:34,969 --> 00:48:41,299 +summarization or some useful statistics +of the data so the variation audio + +666 +00:48:41,300 --> 00:48:45,539 +encoder is this kind of neat twist on +the original order that lets us + +667 +00:48:45,539 --> 00:48:50,690 +hopefully actually generate novel images +from our learns data so here we need to + +668 +00:48:50,690 --> 00:48:54,849 +dive into a little bit of patience that +this tax so this is something that we + +669 +00:48:54,849 --> 00:48:58,320 +haven't really talked about at all in +this class anymore but up to this point + +670 +00:48:58,320 --> 00:49:02,420 +but there's this whole other side of +machine learning that doesn't do near + +671 +00:49:02,420 --> 00:49:05,250 +networks and deep learning but things +really hard about probability + +672 +00:49:05,250 --> 00:49:09,260 +distributions and how bility +distributions can fit together to + +673 +00:49:09,260 --> 00:49:13,190 +generate data sets and then reason +probabilistically about your data and + +674 +00:49:13,190 --> 00:49:16,670 +this type of paradigm is really nice +because it lets you sort of State + +675 +00:49:16,670 --> 00:49:17,970 +explicit probabilistic + +676 +00:49:17,969 --> 00:49:22,000 +assumptions about how you think your +data was generated and then given those + +677 +00:49:22,000 --> 00:49:25,858 +probabilistic assumptions you try to +figure model to the data that follows + +678 +00:49:25,858 --> 00:49:30,199 +your assumptions so what the variation +alarming quarter we're assuming this + +679 +00:49:30,199 --> 00:49:35,589 +this particular type of method by which +our data was generated so we assume that + +680 +00:49:35,590 --> 00:49:39,800 +we've barely exists out there in the +world some prior distribution which is + +681 +00:49:39,800 --> 00:49:44,440 +generating these latent States Z and +we've been we assume some conditional + +682 +00:49:44,440 --> 00:49:49,789 +distribution that once we have the +leading states we can generate samples + +683 +00:49:49,789 --> 00:49:54,389 +from some other distribution to generate +the data so the variation audio encoder + +684 +00:49:54,389 --> 00:49:58,170 +it really imagine that our data was +generated by this pretty simple process + +685 +00:49:58,170 --> 00:50:03,639 +that first we sample from some prior +distribution to get some to get raz B + +686 +00:50:03,639 --> 00:50:10,940 +sample from this conditional to get our +acts so the intuition is that acts as + +687 +00:50:10,940 --> 00:50:15,240 +something like an image and Z maybe +summarizes some useful stuff about that + +688 +00:50:15,239 --> 00:50:19,649 +image so if these were see far images +then maybe that lay in state she could + +689 +00:50:19,650 --> 00:50:23,800 +be something like the class of the image +whether it's a frog or a deer or cat and + +690 +00:50:23,800 --> 00:50:27,690 +also might contain variables about how +that cat is oriented or what color it is + +691 +00:50:27,690 --> 00:50:29,269 +or something like that + +692 +00:50:29,269 --> 00:50:33,719 +so this is kind of a nice sort of having +a pretty simple some pretty simple idea + +693 +00:50:33,719 --> 00:50:37,279 +but makes a lot of sense for how you +might imagine image images to be + +694 +00:50:37,280 --> 00:50:43,670 +generated so the problem now is that we +want to ask to meet these parameters + +695 +00:50:43,670 --> 00:50:48,470 +data of both the prior and the +conditional without actually having + +696 +00:50:48,469 --> 00:50:52,598 +access to these latest dates and see and +that's that's the that's a challenging + +697 +00:50:52,599 --> 00:50:57,588 +problem so to make a simple we're gonna +do something that you see a lot in + +698 +00:50:57,588 --> 00:51:00,769 +Bayesian statistics and I'll just assume +that the priors got a shampoo that's + +699 +00:51:00,769 --> 00:51:07,088 +easy to handle and the conditional be a +will also be shown but it's gonna be a + +700 +00:51:07,088 --> 00:51:11,489 +little bit fancier so we'll assume that +it's a Gaussian with diagonal mean and + +701 +00:51:11,489 --> 00:51:16,729 +unit with sorry diagonal covariance and +some mean but instead we're just gonna + +702 +00:51:16,730 --> 00:51:19,650 +put but the way that we're going to get +those is we're going to compute them + +703 +00:51:19,650 --> 00:51:24,800 +with a neural network so it suppose that +we had the latest agency for some piece + +704 +00:51:24,800 --> 00:51:27,579 +of data that we assume that that late +instead + +705 +00:51:27,579 --> 00:51:32,160 +will go into some decoder network which +could be some big complicated neural + +706 +00:51:32,159 --> 00:51:36,078 +network and now that neural network is +gonna spit out two things it's gonna + +707 +00:51:36,079 --> 00:51:40,079 +spit out the meaning of the data it's +gonna spit out the meaning of the data + +708 +00:51:40,079 --> 00:51:45,068 +acts and also the the the variance of +the data acts so you should think that + +709 +00:51:45,068 --> 00:51:48,958 +this looks very much like the top half +of a normal audio encoder that we have + +710 +00:51:48,958 --> 00:51:52,699 +this link state we have some known that +that's operating on the latest eight but + +711 +00:51:52,699 --> 00:51:57,588 +now instead of just directly spitting +out the data instead it's spitting out + +712 +00:51:57,588 --> 00:52:01,690 +the mean of the data and the variance of +the data but other than that this looks + +713 +00:52:01,690 --> 00:52:07,528 +very much like the decoder of the normal +audio encoder so this this decoder + +714 +00:52:07,528 --> 00:52:11,518 +network sort of thinking back to the +normal audio encoder might be a simple + +715 +00:52:11,518 --> 00:52:14,578 +fully connected thing or it might be +this very big powerful deconvolution + +716 +00:52:14,579 --> 00:52:22,269 +network and both of those are pretty +common so now the problem is that by + +717 +00:52:22,268 --> 00:52:26,679 +baseball if given the prior and given +the conditional basil tells us the + +718 +00:52:26,679 --> 00:52:31,578 +posterior that given so if we want to +actually use this model we need to be + +719 +00:52:31,579 --> 00:52:35,209 +able to estimate the latent state from +the input data and the way that we + +720 +00:52:35,208 --> 00:52:38,659 +estimate the leading state from the +input data is by writing down this + +721 +00:52:38,659 --> 00:52:42,899 +posterior distribution which is the +probability of the latest easy given are + +722 +00:52:42,900 --> 00:52:47,519 +observed data and using payroll we can +easily flip this around and write it in + +723 +00:52:47,518 --> 00:52:54,189 +terms of our prior oversee and in terms +of our conditional Givenchy and so we + +724 +00:52:54,190 --> 00:52:57,249 +can use by Israel to actually put this +thing around and write it in terms of + +725 +00:52:57,248 --> 00:53:02,409 +these three things so after we look at +these roles we can break down these + +726 +00:53:02,409 --> 00:53:06,818 +three terms and we can see that the +conditional we just use our decoder + +727 +00:53:06,818 --> 00:53:11,558 +network and we easily have access to +that and this prior again we have access + +728 +00:53:11,559 --> 00:53:15,569 +to the prior to be assumed that you +negotiate so that's easy to handle but + +729 +00:53:15,568 --> 00:53:19,458 +this denominator this probability of +acts it turns out if you if you work out + +730 +00:53:19,458 --> 00:53:22,828 +the math and write it out this ends up +being this giant intractable in a row + +731 +00:53:22,829 --> 00:53:26,579 +over the entire leading state space so +that's completely intractable there's no + +732 +00:53:26,579 --> 00:53:29,479 +way you could ever porn that in a girl +and even approximating it would be a + +733 +00:53:29,478 --> 00:53:33,399 +giant disaster so instead we will not +even trying to evaluate that in a girl + +734 +00:53:33,400 --> 00:53:38,759 +instead we're going to introduce some +encoder network that will try to + +735 +00:53:38,759 --> 00:53:40,179 +directly before + +736 +00:53:40,179 --> 00:53:45,210 +in print stuff for us so this encoder +network is going to take in a data point + +737 +00:53:45,210 --> 00:53:48,599 +and it's going to spit out a +distribution over the meeting state + +738 +00:53:48,599 --> 00:53:53,210 +space so again this looks very much +looking back at the original audio + +739 +00:53:53,210 --> 00:53:57,449 +encoder from a few slides ago this looks +very much the same as sort of the bottom + +740 +00:53:57,449 --> 00:54:01,449 +half of a traditional audio encoder +where we're taking in data and now + +741 +00:54:01,449 --> 00:54:04,789 +instead of directly spitting out the +latest eight we're gonna spit out a mean + +742 +00:54:04,789 --> 00:54:09,519 +and variance of the leading state and +again this quarter network might be + +743 +00:54:09,519 --> 00:54:13,639 +something might be somewhat +controversial network or maybe some deep + +744 +00:54:13,639 --> 00:54:21,159 +convolutional network so sort of the +intuition is that this encounter network + +745 +00:54:21,159 --> 00:54:25,259 +will be the separate totally different +destroying function but we're going to + +746 +00:54:25,260 --> 00:54:29,180 +try to train it in a way that it +approximates this posterior distribution + +747 +00:54:29,179 --> 00:54:35,799 +that we don't actually have access to +and so when we probably pieces together + +748 +00:54:35,800 --> 00:54:40,700 +then then we can set up a stitch this +all together and get give rise to this + +749 +00:54:40,699 --> 00:54:44,808 +variation audio encoder so once we put +these things together then we have this + +750 +00:54:44,809 --> 00:54:49,559 +input data point X we're gonna pass it +through our encoder network and the + +751 +00:54:49,559 --> 00:54:52,819 +encoder network will spit out a +distribution over the leading states + +752 +00:54:52,818 --> 00:54:57,789 +once we have this this distribution over +the latest dates you can imagine you + +753 +00:54:57,789 --> 00:55:01,650 +could imagine sampling from that +distribution to get some some highest + +754 +00:55:01,650 --> 00:55:07,700 +let me state of high probability for +that input than once we have been once + +755 +00:55:07,699 --> 00:55:11,889 +we have some concrete example of a +latent state then we can pass it through + +756 +00:55:11,889 --> 00:55:16,409 +this decoder network which will spread +out the probability of which should then + +757 +00:55:16,409 --> 00:55:20,469 +sped out the probability of the data +again and then once we have this + +758 +00:55:20,469 --> 00:55:24,439 +distribution over the data we could +sample from it to actually get something + +759 +00:55:24,440 --> 00:55:29,950 +that hopefully looks like the original +data point so this this ends up looking + +760 +00:55:29,949 --> 00:55:34,269 +very much like a normal audio encoder +where we're taking our input data we're + +761 +00:55:34,269 --> 00:55:37,829 +running it through this encoder to get +some latent state or passing into this + +762 +00:55:37,829 --> 00:55:42,200 +decoder totally reconstruct the original +data and when you go about training this + +763 +00:55:42,199 --> 00:55:46,149 +thing it's actually trained in a very +similar method as normal audio encoder + +764 +00:55:46,150 --> 00:55:50,230 +we have this for past and this backward +pass the only difference is in the loss + +765 +00:55:50,230 --> 00:55:55,490 +function so at the top we have this +reconstruction loss rather than being + +766 +00:55:55,489 --> 00:56:01,078 +displayed by sl2 instead we want this +distribution to be close to the true + +767 +00:56:01,079 --> 00:56:07,349 +input data and we also have this lost +term coming in the middle that we want + +768 +00:56:07,349 --> 00:56:11,230 +this generated distribution over the +Layton States hopefully be very similar + +769 +00:56:11,230 --> 00:56:16,579 +to our stated prior distribution that we +wrote down at the very beginning so once + +770 +00:56:16,579 --> 00:56:19,200 +you put these pieces together you can +just trying this thing like a normal + +771 +00:56:19,199 --> 00:56:22,969 +audio encoder with normal forward +forward forward pass and backward pass + +772 +00:56:22,969 --> 00:56:29,058 +the only difference is where you put the +loss and how you interpret the loss so + +773 +00:56:29,059 --> 00:56:32,500 +any any questions about the setup it's +kind of a when we went through a kind of + +774 +00:56:32,500 --> 00:56:39,608 +ass yeah question is why do you choose a +diagonal covariance and answers cuz it's + +775 +00:56:39,608 --> 00:56:44,199 +really easy to work with theirs but +actually people have tried I think + +776 +00:56:44,199 --> 00:56:50,210 +slightly fancier things too but that's +something you can play around with ok so + +777 +00:56:50,210 --> 00:56:53,530 +once we've actually trained this once +we've actually trained this kind of + +778 +00:56:53,530 --> 00:56:56,920 +variational audio encoder we can +actually use it to generate new data + +779 +00:56:56,920 --> 00:57:00,510 +that looks kind of like original dataset +so here + +780 +00:57:00,510 --> 00:57:04,430 +the idea is that remember we wrote down +this prior that might be a you negotiate + +781 +00:57:04,429 --> 00:57:07,960 +or maybe something a little bit fancier +but at any rate this prior is something + +782 +00:57:07,960 --> 00:57:12,039 +some distribution that we can easily +sample from so you negotiate it's very + +783 +00:57:12,039 --> 00:57:15,989 +easy to draw random samples from that +distribution so to generate new data + +784 +00:57:15,989 --> 00:57:20,459 +will start by just sort of following +this data this data generation process + +785 +00:57:20,460 --> 00:57:24,849 +that we had imagined data so first we'll +sample from our from our prior + +786 +00:57:24,849 --> 00:57:28,430 +distribution over the lake in states and +then we'll pass it through our decoder + +787 +00:57:28,429 --> 00:57:32,078 +network that we have learned during +training and this decoder network will + +788 +00:57:32,079 --> 00:57:36,190 +now spit out a distribution override and +appoints in the turn in in terms of both + +789 +00:57:36,190 --> 00:57:40,460 +I mean and covariance and once we have a +mean and covariance this is just a + +790 +00:57:40,460 --> 00:57:44,548 +diagonal gosh we can easily sample from +this thing again to generate some data + +791 +00:57:44,548 --> 00:57:50,369 +point so now 11 you train this thing +then another thing you can do is sort of + +792 +00:57:50,369 --> 00:57:54,440 +can out the latent space and I'm at and +rather than sampling from a latent + +793 +00:57:54,440 --> 00:57:58,490 +distribution instead you just densely +sample of allegiance from the latest + +794 +00:57:58,489 --> 00:58:01,979 +base to kind of get an idea of what type +of structure structure the network had + +795 +00:58:01,980 --> 00:58:09,280 +learned so this is doing exactly that on +this dataset so here we we trained this + +796 +00:58:09,280 --> 00:58:12,990 +variation audio encoder with where is +the latest eight is just a + +797 +00:58:12,989 --> 00:58:17,959 +two-dimensional thing and now we can +actually scan out this late in space we + +798 +00:58:17,960 --> 00:58:22,490 +can explore densely this two-dimensional +late in space and for each point in the + +799 +00:58:22,489 --> 00:58:26,519 +latent space passes through the decoder +and use it to generate some image you + +800 +00:58:26,519 --> 00:58:30,599 +can see that it's actually discovered +this beautiful structure it's that sort + +801 +00:58:30,599 --> 00:58:34,618 +of smoothly interpolates between the +different digit classes so I'll be up + +802 +00:58:34,619 --> 00:58:38,530 +here at the left you see six is the kind +of morph into zeros as you go down you + +803 +00:58:38,530 --> 00:58:42,690 +see six is that turned into seven into +BB nines and Southern's the aids are + +804 +00:58:42,690 --> 00:58:46,159 +hanging out in the middle somewhere in +the ones are down here so this latent + +805 +00:58:46,159 --> 00:58:50,049 +space actually learned this beautiful +disentanglement of the data in this very + +806 +00:58:50,050 --> 00:58:55,910 +nice unsupervised way we can also turn +this thing on our faces dataset and it's + +807 +00:58:55,909 --> 00:58:59,199 +the same sort of story where we're just +training this two-dimensional variation + +808 +00:58:59,199 --> 00:59:02,679 +audio encoder and then once we train it +we densely sampled from that late in + +809 +00:59:02,679 --> 00:59:05,679 +space to try to see what he has learned +western + +810 +00:59:13,018 --> 00:59:19,458 +yeah so the question is whether people +ever try to force the least specifically + +811 +00:59:19,458 --> 00:59:23,139 +variables to have some some some exact +meaning and yeah there has been some + +812 +00:59:23,139 --> 00:59:27,058 +follow-up work that does exactly that +there is a paper called deep inverse + +813 +00:59:27,059 --> 00:59:31,890 +graphics networks from MIT that has that +does exactly this setup where they try + +814 +00:59:31,889 --> 00:59:36,199 +to force where they want to learn sort +of a renderer in as a neural network so + +815 +00:59:36,199 --> 00:59:41,568 +they want to learn to like render 3d +images of things they want to force some + +816 +00:59:41,568 --> 00:59:44,619 +of the latent space some of the +variables in the latent space to + +817 +00:59:44,619 --> 00:59:49,289 +corresponds to the the 3d angles of the +object and maybe the class and the + +818 +00:59:49,289 --> 00:59:53,009 +repose of the object and the rest of +them it led to learn from whatever it + +819 +00:59:53,009 --> 00:59:56,099 +wants and that she has some cool +experiments were now they can do exactly + +820 +00:59:56,099 --> 01:00:00,809 +as you said and by setting those those +specific values the latent variables + +821 +01:00:00,809 --> 01:00:03,869 +that can render and actually rotate the +object and those are those are pretty + +822 +01:00:03,869 --> 01:00:09,390 +cool but that's that's that's a lot of +fancier than these spaces but these + +823 +01:00:09,389 --> 01:00:11,908 +faces are still pretty cool you can see +it sort of interpolating between + +824 +01:00:11,909 --> 01:00:16,689 +different phases in this very nice way +and I think that there's actually a very + +825 +01:00:16,688 --> 01:00:21,759 +nice motivation here and one of the +reasons we pick a diagonal tension is + +826 +01:00:21,759 --> 01:00:26,079 +that that has the probabilistic +interpretation of having independent but + +827 +01:00:26,079 --> 01:00:29,179 +that the very different variables in our +living space + +828 +01:00:29,179 --> 01:00:33,918 +actually should be independent so I +think that helps to explain why there + +829 +01:00:33,918 --> 01:00:37,219 +actually is this very nice separation +between the accys when you end up + +830 +01:00:37,219 --> 01:00:40,858 +sampling from the lead in space is due +to this probabilistic independence + +831 +01:00:40,858 --> 01:00:45,630 +assumption embedded in a prior so this +idea prior to this very powerful and + +832 +01:00:45,630 --> 01:00:51,139 +lets you sort of big those types of +things directly into a model so I I + +833 +01:00:51,139 --> 01:00:54,028 +wrote down a bunch of math and I don't +think we really have time to go through + +834 +01:00:54,028 --> 01:00:57,849 +it but the idea is that sort of +classically when you're training + +835 +01:00:57,849 --> 01:01:01,130 +generative models there's this thing +called maximum likelihood where you want + +836 +01:01:01,130 --> 01:01:04,608 +to maximize the likelihood of your data +under the model and then pick the model + +837 +01:01:04,608 --> 01:01:09,018 +where that makes your data most likely +but it turns out that if you just try to + +838 +01:01:09,018 --> 01:01:13,068 +run maximum-likelihood using a normal +using this generative process that we + +839 +01:01:13,068 --> 01:01:17,708 +had imagined for the very issues older +than you run into this giant you end up + +840 +01:01:17,708 --> 01:01:21,009 +needing to marginalize this joint +distribution which becomes this giant + +841 +01:01:21,009 --> 01:01:24,289 +intractable in a girl over the entire +meeting state space that's not something + +842 +01:01:24,289 --> 01:01:25,890 +that we can do + +843 +01:01:25,889 --> 01:01:29,659 +so instead the various audio encoder +encoder does this thing called a + +844 +01:01:29,659 --> 01:01:34,259 +variational inference which is a pretty +cool idea and the math is here in case + +845 +01:01:34,260 --> 01:01:38,150 +you want to go through it but the idea +is that instead of maximizing along + +846 +01:01:38,150 --> 01:01:42,619 +probability of the data work on a +cleverly insert this extra content and + +847 +01:01:42,619 --> 01:01:47,429 +break it up into these two different +terms so we're right this is an exact + +848 +01:01:47,429 --> 01:01:50,419 +equivalents that you can maybe work +there on your own but this log + +849 +01:01:50,420 --> 01:01:54,710 +likelihood we can write in terms of this +term that we call an elbow and this + +850 +01:01:54,710 --> 01:01:58,869 +other term which is a Cal divergence +between two distributions and we know + +851 +01:01:58,869 --> 01:02:03,029 +that killed two virgins is always zero +so we know that killed two virgins + +852 +01:02:03,030 --> 01:02:07,120 +between distributions is non-zero so we +know that this term has to be non-zero + +853 +01:02:07,119 --> 01:02:12,420 +which means that this this elbow term +actually is a lower bound on the log + +854 +01:02:12,420 --> 01:02:16,480 +likelihood of our data and notice that +in the process of writing down this + +855 +01:02:16,480 --> 01:02:20,889 +elbow we introduce this additional +parameter feed that we can interpret as + +856 +01:02:20,889 --> 01:02:25,710 +the parameters of this this encoder +network that is sort of approximating + +857 +01:02:25,710 --> 01:02:30,909 +this hard posterior distribution so now +instead of trying to directly maximize + +858 +01:02:30,909 --> 01:02:34,319 +the log likelihood of our data instead +will try to maximize this very issue + +859 +01:02:34,320 --> 01:02:39,539 +lower bound of the data and because the +elbow is as a lower bound of the log + +860 +01:02:39,539 --> 01:02:43,769 +likelihood then maximizing the elbow +will also have the effect of raising up + +861 +01:02:43,769 --> 01:02:49,059 +the log likelihood and stuff and these +these two terms of the elbow actually of + +862 +01:02:49,059 --> 01:02:53,360 +this beautiful interpretation that this +one at the at the front is the + +863 +01:02:53,360 --> 01:02:57,849 +expectation over the Layton States be +over the latent state space of the + +864 +01:02:57,849 --> 01:03:01,889 +probability of X given the latent state +space so you think about that that's + +865 +01:03:01,889 --> 01:03:05,559 +actually a data reconstruction term +that's saying that if we averaged over + +866 +01:03:05,559 --> 01:03:08,789 +all possible eighteen states that we +should end up with something that is + +867 +01:03:08,789 --> 01:03:13,639 +similar to our original data and this is +at this this other term is actually a + +868 +01:03:13,639 --> 01:03:17,940 +regularization term this is the Cal +divergence between are approximate + +869 +01:03:17,940 --> 01:03:22,059 +posterior and between the prior so this +is a regularization of trying to force + +870 +01:03:22,059 --> 01:03:27,019 +those two things together so impact this +this this first term you can approximate + +871 +01:03:27,019 --> 01:03:31,590 +with something called the approximate by +sampling using this trick in the paper + +872 +01:03:31,590 --> 01:03:35,600 +that I won't get into and this other +term again because everything is going + +873 +01:03:35,599 --> 01:03:38,489 +on here you can just about the skilled +emergence explicitly + +874 +01:03:38,489 --> 01:03:44,509 +so I think this is the most map every +slide in the class so that's that's kind + +875 +01:03:44,510 --> 01:03:50,020 +of fun but actually it's so but it's +actually scary but it's actually just + +876 +01:03:50,019 --> 01:03:54,150 +exactly just this one quarter idea we +have a reconstruction and then you have + +877 +01:03:54,150 --> 01:03:59,050 +this penalty penalizing to you to go +backwards the prior to any any questions + +878 +01:03:59,050 --> 01:04:08,840 +on the various quarters as that in +general the idea of an audio encoder is + +879 +01:04:08,840 --> 01:04:12,180 +that we want to force a network to try +to reconstruct our data and hopefully + +880 +01:04:12,179 --> 01:04:16,089 +this will learn sort of useful +representations of the data for Trisha + +881 +01:04:16,090 --> 01:04:19,470 +lot of encoders this is used for Peter +learning but once we move to variation + +882 +01:04:19,469 --> 01:04:23,569 +in quarters we make this thing patients +so we can actually generate samples that + +883 +01:04:23,570 --> 01:04:29,440 +are similar to our data so then this +idea of generating samples from my data + +884 +01:04:29,440 --> 01:04:32,690 +is really cool and everyone loves +looking at these kinds of pictures so + +885 +01:04:32,690 --> 01:04:37,119 +there's another idea that maybe can we +generate really cool samples without all + +886 +01:04:37,119 --> 01:04:41,100 +this scary Bayesian math and it turns +out that there's this idea called a + +887 +01:04:41,099 --> 01:04:45,219 +generative adversarial network that is a +sort of different idea a different twist + +888 +01:04:45,219 --> 01:04:49,799 +that lets you still generate samples +that look like your data but sort of a + +889 +01:04:49,800 --> 01:04:52,560 +little bit more explicitly without +having to worry about divergences + +890 +01:04:52,559 --> 01:04:54,340 +umpires and this sort of stuff + +891 +01:04:54,340 --> 01:04:58,920 +the idea is that we're gonna have a +generator not work that well first we're + +892 +01:04:58,920 --> 01:05:02,780 +gonna start with some random noise that +probably has drawn from you negotiate or + +893 +01:05:02,780 --> 01:05:07,060 +something like that and then we're going +to have a generator network and this + +894 +01:05:07,059 --> 01:05:11,079 +generator network actually looks very +much like the decoder in the variational + +895 +01:05:11,079 --> 01:05:15,849 +audio encoder or like the second half of +a normal audio encoder in that we're + +896 +01:05:15,849 --> 01:05:20,449 +taking this random noise and we're gonna +spend out an image that is going to be + +897 +01:05:20,449 --> 01:05:26,379 +some fake not real image that we're just +generating using this train network then + +898 +01:05:26,380 --> 01:05:29,410 +we're also going to hook up a +discriminator network that is going to + +899 +01:05:29,409 --> 01:05:32,679 +look at this fake image and try to +decide whether it's whether or not that + +900 +01:05:32,679 --> 01:05:34,769 +generated image is real or fake + +901 +01:05:34,769 --> 01:05:38,679 +so this is this so the second network is +just doing this binary classification + +902 +01:05:38,679 --> 01:05:42,949 +task where it receives an input and it +just needs to say whether or not it's + +903 +01:05:42,949 --> 01:05:46,739 +it's true or it's whether or not it's +real image or not that's just sort of a + +904 +01:05:46,739 --> 01:05:49,739 +classification task that you can hook up +like anything else + +905 +01:05:50,730 --> 01:05:55,349 +so then we can train this thing called +all jointly altogether + +906 +01:05:55,960 --> 01:06:01,179 +where r generator network will receive +many batches of random noise and I'll + +907 +01:06:01,179 --> 01:06:06,629 +spit out and it'll take images and our +discriminator network will receive many + +908 +01:06:06,630 --> 01:06:12,640 +batches of partially these images and +partially real images from a dataset and + +909 +01:06:12,639 --> 01:06:16,039 +it will have to do it will try to make +this classification task to say which + +910 +01:06:16,039 --> 01:06:21,358 +are real and which are fake and so this +is sort of now another way that we can + +911 +01:06:21,358 --> 01:06:25,880 +hook up this kind of supervised learning +problem ish without any real data so we + +912 +01:06:25,880 --> 01:06:30,390 +hope this thing up and we train the +hoping jointly so we can look at some + +913 +01:06:30,389 --> 01:06:34,730 +examples from the original general +adversarial networks paper and so these + +914 +01:06:34,730 --> 01:06:38,840 +are fake images that are generated by +the network announced you can see that + +915 +01:06:38,840 --> 01:06:41,829 +it's done a very good job of actually +generating fake tits they look like real + +916 +01:06:41,829 --> 01:06:46,549 +digits and here I'm here this this +middle column is showing actually the + +917 +01:06:46,550 --> 01:06:50,080 +nearest neighbor in the training set of +those digits to hopefully let you know + +918 +01:06:50,079 --> 01:06:53,599 +that it doesn't just memorize the +training set so for example this too has + +919 +01:06:53,599 --> 01:06:57,389 +a little dot and then this guy doesn't +have a dot so it's not just memorizing + +920 +01:06:57,389 --> 01:07:01,079 +training data and it also does a pretty +good job of recognizing pace + +921 +01:07:01,079 --> 01:07:05,849 +generating faces so but you know as +people who worked in machine learning + +922 +01:07:05,849 --> 01:07:10,440 +known these these digits and paste data +sets tend to be pretty easy to generate + +923 +01:07:10,440 --> 01:07:16,869 +samples from and when we apply this this +task to see far than RJR samples don't + +924 +01:07:16,869 --> 01:07:21,840 +quite look as nice and clean so here +it's clearly got some idea about CPR + +925 +01:07:21,840 --> 01:07:25,108 +data worth making blue stock and green +stuff but they don't really look like + +926 +01:07:25,108 --> 01:07:32,429 +real objects so that's that's a problem +so it's a follow-up work actually tried + +927 +01:07:32,429 --> 01:07:35,599 +some follow-up work on generative +adversarial networks has tried to make + +928 +01:07:35,599 --> 01:07:38,529 +these architectures bigger and more +powerful so hopefully be able to + +929 +01:07:38,530 --> 01:07:44,080 +generate better samples on these more +complex datasets so one idea is this + +930 +01:07:44,079 --> 01:07:48,949 +idea is multiscale processing so rather +than generating the image all at once + +931 +01:07:48,949 --> 01:07:53,919 +we're actually gonna generate our image +at multiple scales in this way so first + +932 +01:07:53,920 --> 01:07:58,170 +we're gonna happen generator that feeds +in bed receives noise and then generates + +933 +01:07:58,170 --> 01:08:03,670 +a low resolution and then we'll up +sample that nora skyy and apply a second + +934 +01:08:03,670 --> 01:08:04,200 +generator + +935 +01:08:04,199 --> 01:08:08,230 +ur that receives a new batch of random +noise and compute some Delta on top of + +936 +01:08:08,230 --> 01:08:12,070 +the low res image then what up sample +that again and repeat the process + +937 +01:08:12,070 --> 01:08:16,810 +several times until we've actually +finally generated are generated our + +938 +01:08:16,810 --> 01:08:22,219 +final result so this is again a very +similar ideas the previous as the + +939 +01:08:22,219 --> 01:08:25,329 +original gender diverse area network or +just generating at multiple scales + +940 +01:08:25,329 --> 01:08:30,199 +simultaneously and the training here is +a little bit more complex you actually a + +941 +01:08:30,199 --> 01:08:35,710 +discriminator at each scale and that +hopefully hopefully there's something so + +942 +01:08:35,710 --> 01:08:39,039 +when we look at the train samples from +this guy actually a lot better so here + +943 +01:08:39,039 --> 01:08:43,869 +are actually trained a separate model +per class on C 510 so here they've + +944 +01:08:43,869 --> 01:08:48,599 +trained this adversarial network on just +one just planes from CPR and you can see + +945 +01:08:48,600 --> 01:08:51,460 +that they're starting to look like real +planes so that's that's getting + +946 +01:08:51,460 --> 01:08:52,210 +somewhere + +947 +01:08:52,210 --> 01:08:56,689 +these look almost like real quarters and +these may be looked kinda like real + +948 +01:08:56,689 --> 01:09:04,278 +birds so in in the following year people +actually threw away this multiscale idea + +949 +01:09:04,279 --> 01:09:09,339 +and just used a simple are better more +principled continent so here is the idea + +950 +01:09:09,338 --> 01:09:14,318 +is forget about this multi skilled staff +and just use use batch norm don't use + +951 +01:09:14,319 --> 01:09:17,739 +fully connected layers sort of all these +architectural constraints that we've had + +952 +01:09:17,738 --> 01:09:22,759 +become practice practice and last couple +years just use those and turns out that + +953 +01:09:22,759 --> 01:09:27,969 +your adversary in that span work really +well so here they're generator is this + +954 +01:09:27,969 --> 01:09:33,088 +pretty pretty simple pretty simple +pretty small convolutional network and + +955 +01:09:33,088 --> 01:09:38,539 +the discriminator is again just a simple +network with nationalization and all + +956 +01:09:38,539 --> 01:09:42,180 +these other bells and whistles and once +you hook up this thing they get some + +957 +01:09:42,180 --> 01:09:47,810 +amazing samples in this paper so these +are generated bedrooms from the network + +958 +01:09:47,810 --> 01:09:53,450 +so these actually are pretty impressive +results these look like real data almost + +959 +01:09:53,449 --> 01:09:57,529 +so you can see that it's done a really +good job of capturing + +960 +01:09:57,529 --> 01:10:00,920 +really detailed structure about bedrooms +like there's a bad there's a window + +961 +01:10:00,920 --> 01:10:07,710 +there's a light switch so these are +these are really amazing samples but it + +962 +01:10:07,710 --> 01:10:12,579 +turns out that rather than just +generating samples we can play the same + +963 +01:10:12,579 --> 01:10:16,260 +trick as the very issue lot of encoder +and actually try to exploit try to play + +964 +01:10:16,260 --> 01:10:16,670 +around + +965 +01:10:16,670 --> 01:10:21,739 +meeting space because this cuz these +adversarial networks are receiving this + +966 +01:10:21,738 --> 01:10:25,579 +noise input and we can cleverly try to +move around that noise and put it and + +967 +01:10:25,579 --> 01:10:29,920 +try to change the type of things that +these networks generate so one example + +968 +01:10:29,920 --> 01:10:36,050 +that we can try is interpolating between +bedrooms so here on the left hip so here + +969 +01:10:36,050 --> 01:10:40,119 +the idea is that on the left for these +images on the left hand side we've drawn + +970 +01:10:40,119 --> 01:10:43,550 +a random point from our noise +distribution and then use it to generate + +971 +01:10:43,550 --> 01:10:47,690 +an image and now on the right hand side +we've done the same and we generate + +972 +01:10:47,689 --> 01:10:51,259 +another random point from our noise +distribution and use it to generate an + +973 +01:10:51,260 --> 01:10:57,710 +image so now these these two guys on the +opposite sides are generated are sort of + +974 +01:10:57,710 --> 01:11:01,760 +two points on a line and I we want to +interpolate in the lead in space between + +975 +01:11:01,760 --> 01:11:08,210 +those two lead actors and along that +line we're gonna generate used the use + +976 +01:11:08,210 --> 01:11:11,859 +the generator to generate images and +hopefully this will interpolate between + +977 +01:11:11,859 --> 01:11:16,439 +the latest dates of those two guys and +you can see that this is pretty crazy + +978 +01:11:16,439 --> 01:11:22,169 +that these bedrooms are more fame sort +of in a very nice smooth continuous way + +979 +01:11:22,170 --> 01:11:28,020 +from one bedroom to another and if you +one thing to point out is that this + +980 +01:11:28,020 --> 01:11:32,300 +morning is actually happening in kind of +a nice romantic way if you imagine what + +981 +01:11:32,300 --> 01:11:35,460 +this would look like and pixel space +than it would just be kind of this + +982 +01:11:35,460 --> 01:11:39,100 +fading effect and it would not look very +good at all but here you can see that + +983 +01:11:39,100 --> 01:11:42,690 +actually the shapes of these things and +colors are sort of continuously + +984 +01:11:42,689 --> 01:11:50,119 +deforming from one side to the other +which is quite fun so another experiment + +985 +01:11:50,119 --> 01:11:53,939 +they have in this paper is actually +using vector math to play around the + +986 +01:11:53,939 --> 01:11:58,069 +type of things that these networks +generate so here the idea is that they + +987 +01:11:58,069 --> 01:12:02,189 +generated a whole bunch of random +samples from the noise distribution then + +988 +01:12:02,189 --> 01:12:05,789 +pushed them all through the generator to +generate a whole bunch of samples and + +989 +01:12:05,789 --> 01:12:09,698 +then they as he using their own human +intelligence they tried to make some + +990 +01:12:09,698 --> 01:12:14,500 +semantic judgments about what those +random samples look like and then group + +991 +01:12:14,500 --> 01:12:18,050 +them into a couple of meaningful +semantic categories so here at this + +992 +01:12:18,050 --> 01:12:21,739 +would be three things that three images +that were generated from the network + +993 +01:12:21,738 --> 01:12:25,529 +that all kind of look like a smiling +woman and those are human provided + +994 +01:12:25,529 --> 01:12:26,819 +labels + +995 +01:12:26,819 --> 01:12:30,309 +here in the middle are three samples +from the network of a neutral women that + +996 +01:12:30,310 --> 01:12:35,010 +are that's not smiling and share on the +rate is 300 free samples of a man that + +997 +01:12:35,010 --> 01:12:40,289 +is not smiling so each of these guys was +produced from some latent state vector + +998 +01:12:40,289 --> 01:12:45,729 +so we'll just average those lay in state +vectors to compute this sort of average + +999 +01:12:45,729 --> 01:12:51,269 +average rating state of smiling woman +neutral women and neutral man now once + +1000 +01:12:51,270 --> 01:12:55,220 +we have this latent state vector we can +do some vector math so we can take a + +1001 +01:12:55,220 --> 01:13:01,050 +smiling woman subtract a neutral woman +and at a neutral man so what what would + +1002 +01:13:01,050 --> 01:13:06,070 +that give you so you hope that would +give you a smiling man and this is what + +1003 +01:13:06,069 --> 01:13:12,649 +it generates so this actually it does +kinda look like a smiling man that's + +1004 +01:13:12,649 --> 01:13:19,199 +that's pretty amazing we can do another +experiment we can take a man with + +1005 +01:13:19,199 --> 01:13:25,099 +glasses and a man without glasses and a +man with glasses subtract the man with + +1006 +01:13:25,100 --> 01:13:31,140 +glasses and add a woman with glasses +with no glasses this this is confusing + +1007 +01:13:31,140 --> 01:13:38,630 +stuff so that and what was this what +would this little equation give us a + +1008 +01:13:38,630 --> 01:13:47,369 +look at that so that's that's pretty +crazy so it def assault even though + +1009 +01:13:47,369 --> 01:13:51,279 +we're not sort of forcing an explicit +prior on the sleeping space space these + +1010 +01:13:51,279 --> 01:13:54,869 +adversarial networks have somehow still +managed to learn some really nice useful + +1011 +01:13:54,869 --> 01:13:59,960 +representation there so i also very +quickly I think there's a pretty cool + +1012 +01:13:59,960 --> 01:14:04,220 +paper that just came out a week or two +ago that puts all of these ideas + +1013 +01:14:04,220 --> 01:14:07,820 +together like we covered a lot of +different ideas in this lecture and + +1014 +01:14:07,819 --> 01:14:11,239 +let's just stick them all together so +first we're gonna take a variation on + +1015 +01:14:11,239 --> 01:14:15,659 +quarter as as our starting point and +this will have sort of the normal its + +1016 +01:14:15,659 --> 01:14:20,130 +allies loss from various audio encoder +but we saw that these adversarial + +1017 +01:14:20,130 --> 01:14:24,220 +networks give really amazing samples so +why don't we had an adversarial network + +1018 +01:14:24,220 --> 01:14:29,630 +to the variation autumn quarter so we do +that so now in addition to having our + +1019 +01:14:29,630 --> 01:14:33,710 +variation ottoman quarter we also have +this this discriminator network that's + +1020 +01:14:33,710 --> 01:14:35,949 +trying to tell the difference between +the + +1021 +01:14:35,949 --> 01:14:40,689 +no data and between the samples from the +variational audio encoder but that's not + +1022 +01:14:40,689 --> 01:14:47,099 +cool enough so why don't we also +download Alex NAT and then pass these + +1023 +01:14:47,100 --> 01:14:47,930 +two images + +1024 +01:14:47,930 --> 01:14:53,730 +Alex net and extract Alex net features +for both the original image and four are + +1025 +01:14:53,729 --> 01:14:59,079 +generated image and now in addition to +having a similar pics loss and hair and + +1026 +01:14:59,079 --> 01:15:02,340 +pulling the discriminator we're also +hoping to generate samples that have + +1027 +01:15:02,340 --> 01:15:06,900 +similar Alex net features as measured by +all too and once you stick all these + +1028 +01:15:06,899 --> 01:15:10,859 +things together hopefully you'll get +some really beautiful samples right so + +1029 +01:15:10,859 --> 01:15:17,069 +here are the examples from the paper so +these are paid just train the entire + +1030 +01:15:17,069 --> 01:15:21,109 +thing on image that so we should I think +this is these are actually quite nice + +1031 +01:15:21,109 --> 01:15:26,029 +samples and if you contrast this with +the multiscale samples on CPR that we + +1032 +01:15:26,029 --> 01:15:29,609 +saw before for those samples remember +they were actually training a separate + +1033 +01:15:29,609 --> 01:15:34,380 +model per class and see fire and these +those beautiful bedroom samples that you + +1034 +01:15:34,380 --> 01:15:35,760 +saw was again + +1035 +01:15:35,760 --> 01:15:40,270 +training one model that's specific to +bedrooms but here they actually trained + +1036 +01:15:40,270 --> 01:15:45,050 +one model on all of internet and still +like these are real images but they're + +1037 +01:15:45,050 --> 01:15:50,489 +definitely getting towards real issue +looking images so that's i think these + +1038 +01:15:50,489 --> 01:15:54,170 +are pretty cool I also think it's kind +of fun to just take all these things and + +1039 +01:15:54,170 --> 01:16:00,020 +stick them together and hopefully get +some really nice samples thats I think + +1040 +01:16:00,020 --> 01:16:02,460 +that's pretty much all we have to say +about unsupervised learning so if + +1041 +01:16:02,460 --> 01:16:05,460 +there's any any questions + +1042 +01:16:07,100 --> 01:16:17,110 +what does what is going on here + +1043 +01:16:18,680 --> 01:16:23,500 +yeah so the question is are you may be +literate linear rising the bedroom space + +1044 +01:16:23,500 --> 01:16:28,079 +and that's maybe one way to think about +it that here we remember we're just + +1045 +01:16:28,079 --> 01:16:30,729 +sampling program just sampling from +noise and passing them through the + +1046 +01:16:30,729 --> 01:16:35,319 +discriminator rather through the +generator and then the generator has + +1047 +01:16:35,319 --> 01:16:40,630 +just decided to use these different +noises channels in nice ways such that + +1048 +01:16:40,630 --> 01:16:44,510 +if you interplay between the noise you +end up interpolating between the images + +1049 +01:16:44,510 --> 01:16:49,110 +in sort of a nice smooth way so +hopefully that lets you know that it's + +1050 +01:16:49,109 --> 01:16:51,799 +not just sort of memorizing training +examples it's actually wanting to + +1051 +01:16:51,800 --> 01:17:00,310 +generalize from him in a nice way right +so just to recap everything we talked + +1052 +01:17:00,310 --> 01:17:04,430 +about today we gave you a lot of really +useful practical tips for working with + +1053 +01:17:04,430 --> 01:17:08,470 +videos and then I give you a lot of very +non practical tips for generating + +1054 +01:17:08,470 --> 01:17:16,119 +beautiful images so I think this stuff +is really cool but I'm not sure what the + +1055 +01:17:16,119 --> 01:17:19,840 +uses other than generating images but +its cool so it's fun and definitely + +1056 +01:17:19,840 --> 01:17:24,640 +stick around next time because we'll +have a guest lecture from jap teen so if + +1057 +01:17:24,640 --> 01:17:27,310 +you're watching on the internet maybe +you might wanna come to class for that + +1058 +01:17:27,310 --> 01:17:31,500 +one so I think that's everything we have +today and see you guys later + diff --git a/captions/En/Lecture15_en.srt b/captions/En/Lecture15_en.srt new file mode 100644 index 00000000..711039c0 --- /dev/null +++ b/captions/En/Lecture15_en.srt @@ -0,0 +1,4238 @@ +1 +00:00:00,000 --> 00:00:03,370 +like to point out that while I'll be +presenting today is partly my work in + +2 +00:00:03,370 --> 00:00:06,919 +collaboration with others and sometimes +I'm presenting work done by people in my + +3 +00:00:06,919 --> 00:00:10,929 +group that I wasn't really involved in +but its joint work with many many people + +4 +00:00:10,929 --> 00:00:14,740 +you'll see lots of names throughout the +talks so take that with a grain of salt + +5 +00:00:14,740 --> 00:00:20,920 +so what I'm gonna tell you about is kind +of how Google got to where it is today + +6 +00:00:20,920 --> 00:00:26,310 +in terms of using departing in a lot of +different places the project that + +7 +00:00:26,309 --> 00:00:30,608 +involved in actually started in 2011 +when entering with spending one day a + +8 +00:00:30,609 --> 00:00:36,340 +week in Google and I happen to bump into +him in the micro kitchen and and I said + +9 +00:00:36,340 --> 00:00:39,420 +oh what you were doing was like I don't +know but I haven't figured out yet but + +10 +00:00:39,420 --> 00:00:44,170 +ignore laps or are interesting and I got +the call and turns out I don't + +11 +00:00:44,170 --> 00:00:49,120 +understand pieces on parallel training +of termites like ages ago I don't want + +12 +00:00:49,119 --> 00:00:50,250 +to tell you how long ago + +13 +00:00:50,250 --> 00:00:56,350 +back kind of in the first exciting +period relax and I always kind of really + +14 +00:00:56,350 --> 00:01:00,660 +like the computational model they +provided but at that time there was a + +15 +00:01:00,659 --> 00:01:03,599 +little too early like we didn't have a +big enough data set for the number of + +16 +00:01:03,600 --> 00:01:08,879 +computation to really make them sing and +Andrew kind of sad 0 will be interesting + +17 +00:01:08,879 --> 00:01:13,579 +to train but now I'm like ok that's on +the phone so we kind of collaboratively + +18 +00:01:13,579 --> 00:01:20,209 +started the brain project to push the +size and scale of normativity training + +19 +00:01:20,209 --> 00:01:24,059 +and in particular we were really +interested in using big data sets in + +20 +00:01:24,060 --> 00:01:27,890 +large amounts competition to tackle +perception problems in my marriage + +21 +00:01:27,890 --> 00:01:34,400 +problems and read them I often found +Coursera and kind of just away from + +22 +00:01:34,400 --> 00:01:39,719 +google but since then we've been doing a +lot of interesting work in both kind of + +23 +00:01:39,719 --> 00:01:43,408 +research areas in a lot of different +domains you know one of the nice things + +24 +00:01:43,409 --> 00:01:46,859 +about no matter their incredibly +applicable to many many different kinds + +25 +00:01:46,859 --> 00:01:52,478 +of problems as I'm sure you seen in this +class and we've also deployed production + +26 +00:01:52,478 --> 00:01:56,530 +systems using our mats in pretty wide +variety of different products all kind + +27 +00:01:56,530 --> 00:02:00,049 +of give you a sampling of some of the +research some of the production aspects + +28 +00:02:00,049 --> 00:02:04,579 +some of the systems that we've built +underneath the covers including kind of + +29 +00:02:04,578 --> 00:02:08,030 +some of the implementation stuff that we +do intend to follow to make these kinds + +30 +00:02:08,030 --> 00:02:12,959 +of models run fast and I'll focus on her +mouth but a lot of the techniques are + +31 +00:02:12,959 --> 00:02:13,349 +more + +32 +00:02:13,349 --> 00:02:17,699 +a couple that just months before you can +train lots of different kinds of + +33 +00:02:17,699 --> 00:02:22,159 +reinforcement algorithms or other kinds +of other kinds of machinery industrial + +34 +00:02:22,159 --> 00:02:29,099 +thats ok Kevin hear me actually some of +the back if it comes up the time I one + +35 +00:02:29,099 --> 00:02:32,560 +of the things I really like about the +team we've put together is that we have + +36 +00:02:32,560 --> 00:02:36,479 +a really broad mix of different kinds of +expertise so we have people are really + +37 +00:02:36,479 --> 00:02:40,709 +experts at machine learning research you +know people like jeffrey hinton other + +38 +00:02:40,710 --> 00:02:45,820 +people all that we have large-scale +distributed systems builders I kind of + +39 +00:02:45,819 --> 00:02:50,169 +consider myself in that more in that +mold and then we have people can do with + +40 +00:02:50,169 --> 00:02:54,989 +a mix of those skills and often some of +the projects we work on you collectively + +41 +00:02:54,990 --> 00:03:00,870 +put together people with these different +kinds of expertise and collectively you + +42 +00:03:00,870 --> 00:03:03,580 +do something that none of you could do +individually because often you need both + +43 +00:03:03,580 --> 00:03:09,670 +kind of large-scale systems thinking and +machine learning ideas so that's always + +44 +00:03:09,669 --> 00:03:13,539 +fun and you often kind of pick up and +learn new things from other people + +45 +00:03:13,539 --> 00:03:22,280 +script outline actually this is from +hold back so you know you can kind of + +46 +00:03:22,280 --> 00:03:26,080 +see the progress of how Google has been +applying deep learning across lots of + +47 +00:03:26,080 --> 00:03:28,540 +different areas with this is sort of +when we started the project and we + +48 +00:03:28,539 --> 00:03:32,209 +started collaborating with a speech team +a bit and started doing it with some + +49 +00:03:32,210 --> 00:03:37,830 +kind of early computer vision kinds of +problems and as we had success in some + +50 +00:03:37,830 --> 00:03:42,770 +of the other teams that Google would say +hey I have a problem too and like they + +51 +00:03:42,770 --> 00:03:46,550 +would come to us or we would go to them +and say hey we think this could help + +52 +00:03:46,550 --> 00:03:50,610 +with your particular problem and over +time we've kind of gradually not so + +53 +00:03:50,610 --> 00:03:54,670 +gradually expanded the set of teams on +areas that we've been applying these + +54 +00:03:54,669 --> 00:03:58,539 +kinds of problems and you see the +breadth + +55 +00:03:58,539 --> 00:04:03,689 +different kinds of areas it's not like +it's only computer vision problems so + +56 +00:04:03,689 --> 00:04:08,150 +that's that's kinda nice we're +continuing to grow which is good and + +57 +00:04:08,150 --> 00:04:12,920 +part of the reason for that broad +spectrum of things is that you can + +58 +00:04:12,919 --> 00:04:18,229 +really think of that as these nice +really universal system that you can put + +59 +00:04:18,230 --> 00:04:21,359 +lots of different kinds of inputs into +you lots get lots of different kinds of + +60 +00:04:21,358 --> 00:04:22,129 +outputs + +61 +00:04:22,129 --> 00:04:27,300 +out of them with you know slight +differences in the model you try but in + +62 +00:04:27,300 --> 00:04:32,270 +general the same fundamental techniques +work pretty well across all these + +63 +00:04:32,269 --> 00:04:36,990 +different domains and i'd give our +results as a true you've heard about in + +64 +00:04:36,990 --> 00:04:40,400 +this class in lots of different areas +now pretty much any computer vision + +65 +00:04:40,399 --> 00:04:46,219 +problem any speech problem these days +starting to be more the case in lots of + +66 +00:04:46,220 --> 00:04:51,880 +language understanding areas lots of +kind of other areas of science like drug + +67 +00:04:51,879 --> 00:04:54,519 +discovery are starting to have +interesting role models that are better + +68 +00:04:54,519 --> 00:05:05,930 +than alternate yeah I like them they're +good along the way we've kind of built + +69 +00:05:05,930 --> 00:05:10,040 +two different generations of our +underlying system software for training + +70 +00:05:10,040 --> 00:05:14,640 +and deploying their lips the first was +called disbelief republish paper about + +71 +00:05:14,639 --> 00:05:20,479 +your nips 2012 it had the advantage of +their was really scalable like the first + +72 +00:05:20,480 --> 00:05:23,759 +one of the first uses we put to it was +doing some unsupervised training I'll + +73 +00:05:23,759 --> 00:05:27,319 +tell you about a minute which used +16,000 course to training they don't + +74 +00:05:27,319 --> 00:05:31,209 +have a lot of parameters is good for +production use but it wasn't super + +75 +00:05:31,209 --> 00:05:35,819 +flexible for research like it was kinda +hard to express kind of weird or more + +76 +00:05:35,819 --> 00:05:38,949 +esoteric kinds of models reinforcement +learning algorithms be hard to express + +77 +00:05:38,949 --> 00:05:43,349 +and it had this kind of much more later +driven approach with up-and-down + +78 +00:05:43,350 --> 00:05:48,770 +messages and it worked well for what it +did but we kind of took a step back + +79 +00:05:48,769 --> 00:05:52,639 +about a year and a little bit ago and +started building our second generation + +80 +00:05:52,639 --> 00:05:57,339 +system tends to flow which is based on +what we learned the first generation and + +81 +00:05:57,339 --> 00:06:02,289 +what we learned from work and other sort +of available open source packages and + +82 +00:06:02,290 --> 00:06:06,620 +rethink its retained a lot of good +features in disbelief but also made it + +83 +00:06:06,620 --> 00:06:13,329 +pretty flexible for a wide variety of +research is open source it which I got + +84 +00:06:13,329 --> 00:06:19,120 +heard about one of the really nice +properties have known that so I grabbed + +85 +00:06:19,120 --> 00:06:23,459 +this from a particular paper cuz it had +graphs on both scaling the sides of + +86 +00:06:23,459 --> 00:06:27,819 +training data and how accuracy increases +and also scaling the size of the neural + +87 +00:06:27,819 --> 00:06:30,279 +net and how accuracy increases + +88 +00:06:30,279 --> 00:06:33,109 +exact details aren't important you can +find these kinds of trends and hundreds + +89 +00:06:33,110 --> 00:06:37,509 +of papers but one of the really nice +properties is if you have more data and + +90 +00:06:37,509 --> 00:06:42,180 +you can make your model bigger generally +killing both of those things and even + +91 +00:06:42,180 --> 00:06:47,019 +better than scaling just one of them you +need a really big model in order to + +92 +00:06:47,019 --> 00:06:49,810 +capture kind of a more subtle trends +that appear in larger and larger + +93 +00:06:49,810 --> 00:06:54,180 +datasets you know any known that will +capture kind of obvious trends or + +94 +00:06:54,180 --> 00:06:57,370 +obvious kinda patterns but the more +subtle ones are ones where you need a + +95 +00:06:57,370 --> 00:07:04,189 +bigger model to capture and if that +extra she saw him too salty and that + +96 +00:07:04,189 --> 00:07:09,579 +requires a lot more competition so we +focus a lot on scaling the computation + +97 +00:07:09,579 --> 00:07:17,689 +we need and be able to train big models +on big data sets to one of the first + +98 +00:07:17,689 --> 00:07:22,699 +things we did in this project was we +said oh I'm surprised learning gonna be + +99 +00:07:22,699 --> 00:07:28,879 +really important and we had a big focus +on that initially quickly and others + +100 +00:07:28,879 --> 00:07:34,870 +said what would happen if we did +unsupervised learning of random you to + +101 +00:07:34,870 --> 00:07:38,519 +print so the idea is Rena take ten +million random youtube frame single + +102 +00:07:38,519 --> 00:07:42,990 +frames from a bunch of random videos and +we're going to essentially training data + +103 +00:07:42,990 --> 00:07:47,418 +recorder everyone knows what color is +that sounds like a family multi-level + +104 +00:07:47,418 --> 00:07:51,788 +auto encoder you know and this one we're +just trying to reconstruct the image now + +105 +00:07:51,788 --> 00:07:54,459 +on we're trying to reconstruct the +representation here from repetition + +106 +00:07:54,459 --> 00:08:01,629 +there and so on and we used sixteen +thousand cars we didn't have GPUs in the + +107 +00:08:01,629 --> 00:08:07,459 +datacenter the time so we compensated +with light throwing more CPUs at it we + +108 +00:08:07,459 --> 00:08:11,870 +used a sink a cutie which will talk +about a minute for optimization actually + +109 +00:08:11,870 --> 00:08:17,189 +had a lot of parameters cuz it was not +convolutional this was prior to come we + +110 +00:08:17,189 --> 00:08:20,199 +should be all the rage so he said well +we'll have a local receptive fields but + +111 +00:08:20,199 --> 00:08:24,168 +they won't become delusional and will +learn like separate representation for + +112 +00:08:24,168 --> 00:08:28,269 +this part of the image in this part of +the image which is kind of an + +113 +00:08:28,269 --> 00:08:31,038 +interesting twist I think it'd be +actually an interesting experiment to + +114 +00:08:31,038 --> 00:08:37,330 +redo this work but with convolutional +opera sharing I'll be kind of cool in + +115 +00:08:37,330 --> 00:08:40,590 +any case the representation he learned +the top after like nine layers + +116 +00:08:40,590 --> 00:08:45,580 +of these non convolutional local +receptive field $60,000 on the top level + +117 +00:08:45,580 --> 00:08:50,750 +and one of the things we thought might +happen is it would learn kind of + +118 +00:08:50,750 --> 00:08:54,799 +high-level feature detectors so in +particular printing in pixels but it + +119 +00:08:54,799 --> 00:08:58,929 +couldn't learn high-level concepts we +had a dataset that was half faces and + +120 +00:08:58,929 --> 00:09:04,349 +have not faces and we found looked +around for neurons that were good + +121 +00:09:04,350 --> 00:09:08,120 +selectors of whether or not the image +but estimates contained a face and we + +122 +00:09:08,120 --> 00:09:13,850 +found several such neurons the best one +that are those are some of the sample + +123 +00:09:13,850 --> 00:09:19,610 +images that caused that neuron to get +the most excited and then if you look + +124 +00:09:19,610 --> 00:09:24,240 +around for what stimulus will cause the +neuron to get the most excited there's + +125 +00:09:24,240 --> 00:09:32,669 +creepy face guy and that kind of +interesting like we did had no labels on + +126 +00:09:32,669 --> 00:09:38,399 +the image in the dataset at all that +we're training and a neuron in this + +127 +00:09:38,399 --> 00:09:43,029 +model has picked up on the fact that +faces are things I'm gonna get excited + +128 +00:09:43,029 --> 00:09:48,399 +when I see kind of a Caucasian face from +head on its YouTube so we also have a + +129 +00:09:48,399 --> 00:09:55,179 +cat now on a dataset with have captain +have not kept in this is average tabby I + +130 +00:09:55,179 --> 00:10:03,019 +call them and then you can take that +unsupervised model and and start a + +131 +00:10:03,019 --> 00:10:07,659 +supervised training tasks in particular +at this time we were i training on the + +132 +00:10:07,659 --> 00:10:11,669 +image next twenty thousand class task +which is not the one the most damage + +133 +00:10:11,669 --> 00:10:14,939 +that results are reported on that one +thousand classes is trying to + +134 +00:10:14,940 --> 00:10:21,490 +distinguish any made from one of 20 to +20,000 classes it's much harder task and + +135 +00:10:21,490 --> 00:10:26,340 +then we trained and then looked around +at what kinds of images cause different + +136 +00:10:26,340 --> 00:10:29,300 +popular routes to get excited you see +they're picking up on very high-level + +137 +00:10:29,299 --> 00:10:33,819 +concepts you know yellow flowers only or +waterfowl + +138 +00:10:34,620 --> 00:10:41,080 +I like and this retraining actually +increase the state to be hard accuracy + +139 +00:10:41,080 --> 00:10:44,080 +on that particular task for amount at +the time + +140 +00:10:45,129 --> 00:10:50,500 +then we kind of lost our excitement +about unsupervised learning because + +141 +00:10:50,500 --> 00:10:54,860 +supervised learning to cook so darn well +and so we started working with a speech + +142 +00:10:54,860 --> 00:11:00,100 +team who at the time was had a +non-parole Matt based acoustic + +143 +00:11:00,100 --> 00:11:06,570 +essentially trying to go from a small +segment of audio data like a hundred and + +144 +00:11:06,570 --> 00:11:09,420 +fifty millisecond time you try to +predict what sound does being uttered in + +145 +00:11:09,419 --> 00:11:17,809 +the middle 10 milliseconds and so we +just decided to try a layer fully + +146 +00:11:17,809 --> 00:11:21,879 +connected nomads and then predict one of +fourteen thousand try phones at the top + +147 +00:11:22,549 --> 00:11:27,939 +I'm at work family while basically could +train it pretty quickly and it gave a + +148 +00:11:27,940 --> 00:11:31,530 +huge reduction a moderate like this is +one of the people on speech team said + +149 +00:11:31,529 --> 00:11:34,339 +that like the biggest single improvement +they've seen in their 20 years of + +150 +00:11:34,340 --> 00:11:47,970 +research and that launched as part of +the Android based search system 2012 so + +151 +00:11:47,970 --> 00:11:51,990 +one of the things we often do is find +that we have a lot of data for some + +152 +00:11:51,990 --> 00:11:57,149 +tasks but not very many very much data +from the tasks and so for that we often + +153 +00:11:57,149 --> 00:12:02,949 +deploy systems that make you sad +multitask and transfer learning in + +154 +00:12:02,950 --> 00:12:09,030 +various ways so let's look at an example +where we use this in speech so obviously + +155 +00:12:09,029 --> 00:12:13,110 +with English we have a lot of data and +we got a really nice slow word or it + +156 +00:12:13,110 --> 00:12:17,350 +lowers that are for Portuguese on the +other hand about time we didn't have + +157 +00:12:17,350 --> 00:12:21,310 +that much training today we had $100 +purchase until the word error rate is a + +158 +00:12:21,309 --> 00:12:27,129 +lot worse which is bad so one of the +first and most simple things you can do + +159 +00:12:27,129 --> 00:12:30,620 +which is kind of what you do when you +take a model has been pre trained on + +160 +00:12:30,620 --> 00:12:33,509 +imaging that and apply to some other +problem we don't have as much data as + +161 +00:12:33,509 --> 00:12:37,610 +you just start training with those +weights by them totally random nights + +162 +00:12:37,610 --> 00:12:41,700 +I'm not actually improves your word +error rate for Portuguese if it does + +163 +00:12:41,700 --> 00:12:45,210 +there's enough similarities in the kinds +of features you want for speech in + +164 +00:12:45,210 --> 00:12:50,570 +general regardless of language no more +complicated thing you can do is actually + +165 +00:12:50,570 --> 00:12:55,390 +jointly train models that share of +entrepreneurs across all languages or in + +166 +00:12:55,389 --> 00:12:56,360 +this case all + +167 +00:12:56,360 --> 00:13:04,680 +all European languages I think it's what +we used and so they are you see we're + +168 +00:13:04,679 --> 00:13:07,939 +jointly training on this data and we +actually got a pretty significant + +169 +00:13:07,940 --> 00:13:13,310 +improvement even over the just copying +the date of the Portuguese model but + +170 +00:13:13,309 --> 00:13:17,739 +surprisingly we actually got a small +improvement English because in total + +171 +00:13:17,740 --> 00:13:20,889 +across all the other languages we +actually almost double the amount of + +172 +00:13:20,889 --> 00:13:25,399 +training data we were able to use you +miss model compared to just English + +173 +00:13:25,399 --> 00:13:30,379 +alarm so basically like languages +without much dated all improved a lot + +174 +00:13:30,379 --> 00:13:35,850 +languages with a lot of data improved +even a little bit and then we had a + +175 +00:13:35,850 --> 00:13:39,350 +language-specific top layer little +little bit of fiddling to figure out + +176 +00:13:39,350 --> 00:13:44,620 +does it make some tough to language +specific top players 1 I'll believe + +177 +00:13:44,620 --> 00:13:47,620 +these are the kinds of human guided +choices you are making + +178 +00:13:48,269 --> 00:13:53,149 +that's the production speech models +involved a lot from those really simple + +179 +00:13:53,149 --> 00:13:57,778 +feedforward models used now I last came +to deal with time to mention the + +180 +00:13:57,778 --> 00:14:02,490 +compilation of allusions to make them in +very into different frequencies so there + +181 +00:14:02,490 --> 00:14:06,769 +was a paper published here you know you +don't necessarily need to understand all + +182 +00:14:06,769 --> 00:14:11,459 +the details but there's a lot of more +complexity in the kind of model and it's + +183 +00:14:11,458 --> 00:14:15,088 +it's using much more sophisticated her +current models and computational models + +184 +00:14:15,089 --> 00:14:22,100 +a recent trend has been met you can use +alice is completely and and so rather + +185 +00:14:22,100 --> 00:14:26,730 +than having an acoustic model and then a +language model that kind of takes the + +186 +00:14:26,730 --> 00:14:30,550 +output of the acoustic model of an +estranged somewhat separately you can go + +187 +00:14:30,549 --> 00:14:34,879 +directly from audio waveforms to +producing transcript to character at a + +188 +00:14:34,879 --> 00:14:38,120 +time and I think that's going to be a +really big trend + +189 +00:14:38,809 --> 00:14:44,169 +both in speech and more generally in a +lot of heating systems you often have + +190 +00:14:44,169 --> 00:14:49,338 +today a lot of systems are kind of +composed of a bunch of subsystems each + +191 +00:14:49,339 --> 00:14:54,350 +perhaps with some she learned pieces and +some kind of hand coded pieces and then + +192 +00:14:54,350 --> 00:14:58,000 +I usually a big pile of goo decode to +glue it all together and + +193 +00:14:58,509 --> 00:15:04,600 +and often although separately developed +pieces have impediments optimization + +194 +00:15:04,600 --> 00:15:08,800 +right like you optimize your subsystem +in the context of symmetric by that + +195 +00:15:08,799 --> 00:15:12,699 +metric might not be the right thing for +the final task you care about which + +196 +00:15:12,700 --> 00:15:22,370 +might be transcribed correctly so having +a much bigger single system like a + +197 +00:15:22,370 --> 00:15:25,649 +single neural Apple goes directly from +audio waveform all the way to the end + +198 +00:15:25,649 --> 00:15:29,929 +objective you care about prescription +and that you cannot optimize end-to-end + +199 +00:15:29,929 --> 00:15:34,579 +through and there's not a lot of hand +written code in the middle that is going + +200 +00:15:34,580 --> 00:15:37,440 +to be a big trend I think you'll see +that here you'll see that I'm missing + +201 +00:15:37,440 --> 00:15:46,250 +translation a lot of other kinds of +demands so who's all competitions we + +202 +00:15:46,250 --> 00:15:48,919 +have tons of vision problems that we've +been using various kinds of + +203 +00:15:48,919 --> 00:15:54,849 +computational models for you know the +big excitement around convolutional + +204 +00:15:54,850 --> 00:15:59,220 +neural nets well first it started with +young and check reading competition that + +205 +00:15:59,220 --> 00:16:05,110 +kind of like subsided for a while and +then Alex Kozinski yo yo sup favor and + +206 +00:16:05,110 --> 00:16:10,200 +check for him to paper in 2012 which +light blue the other competitors out of + +207 +00:16:10,200 --> 00:16:16,470 +the water in the image net 2012 +challenge using a non that I think put + +208 +00:16:16,470 --> 00:16:20,500 +those things on everyone's map again +saying well we should we should be using + +209 +00:16:20,500 --> 00:16:24,399 +these things for vision cuz they work +really well and the next year + +210 +00:16:24,399 --> 00:16:28,100 +something like twenty twenty of the +entries or something you know not + +211 +00:16:28,100 --> 00:16:34,550 +threads previously it was just Alex +we've had a bunch of people at Google + +212 +00:16:34,549 --> 00:16:38,529 +looking at various kinds of +architectures for doing better and + +213 +00:16:38,529 --> 00:16:41,829 +better image that consultations on the +inspection architecture has like this + +214 +00:16:41,830 --> 00:16:45,889 +complicated model of like different size +competitions that are all kind of + +215 +00:16:45,889 --> 00:16:50,419 +concatenated together and then you can't +replicate those models a bunch of times + +216 +00:16:50,419 --> 00:16:51,319 +and + +217 +00:16:51,320 --> 00:16:55,810 +you end up with a very deep known at +that turned out to be quite good at it + +218 +00:16:56,789 --> 00:17:01,870 +condition there's been some slight +additions to that and slight changes to + +219 +00:17:01,870 --> 00:17:07,740 +make it even more accurate you know I +have you seen a slight like that in like + +220 +00:17:07,740 --> 00:17:17,120 +okay so I I was lazy susan only took my +slides from a folder thing I ever told + +221 +00:17:17,119 --> 00:17:19,549 +the story about Andre sitting down on +him labeling + +222 +00:17:19,549 --> 00:17:26,559 +ok signing Andrei decided he was helping +to administer the image that contest he + +223 +00:17:26,559 --> 00:17:31,269 +would sit down and subject himself 200 +hours of training training training + +224 +00:17:31,269 --> 00:17:38,099 +tough split and like this at an +Australian Shepherd Dog I don't know and + +225 +00:17:38,099 --> 00:17:41,449 +yes I can convince one of the lab mates +to do it but they weren't intelligence + +226 +00:17:41,450 --> 00:17:45,309 +are heated about a hundred and twenty +hours of training on images + +227 +00:17:45,980 --> 00:17:52,380 +and his lab may get tired after 12 hours +or something so he got 5.1 percent error + +228 +00:17:52,380 --> 00:17:55,380 +made got I think 12% + +229 +00:17:56,269 --> 00:18:12,918 +human error but without rain badly all +over the weekend + +230 +00:18:12,919 --> 00:18:19,690 +back at one hundred and twelve hours +later whatever anyway here is a great + +231 +00:18:19,690 --> 00:18:23,220 +blog post about it I encourage you to +check it out he has a lot of parameters + +232 +00:18:23,220 --> 00:18:34,279 +so typical humans are like you know 80 +trillion connection that many 201 one + +233 +00:18:34,279 --> 00:18:37,918 +point about these models as the models +of a small number of parameters fit well + +234 +00:18:37,919 --> 00:18:43,440 +on my mobile devices so I'm doesn't fit +while on a mobile phone but the general + +235 +00:18:43,440 --> 00:18:47,029 +trend other than andre is like smaller +numbers of parameters compared to Alex + +236 +00:18:47,029 --> 00:18:52,509 +mostly Alex net had like these two giant +fully connected layers of the top that + +237 +00:18:52,509 --> 00:18:57,000 +giant but a lot of parameters and later +worked just kind of get away with it was + +238 +00:18:57,000 --> 00:19:02,220 +the most part and so they've used you +know a small number of parameters but + +239 +00:19:02,220 --> 00:19:07,829 +more floating point operations per use +compositional parameters mark which is + +240 +00:19:07,829 --> 00:19:12,379 +good for putting them on funds we +released as part of the tensor flow + +241 +00:19:12,380 --> 00:19:18,549 +update up retrain adoption model which +you can use there's an editorial about + +242 +00:19:18,548 --> 00:19:24,089 +it there is Chris Harper although we +think its military uniform which is not + +243 +00:19:24,089 --> 00:19:29,859 +terribly inaccurate one of the nice +things about these models as they're + +244 +00:19:29,859 --> 00:19:32,589 +really good at doing very fine-grained +consultations I think one of the things + +245 +00:19:32,589 --> 00:19:35,959 +that is an Andres blog is that the +computer models are actually much much + +246 +00:19:35,960 --> 00:19:40,880 +better than people at distinguishing +exact breeds of dogs but humans are + +247 +00:19:40,880 --> 00:19:42,179 +better at + +248 +00:19:42,179 --> 00:19:49,150 +often picking out a small you know if if +the label is ping pong ball and it's + +249 +00:19:49,150 --> 00:19:52,190 +like a giant senior people playing ping +pong humans are better at that + +250 +00:19:52,829 --> 00:20:00,250 +models tend to focus on things with more +pixels if you train models with the + +251 +00:20:00,250 --> 00:20:01,109 +right kind of data + +252 +00:20:01,109 --> 00:20:05,019 +you know generalize while these scenes +look nothing alike but they actually you + +253 +00:20:05,019 --> 00:20:08,690 +know we'll both get labeled as me love +your training data is represented well + +254 +00:20:08,690 --> 00:20:14,710 +they make an acceptable errors which +kinda nineties no it's not a snake but + +255 +00:20:14,710 --> 00:20:19,230 +you understand why am I just said that +and I know it's not a dog but I actually + +256 +00:20:19,230 --> 00:20:25,190 +had to think carefully if the front +animal there is a is a donkey and I'm + +257 +00:20:25,190 --> 00:20:27,490 +still not entirely sure + +258 +00:20:27,490 --> 00:20:37,900 +any votes so one of the production uses +we've put these kinds of models kiryas + +259 +00:20:37,900 --> 00:20:42,850 +Google photo search so we launched +Google photo product and you can search + +260 +00:20:42,849 --> 00:20:46,539 +the photos that you've uploaded without +talking about all you just type ocean + +261 +00:20:46,539 --> 00:20:51,639 +and all of a sudden oliver ocean +Photoshop so for example this user + +262 +00:20:51,640 --> 00:20:56,870 +posted publicly hey I posted a +screenshot hey I didn't take these + +263 +00:20:56,869 --> 00:21:04,879 +statues of Buddha showed up for city +driving you know this is a tough because + +264 +00:21:04,880 --> 00:21:09,520 +it got a lot of textured compared to +most Utahns so we're pretty pleased to + +265 +00:21:09,519 --> 00:21:18,339 +retrieve macrophage others we have a lot +of kind of other kinds of more specific + +266 +00:21:18,339 --> 00:21:21,730 +visual tasks like essentially one of the +things we want to do in our Street View + +267 +00:21:21,730 --> 00:21:25,819 +imagery of these cars the driver in the +world and take pictures of all the roads + +268 +00:21:25,819 --> 00:21:29,609 +and street scenes and then we want to be +able to read all the texts that we find + +269 +00:21:29,609 --> 00:21:34,909 +so first you have to find the text and +well one of the first thing you want to + +270 +00:21:34,910 --> 00:21:39,720 +do is find all the addresses and maps +and months ago that you wanna like read + +271 +00:21:39,720 --> 00:21:43,829 +all the other texts so you can see that +it doesn't we have a model that does a + +272 +00:21:43,829 --> 00:21:47,799 +pretty good job of predicting that a +pixel level which which pixels contain + +273 +00:21:47,799 --> 00:21:53,819 +text or not and does pretty well in + +274 +00:21:53,819 --> 00:21:58,289 +well first of all finds lots of tax in +the training data had different kinds of + +275 +00:21:58,289 --> 00:22:03,019 +characters that represented so it has no +problem recognizing Chinese characters + +276 +00:22:03,019 --> 00:22:08,569 +English characters are Roman Latin +characters it does pretty well like + +277 +00:22:08,569 --> 00:22:12,889 +different colors of of tax two different +fonts and sizes and some of them are + +278 +00:22:12,890 --> 00:22:17,200 +very close to the cameras are very far +away and i was just and this is data + +279 +00:22:17,970 --> 00:22:24,809 +from just human labeled drawn polygons +around pieces of text and then they + +280 +00:22:24,809 --> 00:22:27,809 +transcribed it and then we have an OCR +model we also print + +281 +00:22:30,880 --> 00:22:34,500 +we've been kind of gradually releasing +other kinds of products we just launched + +282 +00:22:34,500 --> 00:22:39,799 +cloud vision of ATI's you can do lots of +things like label images this is meant + +283 +00:22:39,799 --> 00:22:44,859 +for people who don't necessarily wanna +want or how machine learning expertise I + +284 +00:22:44,859 --> 00:22:48,349 +just kind of want to do cool stuff with +images you want to go to you know say + +285 +00:22:48,349 --> 00:22:54,990 +only that they're running seemed to do +the OCR and find taxed in any image + +286 +00:22:54,990 --> 00:22:58,650 +uploads you just basically given an +emergency a bike Toronto CRM label + +287 +00:22:58,650 --> 00:23:03,820 +generation of this image and if it goes +to people have been pretty happy with + +288 +00:23:03,819 --> 00:23:06,689 +that + +289 +00:23:06,690 --> 00:23:10,220 +internally people have been thinking of +more creative uses of how to use + +290 +00:23:10,220 --> 00:23:13,600 +computer vision essentially now that +computer vision sort of really actually + +291 +00:23:13,599 --> 00:23:19,819 +works compared to five years ago this is +something that our our our geo team that + +292 +00:23:19,819 --> 00:23:23,250 +process and satellite imagery put +together and released which is basically + +293 +00:23:23,250 --> 00:23:28,740 +a way of predicting the slope of roofs +from multiple satellite views of that + +294 +00:23:28,740 --> 00:23:32,769 +country you'd like have you know every +few months new satellite imagery here + +295 +00:23:32,769 --> 00:23:36,099 +until we have multiple views of the same +location and we can predict what the + +296 +00:23:36,099 --> 00:23:40,109 +slope of the roof is given all those +different views of the same location and + +297 +00:23:40,109 --> 00:23:43,589 +how much sun exposure to get out and +then predict you know if you were to + +298 +00:23:43,589 --> 00:23:48,490 +install solar panels anyhow how much +energy could you generated by getting + +299 +00:23:48,490 --> 00:23:53,930 +kinda cool you know it's like a small +random things you can do not a vision + +300 +00:23:53,930 --> 00:24:03,160 +works ok so this class has been mostly +mostly about vision so I'm gonna talk + +301 +00:24:03,160 --> 00:24:08,029 +now about other kinds of problems like +language understanding one of the most + +302 +00:24:08,029 --> 00:24:16,779 +important problems is search obviously +so we care a lot about surgery and in + +303 +00:24:16,779 --> 00:24:20,700 +particular if I do the query car parts +for sale I'd like to determine which of + +304 +00:24:20,700 --> 00:24:25,400 +these two documents is more relevant and +you just look at the service forms of + +305 +00:24:25,400 --> 00:24:28,019 +the word that first document looks +pretty darn relevant + +306 +00:24:28,019 --> 00:24:34,609 +like lots of the words occur autorad but +actually the second document is much + +307 +00:24:34,609 --> 00:24:41,189 +more relevant given that and we'd like +to be able to understand that so how + +308 +00:24:41,190 --> 00:24:47,269 +much have you talked about embedding +model awesome so you know about the + +309 +00:24:47,269 --> 00:24:47,879 +medics + +310 +00:24:47,880 --> 00:24:54,680 +embedding defendants to so I will go +quickly but basically you want to + +311 +00:24:54,680 --> 00:24:58,200 +represent words or things in +high-dimensional things that are sparse + +312 +00:24:58,200 --> 00:25:03,559 +map them into a dense case some hundred +dimension 11,000 dimensional space so + +313 +00:25:03,559 --> 00:25:11,440 +that you can now have things that are +near each other and have similar + +314 +00:25:11,440 --> 00:25:15,029 +meanings will end up near each other in +the high-dimensional spaces so for + +315 +00:25:15,029 --> 00:25:17,769 +example you might porpoises and dolphins +to be very near each other in the + +316 +00:25:17,769 --> 00:25:20,099 +high-dimensional space because they're +quite similar words and have some + +317 +00:25:20,099 --> 00:25:23,099 +meetings they share the same time the +purpose + +318 +00:25:24,909 --> 00:25:27,420 +ok + +319 +00:25:27,420 --> 00:25:32,620 +and SeaWorld you be kind of nearby and +Cameron parents to be pretty far away + +320 +00:25:32,619 --> 00:25:39,069 +and you can train embedding to modernize +one is to have it kind of is the first + +321 +00:25:39,069 --> 00:25:42,519 +thing you do when you're feeding get +into and out of steam and even simpler + +322 +00:25:42,519 --> 00:25:47,859 +thing is a technique my former colleague +too much nickel off came up with the be + +323 +00:25:47,859 --> 00:25:51,969 +published paper about where essentially +it's called the word to make model and + +324 +00:25:51,970 --> 00:25:55,870 +essentially you pick up window of words +maybe twenty words why did you pick the + +325 +00:25:55,869 --> 00:26:00,119 +center word and then you pick another +random where do try to use the embedding + +326 +00:26:00,119 --> 00:26:06,419 +representation of that center word to +predict a man you can train that hoping + +327 +00:26:06,420 --> 00:26:11,230 +the backdrop essentially you adjust the +weights muscle flex classifier and then + +328 +00:26:11,230 --> 00:26:17,190 +in turn you through backpropagation you +you make little adjustments to the + +329 +00:26:17,190 --> 00:26:20,830 +embedding representation of that center +word so that next time you'll be able to + +330 +00:26:20,829 --> 00:26:25,919 +better predict the word parts from +automobile and actually works right like + +331 +00:26:25,920 --> 00:26:29,930 +one of the really nice things about +abetting the is given enough training + +332 +00:26:29,930 --> 00:26:34,070 +did you got really phenomenal weapons +visions of words so these are the + +333 +00:26:34,069 --> 00:26:39,759 +nearest neighbors for these three +different words or phrases as vocabulary + +334 +00:26:39,759 --> 00:26:44,319 +items in this particular on the tiger +shark you can think of his 11 embedding + +335 +00:26:44,319 --> 00:26:48,480 +vector and these are the nearest +neighbors say it it got the center of + +336 +00:26:48,480 --> 00:26:55,529 +sharpness car is interesting right like +you see why this is useful for search + +337 +00:26:55,529 --> 00:27:01,000 +because you have things that people +often hand coded information retrieval + +338 +00:27:01,000 --> 00:27:07,079 +systems like plurals and stemming and +like some kind of simple synonyms but + +339 +00:27:07,079 --> 00:27:10,750 +here he just seemed like oh I know car +automobile pickup truck racing car + +340 +00:27:10,750 --> 00:27:15,470 +passenger car dealership is kind of +related you just see that has this this + +341 +00:27:15,470 --> 00:27:19,200 +right concept of a knife Kenneth smooth +representation of car rather than + +342 +00:27:19,200 --> 00:27:26,509 +explicitly only the latter see our match +that and it turns out that if you + +343 +00:27:26,509 --> 00:27:29,980 +trained using the word avec approach +that directions turn out to be + +344 +00:27:29,980 --> 00:27:35,730 +meaningful and mental spaces so not only +is proximity interesting but directions + +345 +00:27:35,730 --> 00:27:38,730 +are interesting so it turns out if you +look at + +346 +00:27:39,720 --> 00:27:43,860 +capital and country pairs you go + +347 +00:27:43,859 --> 00:27:47,288 +roughly the same direction and distance +to get from a country with corresponding + +348 +00:27:47,288 --> 00:27:56,029 +capital or vice versa for any country +capital Paris and you also can you see + +349 +00:27:56,029 --> 00:27:59,298 +some semblance of other structures is +the embeddings map down to two + +350 +00:27:59,298 --> 00:28:05,889 +dimensions the principal components +analysis so and you see kind of + +351 +00:28:05,890 --> 00:28:12,788 +interesting structures around verb +tenses regardless of the firm which + +352 +00:28:12,788 --> 00:28:18,210 +means you can solve analogies like queen +is decaying as well mister man by doing + +353 +00:28:18,210 --> 00:28:21,279 +some simple fact arithmetic say you're +literally just looking at the embedding + +354 +00:28:21,279 --> 00:28:26,029 +vector and then adding the difference to +get to that point approximately the + +355 +00:28:26,029 --> 00:28:35,269 +point so we've been in collaboration +with the search team we launched kind of + +356 +00:28:35,269 --> 00:28:40,668 +one of the biggest search ranking +changes in the last few years we called + +357 +00:28:40,669 --> 00:28:44,640 +it rang bringing essentially just a deep +know that but uses embeddings and a + +358 +00:28:44,640 --> 00:28:50,059 +bunch of players to give you a score for +how relevant this document is for this + +359 +00:28:50,058 --> 00:28:51,730 +particular + +360 +00:28:51,730 --> 00:28:58,308 +and it's the third most important for +train travel miles out of hundreds of + +361 +00:28:58,308 --> 00:29:07,259 +that so-called smart reply was a little +cooperation with the Gmail team were + +362 +00:29:07,259 --> 00:29:11,259 +essentially replying to mail on your +phone kind of sucks cuz typing is hard + +363 +00:29:11,259 --> 00:29:16,429 +and so we wanted to have a system where +often you can predict what would be a + +364 +00:29:16,429 --> 00:29:21,900 +good reply just looking at the message +so we have a small network the predicts + +365 +00:29:21,900 --> 00:29:26,970 +is that a likely to be something that I +can have a short terse response to see + +366 +00:29:26,970 --> 00:29:30,380 +if you ask them i activate a much bigger + +367 +00:29:30,380 --> 00:29:35,409 +model and this is a message one of my +colleagues received a project that from + +368 +00:29:35,409 --> 00:29:37,720 +his brother he said we want to invite +you to join us for an early Thanksgiving + +369 +00:29:37,720 --> 00:29:43,220 +probable bob we've been your favorite +dish RCP next week so then the model + +370 +00:29:43,220 --> 00:29:48,100 +predicts countess and will be there or +sorry won't be able to make it + +371 +00:29:49,660 --> 00:29:54,810 +great if you get a lot of email it's +fantastic although your replies will be + +372 +00:29:54,809 --> 00:29:58,169 +somewhat curse of them which is nice + +373 +00:30:02,250 --> 00:30:07,329 +you know we can do interesting things +like this is a mobile app that actually + +374 +00:30:07,329 --> 00:30:11,779 +runs in airplane mode so it's actually +running the models on the phone and it's + +375 +00:30:11,779 --> 00:30:19,430 +actually got a lot of interesting things +entirely realized so you're essentially + +376 +00:30:19,430 --> 00:30:25,670 +using the camera image for detecting +text in your finding what the words are + +377 +00:30:25,670 --> 00:30:28,830 +doing OCR on it here then running it +through a translation model you can + +378 +00:30:28,829 --> 00:30:31,980 +figure it in a particular about this is +just cycling through different languages + +379 +00:30:31,980 --> 00:30:38,779 +but normally you'd set on Spanish money +but only show you Spanish but the thing + +380 +00:30:38,779 --> 00:30:43,460 +that in realizes there's actually an +interesting fun selection problem like + +381 +00:30:43,460 --> 00:30:49,210 +choose what I want to show you the +output so kind of call good if you're + +382 +00:30:49,210 --> 00:30:50,410 +traveling + +383 +00:30:50,410 --> 00:30:55,590 +interesting place I'm actually going to +Korea untiring so I am i'm looking + +384 +00:30:55,589 --> 00:31:04,549 +forward to using my translator up as +they don't be so one of the things we do + +385 +00:31:04,549 --> 00:31:09,000 +a bit of work on is reducing insurance +costs there's like nothing worse than + +386 +00:31:09,000 --> 00:31:15,789 +this feeling that wow my model is so +awesome with great it's just sad dreams + +387 +00:31:15,789 --> 00:31:18,309 +my phone's battery in Germany + +388 +00:31:18,309 --> 00:31:22,769 +or you know I can't afford the +temptation to run it at you know I keep + +389 +00:31:22,769 --> 00:31:27,039 +you in my data center even though I have +gotten machines so there's lots of + +390 +00:31:27,039 --> 00:31:31,720 +tricks you can use in particular the +simplest wanna news in for instance + +391 +00:31:31,720 --> 00:31:39,430 +generally much more forgiving of even +much lower precision computation dan + +392 +00:31:39,430 --> 00:31:44,120 +training so far in France we usually +find we can quantized all the way to get + +393 +00:31:44,119 --> 00:31:48,319 +through even less a bit too sista nice +quality but cheap you'd like to deal + +394 +00:31:48,319 --> 00:31:52,139 +with really you could do six that's +prolly but that doesn't help that much + +395 +00:31:52,140 --> 00:31:57,930 +that gives you like a nice Forex memory +reduction in storing the parameters and + +396 +00:31:57,930 --> 00:32:01,850 +also give you for a competition +efficiency cuz you can use CPU vector + +397 +00:32:01,849 --> 00:32:08,809 +instructions to 24 multiplies instead of +1:30 but why suddenly got to tell you + +398 +00:32:08,809 --> 00:32:13,879 +about kind of a cuter more exotic way of +getting more efficiency out of a mobile + +399 +00:32:13,880 --> 00:32:14,310 +phone + +400 +00:32:14,309 --> 00:32:19,169 +the technique called distillation that +jeffrey hinton organelles and I worked + +401 +00:32:19,170 --> 00:32:24,910 +on so suppose you have a really really +giant model the problem I just described + +402 +00:32:24,910 --> 00:32:30,660 +this fantastic model you really pleased +with maybe of an ensemble of those and + +403 +00:32:30,660 --> 00:32:36,430 +now you want a smaller cheaper model at +almost the same actors so here it is + +404 +00:32:36,430 --> 00:32:41,480 +your giant expensive model you feed the +same agenda gives you fantastic + +405 +00:32:41,480 --> 00:32:47,630 +predictions like . 95 Jaguar I'm pretty +sure and I'm definitely sure that's not + +406 +00:32:47,630 --> 00:32:48,530 +a car + +407 +00:32:48,529 --> 00:32:57,769 +10-4 car window for you I'm heading to +bed it could be a lion right so that's + +408 +00:32:57,769 --> 00:33:02,900 +what I really accurate model do tell the +main idea unfortunately we later + +409 +00:33:02,900 --> 00:33:07,380 +discovered the rich Caruana in 2006 had +published a similar idea in a paper + +410 +00:33:07,380 --> 00:33:13,310 +called model compression so the ensemble +for your giant accurate model implements + +411 +00:33:13,309 --> 00:33:18,669 +this interesting function from +input-output so if you forget the fact + +412 +00:33:18,670 --> 00:33:22,720 +that there's some structure there and +you just try to use the information + +413 +00:33:22,720 --> 00:33:27,500 +that's contained in that function how +can we transfer the knowledge in that + +414 +00:33:27,500 --> 00:33:30,730 +really accurate function into a smaller + +415 +00:33:30,730 --> 00:33:36,339 +intention of the function so when you're +training a model typically what you do + +416 +00:33:36,339 --> 00:33:40,740 +is you feat an image like this and then +you give it targets to try to the chief + +417 +00:33:40,740 --> 00:33:47,109 +and you give it the target one Jaguar +Land Rover everything else I'm gonna + +418 +00:33:47,109 --> 00:33:52,819 +call that a hard target so that's kind +of the ideal your model is striving to + +419 +00:33:52,819 --> 00:33:56,298 +achieve and you give it you know +hundreds of thousands or millions of + +420 +00:33:56,298 --> 00:34:00,918 +training images in a drive to +approximate all these factors from the + +421 +00:34:00,919 --> 00:34:05,160 +differences in actual fact it doesn't +quite do that cuz he gives you this nice + +422 +00:34:05,160 --> 00:34:09,990 +public probability distribution over +different images over different classes + +423 +00:34:09,989 --> 00:34:17,579 +for the same marriage so let's take our +giant expensive model and one of the + +424 +00:34:17,579 --> 00:34:22,079 +things we can do is we can actually +soften that distribution of it and this + +425 +00:34:22,079 --> 00:34:30,940 +is what jeffrey hinton calls dark +knowledge but if you soften this by + +426 +00:34:30,940 --> 00:34:34,500 +essentially dividing all the logistic +units by a temperature to you might be + +427 +00:34:34,500 --> 00:34:38,820 +like five or ten or something you then +get a softer representation of this + +428 +00:34:38,820 --> 00:34:44,159 +probability distribution where you say +okay at the Jaguar but also kinda hedge + +429 +00:34:44,159 --> 00:34:48,950 +about the little and call it a bit of a +lion maybe even less of a cow still call + +430 +00:34:48,949 --> 00:34:56,878 +it definitely not a car and that's +something you can then years and this + +431 +00:34:56,878 --> 00:35:00,139 +fall distribution made a lot more +information about the image about the + +432 +00:35:00,139 --> 00:35:04,429 +function of being implemented by this +large ensemble ensemble is trying to + +433 +00:35:04,429 --> 00:35:08,169 +head to bed soon do a really good job on +giving you a probability probability + +434 +00:35:08,170 --> 00:35:15,559 +distribution over that image so then you +can train the small model for normally + +435 +00:35:15,559 --> 00:35:19,070 +when you train just training hard +targets but instead you can train on + +436 +00:35:19,070 --> 00:35:25,640 +some combination of the hard targets +plus the soft targets and the training + +437 +00:35:25,639 --> 00:35:32,089 +objectives gonna try to Matt Matt should +some function of those two things so + +438 +00:35:32,090 --> 00:35:37,579 +this works surprisingly well so here's +an experiment we did on a large speech + +439 +00:35:37,579 --> 00:35:42,039 +model so we started by the model the +classified 58.9 percent of his friends + +440 +00:35:42,039 --> 00:35:46,190 +correctly that's our big accurate model +and now we're going to use that horrible + +441 +00:35:46,190 --> 00:35:50,829 +to provide soft targets for smaller +model they also get to see the hard + +442 +00:35:50,829 --> 00:35:57,690 +target and we're gonna train that only +3% of the data so the new model with the + +443 +00:35:57,690 --> 00:36:04,599 +soft targets kept almost that accuracy +57% am just hard targets + +444 +00:36:05,210 --> 00:36:12,800 +drastically over fits 44.5% accurate and +then go south so soft targets are really + +445 +00:36:12,800 --> 00:36:17,700 +really good regularize and the other +thing is that because the stock targets + +446 +00:36:17,699 --> 00:36:21,739 +have so much information them compared +to just a single one imagines arose you + +447 +00:36:21,739 --> 00:36:27,889 +train much much faster you get to that +accuracy in like a week short about the + +448 +00:36:27,889 --> 00:36:33,358 +time that that's pretty nice and you can +do this approach with light drying + +449 +00:36:33,358 --> 00:36:37,889 +ensembles napping into one size model +about ensemble you can do from a large + +450 +00:36:37,889 --> 00:36:45,269 +bottle into a smaller one somewhat +under-appreciated technique ok let's see + +451 +00:36:45,269 --> 00:36:51,980 +so one of the things we did when we +thought about building tons of flour was + +452 +00:36:51,980 --> 00:36:56,309 +we kind of took a step back for more +aware and we said what do you really + +453 +00:36:56,309 --> 00:36:59,259 +want to research system so you want a +lot of different things and it's kind of + +454 +00:36:59,260 --> 00:37:04,740 +hard to balance all of the things I but +really one of the things you really care + +455 +00:37:04,739 --> 00:37:08,489 +about a few researcher is either the +expression I wanna be able to take any + +456 +00:37:08,489 --> 00:37:12,589 +old research idea and try it out + +457 +00:37:15,119 --> 00:37:37,219 +it was considerably smaller like instead +of thousand wide fully connected layers + +458 +00:37:37,219 --> 00:37:43,409 +it was like 600 or 500 y which is +actually a big difference but checking + +459 +00:37:43,409 --> 00:37:51,399 +that paper for the details I'm probably +misremembered right and then you want to + +460 +00:37:51,400 --> 00:37:55,490 +be able to take your research idea a lot +and running quickly you want to be able + +461 +00:37:55,489 --> 00:38:00,689 +to run it probably on both data centers +and iPhones nice to be able to reproduce + +462 +00:38:00,690 --> 00:38:04,269 +things and you want to go from a good +research idea to a production system + +463 +00:38:04,269 --> 00:38:10,730 +without having to rewrite and some other +system that's how we kind of the main + +464 +00:38:10,730 --> 00:38:15,659 +things we were considering Wendling +counterflow open source it as as you're + +465 +00:38:15,659 --> 00:38:25,519 +aware that our first emotion is flexible +so the core bits of tender flow are we + +466 +00:38:25,519 --> 00:38:30,769 +have a notion of different devices it is +portable that runs on a much different + +467 +00:38:30,769 --> 00:38:34,340 +operating systems we have this core +graphics solution engine and then on top + +468 +00:38:34,340 --> 00:38:37,700 +of that we have different friends were +you expressed the kinds of competitions + +469 +00:38:37,699 --> 00:38:41,819 +are trying to do we have a C++ friend +and which most people don't use in my + +470 +00:38:41,820 --> 00:38:45,700 +mind we have the Piton friend I'm sure +most of you are probably more so they + +471 +00:38:45,699 --> 00:38:49,339 +don't have to wear most men but there's +nothing preventing people from putting + +472 +00:38:49,340 --> 00:38:55,750 +other languages I wanted to be fairly +language neutral so there is some work + +473 +00:38:55,750 --> 00:38:58,269 +going on to put ago friend on there + +474 +00:38:58,269 --> 00:39:03,980 +other kinds of languages and you wanna +be able to take that model and running + +475 +00:39:03,980 --> 00:39:09,440 +on a pretty wide variety of different +platforms the basic computational model + +476 +00:39:09,440 --> 00:39:12,710 +is the ground I don't know how much +talked about this in your overview of + +477 +00:39:12,710 --> 00:39:17,179 +ten little bit ok so this graph things +that flow along the edges or tenders for + +478 +00:39:17,179 --> 00:39:25,469 +arbitrary and dimensional arrays with a +primitive type like Procter into unlike + +479 +00:39:25,469 --> 00:39:29,269 +pure data flow models there's actually +stayed in this crassly you have things + +480 +00:39:29,269 --> 00:39:33,219 +like diocese which is a variable and +then you have operations again update + +481 +00:39:33,219 --> 00:39:37,019 +things that happen system state can go +through the whole graph compute some + +482 +00:39:37,019 --> 00:39:45,329 +gradient and then adjust the bias is +based on gradient graph goes through a + +483 +00:39:45,329 --> 00:39:50,809 +series of stages one important stage is +deciding given a whole bunch of + +484 +00:39:50,809 --> 00:39:55,670 +computational devices and McGrath where +are we in a run each of the different + +485 +00:39:55,670 --> 00:40:01,369 +node in the graph terms of computation +for example here we might have a CPU and + +486 +00:40:01,369 --> 00:40:06,650 +blue and I GPU card and green and we +might want to run the graph in such a + +487 +00:40:06,650 --> 00:40:13,160 +way that although that's a competition +happens on the GPU so actually as an + +488 +00:40:13,159 --> 00:40:17,259 +aside this placement decisions are kind +of tricky we allow users to provide him + +489 +00:40:17,260 --> 00:40:22,760 +the guide this a bit and then given the +hints which are not necessarily hard + +490 +00:40:22,760 --> 00:40:26,750 +constraints on the new black device but +might be something like you should + +491 +00:40:26,750 --> 00:40:33,300 +really try to run this on a GPU or place +it on task seven and I don't care what + +492 +00:40:33,300 --> 00:40:40,200 +device and then we want to basically +minimize the time for the graph subject + +493 +00:40:40,199 --> 00:40:44,159 +all kinds of other constraints like the +memory we have available on each keep + +494 +00:40:44,159 --> 00:40:51,199 +you Carter on CPUs I think it'd be +interesting actually use at home at with + +495 +00:40:51,199 --> 00:40:54,639 +some reinforcement learning because you +can actually measure an objective here + +496 +00:40:54,639 --> 00:40:58,759 +of you know if I place this note and +this known in this note in this way how + +497 +00:40:58,760 --> 00:41:02,500 +fast is my graph and I think that would +be pretty interesting reinforcement + +498 +00:41:02,500 --> 00:41:02,929 +learning + +499 +00:41:02,929 --> 00:41:09,139 +problem 13 made decisions over to place +things then we insert the sending + +500 +00:41:09,139 --> 00:41:12,500 +receive nodes which essentially +encapsulate all the communication system + +501 +00:41:12,500 --> 00:41:16,800 +so basically you want to move it answer +from one place to another the send nodal + +502 +00:41:16,800 --> 00:41:21,200 +kind of just hold onto the tensor until +they receive no checks and they've + +503 +00:41:21,199 --> 00:41:26,669 +really love that data for that and you +do this for all the edges of the Cross + +504 +00:41:26,670 --> 00:41:32,150 +device boundaries and you have different +implications of sending receive Paris + +505 +00:41:32,150 --> 00:41:36,220 +depending on the device see how for +example if the GPUs are on the same + +506 +00:41:36,219 --> 00:41:39,779 +machine you can often do our DNA +directly from one GPU memory to be there + +507 +00:41:39,780 --> 00:41:44,410 +if they're on different machines and you +across machine RBC your network might + +508 +00:41:44,409 --> 00:41:50,868 +support RDMA across the network and I +case you would just use directly reach + +509 +00:41:50,869 --> 00:41:56,920 +into the southern GPU memory on the +southern machine and credit you can + +510 +00:41:56,920 --> 00:42:00,210 +define new operations and colonels +pretty easily + +511 +00:42:00,210 --> 00:42:06,920 +such an interface is essentially how you +run the graph can typically you run he + +512 +00:42:06,920 --> 00:42:10,940 +set up a graph once and then you run a +lot so that allows us to kind of have + +513 +00:42:10,940 --> 00:42:17,068 +the system do a lot of optimization and +decisions about essentially how it wants + +514 +00:42:17,068 --> 00:42:22,199 +to place competition no then perhaps do +some experiments on like does it make + +515 +00:42:22,199 --> 00:42:26,068 +more sense to put it here here because +it's can advertise that overlap from + +516 +00:42:26,068 --> 00:42:30,969 +author bryan calls the single process +configuration everything runs and one + +517 +00:42:30,969 --> 00:42:35,509 +process and it's just sort of simple +procedure calls in a distributed setting + +518 +00:42:35,510 --> 00:42:38,440 +there's a client process a master +process and then a bunch of workers that + +519 +00:42:38,440 --> 00:42:43,608 +have devices and the Masterton clients +as I'd like to run the subgraph the + +520 +00:42:43,608 --> 00:42:47,568 +master says okay that means I need to +talk to process wanted to tell them to + +521 +00:42:47,568 --> 00:42:54,808 +do stuff you can feed in fact data and +that means that I might sort of have a + +522 +00:42:54,809 --> 00:42:59,619 +more complex graph but I only need to +run little bits of it cause I only need + +523 +00:42:59,619 --> 00:43:05,440 +to run the part to the computation that +the output throughout our + +524 +00:43:05,940 --> 00:43:14,940 +are needed based on a story we focus a +lot on being able to scale this + +525 +00:43:14,940 --> 00:43:19,099 +distributed environment we actually one +of the biggest things when we first open + +526 +00:43:19,099 --> 00:43:23,210 +source center for a week hadn't quite +carved apart a open source mobile + +527 +00:43:23,210 --> 00:43:28,269 +distributed implementation so that was +good how this your number 23 which got + +528 +00:43:28,269 --> 00:43:33,259 +filed within like a day of our release +that hey where's the distributed version + +529 +00:43:33,260 --> 00:43:39,839 +we did the initial released last +Thursday so that's good it'll get better + +530 +00:43:39,838 --> 00:43:43,619 +packaging but at the moment you can kind +of and configure multiple processes with + +531 +00:43:43,619 --> 00:43:48,710 +the names of the other process he's +involved IP addresses importance we're + +532 +00:43:48,710 --> 00:43:55,150 +gonna package that I'm better and next +couple of weeks but that's good and the + +533 +00:43:55,150 --> 00:43:59,250 +whole reason to have that is that you +want much better turnaround time for + +534 +00:43:59,250 --> 00:44:05,889 +experiments so if you're in the mode +where your training and experiment + +535 +00:44:05,889 --> 00:44:09,769 +iteration is kind of minutes or hours +that's really really good if you're in + +536 +00:44:09,769 --> 00:44:15,159 +the mode of like multiple weeks that's +kind of hopeless like more than a month + +537 +00:44:15,159 --> 00:44:19,279 +you you generally want to do it or if +you do you're like oh my travels done + +538 +00:44:19,280 --> 00:44:26,130 +why did I do that again so we really +emphasize a lot in our group just being + +539 +00:44:26,130 --> 00:44:31,269 +able to make it to people can do +experiments as fast as is reasonable + +540 +00:44:33,920 --> 00:44:39,250 +so the two main things we do our model +parallels amid a problem I'll talk about + +541 +00:44:39,250 --> 00:44:46,588 +both you've talked about this a little +bit or ok so the best way you can + +542 +00:44:46,588 --> 00:44:52,279 +decrease 9 training time is decreased to +stop time so one of the really nice + +543 +00:44:52,280 --> 00:44:56,329 +properties most laptops there's lots and +lots of inherent parallelism right like + +544 +00:44:56,329 --> 00:44:59,329 +if you think about a computational model +there's lots of parallelism + +545 +00:45:00,539 --> 00:45:04,119 +each of the layers because all the +spatial positions are mostly independent + +546 +00:45:04,119 --> 00:45:06,280 +you can just run around them + +547 +00:45:06,280 --> 00:45:10,680 +in parallel on different devices the +problem is figure out how to communicate + +548 +00:45:10,679 --> 00:45:17,889 +how to distribute that computation in +such a way that doesn't kill you if you + +549 +00:45:17,889 --> 00:45:21,389 +think help you someone is local +conductivity like convolutional neural + +550 +00:45:21,389 --> 00:45:25,299 +mats have this nice property that +they're generally looking like a five by + +551 +00:45:25,300 --> 00:45:31,070 +five patch of data below them and they +don't need anything else and the neuron + +552 +00:45:31,070 --> 00:45:35,289 +next to it as a whole lot of overlap +with the data it needs for for that + +553 +00:45:35,289 --> 00:45:41,099 +first neuron UCAV towers with little or +no connectivity between the towers so + +554 +00:45:41,099 --> 00:45:46,179 +every few layers you might communicate a +little bit but mostly you don't accept + +555 +00:45:46,179 --> 00:45:50,399 +paper did that so essentially had two +separate hours that mostly ran into + +556 +00:45:50,400 --> 00:45:55,880 +penalty on GPUs to different CPUs and +occasionally exchanged some information + +557 +00:45:55,880 --> 00:45:59,220 +you get a specialized parts of the model +attractive woman for some example + +558 +00:45:59,219 --> 00:46:06,759 +there's lots of ways to exploit +parallelism so when you're just naively + +559 +00:46:06,760 --> 00:46:10,630 +compiling matrix multiply code with gcc +or something it a lot probably already + +560 +00:46:10,630 --> 00:46:16,880 +take advantage of instruction +parallelism present on Intel CPUs scores + +561 +00:46:16,880 --> 00:46:23,420 +you can use Thread heroism and things +that way across devices communicating + +562 +00:46:23,420 --> 00:46:27,760 +between the abusers often pretty limited +to you have like a factor of 30 to 40 + +563 +00:46:27,760 --> 00:46:31,950 +better band trip to the local team +member you can you do to like another + +564 +00:46:31,949 --> 00:46:36,750 +GPU cards memory on the same machine and +across machine down in general even + +565 +00:46:36,750 --> 00:46:41,519 +worse so pretty important to kind of +keep as much data local as you can and + +566 +00:46:41,519 --> 00:46:48,159 +avoid eating too much but model +parallels in the basic idea is you're + +567 +00:46:48,159 --> 00:46:51,929 +just going to partition the +computational model somehow maybe + +568 +00:46:51,929 --> 00:47:01,710 +especially like this maybe layer by +layer and then in this case for example + +569 +00:47:01,710 --> 00:47:05,730 +the only communication I need to do is +that this boundary you know some of the + +570 +00:47:05,730 --> 00:47:09,039 +data from petition to have needed for +the input of that partition one but + +571 +00:47:09,039 --> 00:47:16,949 +mostly all that is local the other +techniques you can use for speeding up + +572 +00:47:16,949 --> 00:47:21,419 +convergence is data parallelism some a +case you're going to use many different + +573 +00:47:21,420 --> 00:47:24,608 +replicas of the same model structure and +they're all going to collaborate to + +574 +00:47:24,608 --> 00:47:30,949 +update parameters so in some shared set +of servers that hold the parameters + +575 +00:47:30,949 --> 00:47:36,629 +state speedups depend a lot on the kind +of model could be 10 to 40 X speed up + +576 +00:47:36,630 --> 00:47:42,720 +450 replicas sparse models with like +really large embeddings for every + +577 +00:47:42,719 --> 00:47:44,769 +vocabulary word known to man + +578 +00:47:44,769 --> 00:47:48,469 +generally you can't report more +parallelism cuz most updates only update + +579 +00:47:48,469 --> 00:47:53,129 +a handful of the embedding entries have +a sentence has like 10 unique words in + +580 +00:47:53,130 --> 00:47:57,630 +it out of a million and you can have +millions and millions are thousands of + +581 +00:47:57,630 --> 00:48:03,088 +replicas doing lots of work so the basic +idea and data parallelism is you have + +582 +00:48:03,088 --> 00:48:07,019 +these different model replicas are gonna +have the centralized system that keeps + +583 +00:48:07,019 --> 00:48:10,519 +track of the parameters that may not +just be a single machine and maybe a lot + +584 +00:48:10,519 --> 00:48:16,338 +of machines because you need a lot of +network bandwidth sometimes to keep all + +585 +00:48:16,338 --> 00:48:19,900 +these model replica standard parameters +so that might you know in our big setup + +586 +00:48:19,900 --> 00:48:24,950 +that my behind and 27 machines been +stopped and then you know you might have + +587 +00:48:24,949 --> 00:48:29,259 +five and replicas of the models down +there and before every model replica + +588 +00:48:29,260 --> 00:48:34,430 +doesn't match its gonna grab the +parameters so it says okay you hundred + +589 +00:48:34,429 --> 00:48:39,179 +and twenty-seven machines give me the +parameters and then it does a + +590 +00:48:39,179 --> 00:48:44,289 +combination of around the mini badge and +because I would agree it should be it + +591 +00:48:44,289 --> 00:48:47,869 +doesn't apply to rate of time degrading +back to the parameters servers routers + +592 +00:48:47,869 --> 00:48:52,829 +servers then update the current +parameter values and then before the + +593 +00:48:52,829 --> 00:48:58,039 +next step we did the same thing really +network intensive depending on your + +594 +00:48:58,039 --> 00:49:01,690 +model things that help here are modeled +the don't have very many parameters + +595 +00:49:01,690 --> 00:49:06,068 +competitions are really nice in that +respect Ella standardise in that respect + +596 +00:49:06,068 --> 00:49:11,250 +because you're essentially than reusing +every parameter lock them up the time so + +597 +00:49:11,250 --> 00:49:16,929 +you're already using you know however +bigger batch size is on the model of + +598 +00:49:16,929 --> 00:49:20,088 +your child's 228 you're gonna bring +pressure over you can use a hundred and + +599 +00:49:20,088 --> 00:49:23,900 +twenty eight times for all the columns +in the match but have a convolutional + +600 +00:49:23,900 --> 00:49:28,970 +model now you're gonna get an additional +factor of reuse of maybe like $10 in + +601 +00:49:28,969 --> 00:49:30,019 +different positions + +602 +00:49:30,019 --> 00:49:34,769 +in a layer that you're going to use it +an analysis p.m. if you unroll a hundred + +603 +00:49:34,769 --> 00:49:41,460 +times steps you can reuse it a hundred +times just for the unrolling those kinds + +604 +00:49:41,460 --> 00:49:47,220 +of things that have model have lots of +computation and fewer parameters to sort + +605 +00:49:47,219 --> 00:49:50,109 +of Dr that competition generally will +work better and did a parallel + +606 +00:49:50,110 --> 00:49:57,340 +environments now there's an obvious +issue depending on how you do those so + +607 +00:49:57,340 --> 00:50:00,720 +one way you can do this is completely +asynchronously every model replicas just + +608 +00:50:00,719 --> 00:50:05,459 +sitting in a loop and setting the +parameters doing a mini badge heating + +609 +00:50:05,460 --> 00:50:09,210 +radiant sending it up there and if you +do that asynchronously then the gradient + +610 +00:50:09,210 --> 00:50:13,710 +computes may be completely stale with +respect to the where the parameters are + +611 +00:50:13,710 --> 00:50:17,030 +now right now is computed it with his +back to this parameter value but + +612 +00:50:17,030 --> 00:50:20,810 +meanwhile 10 other applicants have made +called the parameters to meander over + +613 +00:50:20,809 --> 00:50:27,529 +here and now you apply the gradient that +you thought was for here this makes the + +614 +00:50:27,530 --> 00:50:31,080 +additions incredibly uncomfortable there +already uncomfortable cuz it's + +615 +00:50:31,079 --> 00:50:38,619 +completely non conduct problems but the +good news is it worked up to a certain + +616 +00:50:38,619 --> 00:50:43,670 +level it would be really good understand +the conditions under which you know this + +617 +00:50:43,670 --> 00:50:48,059 +works and theoretical basis but in +practice it does seem to work pretty + +618 +00:50:48,059 --> 00:50:51,710 +well the other thing you can do is do +this completely synchronously so you can + +619 +00:50:51,710 --> 00:50:55,800 +have one driving loop that sounds ok +everyone go they all get the parameters + +620 +00:50:55,800 --> 00:50:58,610 +they all compute gradients and then you +wait for the gradients to show up and do + +621 +00:50:58,610 --> 00:51:03,820 +something with a great effort to them +around her and that effectively just + +622 +00:51:03,820 --> 00:51:09,269 +looks like a giant batch are replicas +that looks like you know our times each + +623 +00:51:09,269 --> 00:51:14,300 +individual ones batch size which +sometimes works you kind of get + +624 +00:51:14,300 --> 00:51:18,950 +diminishing returns from larger and +larger batch sizes but the more training + +625 +00:51:18,949 --> 00:51:21,169 +examples you have + +626 +00:51:21,170 --> 00:51:26,159 +more tolerant you are a bigger bite +sized generally have a trillion training + +627 +00:51:26,159 --> 00:51:30,420 +examples you know about the size of a +thousand ok you have a million training + +628 +00:51:30,420 --> 00:51:36,068 +examples outside of a thousand not so +great right + +629 +00:51:36,639 --> 00:51:41,289 +think I said Lewis there's even more +complicated choices are you can have + +630 +00:51:41,289 --> 00:51:52,650 +like a descriptive ends in Europe right +I said that the current models are good + +631 +00:51:52,650 --> 00:51:57,829 +they reuse the parameters a lot so data +parallelism is actually really really + +632 +00:51:57,829 --> 00:52:02,740 +important for almost all of our models +that's how we get to the point of + +633 +00:52:02,739 --> 00:52:10,669 +training models in like half a day or a +day generally so you know you see some + +634 +00:52:10,670 --> 00:52:19,180 +of the rough kind of setup for use and +here's an example training graph of + +635 +00:52:19,179 --> 00:52:25,489 +image net model one GPU 10 GB used 52 +views and there's the kind of speed up + +636 +00:52:25,489 --> 00:52:26,239 +yet + +637 +00:52:26,239 --> 00:52:29,759 +like sometimes these graphs are +receiving like the difference between 10 + +638 +00:52:29,760 --> 00:52:34,220 +and 50 years doesn't seem that big like +lines are kind of close to each other + +639 +00:52:34,219 --> 00:52:39,489 +soldiers but in actual fact the +difference between 10 and 50 is like a + +640 +00:52:39,489 --> 00:52:43,798 +factor of four point want something so +that doesn't look like a factor 4.1 + +641 +00:52:43,798 --> 00:52:51,920 +difference does it but it is yeah the +way you do it as you would like without + +642 +00:52:51,920 --> 00:52:59,150 +one crisis point six and seven thousand +crisis point ok + +643 +00:52:59,150 --> 00:53:04,490 +so let me show you some of the slight +tweaks you make to tender for models to + +644 +00:53:04,489 --> 00:53:08,149 +exploit these different kinds of +parallelism one of the things we wanted + +645 +00:53:08,150 --> 00:53:13,280 +was for these kinds of parallelism +notions to be pretty easy to express so + +646 +00:53:13,280 --> 00:53:17,500 +one of the things I like about 20 mins +it maps pretty well to the kind of + +647 +00:53:17,500 --> 00:53:22,949 +things you might see in a research paper +so it's not talk to read all that but + +648 +00:53:22,949 --> 00:53:30,189 +it's not too different than what you +would see you should never be kinda nice + +649 +00:53:30,190 --> 00:53:37,940 +like a simple stem cell this is the +sequence to sequence model that only a + +650 +00:53:37,940 --> 00:53:43,079 +subsidiary organ all the quickly +published in its 2014 we're essentially + +651 +00:53:43,079 --> 00:53:47,849 +trying to take an input sequence and map +it turned out that sequence this is a + +652 +00:53:47,849 --> 00:53:51,679 +really big area of research it turns out +these kinds of models are applicable for + +653 +00:53:51,679 --> 00:53:56,849 +lots and lots of kinds of problems +there's lots of different groups doing + +654 +00:53:56,849 --> 00:54:07,369 +interesting inactive work in this area +so here's just some examples of recent + +655 +00:54:07,369 --> 00:54:13,269 +work in the last year and a half in this +area from what the different labs around + +656 +00:54:13,269 --> 00:54:17,630 +the world you've already talked about it + +657 +00:54:17,630 --> 00:54:26,320 +caption call just so instead of a +sequence you can put in pixels are you + +658 +00:54:26,320 --> 00:54:31,890 +put in pixels you went through CNN +that's your initial state and then you + +659 +00:54:31,889 --> 00:54:34,889 +can generate captions pretty amazing + +660 +00:54:36,030 --> 00:54:42,019 +35 years ago contributor to that I was I +don't think so not for a while Harry R + +661 +00:54:42,019 --> 00:54:46,730 +you can actually do and then I say it's +a generative model so you can generate + +662 +00:54:46,730 --> 00:54:51,320 +different sentences by exploring the +distribution you know I think both of us + +663 +00:54:51,320 --> 00:54:56,870 +are not captains it's not quite a +sophisticated of the human one don't + +664 +00:54:56,869 --> 00:55:01,230 +often see this one of the things is + +665 +00:55:01,230 --> 00:55:07,639 +if you if you train the model little bit +it's really important to her trainer + +666 +00:55:07,639 --> 00:55:13,210 +model to convergence because light +that's not so bad but if you train that + +667 +00:55:13,210 --> 00:55:17,070 +model longer the same model just got a +lot better + +668 +00:55:21,079 --> 00:55:25,139 +same thing here right training that is +sitting on the tracks yes that's true + +669 +00:55:25,139 --> 00:55:30,909 +but that ones better but she still see +the human has a lot more sophistication + +670 +00:55:30,909 --> 00:55:35,480 +right like they know that they're +crossed the tracks near a depot that's + +671 +00:55:35,480 --> 00:55:42,199 +sort of a more subtle thing that the +model to pick up on another kind of cute + +672 +00:55:42,199 --> 00:55:48,750 +using you can actually use them to solve +all kinds of cool graph problems so or + +673 +00:55:48,750 --> 00:55:56,440 +even yalls mara Fortunato and FTP this +work which you start with a ton of + +674 +00:55:56,440 --> 00:56:03,059 +points and then you try to predict the +traveling salesman for that works best + +675 +00:56:03,059 --> 00:56:11,559 +for the convex hull or Delonte +triangulation of grass gonna call you + +676 +00:56:11,559 --> 00:56:14,199 +know it's just a secret the sequence +problem for you feat in the sequence of + +677 +00:56:14,199 --> 00:56:18,129 +points and then the output is the right +set of points for whatever problem you + +678 +00:56:18,130 --> 00:56:21,130 +care about + +679 +00:56:21,780 --> 00:56:28,519 +reply ok so I'll scams so once you have +that Alice p.m. cellco that I showed you + +680 +00:56:28,519 --> 00:56:35,530 +on there you can enroll in time twenty +time steps let's say you wanted four + +681 +00:56:35,530 --> 00:56:37,680 +layers per time step instead of one + +682 +00:56:37,679 --> 00:56:42,389 +well you would make a little bit of +change your code and you do that now you + +683 +00:56:42,389 --> 00:56:47,690 +have four layers of computations 2011 of +the things you might want to do is run + +684 +00:56:47,690 --> 00:56:51,840 +each of those layers on a different GPU +so that's the change would make you tons + +685 +00:56:51,840 --> 00:56:56,869 +of occurred to do that and that allows +you to have a model like this so this is + +686 +00:56:56,869 --> 00:57:01,289 +my sequins these are the different deep +jealousy I'm layers I have per time step + +687 +00:57:01,289 --> 00:57:08,190 +and after the first little bit I can +start getting more and more GPUs kind of + +688 +00:57:08,190 --> 00:57:10,349 +involved in the process + +689 +00:57:10,349 --> 00:57:15,579 +and you essentially pipeline the entire +thing there's a giant soft packs at the + +690 +00:57:15,579 --> 00:57:19,710 +top of you can split across keep you +pretty easily do that to model + +691 +00:57:19,710 --> 00:57:25,500 +parallelism right we've now got six GPUs +in this picture we actually use a split + +692 +00:57:25,500 --> 00:57:30,909 +that soft max cross-border abuses and +man it so every replica would be a GPU + +693 +00:57:30,909 --> 00:57:36,109 +cards on the same machine all kind of +humming along and then you might use + +694 +00:57:36,110 --> 00:57:37,849 +data parallelism in addition to that + +695 +00:57:37,849 --> 00:57:45,989 +to train a bunch of AGP card replicas to +train quickly we have this notion of QS + +696 +00:57:45,989 --> 00:57:50,509 +he can kind of have her photographs the +do a bunch of stuff and then suffered an + +697 +00:57:50,510 --> 00:57:55,860 +EQ and then later you have another bit +of time to photograph that starts with D + +698 +00:57:55,860 --> 00:58:00,789 +hearings and stuff and then a dozen +things so one and one example is you + +699 +00:58:00,789 --> 00:58:04,650 +might want to prefetch inputs and then +why do the JPEG decoding to convert them + +700 +00:58:04,650 --> 00:58:09,240 +into sort of arrays and maybe do some +whitening and cropping a random + +701 +00:58:09,239 --> 00:58:16,149 +selection of men's stuff like you and +then you can then dq on a different GPU + +702 +00:58:16,150 --> 00:58:22,769 +cards or something we also can group +similar examples for translation work we + +703 +00:58:22,769 --> 00:58:27,869 +actually bucket by length of sentence so +that your batch has a bunch of examples + +704 +00:58:27,869 --> 00:58:32,449 +that are all roughly the same sentence +length all 13 216 words sentences or + +705 +00:58:32,449 --> 00:58:37,539 +something that just means we even need +only execute exactly that many unrolled + +706 +00:58:37,539 --> 00:58:42,210 +steps rather than you know arbitrary +next sentence length good for + +707 +00:58:42,210 --> 00:58:46,099 +randomization challenged members +shuffling cue is just a whole bunch of + +708 +00:58:46,099 --> 00:58:49,099 +examples and then get random ones out + +709 +00:58:55,130 --> 00:59:02,269 +data parallelism right so again we want +to be able to have many replicas of this + +710 +00:59:02,269 --> 00:59:09,309 +thing and so you make a modest amount of +changes to your we're not quite as happy + +711 +00:59:09,309 --> 00:59:13,769 +with this amount of change but this is +kind of what you do there's a supervisor + +712 +00:59:13,769 --> 00:59:19,429 +that has a bunch of things you now say +there's pressure devices and prepare the + +713 +00:59:19,429 --> 00:59:25,509 +session and then each one of these +rounds a local loop and you not keep + +714 +00:59:25,510 --> 00:59:28,000 +track of how many steps have been +applied globally across all the + +715 +00:59:28,000 --> 00:59:32,500 +different replicas and soon is the +cumulative sum of all those is big + +716 +00:59:32,500 --> 00:59:38,829 +enough for a synchronous training looks +kinda like that three separate client + +717 +00:59:38,829 --> 00:59:43,929 +dreads driving three separate replicas +all with parameters so one of the big + +718 +00:59:43,929 --> 00:59:47,119 +implications from disbelief to tend to +flow if we don't have the separate + +719 +00:59:47,119 --> 00:59:54,359 +parameters server notion we have answers +and variables variables that contained + +720 +00:59:54,360 --> 00:59:59,590 +answers and they're just other parts of +the graph and typically you map them + +721 +00:59:59,590 --> 01:00:04,250 +onto a small set of devices they're +gonna hold you parameters but it's all + +722 +01:00:04,250 --> 01:00:07,269 +kind of unified in the same framework +whether I'm sending it to answer that + +723 +01:00:07,269 --> 01:00:12,829 +parameters or activations or whatever +doesn't matter this is kind of a + +724 +01:00:12,829 --> 01:00:16,750 +synchronous do you have one client and I +just split my batch across three + +725 +01:00:16,750 --> 01:00:22,989 +replicas and had the gradient and apply +them know might turn out to be pretty + +726 +01:00:22,989 --> 01:00:31,239 +tolerant of reduced precision so convert +to FB 16 there's actually and I Tripoli + +727 +01:00:31,239 --> 01:00:36,869 +standard for 16 to 14 points now putting +point I use now most CPU don't quite + +728 +01:00:36,869 --> 01:00:42,719 +support that yet so we implemented our +own sixteen-bit format which is + +729 +01:00:42,719 --> 01:00:45,719 +essentially we have a 32 bit floating be +lopped off to buy to me + +730 +01:00:47,429 --> 01:00:55,889 +and you should kind of new stochastic +public but we don't so sort of ok it's + +731 +01:00:55,889 --> 01:01:01,389 +just know if any concurred converted to +32 bit on the other side by filling in + +732 +01:01:01,389 --> 01:01:15,098 +for it it's very sleepy roof friendly +paper while still model and data + +733 +01:01:15,099 --> 01:01:19,500 +parallelism in conjunction bind really +likes you train models quickly and + +734 +01:01:19,500 --> 01:01:24,639 +that's what this is all really about is +being able to take a research idea try + +735 +01:01:24,639 --> 01:01:28,250 +it out on a large dataset is +representative of a problem you care + +736 +01:01:28,250 --> 01:01:29,000 +about + +737 +01:01:29,000 --> 01:01:34,199 +figure out that work figure out the next +set of experiments as it's pretty easy + +738 +01:01:34,199 --> 01:01:38,039 +to express intensive load the data +profile somewhere not too happy with for + +739 +01:01:38,039 --> 01:01:44,889 +a synchronous parallelism but in general +it's not too bad we have open source + +740 +01:01:44,889 --> 01:01:49,480 +center flow because we think that'll +make it easier to share research writing + +741 +01:01:49,480 --> 01:01:56,338 +is we think you know having lots of +people using the system outside of + +742 +01:01:56,338 --> 01:01:59,849 +Google was there is a good thing to +improve it and bring ideas that we don't + +743 +01:01:59,849 --> 01:02:05,200 +necessarily how it makes it pretty easy +to deploy machine learning systems into + +744 +01:02:05,199 --> 01:02:09,298 +real products because you can go from +our research idea into something running + +745 +01:02:09,298 --> 01:02:13,059 +on a phone relatively easily the +community of tens of users outside + +746 +01:02:13,059 --> 01:02:16,609 +Google is growing which is nice how +they're doing all kinds of cool things I + +747 +01:02:16,608 --> 01:02:21,130 +picked a few random examples of things +people have done that are posted and get + +748 +01:02:21,130 --> 01:02:28,769 +how this is one that's like Andre has +this discontent at Daylesford runs in + +749 +01:02:28,769 --> 01:02:32,920 +your browser using javascript and one of +the things he has a little game he's + +750 +01:02:32,920 --> 01:02:38,798 +reinforcement learning the yellow dot +learns to get learns to eat the real + +751 +01:02:38,798 --> 01:02:42,769 +urgency the Green Dot to avoid the red +dots so someone reimplemented that in + +752 +01:02:42,769 --> 01:02:47,059 +terms of flow and actually added orange +dots are really bad + +753 +01:02:50,650 --> 01:02:54,550 +someone implemented this really nice +paper from University of Tilburg in the + +754 +01:02:54,550 --> 01:02:59,590 +Max Planck Institute only be seen this +work are you take an image a picture and + +755 +01:02:59,590 --> 01:03:05,269 +typically a painting and then renders +that picture in the style of that paper + +756 +01:03:05,269 --> 01:03:14,820 +and you end up with the cool stuff like +that bad you know there's a character + +757 +01:03:14,820 --> 01:03:19,550 +and model here outside the popular sort +of higher-level library to make it + +758 +01:03:19,550 --> 01:03:25,640 +easier to express mail mats someone +implemented the neural captioning model + +759 +01:03:25,639 --> 01:03:31,099 +in terms of low there's our effort +underway to translated into Mandarin + +760 +01:03:31,099 --> 01:03:39,349 +cool great last thing I will talk about +the brain residency programs we've + +761 +01:03:39,349 --> 01:03:44,349 +started this program a bit of an +experiment this year and so this is more + +762 +01:03:44,349 --> 01:03:47,769 +as an FYI for next year cause or +applications are closed rectory + +763 +01:03:47,769 --> 01:03:53,420 +selecting our final candidates this week +and then the idea is the people will + +764 +01:03:53,420 --> 01:03:57,789 +spend a year in our group doing deep +learning research and the hope is + +765 +01:03:57,789 --> 01:04:02,750 +they'll come out and have published a +couple of papers on archiver submitted + +766 +01:04:02,750 --> 01:04:08,039 +to Companies is and learn a lot about +doing sort of interesting machine + +767 +01:04:08,039 --> 01:04:16,170 +learning research and now we're looking +for people for next year obviously about + +768 +01:04:16,170 --> 01:04:24,670 +our strong in you know anyone taking the +class will reopen applications in the + +769 +01:04:24,670 --> 01:04:25,990 +fall + +770 +01:04:25,989 --> 01:04:34,439 +graduating like next year opportunity +there you go there's a bunch more + +771 +01:04:34,440 --> 01:04:36,909 +reading there + +772 +01:04:36,909 --> 01:04:42,949 +start your cuz I did a lot of work in +the white paper to make the whole set of + +773 +01:04:42,949 --> 01:04:52,169 +references clickable and then click your +way through 250 other figures ok so I + +774 +01:04:52,170 --> 01:04:53,820 +have been done early + +775 +01:04:53,820 --> 01:04:56,820 +hundred and sixty-five + +776 +01:05:02,730 --> 01:05:31,599 +yes so those kind of things are actually +tricky and we have an actually a pretty + +777 +01:05:31,599 --> 01:05:37,329 +extensive detailed process for things +that are you know talking about you're + +778 +01:05:37,329 --> 01:05:43,119 +using a user's private data for these +kinds of things so smart reply + +779 +01:05:43,119 --> 01:05:47,559 +essentially all the replies that word +that it ever will generate are things + +780 +01:05:47,559 --> 01:05:52,710 +that have been said by thousands of +users so the input to the model for + +781 +01:05:52,710 --> 01:05:57,380 +training is an email which is typically +not about how the people at but the only + +782 +01:05:57,380 --> 01:06:02,480 +things will ever suggest are things that +are generated in response by you know + +783 +01:06:02,480 --> 01:06:07,670 +suspicion number of unique users to +protect the privacy of users that put + +784 +01:06:07,670 --> 01:06:10,710 +kind of things you're thinking about +when designing products like cotton and + +785 +01:06:10,710 --> 01:06:16,400 +is actually a lot of Karen thought going +into you know we think this will be a + +786 +01:06:16,400 --> 01:06:22,119 +great feature but how can we do this in +a way that ensures the people's privacy + +787 +01:06:22,119 --> 01:06:25,119 +is protected + +788 +01:06:52,670 --> 01:07:30,108 +as much as we probably should have +assured it's just kind of been one of + +789 +01:07:30,108 --> 01:07:32,548 +the things on the back burner compared +to all the other things we've been + +790 +01:07:32,548 --> 01:07:37,679 +working on I do think the notion of +specialists so I didn't talk about that + +791 +01:07:37,679 --> 01:07:42,489 +at all but essentially we had a model +that was sort of arbitrary image that + +792 +01:07:42,489 --> 01:07:46,868 +classification model like JFT which is +like seventeen thousand losses or + +793 +01:07:46,869 --> 01:07:51,220 +something it's an internal data that we +trained a good general model that could + +794 +01:07:51,219 --> 01:07:57,539 +deal with all those classes and then we +found interesting confuse computable + +795 +01:07:57,539 --> 01:08:01,719 +classes that are algorithmically like +all the kinds of mushrooms in the world + +796 +01:08:01,719 --> 01:08:06,539 +and we were trained specialists on data +set there were enriched ribbed only + +797 +01:08:06,539 --> 01:08:11,909 +mushroom data primarily and an +occasional random images and we could + +798 +01:08:11,909 --> 01:08:16,179 +train fifty such models that reach good +at different kinds of things and get + +799 +01:08:16,179 --> 01:08:24,440 +pretty significant accuracy increases at +the time we we were able to distill it + +800 +01:08:24,439 --> 01:08:27,588 +into a single model pretty well we +haven't really pursued that too much + +801 +01:08:27,588 --> 01:08:31,899 +turned out just the mechanics have been +training fifty separate models and then + +802 +01:08:31,899 --> 01:08:34,899 +distilling them as a bit unwieldy + +803 +01:08:38,170 --> 01:09:20,630 +14 exploration and further research has +as you say this clearly demonstrates + +804 +01:09:20,630 --> 01:09:25,920 +that we're i mean it's a different +objectives were telling the model to do + +805 +01:09:25,920 --> 01:09:31,048 +right we're telling it to use this hard +label or use this hard label and also + +806 +01:09:31,048 --> 01:09:36,189 +get this incredibly rich gradient which +says like here's a hundred other signals + +807 +01:09:36,189 --> 01:09:41,379 +information so in some sense an unfair +comparison right you're telling it a lot + +808 +01:09:41,380 --> 01:09:46,829 +more stuff about every example my case +so sometimes it's not so much an + +809 +01:09:46,829 --> 01:09:49,119 +operation feeling it's maybe we should +be Fig + +810 +01:09:49,119 --> 01:09:53,960 +figuring out how to feed preacher +signals than just a single binary label + +811 +01:09:53,960 --> 01:09:59,569 +to our models I think that's probably an +interesting area to pursue I we thought + +812 +01:09:59,569 --> 01:10:05,349 +about ideas of having a big ensemble of +models all training collectively and + +813 +01:10:05,350 --> 01:10:08,449 +sort of exchanging information in the +form of their predictions are rather + +814 +01:10:08,448 --> 01:10:12,779 +than in their parameters as I might be +much cheaper more network friendly way + +815 +01:10:12,779 --> 01:10:19,099 +of of collaboratively training on a +really big did you train and 1% of the + +816 +01:10:19,100 --> 01:10:22,100 +day or something and swap predictions + +817 +01:10:39,729 --> 01:10:49,779 +yeah I mean I think all these kind of +radios are worth pursuing the captioning + +818 +01:10:49,779 --> 01:10:55,039 +workers interesting but it tends to you +tend to have many fewer labels with + +819 +01:10:55,039 --> 01:11:02,550 +captions then we have images with sort +of hard labels like Jeter Jaguar at + +820 +01:11:02,550 --> 01:11:06,810 +least that are prepared in a clean way I +think actually I'm aware there's a lot + +821 +01:11:06,810 --> 01:11:11,539 +of images with sentences written about +in the trick is identifying which + +822 +01:11:11,539 --> 01:11:26,430 +sentences about which image problem some +problems you know you don't need to + +823 +01:11:26,430 --> 01:11:29,510 +really train on mine like speech +recognition is a good example it's not + +824 +01:11:29,510 --> 01:11:35,670 +like human vocal cords change that often +the words you say change a little bit so + +825 +01:11:35,670 --> 01:11:38,670 +we redistributions tend to be not very +stationary + +826 +01:11:39,640 --> 01:11:45,460 +like the words everyone collectively +says tomorrow are pretty similar to ones + +827 +01:11:45,460 --> 01:11:50,640 +they say today but subtly different like +Long Island Chocolate Festival might + +828 +01:11:50,640 --> 01:11:55,220 +suddenly become more and more prominent +over the next two weeks and those kinds + +829 +01:11:55,220 --> 01:11:58,930 +of things you know you need to be +cognizant of the fact that you want to + +830 +01:11:58,930 --> 01:12:03,079 +capture those kinds of effects and one +of the ways to do it is to train your + +831 +01:12:03,079 --> 01:12:07,380 +model and a minor sometime he doesn't +need to be so online but you like + +832 +01:12:07,380 --> 01:12:10,770 +getting an example and immediately +update your model but you know the + +833 +01:12:10,770 --> 01:12:16,180 +pentium problem every five minutes or +ten minutes or hour or day is sufficient + +834 +01:12:16,180 --> 01:12:23,940 +for most problems but it is pretty +important to do that for non-stationary + +835 +01:12:23,939 --> 01:12:28,949 +problems like ads or search queries or +things that change over time like that + +836 +01:12:28,949 --> 01:12:33,738 +right + +837 +01:12:33,738 --> 01:12:42,428 +the third most important I can't say yes + +838 +01:12:45,819 --> 01:12:57,170 +yeah I mean noise in training datasets +actually happens all the time greats + +839 +01:12:57,170 --> 01:13:01,340 +like even if you look at the image that +examples occasionally you'll come across + +840 +01:13:01,340 --> 01:13:02,328 +one in your life + +841 +01:13:02,328 --> 01:13:06,670 +actually I was just sitting in a meeting +with some people who are working on + +842 +01:13:06,670 --> 01:13:10,929 +visualization techniques and one of the +things that were visualizing was see far + +843 +01:13:10,929 --> 01:13:14,779 +input data and they had this kind of +core presentation of all the C four + +844 +01:13:14,779 --> 01:13:18,920 +examples all mapped onto like +four-by-four pixels each one month on + +845 +01:13:18,920 --> 01:13:22,819 +their screen for sixty thousand images +and Mike you could kind of pick things + +846 +01:13:22,819 --> 01:13:28,219 +out and select them toward and here's +one that liked the model predicted with + +847 +01:13:28,219 --> 01:13:33,948 +high confidence but it got wrong and it +said her plane as the model that + +848 +01:13:33,948 --> 01:13:40,518 +airplane and you look at the image and +it's an airplane and the label is not + +849 +01:13:40,519 --> 01:13:49,690 +heavily you like I understand why I +gotta run so it's you know you want to + +850 +01:13:49,689 --> 01:13:53,288 +make sure your dataset is as clean as +possible cuz training and noisy data is + +851 +01:13:53,288 --> 01:13:56,488 +generally not as good as + +852 +01:13:56,488 --> 01:14:00,819 +cleaned it out but on the other hand +expending too much effort to clean that + +853 +01:14:00,819 --> 01:14:06,969 +it is often more more effort than its +worth to kind of do some filtering kinds + +854 +01:14:06,969 --> 01:14:12,788 +of things you don't throw out the +obvious bad stuff and generally more + +855 +01:14:12,788 --> 01:14:15,788 +noisy data is often better than less +clean it up + +856 +01:14:18,739 --> 01:14:28,649 +depends on the problem but only about +one thing to try and then if you're + +857 +01:14:28,649 --> 01:14:34,159 +unhappy with the result then investigate +why the question + +858 +01:14:34,159 --> 01:14:39,210 +okay thank you + diff --git a/captions/En/Lecture1_en.srt b/captions/En/Lecture1_en.srt new file mode 100644 index 00000000..c836c419 --- /dev/null +++ b/captions/En/Lecture1_en.srt @@ -0,0 +1,3514 @@ +1 +00:00:00,000 --> 00:00:03,899 +There's more seats on the side. + +2 +00:00:03,899 --> 00:00:19,868 +people are walking in late. +So, just to make sure you're in cs231n + +3 +00:00:19,868 --> 00:00:23,969 +Deep Learning Neural network class for +visual recognition. + +4 +00:00:23,969 --> 00:00:33,549 +Anybody in the wrong class? OK, good. +Alright. So, welcome and happy new year, happy first day of the winter break. + +5 +00:00:33,549 --> 00:00:41,069 +So, this class CS231n. +This is the second offering of this class + +6 +00:00:41,070 --> 00:00:48,738 +when we have literally doubled our enrollment +from 180 people last time we offered to + +7 +00:00:48,738 --> 00:00:55,939 +about 350 of you signed up. +Just a couple of words to make us all legally + +8 +00:00:55,939 --> 00:01:02,570 +covered, we are video recording this class. +So, you know if you're + +9 +00:01:02,570 --> 00:01:10,680 +uncomfortable about this for today just +go behind that camera or go to the corner that + +10 +00:01:10,680 --> 00:01:18,280 +camera's not gonna turn, but we are going to send +out forms for you to fill out in terms + +11 +00:01:18,280 --> 00:01:25,228 +of allowing video recording. +So, that's just one bit of housekeeping. + +12 +00:01:25,228 --> 00:01:32,200 +So, alright. My name is Fei-Fei Li, a professor +at the computer science department. + +13 +00:01:32,200 --> 00:01:37,960 +So, this class, I'm co-teaching with two +senior graduate students and one of them + +14 +00:01:37,961 --> 00:01:45,839 +is here. He's Andre Karpathy. Andre can you just say hi to everybody? +We have.. I don't think Andre needs too much + +15 +00:01:45,840 --> 00:01:48,659 +introduction. A lot of you probably know his work, + +16 +00:01:48,659 --> 00:01:53,960 +follow his blog, his Twitter follower. + +17 +00:01:53,961 --> 00:02:02,509 +Andre has way more followers than I do. +So, he's very popular. And also Justin Johnson who is still + +18 +00:02:02,510 --> 00:02:08,200 +traveling internationally but will be +back in a few days. So, Andre and Justin + +19 +00:02:08,201 --> 00:02:14,509 +will be picking up the bulk of the +lecture teaching, and today I'll be giving + +20 +00:02:14,509 --> 00:02:20,039 +a first lecture but as you probably can +see that I'm expecting a newborn very soon, + +21 +00:02:20,039 --> 00:02:28,239 +speaking of weeks, so you'll see more of +Andre and Justin in lecture time. We will + +22 +00:02:28,239 --> 00:02:34,189 +also introduce a whole team of TAs +towards the end of this lecture. + +23 +00:02:34,189 --> 00:02:38,959 +Again, people who are looking for seats you go +out of that door and come back. There's + +24 +00:02:38,959 --> 00:02:47,039 +a whole bunch of seats on the side. +So, for this lecture, we're going to + +25 +00:02:47,039 --> 00:02:53,519 +give the introduction of the class, +what kind of problems we work on and the + +26 +00:02:53,519 --> 00:03:03,530 +tools will be learning. So, again, welcome +to CS231n. This is a vision class. + +27 +00:03:03,530 --> 00:03:09,140 +It's based on a very specific +modeling architecture called neural network + +28 +00:03:09,141 --> 00:03:16,000 +and the more specifically, mostly +on the Convolutional Neural Network + +29 +00:03:16,000 --> 00:03:23,799 +and a lot of you hear this term, maybe through a +popular press article or + +30 +00:03:23,799 --> 00:03:34,239 +coverage we tend to call this the deep +learning network. Vision is one of the fastest growing field of + +31 +00:03:34,239 --> 00:03:40,920 +artificial intelligence. +In fact, CISCO has estimated it and we are + +32 +00:03:40,921 --> 00:03:50,018 +day 4 of this by 2016 which we +already have arrived more than 85% of + +33 +00:03:50,019 --> 00:03:56,230 +the Internet cyberspace data is in the form of pixels + +34 +00:03:56,231 --> 00:04:05,329 +or what they call Multimedia. +So, we basically have entered at age of vision + +35 +00:04:05,330 --> 00:04:12,530 +of images and videos. +Why is this so, well partially to a large extent + +36 +00:04:12,530 --> 00:04:20,858 +is because of the explosion of both the +Internet as a carrier of data as well as + +37 +00:04:20,858 --> 00:04:25,930 +sensors. We have more sensors than the +number of people on the Earth these days. + +38 +00:04:25,930 --> 00:04:32,000 +Every one of you is carrying some kind +of smart phones, digital cameras and + +39 +00:04:32,000 --> 00:04:37,879 +you know, cars running on the +street with cameras. So, the sensors + +40 +00:04:37,879 --> 00:04:46,500 +have really enabled the explosion of +visual data on the Internet but visual + +41 +00:04:46,500 --> 00:04:55,209 +data or pixel data is also the hardest +data to harness so if you have heard my + +42 +00:04:55,209 --> 00:05:07,810 +previous talks and some other talks +by Computer Vision professors, we call this the dark matter of the Internet + +43 +00:05:07,810 --> 00:05:13,879 +why is this the dark matter? Just like +the universe consist of to 85% dark + +44 +00:05:13,879 --> 00:05:19,409 +matter. Dark energy is these matters that +energy that is very hard to observe. + +45 +00:05:19,410 --> 00:05:25,919 +we can infer it by mathematical models in +the universe. On the Internet these are the + +46 +00:05:25,920 --> 00:05:30,649 +matters pixel data are the data that we don't know. +We have a hard time + +47 +00:05:30,649 --> 00:05:36,239 +grasping the contents here's one very +very simple aspects for you to consider + +48 +00:05:36,240 --> 00:05:39,090 +so today + +49 +00:05:39,091 --> 00:05:49,560 +YouTube servers every 60 seconds we'll +have more than 150 hours of videos uploaded + +50 +00:05:49,560 --> 00:05:54,089 +onto YouTube servers for every 60 seconds + +51 +00:05:54,089 --> 00:06:02,739 +Think about the amount of data. +There is no way that human eyes can sift through + +52 +00:06:02,740 --> 00:06:07,829 +this massive amount of data and make it annotations, + +53 +00:06:07,829 --> 00:06:14,009 +labeling it, and describe the contents. +So, think from the + +54 +00:06:14,009 --> 00:06:20,980 +perspective of the YouTube team or Google company. +If they want to help us + +55 +00:06:20,980 --> 00:06:25,640 +to search, index, manage, +and of course for their purpose, + +56 +00:06:25,641 --> 00:06:31,529 +put advertisement or whatever manipulate +the content of the data, we're at a loss, + +57 +00:06:31,529 --> 00:06:38,919 +because nobody can hand-annotate this. +The only hope we can do this is through vision + +58 +00:06:38,920 --> 00:06:44,640 +technology. To be able to label the +objects, find the scenes, find the frames, + +59 +00:06:44,641 --> 00:06:50,349 +you know, locate where that basketball video +were Kobe Bryant is making like that + +60 +00:06:50,350 --> 00:06:57,320 +awesome shot. So, these are the +problems that we are facing today that the + +61 +00:06:57,321 --> 00:07:02,860 +massive amount of data and the +challenges of the dark matter. + +62 +00:07:02,860 --> 00:07:07,379 +So, computer vision is a field that +touches upon many other fields of + +63 +00:07:07,379 --> 00:07:12,740 +studies. So, I am sure that even sitting here, + +64 +00:07:12,740 --> 00:07:18,050 +many of you come from computer science, but +many of you come from biology, psychology, + +65 +00:07:18,050 --> 00:07:24,389 +are specializing in natural language +processing or graphics or robotics or + +66 +00:07:24,389 --> 00:07:30,680 +or you know medical imaging and so on. +So, as a field, computer vision is really a + +67 +00:07:30,680 --> 00:07:37,329 +truly interdisciplinary field. +What the problems we work on, the models we use + +68 +00:07:37,329 --> 00:07:43,849 +touches an engineering, physics, biology, +psychology computer science and mathematics. + +69 +00:07:43,850 --> 00:07:51,030 +So just a little bit of a more personal touch, +I am the director of the computer vision lab + +70 +00:07:51,031 --> 00:07:58,589 +at the Stanford. In our lab, +I work with graduate students and post-docs and even + +71 +00:07:58,589 --> 00:08:04,669 +under-graduate students on the number of +topics and most dear to our own research + +72 +00:08:04,670 --> 00:08:10,540 +who some of them, you know, +Andre, Justin come from my lab. + +73 +00:08:10,540 --> 00:08:17,780 +A number of TAs come from my lab. +we work on machine learning which is part + +74 +00:08:17,781 --> 00:08:26,109 +of a superset of deep learning. +We work a lot on cognitive science and neuroscience as well + +75 +00:08:26,110 --> 00:08:31,270 +as the intersection between an NLP and +speech. so that's that's the kind of + +76 +00:08:31,269 --> 00:08:40,399 +landscape of computer vision research that my lab works. +So, also to put things + +77 +00:08:40,399 --> 00:08:45,600 +in a little more perspective, what other +computer vision classes that we offer + +78 +00:08:45,600 --> 00:08:51,050 +here at Stanford through the computer science department. +Clearly, you're in + +79 +00:08:51,049 --> 00:08:59,629 +this class CS231n. +So, some of you who +have never taken computer vision + +80 +00:08:59,629 --> 00:09:06,220 +probably heard of computer vision for the first time. +probably should have already + +81 +00:09:06,220 --> 00:09:14,730 +done CS131. That's an intro class of +previous quarter we offered. + +82 +00:09:14,730 --> 00:09:19,779 +and then next quarter which normally is +offered this quarter but this year as a + +83 +00:09:19,779 --> 00:09:25,069 +little shifted there is an important +graduate-level computer vision class + +84 +00:09:25,070 --> 00:09:31,840 +called CS231a offered by professor +Silvio Savarese who works in robotic + +85 +00:09:31,840 --> 00:09:47,230 +3D vision and a lot of you ask the +question that do these replace each other? +CS231n versus + +86 +00:09:47,230 --> 00:09:56,639 +CS231a and the answer is no. +if you're interested in a broader + +87 +00:09:56,639 --> 00:10:03,220 +coverage of tools and topics of computer +vision as well as some of the + +88 +00:10:03,220 --> 00:10:11,009 +fundamental topics that comes +that related to 3D vision, robotic vision + +89 +00:10:11,009 --> 00:10:17,269 +and visual recognition you should +consider taking 231a. That is the + +90 +00:10:17,269 --> 00:10:26,039 +more general class. 231n which will go +into starting today more deeply focuses + +91 +00:10:26,039 --> 00:10:33,329 +on a specific ando of both problem and +model. Model is neural network and the + +92 +00:10:33,330 --> 00:10:38,580 +ando is visual recognition mostly, +but of course they have a little bit of + +93 +00:10:38, 580 --> 00:10:47,990 +overlap but that's the major difference. +Next quarter, we also have possibly + +94 +00:10:47,990 --> 00:10:55,590 +a couple of advanced seminar +level class but that's still in the + +95 +00:10:55,590 --> 00:11:01,649 +formations so you just have to check the syllabus. +So, that's the kind of computer + +96 +00:11:01,649 --> 00:11:11,409 +vision curriculum we offer this year at +Stanford. Any question so far? Yes + +97 +00:11:11,409 --> 00:11:20,879 +131 is not a strict requirement for this class, +but you should see that if you've + +98 +00:11:20,879 --> 00:11:25,570 +never heard of computer vision for the +first time I suggest you find a way to + +99 +00:11:25,570 --> 00:11:33,830 +catch up because this class assumes +a basic level of understanding of + +100 +00:11:33,830 --> 00:11:42,560 +computer vision. + You can browse the notes and so on. + +101 +00:11:42,561 --> 00:11:49,619 +Okay, so the rest of today is that I will give a very brief +broad stroke history of computer vision + +102 +00:11:49,620 --> 00:11:55,519 +and then we'll talk about 231n +little bit in terms of the organization + +103 +00:11:55,519 --> 00:12:01,409 +of the class. I actually really care about sharing +with you this brief history of computer + +104 +00:12:01,409 --> 00:12:07,480 +vision because you know you might be +here primarily because of your interest + +105 +00:12:07,480 --> 00:12:11,990 +in this really interesting tool called +deep learning and this is the purpose of this + +106 +00:12:11,990 --> 00:12:16,370 +class. +We're offering you an in-depth look +and then + +107 +00:12:16,370 --> 00:12:22,470 +and just journey through the of the what +this deep learning model is but without + +108 +00:12:22,470 --> 00:12:28,050 +understanding the problem domain, without +thinking deeply about what this problem is, + +109 +00:12:28,051 --> 00:12:37,849 +it's very hard for you to to go out +to be an inventor of the next model that + +110 +00:12:37,850 --> 00:12:43,320 +really solves the big problem in vision or +to be you know developing developing + +111 +00:12:43,320 --> 00:12:52,379 +making impactful work in solving a hard +problem. and also in general problem + +112 +00:12:52,379 --> 00:12:58,860 +domain and model the modeling tools +themselves are never never fully + +113 +00:12:58,860 --> 00:13:00,129 +decoupled. + +114 +00:13:00,129 --> 00:13:05,360 +They inform each other and you'll see through +the history of deep learning a little + +115 +00:13:05,360 --> 00:13:13,000 +bit that the coalition on your network +architecture come from the need to solve + +116 +00:13:13,000 --> 00:13:15,289 +a vision problem + +117 +00:13:15,289 --> 00:13:23,449 +vision problem helps the the deep learning +algorithm to evolve and I'm back and + +118 +00:13:23,450 --> 00:13:29,350 +forth so is really important to to you +know I want you to finish this course I + +119 +00:13:29,350 --> 00:13:34,300 +feel proud that you're student of +computer vision and of deep learning so you you + +120 +00:13:34,301 --> 00:13:39,528 +have this both tool-set and the +in-depth understanding of how to use the + +121 +00:13:39,528 --> 00:13:46,750 +tool-set to to to to tackle important +problems so it's a brief history but + +122 +00:13:46,750 --> 00:13:54,149 +doesn't mean it's a short history so we're +gonna go all the way back to 200 sorry, 540 + +123 +00:13:54,149 --> 00:14:00,110 +million years ago so why why did I +picked this you know on the scale + +124 +00:14:00,110 --> 00:14:09,240 +of Earth history this is a very +specific range of years. Well, so I don't + +125 +00:14:09,240 --> 00:14:14,049 +know if you have heard of this but this +is a very very curious period of the + +126 +00:14:14,049 --> 00:14:23,539 +Earth's history. Biologists call this the +big bag of evolution. Before 503, 4 + +127 +00:14:23,539 --> 00:14:27,679 +540 million years ago, + +128 +00:14:27,679 --> 00:14:37,989 +The Earth was a very peaceful pot of water. +It's pretty big pot of water. So, we have very simple organisms. + +129 +00:14:37,990 --> 00:14:46,049 +These are like animals that just floats +in the water and the way they eat and hang out + +130 +00:14:46,049 --> 00:14:53,838 +on a daily basis is you know they just float +and some kind of food comes by near + +131 +00:14:53,839 --> 00:15:01,160 +their mouth or whatever, they just open +their mouths grabbed it and we don't + +132 +00:15:01,160 --> 00:15:09,969 +have too many different types of animals, +but something really strange happened around 540 + +133 +00:15:09,970 --> 00:15:18,430 +million suddenly from the fossils we study +there's a huge explosive of species. + +134 +00:15:18,430 --> 00:15:27,729 +Biologists call speciation. It's like suddenly, +for some reason, something hit the Earth that animal + +135 +00:15:27,730 --> 00:15:35,230 +start to diversify and they got really +complex the start to have + +136 +00:15:35,230 --> 00:15:41,039 +predators and preys and they have all +kind of tools to survive. What was + +137 +00:15:41,039 --> 00:15:46,698 +the triggering force of this was a huge +question, because people was saying + +138 +00:15:46,698 --> 00:15:53,269 +you know another said whatever meteoroid hit +the Earth or or you know the environment + +139 +00:15:53,269 --> 00:16:00,198 +change? It turned out one of the most +convincing theory is by this guy called + +140 +00:16:00,198 --> 00:16:03,159 +Andrew Parker. He is a + +141 +00:16:03,159 --> 00:16:09,490 +modern geologist in Australia from Australia. +He he studied a lot of + +142 +00:16:09,490 --> 00:16:19,278 +fossils and he's theory is that it was +the onset of the ice. So, one of the first + +143 +00:16:19,278 --> 00:16:25,688 +trilobite developed an eye, a really +really simple eye. It's almost like a + +144 +00:16:25,688 --> 00:16:30,779 +pinhole camera that just catches light +and make some projections in + +145 +00:16:30,779 --> 00:16:34,750 +register some information from the +environment. + +146 +00:16:34,750 --> 00:16:41,080 +Suddenly ,life is no longer so medal +because once you have that eye, the first + +147 +00:16:41,080 --> 00:16:44,889 +thing you can do is you could go patch +food. You actually know where food is. + +148 +00:16:44,889 --> 00:16:51,809 +Not just like blind them floating in the +water and once you can go catch food. + +149 +00:16:51,809 --> 00:16:57,399 +Guess what? The food had better developed +eyes and to run away from you otherwise + +150 +00:16:57,399 --> 00:17:02,590 +They'll be gone. You know your your so +the first of all who had had eyes were + +151 +00:17:02,590 --> 00:17:11,380 +like in unlimited buffet like working in Google and so +just like it has the best time eating + +152 +00:17:11,380 --> 00:17:18,170 +everything they can. But because of this +onset of ice, what we what the + +153 +00:17:18,170 --> 00:17:28,400 +realized is the biological arms +race begin. Every single animal needs to + +154 +00:17:28,400 --> 00:17:34,170 +needs to learn to develop things to +survive or to you know you you you + +155 +00:17:34,170 --> 00:17:40,190 +suddenly have preys and predators and +all this and the speciation began. so that's + +156 +00:17:40,190 --> 00:17:47,870 +when vision begun 540 million years and +not only vision begun. vision was one + +157 +00:17:47,870 --> 00:17:53,189 +of the major driving force of the +speciation or that the big bang of + +158 +00:17:53,190 --> 00:17:58,980 +evolution. Alright, so so we're not gonna +fall evolution for with too much detail. + +159 +00:17:58,980 --> 00:18:08,710 +Another big important work that the +engineering of vision happened around + +160 +00:18:08,710 --> 00:18:19,220 +the Renaissance and of course it's +attributed to this amazing guy Leonardo Da Vinci. so before + +161 +00:18:19,220 --> 00:18:23,740 +Renaissance you know throughout human +civilization from Asia to Europe to + +162 +00:18:23,740 --> 00:18:30,400 +India to Arabic world, we have seen +models of cameras so Aristotle has + +163 +00:18:30,400 --> 00:18:36,360 +proposed the camera through the Leaves. +Chinese philosopher Mozi have proposed + +164 +00:18:36,359 --> 00:18:40,939 +the camera through a box with the whole +but + +165 +00:18:40,940 --> 00:18:47,750 +if you look at the first documentation +of really modern looking camera it's called + +166 +00:18:47,750 --> 00:18:49,180 +camera obscura + +167 +00:18:49,180 --> 00:18:56,610 +and that is documented by Leonardo da +Vinci. I'm not gonna get into the details + +168 +00:18:56,609 --> 00:19:07,240 +but this is you know you get the idea +that there is some kind of lens or at least a whole to + +169 +00:19:07,240 --> 00:19:12,240 +capture lights reflected from the real +world and then there is some kind of + +170 +00:19:12,240 --> 00:19:20,319 +projection to capture the information of +the of the of the real-world image so + +171 +00:19:20,319 --> 00:19:27,779 +That's the beginning of the modern, you +know, engineering of vision. + +172 +00:19:27,779 --> 00:19:36,170 +It's started with wanting to copy +the world and wanting to make a copy of + +173 +00:19:36,170 --> 00:19:42,350 +the visual world. +It hasn't got anywhere close to wanting to engineer the + +174 +00:19:42,349 --> 00:19:46,879 +understanding of the visual world. +Right now, we're just talking about duplicating + +175 +00:19:46,880 --> 00:19:53,760 +the visual world. so that's one important +work to remember and of course after + +176 +00:19:53,759 --> 00:20:01,299 +camera Obscura that we we we start +to see a whole series of successful, you + +177 +00:20:01,299 --> 00:20:07,539 +know, some film gets developed, you know +like Kodak was one of the first + +178 +00:20:07,539 --> 00:20:12,329 +companies developing commercial cameras +and then we start to have camcorders and + +179 +00:20:12,329 --> 00:20:21,889 +and and all this. Another very important +important piece of work that I want you + +180 +00:20:21,890 --> 00:20:28,050 +to be aware of as vision student is +actually not a engineering work but + +181 +00:20:28,049 --> 00:20:32,710 +science piece of science +work that's starting to ask the question + +182 +00:20:32,710 --> 00:20:38,130 +is how does Vision work in our +biological brain? you know we + +183 +00:20:38,130 --> 00:20:45,760 +we now know that it took 540 million +years of evolution to get to really + +184 +00:20:45,759 --> 00:20:54,579 +fantastic visual system in mammals and humans but +what did evolution do during this time? + +185 +00:20:54,579 --> 00:21:01,759 +what kind of architecture did it develop +from that simple trilobite eye to today + +186 +00:21:01,759 --> 00:21:07,950 +yours and mine? Well, very important +piece of work happened at Harvard like + +187 +00:21:07,950 --> 00:21:12,690 +two at that time two young two very young +ambitious post-doc Hubel and Wiesel. + +188 +00:21:12,690 --> 00:21:21,500 +What they did is that they used +awake but anaesthetized cats and then + +189 +00:21:21,500 --> 00:21:28,529 +there was enough technology to build this +little needle called electrode to push the + +190 +00:21:28,529 --> 00:21:35,129 +electrode through into the the the the +skull is open into the brain of the + +191 +00:21:35,130 --> 00:21:42,180 +cat into an area what we already know +called primary visual cortex. + +192 +00:21:42,180 --> 00:21:49,490 +Primary visual cortex is the area that +nuerons do a lot of things for for visual processing + +193 +00:21:49,490 --> 00:21:54,779 +but before Hubel and Wiesel, +we don't really know what primary visual cortex is doing. + +194 +00:21:54,779 --> 00:22:02,369 +We just know it's one of the earliest stage on the eyes, +of course, but earliest stage for visual processing. + +195 +00:22:02,369 --> 00:22:07,299 +And then there is tons and tons +of neurons working on vision. + +196 +00:22:07,299 --> 00:22:12,419 +And we really ought to know what this is +because that's the beginning of vision + +197 +00:22:12,420 --> 00:22:20,300 +visual process in the brain. +So they they put this electrode into the primary visual cortex + +198 +00:22:20,300 --> 00:22:25,930 +and an interestingly, +this is another interesting fact. + +199 +00:22:25,930 --> 00:22:34,880 +I will drop off my stuff. I will show you. +Primary visual cortex, the first stage, or second depending on where they come from. + +200 +00:22:34,880 --> 00:22:40,910 +I'm being very very rough. +The First stage of your cortical visual processing stage is + +201 +00:22:40,910 --> 00:22:47,180 +in the back of your brain not near your eye. +Okay? It's very interesting because + +202 +00:22:47,180 --> 00:22:51,788 +your olfactory cortical processing is right behind your nose. + +203 +00:22:51,788 --> 00:22:58,519 +Your auditory is right behind your ear. + + +204 +00:22:58,519 --> 00:23:05,798 +but your primary visual cortex is the furthest from your eye +and another very interesting fact. + +205 +00:23:05,798 --> 00:23:11,099 +In fact, not only the primary, +there's a huge area working on vision. + +206 +00:23:11,099 --> 00:23:17,888 +Almost 50% of your brain is involved in vision. +Vision is the hardest and most important + +207 +00:23:17,888 --> 00:23:22,608 +sensory perceptual cognitive system in the brain. +I'm not saying anything + +208 +00:23:22,608 --> 00:23:29,839 +else isn't useful clearly, but it +take nature this long to develop this sensory system + +209 +00:23:29,839 --> 00:23:37,579 +and it takes the intro this much real estate space to be + +210 +00:23:37,579 --> 00:23:43,148 +used for the system. Why? +Because it's so important and it's so damn hard. + +211 +00:23:43,148 --> 00:23:50,959 +That's why we need to use this much space. +OK, back to Hubel and Wiesel. They were really ambitious. + +212 +00:23:50,960 --> 00:23:56,028 +They wanna know what primary visual cortex is doing +because this is the beginning of our + +213 +00:23:56,028 --> 00:24:02,878 +knowledge for deep learning neural network. +So, they were showing cats. so they put the cats in this room + +214 +00:24:02,878 --> 00:24:07,709 +and they were recording neural activities. +When I say recording neural activity, + +215 +00:24:07,710 --> 00:24:11,659 +they're basically trying to see, +you know if I put the + +216 +00:24:11,659 --> 00:24:18,059 +neural electrode here, +Do the neurons fire when they see something? + +217 +00:24:18,059 --> 00:24:25,308 +So, for example if they show cats, +their idea is.., + +218 +00:24:25,308 --> 00:24:30,519 +if I showed this kind of fish, you know, +apparently at that time cats eat fish rather than these beans. + +219 +00:24:30,519 --> 00:24:42,019 +With the cat's neuron like, you know, +they're happy and start sending spikes. + +220 +00:24:42,019 --> 00:24:48,128 +and the funny thing of a story of scientific discovery is + +221 +00:24:48,128 --> 00:24:52,449 +scientific discovery takes both luck and care and thoughtfulness. + +222 +00:24:52,450 --> 00:24:58,740 +They were showing this cat fish, whatever mouse, flower. +It just doesn't work. The cat's neuron in the primary + +223 +00:24:58,740 --> 00:25:02,839 +visual cortex was silent there was no spiking. + +224 +00:25:02,839 --> 00:25:09,079 +Very little spike and they were really frustrated. +but the good news is that + +225 +00:25:09,079 --> 00:25:14,509 +there was no computer at that time so +what they have to do when they showed this cats + +226 +00:25:14,509 --> 00:25:21,740 +that is a stimulus, they have to use a slide +projector so they put a a slide of fish + +227 +00:25:21,740 --> 00:25:26,799 +and then wait till the neuron spike. +If the neuron doesn't spike, they take + +228 +00:25:26,799 --> 00:25:29,960 +the slide out and put in another slide. + +229 +00:25:29,960 --> 00:25:38,630 +Then, they noticed every time they changed slide, +like this, you know, this square-ish film. + +230 +00:25:38,630 --> 00:25:46,890 +I don't you remember if they use glass or film whatever. +The neuron spikes. That's weird you know like the + +231 +00:25:46,890 --> 00:25:51,940 +actual mouse and fish and flower didn't +drive the neuron or excite the neuron but + +232 +00:25:51,940 --> 00:25:59,759 +the the the movement of taking the slide +out or putting a slide in did excite neuron. + +233 +00:25:59,759 --> 00:26:03,140 +It can be the careless thinking of +finally they're changing the new + +234 +00:26:03,140 --> 00:26:13,410 +you know new objects for me. So, it turned +out there is edge that's created by this slide that + +235 +00:26:13,410 --> 00:26:18,240 +they're changing. Right? the slide +whatever it's a square rectangular plate. + +236 +00:26:18,240 --> 00:26:28,120 +And that moving edge grow or excited the neurons. +So they really chase after that observations. + +237 +00:26:28,120 --> 00:26:34,859 +you know if they were too frustrated or too careless, +they would have missed that, + +238 +00:26:34,859 --> 00:26:41,359 +but they were not. They're really chase +after that and realized neurons in the + +239 +00:26:41,359 --> 00:26:48,279 +primary visual cortex are organized in columns, +and for every column of the neurons, + +240 +00:26:48,279 --> 00:27:01,309 +they'd like to see a specific orientation of the stimulus. +The simple oriented bars rather + +241 +00:27:01,309 --> 00:27:02,980 +than the fish or a mouse. + +242 +00:27:02,980 --> 00:27:07,519 +You know, I'm making this little bit of a simple story +because there are still neurons in + +243 +00:27:07,519 --> 00:27:10,940 +primary visual cortex we don't know what they like. +They don't like simple + +244 +00:27:10,940 --> 00:27:17,570 +oriented bars but by large Hubel and Wiesel +found that the beginning of + +245 +00:27:17,570 --> 00:27:23,779 +visual processing is not a holistic fish or mouse. +The beginning of visual + +246 +00:27:23,779 --> 00:27:29,178 +processing is simple structures of the world, + +247 +00:27:29,179 --> 00:27:40,890 +edges, oriented edges and this is a very deep deep +implication to both neurophysiology, neuroscience as well as + +248 +00:27:40,890 --> 00:27:47,870 +engineering modeling. It's later when we +visualize our deep neural network features, + +249 +00:27:47,870 --> 00:27:57,069 +we'll see that simple edge-like structure in +emerging from our model. + +250 +00:27:57,069 --> 00:28:03,298 +Even though the discovery was later +fifties and early sixties, they won the + +251 +00:28:03,298 --> 00:28:12,039 +Nobel medical prize for this work in 1981. +So, that was another very important + +252 +00:28:12,039 --> 00:28:25,928 +piece of work related to vision and visual processing. +So, when did computer vision begin? + +253 +00:28:25,929 --> 00:28:35,620 +That's another interesting story, history. +the precursor of computer vision as a modern field was + +254 +00:28:35,620 --> 00:28:42,779 +this particular dissertation by Larry Roberts in 1963. +It's called block world. + +255 +00:28:42,779 --> 00:28:49,889 +He, just as Hubel and Wiesel were discovering +that the visual world in our + +256 +00:28:49,890 --> 00:29:00,380 +brain is organized by simple edge-like +structures, Larry Roberts as early as Computer Science PhD + +257 +00:29:00,380 --> 00:29:06,350 +students, were trying to extract these edge-like structures + +258 +00:29:06,349 --> 00:29:08,980 +and images as a as a piece of engineering work. + +259 +00:29:08,980 --> 00:29:16,210 +In this particular case his goal is that + +260 +00:29:16,210 --> 00:29:22,210 +you know, you and I as humans can +recognize blocks no matter how it's turned, right? + +261 +00:29:22,210 --> 00:29:28,009 +We know it's the same block. +These two are the same block, even though + +262 +00:29:28,009 --> 00:29:33,019 +he lighting changed and the orientation changed. +And, his conjuncture + +263 +00:29:33,019 --> 00:29:40,720 +is that just like Hubel and Wiesel told us, +it's the edges that define this the structure. + +264 +00:29:40,720 --> 00:29:46,419 +That the edges define the shape and they don't change, + +265 +00:29:46,419 --> 00:29:53,290 +rather than all these internal things. +So, Larry Roberts wrote a PhD dissertation to + +266 +00:29:53,289 --> 00:29:59,250 +just extract these edges. You know if +your work as a PhD student computer + +267 +00:29:59,250 --> 00:30:03,990 +vision this is like you know this is +like undergraduate computer vision wouldn't + +268 +00:30:03,990 --> 00:30:10,210 +have been a PhD thesis but that was +the first precursor computer vision PhD thesis. + +269 +00:30:10,210 --> 00:30:18,819 +And, Larry Roberts's interest +he gave up his work in computer vision afterwards, + +270 +00:30:18,819 --> 00:30:27,189 +and went to DARPA. +He was one of the inventors of the Internet. He didn't do too badly by + +271 +00:30:27,190 --> 00:30:34,490 +giving up computer vision. +but we always like to say that the birthday of computer + +272 +00:30:34,490 --> 00:30:43,960 +vision as a modern field is in the +summer of 1966. In the summer of 1966, + +273 +00:30:43,960 --> 00:30:49,548 +MIT artificial intelligence lab was established. +Before that actually for one + +274 +00:30:49,548 --> 00:30:55,819 +piece of history you should feel proud of +as Stanford student, this there are two + +275 +00:30:55,819 --> 00:31:02,579 +pioneering artificial intelligence lab +established in the world in the early + +276 +00:31:02,579 --> 00:31:10,329 +1960's, one by Marvin Minsky at MIT, one +by John McCarthy at Stanford. + +277 +00:31:10,329 --> 00:31:15,369 +At Stanford, the artificial intelligence lab was +established before the computer science department + +278 +00:31:15,369 --> 00:31:21,479 +and professor John McCarthy +who founded AI Lab is the one who is + +279 +00:31:21,480 --> 00:31:22,490 +responsible for + +280 +00:31:22,490 --> 00:31:26,450 +the term artificial intelligence. +So, that's a little bit of a proud of Stanford history. + +281 +00:31:26,450 --> 00:31:31,720 +But, anyway we have to give MIT this credit +for starting the field of computer vision, + +282 +00:31:31,720 --> 00:31:41,380 +because in the summer of 1966, +a professor at MIT AI lab decided it's time to solve vision. + +283 +00:31:41,380 --> 00:31:46,630 +You know, so AI was established. +We start to understand, you know, first all the logic and all this. + +284 +00:31:46,630 --> 00:31:55,010 +I think this proves probably invented at that time but anyway, + +285 +00:31:55,009 --> 00:32:01,109 +vision is so easy. You open your eyes, +you see the world. How hard can this be? + +286 +00:32:01,109 --> 00:32:04,109 +Let's solve in one summer. +So, especially MIT students are smart, right? + +287 +00:32:04,109 --> 00:32:18,729 +So, the summer vision project is an attempt +to use our summer workers effectively +in a construction of significant part of a visual system. + +288 +00:32:18,730 --> 00:32:24,329 +This was the proposal from that summer +and maybe they didn't use their summer workers effectively, + +289 +00:32:24,329 --> 00:32:30,490 +but in any case, computer vision was not solved in that summer. + +290 +00:32:30,490 --> 00:32:35,740 +Since then, they become the fastest +growing field of computer vision and AI. + +291 +00:32:35,740 --> 00:32:43,679 +If you go to today's premium computer +vision conferences called CVPR or ICCV, + +292 +00:32:43,679 --> 00:32:52,160 +we have like 2000 to 2500 researchers +worldwide attending this conference and + +293 +00:32:52,160 --> 00:33:00,620 +very practical note for students if you +are a good computer vision / machine + +294 +00:33:00,620 --> 00:33:05,369 +learning students you will not worry +about jobs in Silicon Valley or or + +295 +00:33:05,369 --> 00:33:11,569 +anywhere else so it's actually one of +the most exciting field but that was the + +296 +00:33:11,569 --> 00:33:19,210 +birthday of computer vision which means +this year is the fiftieth anniversary of + +297 +00:33:19,210 --> 00:33:25,829 +computer vision that's a very exciting +year in computer vision and we have come + +298 +00:33:25,829 --> 00:33:28,529 +a long long way + +299 +00:33:28,529 --> 00:33:31,660 +ok so continue on the history of computer vision + +300 +00:33:31,660 --> 00:33:38,169 +this is a person to remember David Mark +he was also at MIT at that time + +301 +00:33:38,169 --> 00:33:50,240 +working with a number of very influential +computer vision scientists like Shimon Ullman, Tommy Poggio. +and Mark himself died + +302 +00:33:50,240 --> 00:33:58,808 +early in the seventies and he wrote very influential +book called "Vision". It's a very thin book. + +303 +00:33:58,808 --> 00:34:08,148 +And David Mark's thinking about vision, he took a +lot of insights from neuro-science. + +304 +00:34:08,148 --> 00:34:14,868 +We have already said that +Hubel and Wiesel give us the concept of simple structure. + +305 +00:34:14,869 --> 00:34:16,539 +Vision starts with simple structure. + +306 +00:34:16,539 --> 00:34:23,259 +It didn't start with a holistic fish or holistic mouse. + +307 +00:34:23,260 --> 00:34:28,679 +David Mark gave us the next important insight +and these two insights together is the + +308 +00:34:28,679 --> 00:34:35,740 +beginning of deep learning architecture +is that vision is hierarchical. + +309 +00:34:35,740 --> 00:34:44,029 +You know so, Hubel and Wiesel said ok we start simple, +but Hubel and Wiesel didn't say we end simple. +This visual world is extremely complex. + +310 +00:34:44,030 --> 00:34:49,540 +In fact, I take a picture, a regular picture today with my iPhone. + +311 +00:34:49,540 --> 00:34:58,309 +There is, I don't know my iPhone's resolution. +Let's suppose it's like 10 mega-pixels. + +312 +00:34:58,309 --> 00:35:05,059 +The potential combination of pixels to form +a picture in that is bigger than the total + +313 +00:35:05,059 --> 00:35:11,429 +number of atoms in the universe. +That's how complex vision can be. + +314 +00:35:11,429 --> 00:35:18,539 +It's really really complex. +So, Hubel and Wiesel told us to start simple. +David Mark told + +315 +00:35:18,539 --> 00:35:25,130 +us build a hierarchical model. Of course +David Mark didn't tell us to build it in + +316 +00:35:25,130 --> 00:35:29,400 +the covolution neural network which will +cover for the rest of the quarter + +317 +00:35:29,400 --> 00:35:36,990 +but his ideas is this. To represent or to think +about an image, we think about it in + +318 +00:35:36,989 --> 00:35:42,129 +several layers. The first one he thinks +we should think about that edge image + +319 +00:35:42,130 --> 00:35:49,110 +which is clearly an inspiration, +took the inspiration from Hubel and Wiesel and + +320 +00:35:49,110 --> 00:35:52,579 +he personally called this the Primal Sketch. + +321 +00:35:52,579 --> 00:35:55,730 +You know, the name is self-explainary. + +322 +00:35:55,730 --> 00:36:02,400 +and then you think about 2.5D +This is where you + +323 +00:36:02,400 --> 00:36:08,829 +start to reconcile your 2D image +with the 3D world. You recognize there is + +324 +00:36:08,829 --> 00:36:15,679 +layers right? I look at you right now. +I don't think half of you only has + +325 +00:36:15,679 --> 00:36:17,239 + head and the neck + +326 +00:36:17,239 --> 00:36:22,799 +even though that's all I see, but there is, +I know you're occluded by the row in + +327 +00:36:22,800 --> 00:36:29,680 +front of you and this is the fundamental challenge of the Vision. +We have ill-post problem to solve + +328 +00:36:29,679 --> 00:36:38,118 +nature had that you oppose prob to solve +because the broadest 3d imagery 2d + +329 +00:36:38,119 --> 00:36:45,210 +nature saw that my first a hard work +trick we just to ice it did they use one + +330 +00:36:45,210 --> 00:36:49,389 +I but there's gonna be a whole bunch of +hoes software trick to lurch the + +331 +00:36:49,389 --> 00:36:53,868 +formation of the two eyes and Aldous so +the same thing with computer vision we + +332 +00:36:53,869 --> 00:36:59,280 +have to solve that too and have tea +problem and they eventually we have to + +333 +00:36:59,280 --> 00:37:03,180 +put everything together so that we +actually have a good 3d model of the + +334 +00:37:03,179 --> 00:37:08,629 +world why do we have to have a 3d model +of the world as we have to survive + +335 +00:37:08,630 --> 00:37:15,309 +navigate manipulate the world when I +shake your hand I really need to know + +336 +00:37:15,309 --> 00:37:16,509 +how do you know + +337 +00:37:16,510 --> 00:37:22,320 +external my hand and grab your heading +the right way that is a 3d modeling of + +338 +00:37:22,320 --> 00:37:26,000 +the world otherwise I won't be able to +grab your head in the right way when I + +339 +00:37:26,000 --> 00:37:34,219 +pick up a mug the same thing so so +that's that's that's David Marr's + +340 +00:37:34,219 --> 00:37:39,899 +architecture for vision that's a +high-level abstract architecture it + +341 +00:37:39,900 --> 00:37:45,490 +doesn't really inform us exactly what +kind of mathematical modeling we should + +342 +00:37:45,489 --> 00:37:51,439 +it doesn't inform us of the learning +procedure and they really does the + +343 +00:37:51,440 --> 00:37:55,599 +inference procedure which we will +getting to through the deep learning + +344 +00:37:55,599 --> 00:38:02,759 +that word architecture but that's not +that's the high-level view of important + +345 +00:38:02,760 --> 00:38:06,250 +it's an important concept to learn + +346 +00:38:06,250 --> 00:38:08,619 +envisioned and we call this the + +347 +00:38:08,619 --> 00:38:16,859 +representation really important work and +this is a little bit stuff first trip to + +348 +00:38:16,860 --> 00:38:25,180 +just show you as soon as they lead out +this important way of thinking about the + +349 +00:38:25,179 --> 00:38:31,879 +first wave of visual recognition +algorithms went after the 3d model + +350 +00:38:31,880 --> 00:38:38,280 +because that's the goal right like no +matter how you represent the stages the + +351 +00:38:38,280 --> 00:38:45,519 +goal here is to reconstruct recognized +object and this is really sensible + +352 +00:38:45,519 --> 00:38:52,380 +because that's when we go to the world +and do so both of these to your work + +353 +00:38:52,380 --> 00:38:58,829 +comes from Palo Alto one of those from +sum 41 as far as ROI Sao Tome before was + +354 +00:38:58,829 --> 00:39:00,440 +a professor at Stanford + +355 +00:39:00,440 --> 00:39:05,760 +I love that he and his two directly +Brooks proposed 11 of the first + +356 +00:39:05,760 --> 00:39:10,430 +so-called generalized till salu model +I'm not gonna get into the details but + +357 +00:39:10,429 --> 00:39:17,129 +the idea is that the world is composed +of simple shapes like + +358 +00:39:17,130 --> 00:39:23,150 +wonders blocks and then any real world +object is just a combination of these + +359 +00:39:23,150 --> 00:39:28,340 +simple shapes given the particular +feeling and go and that was a very + +360 +00:39:28,340 --> 00:39:37,970 +influential visual recognition model in +the seventies and went on to become the + +361 +00:39:37,969 --> 00:39:47,239 +Director of MIT lab and he was also a +founding member of iRobot company rumba + +362 +00:39:47,239 --> 00:39:51,379 +and all this so so he continued the very +influential + +363 +00:39:51,380 --> 00:39:56,930 +I work and nobody interesting model +coming from local + +364 +00:39:56,929 --> 00:40:05,009 +Research Institute I think I saw I is +across the street from El Camino is this + +365 +00:40:05,010 --> 00:40:15,260 +pictorial structure model has less of a +3d flavor but more of a probabilistic + +366 +00:40:15,260 --> 00:40:21,570 +flavor is that the objects are made of a +still simple part + +367 +00:40:21,570 --> 00:40:28,059 +like a person's head is made of eyes and +nose or mouth and the parts were CuMn + +368 +00:40:28,059 --> 00:40:34,679 +acted by springs allowing for some +deformations getting a sense of ok we + +369 +00:40:34,679 --> 00:40:40,069 +recognize the world not every one of you +have exactly the same eyes in the + +370 +00:40:40,070 --> 00:40:45,150 +distance between the eyes will allow for +some kind of rare variability so this + +371 +00:40:45,150 --> 00:40:50,450 +concept of variability start to get +introduced in the model like this and + +372 +00:40:50,449 --> 00:40:56,309 +using models like this you know the +reason I want to show you this is too to + +373 +00:40:56,309 --> 00:41:02,710 +see how simple the the worst was a tease +this was one of the most influential + +374 +00:41:02,710 --> 00:41:09,670 +model in the eighties recognizing +real-world objects and the entire paper + +375 +00:41:09,670 --> 00:41:18,900 +of real world is these seemingly users +but the using the edges and simple + +376 +00:41:18,900 --> 00:41:26,010 +shapes warm but edges to to recognize +this by another and other stuff or + +377 +00:41:26,010 --> 00:41:33,980 +graduate so that's that's that's kind of +the incident world of computer vision + +378 +00:41:33,980 --> 00:41:39,699 +will wind up being seen black and white +or even synthetic images started the + +379 +00:41:39,699 --> 00:41:46,529 +nineties we're finally started moving to +like color images of real world and it + +380 +00:41:46,530 --> 00:41:55,210 +was a big change again very very +influential work here is not + +381 +00:41:55,210 --> 00:42:01,150 +particularly about recognizing an object +is about how do it like carve out an + +382 +00:42:01,150 --> 00:42:08,990 +image into sensible parts right so if +you enter this room there's no way your + +383 +00:42:08,989 --> 00:42:15,559 +visual system is tell you of my god I +see so many pics it was only have group + +384 +00:42:15,559 --> 00:42:22,259 +things you see heads heads have +territory chair a stage platform piece + +385 +00:42:22,260 --> 00:42:26,640 +of furniture in the oldest this is +called perceptual grouping perceptual + +386 +00:42:26,639 --> 00:42:28,309 +grouping as one of me + +387 +00:42:28,309 --> 00:42:34,779 +most important problem envision +biological or artificial if we don't + +388 +00:42:34,780 --> 00:42:39,420 +know how to solve the perceptual +grouping problem where they have a + +389 +00:42:39,420 --> 00:42:46,690 +really hard time to to deeply understand +the visual world and you aren't words + +390 +00:42:46,690 --> 00:42:53,450 +that end of this class this course a +problem as fundamental as the still not + +391 +00:42:53,449 --> 00:42:57,859 +solved in computer vision even though we +have made a lot of progress before + +392 +00:42:57,860 --> 00:43:04,390 +departing after deplaning we're still +grasping the final solution to a problem + +393 +00:43:04,389 --> 00:43:10,650 +like this so so this is again I why I +want to give you at this introduction to + +394 +00:43:10,650 --> 00:43:16,950 +for you to be aware of the deep problems +evasion and also the then-current they + +395 +00:43:16,949 --> 00:43:22,730 +in the the the the challenges envision +we cannot solve all the problems despite + +396 +00:43:22,730 --> 00:43:29,079 +whatever the noose s you know like we're +far from developing terminators who can + +397 +00:43:29,079 --> 00:43:34,860 +do everything so this piece of work is +called normalized cut is what is one of + +398 +00:43:34,860 --> 00:43:42,390 +the first computer vision work that +takes real world images and tries to + +399 +00:43:42,389 --> 00:43:52,420 +solve the problem is the senior computer +vision researcher now professor at + +400 +00:43:52,420 --> 00:43:56,000 +berkeley also stanford graduate + +401 +00:43:56,000 --> 00:44:01,989 +the results are not great I will not +cover any sedimentation in this class + +402 +00:44:01,989 --> 00:44:08,459 +from where you see we are making +progress but this is the beginning of + +403 +00:44:08,460 --> 00:44:15,510 +that another very casual work that I +want to i want to bring up and pay + +404 +00:44:15,510 --> 00:44:22,410 +tribute for even though these work we're +not covering them in the rest of the + +405 +00:44:22,409 --> 00:44:26,679 +course but I think it has a vision +student pretty important for you to be + +406 +00:44:26,679 --> 00:44:31,199 +aware of this because not only +introduces the important problem we want + +407 +00:44:31,199 --> 00:44:36,730 +to solve it also gives you a perspective +of the development of the field let's + +408 +00:44:36,730 --> 00:44:40,480 +work is called villa jones face detector + +409 +00:44:40,480 --> 00:44:46,030 +it's very dear to my heart because as a +graduate student fresh graduate student + +410 +00:44:46,030 --> 00:44:51,650 +at cal tech it's the one of the first +papers I read as a graduate student when + +411 +00:44:51,650 --> 00:44:56,150 +I until the lab and I didn't know +anything about it my advisers with this + +412 +00:44:56,150 --> 00:45:02,090 +amazing piece of work that you know +we're all trying to understand them by + +413 +00:45:02,090 --> 00:45:08,690 +the time I graduated from Celtic this +very work is transferred to the first + +414 +00:45:08,690 --> 00:45:16,510 +smart digital camera by Fujifilm in 2006 +as the first digital camera that has a + +415 +00:45:16,510 --> 00:45:22,390 +face detector so far my transfer pump +technology transfer point of view it was + +416 +00:45:22,389 --> 00:45:28,789 +extremely fast and there was one of the +first successful high-level visual + +417 +00:45:28,789 --> 00:45:35,849 +recognition algorithm that's being used +by consumer product so let's work just + +418 +00:45:35,849 --> 00:45:41,059 +learns to detect faces and faces in the +wild with no longer soon you know + +419 +00:45:41,059 --> 00:45:47,920 +simulation they are a very contrived a +these are any pictures and even though + +420 +00:45:47,920 --> 00:45:53,329 +he didn't use a deep learning network it +has a lot of the deep learning flavor + +421 +00:45:53,329 --> 00:46:01,179 +the features were learned the algorithm +learns to find features simple features + +422 +00:46:01,179 --> 00:46:06,919 +like these black and white filter +features that can you give us the best + +423 +00:46:06,920 --> 00:46:14,639 +localization of faces so this is a very +influential piece of work it's also one + +424 +00:46:14,639 --> 00:46:24,679 +of the first computer visual work that +is deployed computer and can roam real + +425 +00:46:24,679 --> 00:46:31,019 +time before that comparison algorithms +were very slow the the paper actually is + +426 +00:46:31,019 --> 00:46:36,699 +called real-time face detection it was +granted send him to tips I don't know if + +427 +00:46:36,699 --> 00:46:41,409 +anybody remember that kind of chip but +it was not a slow chat but nevertheless + +428 +00:46:41,409 --> 00:46:48,569 +it run real time that was another very +important piece of art and also one more + +429 +00:46:48,570 --> 00:46:53,380 +thing to point out around this time this +is not the only work + +430 +00:46:53,380 --> 00:46:59,170 +but this is a a really good +representation Morales time the focus of + +431 +00:46:59,170 --> 00:47:06,250 +computer vision is shifting remember +that they've Mr + +432 +00:47:06,250 --> 00:47:14,699 +early for work was trying to model the +three the shape of the object now we're + +433 +00:47:14,699 --> 00:47:23,439 +shifting to recognizing what the object +is the little bit about can we really + +434 +00:47:23,440 --> 00:47:27,400 +reconstruct these phases or not there's +a whole branch of computer vision + +435 +00:47:27,400 --> 00:47:34,200 +graphics step continue to work on that +but a big part of computer vision is not + +436 +00:47:34,199 --> 00:47:38,730 +at this time around the turn of the +century is focusing on recognition + +437 +00:47:38,730 --> 00:47:47,539 +that's bringing computer vision and +today the most important parts of the + +438 +00:47:47,539 --> 00:47:55,480 +computer vision work is focused these +cognitive questions like recognition and + +439 +00:47:55,480 --> 00:47:57,369 +I questions + +440 +00:47:57,369 --> 00:48:06,150 +another very important piece of work is +starting to focus on features so around + +441 +00:48:06,150 --> 00:48:12,950 +the time of face recognition people +start to realize it's really really hard + +442 +00:48:12,949 --> 00:48:19,829 +to recognize an object by describing the +whole thing like I just said you know I + +443 +00:48:19,829 --> 00:48:25,960 +see you guys were heavily on concluded I +don't see the rest of your torso I + +444 +00:48:25,960 --> 00:48:31,690 +really don't see any of your legs on it +in the first row but I recognize you and + +445 +00:48:31,690 --> 00:48:39,230 +i ke fir you as an object so some people +start to realize she is fun this is + +446 +00:48:39,230 --> 00:48:44,240 +really a global shape now we have to go +after in order to recognize an object + +447 +00:48:44,239 --> 00:48:50,319 +maybe it's the features if we recognize +the important features an object we can + +448 +00:48:50,320 --> 00:48:53,090 +go a long way and makes a lot of sense + +449 +00:48:53,090 --> 00:48:57,930 +think about evolution if you are out +hunting you don't need to recognize that + +450 +00:48:57,929 --> 00:49:03,909 +Tigers full body in shape to decide you +need to run away you know there's a few + +451 +00:49:03,909 --> 00:49:06,588 +patches of the first of the tiger +through the + +452 +00:49:06,588 --> 00:49:12,679 +leaves probably cool arm you enough so +so we need to listen as quick + +453 +00:49:12,679 --> 00:49:16,429 +decision-making baseball's version is +really quick + +454 +00:49:16,429 --> 00:49:22,308 +a lot of this happens online important +features so this will cost shift by + +455 +00:49:22,309 --> 00:49:28,539 +David low again you saw that name again +is about learning important important + +456 +00:49:28,539 --> 00:49:34,009 +features on an object and once you learn +these important features just a few of + +457 +00:49:34,009 --> 00:49:38,400 +them on the object you can actually +recommends this object in a totally + +458 +00:49:38,400 --> 00:49:45,548 +different and go on the tolling +cluttered scenes so up to keep learnings + +459 +00:49:45,548 --> 00:49:54,880 +research election in 2010 or 2012 for +about 10 years the entire field of + +460 +00:49:54,880 --> 00:50:00,229 +computer vision was focusing on using +these features to build models to + +461 +00:50:00,228 --> 00:50:05,538 +recognize objects and scenes and we've +done a great job we've gone a long way + +462 +00:50:05,539 --> 00:50:12,609 +one of the reasons deep learning that +word was became more more convincing to + +463 +00:50:12,608 --> 00:50:17,690 +a lot of people is we will see that the +features that a deep learning that + +464 +00:50:17,690 --> 00:50:22,880 +learners is very similar to these +engineered features by brilliant + +465 +00:50:22,880 --> 00:50:30,229 +engineers so it kind of confirmed even +know you know if needed we needed them + +466 +00:50:30,228 --> 00:50:34,929 +below to first tell us this features +work and then we start to develop better + +467 +00:50:34,929 --> 00:50:38,978 +mathematical models to learn these +features by itself but they confirmed + +468 +00:50:38,978 --> 00:50:46,210 +each other so so the historical you know +importance of this work should not be + +469 +00:50:46,210 --> 00:50:52,028 +diminished this work is the intellectual +foundation for us one of the + +470 +00:50:52,028 --> 00:50:57,858 +intellectual foundation for us to +realize that how critical or how useful + +471 +00:50:57,858 --> 00:51:07,018 +these deep learning features are where +we learn them just briefly say because + +472 +00:51:07,018 --> 00:51:12,379 +of the features that have a low and many +other researchers told us we can't use + +473 +00:51:12,380 --> 00:51:18,239 +that to to learn scene recognition and +around that time the machine learning + +474 +00:51:18,239 --> 00:51:24,719 +tools we use mostly is either graphical +models or support vector machine and + +475 +00:51:24,719 --> 00:51:29,479 +this is one influential work on using +support vector machine and colonel + +476 +00:51:29,478 --> 00:51:43,358 +models 2222 recognizes thing but I'll be +brief here and last deep learning model + +477 +00:51:43,358 --> 00:51:50,578 +is this feature or feature baseball +called deformable part Waldo is where we + +478 +00:51:50,579 --> 00:51:57,420 +learn parts of object like parts of the +person and we learn how they come figure + +479 +00:51:57,420 --> 00:52:08,519 +each other income figure in space used a +support vector machine kind of model to + +480 +00:52:08,518 --> 00:52:16,179 +recognize objects like humans and +bottles around this time that's 2009 + +481 +00:52:16,179 --> 00:52:21,419 +2010 the field of computer vision is +mature enough that we're working on this + +482 +00:52:21,420 --> 00:52:25,659 +important on the heart probably +recognize the pedestrians and + +483 +00:52:25,659 --> 00:52:30,828 +recognizing cars they're no longer +contrived problem something else was + +484 +00:52:30,829 --> 00:52:37,219 +needed his bench partly because as a +field advancing now if we don't have + +485 +00:52:37,219 --> 00:52:44,039 +good benchmark then everybody feels set +of images and it's really hard to really + +486 +00:52:44,039 --> 00:52:50,369 +set global standard so one of the most +important benchmark is called pass goal + +487 +00:52:50,369 --> 00:52:57,608 +V oc object recognition bench part its +bio European it's a european effort that + +488 +00:52:57,608 --> 00:53:04,190 +researchers put together by tens of +thousands of images from 20 classes of + +489 +00:53:04,190 --> 00:53:13,019 +optics and these are one example per per +object like cats cults cows movie no + +490 +00:53:13,018 --> 00:53:17,808 +cats dogs cows airplanes bottles + +491 +00:53:17,809 --> 00:53:20,048 +horses trained + +492 +00:53:20,048 --> 00:53:27,268 +and Aldous and then we used and then +annually our computer vision researchers + +493 +00:53:27,268 --> 00:53:34,948 +and laps come to compete all the object +recognition task for best girl object + +494 +00:53:34,949 --> 00:53:41,188 +recognition challenge and an over the +past you know like through the years the + +495 +00:53:41,188 --> 00:53:47,949 +the performance just keeps increasing +and that was when we start to feel + +496 +00:53:47,949 --> 00:53:52,929 +excited about the progress of the field +at that time + +497 +00:53:52,929 --> 00:53:59,729 +here's a little bit over more closer +story close to us is that my love of my + +498 +00:53:59,728 --> 00:54:05,718 +students were thinking you know the real +world is not about 20 objects the real + +499 +00:54:05,719 --> 00:54:12,489 +world is a little more than 20 optics so +following the work of Pasco visual + +500 +00:54:12,489 --> 00:54:18,239 +object recognition challenge we put +together this massive massive project + +501 +00:54:18,239 --> 00:54:23,889 +image that some of you may have heard of +image that in this class you will be + +502 +00:54:23,889 --> 00:54:30,098 +using the tiny portion of the image that +in some of your assignment that image + +503 +00:54:30,099 --> 00:54:36,759 +that is the data set of 50 million +images all cleaned my hands and + +504 +00:54:36,759 --> 00:54:47,000 +annotated over 20,000 object classes to +students who cleaned it + +505 +00:54:47,000 --> 00:54:54,469 +various areas of my life remove the +crowdsourcing platform of the habits of + +506 +00:54:54,469 --> 00:54:59,969 +that Gladys dunno also suffered from you +know putting together this this platform + +507 +00:54:59,969 --> 00:55:08,599 +but it's a very exciting day does not we +started we started to put together + +508 +00:55:08,599 --> 00:55:15,900 +competitions annually called image that +competition for object recognition and + +509 +00:55:15,900 --> 00:55:22,440 +for example a standard competition of +image classification by Imogen that is a + +510 +00:55:22,440 --> 00:55:28,710 +thousand object classes over almost 1.5 +million images and algorithms compete on + +511 +00:55:28,710 --> 00:55:34,220 +the performance so actually I just heard +somebody who was on the social media was + +512 +00:55:34,219 --> 00:55:38,589 +referring image that challenges the +Olympics of computer vision I was very + +513 +00:55:38,590 --> 00:55:40,240 +flattering + +514 +00:55:40,239 --> 00:55:55,649 +bringing us close to the history making +the people are so so that challenge 2010 + +515 +00:55:55,650 --> 00:56:00,369 +that's actually around the time pass go +you know where their colleagues they + +516 +00:56:00,369 --> 00:56:05,309 +told us they're gonna start to phase out +their challenge of 20 object so we face + +517 +00:56:05,309 --> 00:56:12,039 +in the thousand object images a +challenge and why accesses error rate + +518 +00:56:12,039 --> 00:56:18,199 +and we start to we started with very +significant error and of course you know + +519 +00:56:18,199 --> 00:56:28,029 +every year that decreased but there is a +particularly years really decreased it + +520 +00:56:28,030 --> 00:56:38,960 +was cutting hot almost is 2012 2012 is +the year that the winning architecture + +521 +00:56:38,960 --> 00:56:45,769 +of image that challenge was a +convolution on your network I will talk + +522 +00:56:45,769 --> 00:56:53,250 +about it was not invented in 2012 +despite how all the new speaker's felt + +523 +00:56:53,250 --> 00:56:58,190 +like it's the newest thing around the +block it's not it was invented back in + +524 +00:56:58,190 --> 00:56:59,349 +the seventies and eighties + +525 +00:56:59,349 --> 00:57:05,279 +he's but having a convergence of things +will talk about convolution on your + +526 +00:57:05,280 --> 00:57:10,519 +network showed its massive power as a +high capacity and to end training + +527 +00:57:10,519 --> 00:57:18,219 +architecture and Wang the image that +challenged by a huge margin and that was + +528 +00:57:18,219 --> 00:57:24,829 +quite a historical moment from a a +mathematical point of view it wasn't + +529 +00:57:24,829 --> 00:57:30,079 +that new before my engineering and and +solving real-world point of view this + +530 +00:57:30,079 --> 00:57:35,090 +was a historical moment that piece of +work was covered by you know numerous + +531 +00:57:35,090 --> 00:57:42,400 +times and all this this is the onset +this is the beginning of learning + +532 +00:57:42,400 --> 00:57:48,869 +revolution if you call it and this is +the premise of this class so at this + +533 +00:57:48,869 --> 00:57:54,609 +point I'm gonna switch so we went +through a brief history of computer + +534 +00:57:54,610 --> 00:57:59,539 +vision for 540 million years + +535 +00:57:59,539 --> 00:58:05,869 +overview of this class is there any +other question + +536 +00:58:05,869 --> 00:58:13,969 +alright so we're talking even though it +was kind of overwhelming we talked a lot + +537 +00:58:13,969 --> 00:58:20,559 +about finding different task in computer +vision seems to 31 and is going to focus + +538 +00:58:20,559 --> 00:58:27,849 +on the visual recognition problem also +enlarged especially through most of the + +539 +00:58:27,849 --> 00:58:29,509 +foundation lecture + +540 +00:58:29,510 --> 00:58:35,750 +classification but now you know +everything we talk about is gonna be + +541 +00:58:35,750 --> 00:58:41,480 +based on that image that classification +set up we will we were getting to other + +542 +00:58:41,480 --> 00:58:47,900 +visual recognition scenarios but the +image classification problem is the main + +543 +00:58:47,900 --> 00:58:52,780 +problem we will focus on Emma's class +which means please keep in mind + +544 +00:58:52,780 --> 00:58:56,600 +visual recognition is not just image +classification right there was 3d + +545 +00:58:56,599 --> 00:59:01,339 +modeling there was a grouping of +segmentation and all this but that's + +546 +00:59:01,340 --> 00:59:06,250 +that's what we'll focus on and I don't +need to call miss you that just even + +547 +00:59:06,250 --> 00:59:11,000 +application wise image classification is +extremely useful problem + +548 +00:59:11,000 --> 00:59:17,929 +from you know big big commercial +Internet companies a point of view to + +549 +00:59:17,929 --> 00:59:22,449 +startup ideas you know you want to +recognize objects you want to recognize + +550 +00:59:22,449 --> 00:59:29,119 +food do online shop mobile shopping you +want us torture albums so you move + +551 +00:59:29,119 --> 00:59:35,710 +classification news is is can be a +bread-and-butter task for many many + +552 +00:59:35,710 --> 00:59:44,650 +important problems there is a problem +that's related to two classification and + +553 +00:59:44,650 --> 00:59:49,329 +today I don't expect you to understand +the differences but I wanted to hear + +554 +00:59:49,329 --> 00:59:55,659 +that while this class will make sure you +learn to understand the nuances in the + +555 +00:59:55,659 --> 01:00:01,879 +the details of different flavors of +visual recognition what is image + +556 +01:00:01,880 --> 01:00:07,700 +classification what's object detection +what's image captioning and these have + +557 +01:00:07,699 --> 01:00:14,529 +different flavors for example you know +while he made two classification my + +558 +01:00:14,530 --> 01:00:19,740 +focus on the whole big image object +detection by tell you where things + +559 +01:00:19,739 --> 01:00:23,579 +exactly are like where the car is the +pedestrian + +560 +01:00:23,579 --> 01:00:30,159 +the hammer and the word that the +relationship between objects and so on + +561 +01:00:30,159 --> 01:00:35,529 +social their nuances and details that +you will be learning about in this class + +562 +01:00:35,530 --> 01:00:43,840 +and I already said CNN or coalition on +your network is one type of deeply + +563 +01:00:43,840 --> 01:00:50,910 +architecture but it's the overwhelmingly +successful the planning architecture and + +564 +01:00:50,909 --> 01:00:54,909 +this is the architecture we will be +focusing on and to just go back to the + +565 +01:00:54,909 --> 01:01:02,849 +image 9 challenge so I said the +historical year is 2012 this is the year + +566 +01:01:02,849 --> 01:01:14,349 +I'll excursion Jeff Hinton proposed this +this convolutional I think it's a seven + +567 +01:01:14,349 --> 01:01:20,500 +layer convolutional your network to win +the image that challenge model before + +568 +01:01:20,500 --> 01:01:22,318 +this year + +569 +01:01:22,318 --> 01:01:30,548 +a SIFT feature plus support vector +machine architecture it still hierarchy + +570 +01:01:30,548 --> 01:01:38,449 +but it doesn't have that flavor of and +two and learning fast forward to 2015 + +571 +01:01:38,449 --> 01:01:43,798 +the winning architecture is still a +conclusion you're not worried it's a + +572 +01:01:43,798 --> 01:01:56,599 +hunter 51 layers buy buy microsoft Asia +research researchers and it's clear + +573 +01:01:56,599 --> 01:02:03,048 +reason the residual that the residual +that so I'm not so sure you for a cover + +574 +01:02:03,048 --> 01:02:09,369 +that definitely don't expect to to know +every single layer what they do actually + +575 +01:02:09,369 --> 01:02:17,269 +they repeat itself at heart but every +year since 2012 the winning architecture + +576 +01:02:17,268 --> 01:02:23,548 +of images that challenge is a deep +learning based architecture so like I + +577 +01:02:23,548 --> 01:02:32,369 +said I also want you to respect history +is not invented overnight there is a lot + +578 +01:02:32,369 --> 01:02:37,979 +of influential players today but you +know there are a lot of people who build + +579 +01:02:37,978 --> 01:02:41,879 +a foundation I actually I don't have the +slides one important thing to remember + +580 +01:02:41,880 --> 01:02:50,910 +is Kunihiko Fukushima contigo solution +was a Japanese scientist who build a + +581 +01:02:50,909 --> 01:02:58,798 +model corneil Kong the truck and that +was the beginning of the newer network + +582 +01:02:58,798 --> 01:03:04,318 +architecture and yellow color is also a +very influential person and he's really + +583 +01:03:04,318 --> 01:03:10,248 +the the groundbreaking work in my +opinion of young coup was published in + +584 +01:03:10,248 --> 01:03:16,348 +the nineteen nineties so that's one +mathematicians which Jeff Hinton + +585 +01:03:16,349 --> 01:03:22,479 +all-inclusive adviser was involved +worked out the back propagation learning + +586 +01:03:22,478 --> 01:03:28,088 +strategy which if this were deleting +anything under will tell you in a couple + +587 +01:03:28,088 --> 01:03:34,528 +of weeks but but the the mathematical +mando was roughed up in the eighties and + +588 +01:03:34,528 --> 01:03:34,920 +the + +589 +01:03:34,920 --> 01:03:40,869 +undies and this was your local was +working for Bell Labs it AT&T which is + +590 +01:03:40,869 --> 01:03:47,160 +amazing place at that time there's no +bail UPS today anymore that they were + +591 +01:03:47,159 --> 01:03:50,949 +working on really ambitious projects and +he needed to recognize the digits + +592 +01:03:50,949 --> 01:03:57,019 +because even to leave that product was +shipped to our bags in the USA post + +593 +01:03:57,019 --> 01:04:03,380 +office to recognize difficult and checks +and kinky constructive those coalition + +594 +01:04:03,380 --> 01:04:08,068 +on your network and this is where he +he's inspired by Hubel and Wiesel he + +595 +01:04:08,068 --> 01:04:14,500 +starts by looking at some pool edge like +structures and image it's not like the + +596 +01:04:14,500 --> 01:04:20,099 +whole letter eight it's really needs to +edges and the layer-by-layer + +597 +01:04:20,099 --> 01:04:25,539 +filters these edges pull them together +filters pool and then the field this + +598 +01:04:25,539 --> 01:04:36,230 +architecture 20121 Alex kruschev ski and +Jeff Hinton you almost exactly the thing + +599 +01:04:36,230 --> 01:04:40,900 +architecture to participate in the car + +600 +01:04:40,900 --> 01:04:47,900 +imagine a challenge there is a few +changes but that become the winning + +601 +01:04:47,900 --> 01:04:54,920 +architecture of this so we'll tell you +more about the detail changes at the lib + +602 +01:04:54,920 --> 01:05:02,380 +capacity model did grow a little bit +because Moore's Law helped us there's + +603 +01:05:02,380 --> 01:05:08,220 +also a very very detailed function that +change the little bit of a shape for + +604 +01:05:08,219 --> 01:05:14,828 +most Signori 224 but to file in their +shape but whatever there's a couple of + +605 +01:05:14,829 --> 01:05:19,130 +small changes but really by large +nothing had changed + +606 +01:05:19,130 --> 01:05:26,490 +mathematically but important things did +change and that grow deep learning + +607 +01:05:26,489 --> 01:05:35,379 +Architektur black ink crew into its +Renaissance one is like a morsel and + +608 +01:05:35,380 --> 01:05:41,180 +hardware hardware made a huge difference +because these are high extremely high + +609 +01:05:41,179 --> 01:05:44,669 +capacity models one dela Cruz + +610 +01:05:44,670 --> 01:05:50,720 +this painfully slow because of the the +bottleneck of computation he couldn't + +611 +01:05:50,719 --> 01:05:55,209 +build this model too big a while so you +cannot be added to big cannot fully + +612 +01:05:55,210 --> 01:06:00,670 +realize its potential for machine +learning standpoint there's over 15 and + +613 +01:06:00,670 --> 01:06:07,780 +all these problems you can also but now +we have much faster bigger transistor + +614 +01:06:07,780 --> 01:06:16,410 +transistor microchips and GPUs from +Nvidia made a huge difference in deep + +615 +01:06:16,409 --> 01:06:22,358 +learning history that we can now +trainees models in a reasonable amount + +616 +01:06:22,358 --> 01:06:27,358 +of time even if they're huge and others +I think we do need to take her out of + +617 +01:06:27,358 --> 01:06:37,159 +work is data availability of data that +was the big data did a self is just you + +618 +01:06:37,159 --> 01:06:41,078 +know it doesn't mean anything if you +don't know how to use it but in this + +619 +01:06:41,079 --> 01:06:45,869 +deep learning Architektur data become +the driving force for high-capacity + +620 +01:06:45,869 --> 01:06:52,390 +model to enable the doin training true +true true help avoid overfitting when + +621 +01:06:52,389 --> 01:06:57,608 +you have enough data so you know so you +if you look at the number of pixels that + +622 +01:06:57,608 --> 01:07:05,639 +machine learning people had in 2012 +versus helical having 1998 it's a huge + +623 +01:07:05,639 --> 01:07:06,469 +difference + +624 +01:07:06,469 --> 01:07:14,469 +orders of magnitude so so that was so +this is the focus of 231 + +625 +01:07:14,469 --> 01:07:21,098 +but will also go oh it's also important +one last time I'm drooling this idea + +626 +01:07:21,099 --> 01:07:27,048 +that visual intelligence does go beyond +object recognition I don't want any of + +627 +01:07:27,048 --> 01:07:31,039 +you coming out of this course dinky +we've done everything you know we've + +628 +01:07:31,039 --> 01:07:38,889 +challenged do flying the entire space of +visual recognition it's not true there + +629 +01:07:38,889 --> 01:07:44,460 +are still a lot of cool problems to +solve for example you know does labeling + +630 +01:07:44,460 --> 01:07:51,650 +entire scene with perceptual grouping so +I know where every single pixel belonged + +631 +01:07:51,650 --> 01:07:52,329 +to + +632 +01:07:52,329 --> 01:07:56,900 +that's still ongoing problem combining + +633 +01:07:56,900 --> 01:08:02,740 +recognition with 3d is a really there's +a lot of excitement happening at the + +634 +01:08:02,739 --> 01:08:09,349 +intersection of vision and robotics this +is this is definitely one area of that + +635 +01:08:09,349 --> 01:08:15,039 +and then anything to do with motion of +borders and and and and this is another + +636 +01:08:15,039 --> 01:08:33,289 +big open area of research work you know +beyond just gonna sing you actually want + +637 +01:08:33,289 --> 01:08:35,689 +deeply understand the victor + +638 +01:08:35,689 --> 01:08:39,489 +what people are doing what are the +relationship between objects in the West + +639 +01:08:39,489 --> 01:08:45,029 +Rd into the the the relation between +objects and this is an ongoing project + +640 +01:08:45,029 --> 01:08:49,759 +called visual genome in my lap that just +in the number of my students are + +641 +01:08:49,760 --> 01:08:55,739 +involved and this goes far beyond image +classification of weed we talked about + +642 +01:08:55,739 --> 01:09:03,639 +and what is one of our Holy Grails while +one of the Holy Grails of community this + +643 +01:09:03,640 --> 01:09:09,260 +to be able to tell the story of a scene +right so I think about you as a human + +644 +01:09:09,260 --> 01:09:11,180 +you open your eyes + +645 +01:09:11,180 --> 01:09:17,840 +the moment you open your eyes you're +able to describe what you see in fact in + +646 +01:09:17,840 --> 01:09:24,940 +psychology experiments we find that even +if you show people this picture for only + +647 +01:09:24,939 --> 01:09:30,659 +five hundred milliseconds that's +literally half of the second people who + +648 +01:09:30,659 --> 01:09:36,769 +write essays about it we pay them $10 an +hour so they didn't + +649 +01:09:36,770 --> 01:09:42,410 +it wasn't that long but you know I +figure if we talked more money they + +650 +01:09:42,409 --> 01:09:47,970 +probably could write longer ethics but +the point is that our visual system is + +651 +01:09:47,970 --> 01:09:54,390 +extremely powerful we can tell stories +and I would dream of this is my cell is + +652 +01:09:54,390 --> 01:10:02,560 +to undress dissertation that we give you +give a computer one picture per and + +653 +01:10:02,560 --> 01:10:03,960 +outcomes + +654 +01:10:03,960 --> 01:10:09,159 +description like this you know I was +getting there you'll see that give the + +655 +01:10:09,159 --> 01:10:15,149 +khmer Olympic tur gives you one sentence +or you give the number one pick turned + +656 +01:10:15,149 --> 01:10:20,319 +into a short sentences but we're not +here yet but that's one of the holder + +657 +01:10:20,319 --> 01:10:26,250 +blue and the other holding growth is +continuing this continue this i think is + +658 +01:10:26,250 --> 01:10:33,659 +is summarized really well by Audrey's +blog is you know like this right there + +659 +01:10:33,659 --> 01:10:42,300 +is refined there so much nuance in this +picture that you get to enjoy not only + +660 +01:10:42,300 --> 01:10:47,890 +you recognize the global seek it will be +very boring old computer can tell you is + +661 +01:10:47,890 --> 01:10:53,650 +that room room scale + +662 +01:10:53,649 --> 01:10:58,238 +whatever type in the locker that's it +you know here you recognize what they + +663 +01:10:58,238 --> 01:11:00,569 +are recognized the trick + +664 +01:11:00,569 --> 01:11:06,009 +Obama is do you recognize the kind of +interaction you recognize the humor you + +665 +01:11:06,010 --> 01:11:11,250 +recognize there's just so much knew that +this is one of the world is about we + +666 +01:11:11,250 --> 01:11:18,719 +used our ability to visual nurse tending +to not only survive navigate when they + +667 +01:11:18,719 --> 01:11:26,000 +play but we use it to socialize to +entertain to understand the world and + +668 +01:11:26,000 --> 01:11:32,929 +this is where vision in all the book +read goals of vision that is so and I + +669 +01:11:32,929 --> 01:11:39,630 +don't need to convince you that computer +visual technology will make our world a + +670 +01:11:39,630 --> 01:11:46,550 +better place despite some scary talks +out there you know even though you home + +671 +01:11:46,550 --> 01:11:51,029 +today in the industry as well as +research world we're using computer + +672 +01:11:51,029 --> 01:11:58,349 +vision to build better robots to save +lives to go deep exploring analyst now + +673 +01:11:58,350 --> 01:12:02,860 +ok so I have like what two minutes 35 +minutes left + +674 +01:12:02,859 --> 01:12:10,839 +great time let me introduce the team and +justice are the color instructors with + +675 +01:12:10,840 --> 01:12:16,989 +me tienes please stand up to say hi to +him + +676 +01:12:16,989 --> 01:12:22,639 +can you like this safe your name quickly +and you're like what you just don't give + +677 +01:12:22,640 --> 01:12:49,180 +a speech but yes + +678 +01:12:49,180 --> 01:13:42,240 +because because people class action and +help us to process I respect a person is + +679 +01:13:42,239 --> 01:14:04,739 +confidential personal issues but again +I'm going on our terms and leave for a + +680 +01:14:04,739 --> 01:14:09,939 +few weeks starting the end of January +social please if you decide you just + +681 +01:14:09,939 --> 01:14:15,379 +want to send email to me unless somebody +like you they will take + +682 +01:14:15,380 --> 01:14:20,770 +I'm likely to a reply you promptly sorry +about that + +683 +01:14:20,770 --> 01:14:25,420 +priorities + +684 +01:14:25,420 --> 01:14:34,739 +about our philosophy and we're not +getting to the details we really want + +685 +01:14:34,738 --> 01:14:39,448 +this to be a very hands-on project this +is really I give a lot of credit to + +686 +01:14:39,448 --> 01:14:46,419 +Justin and Andre they are extremely good +at walking through these hands-on + +687 +01:14:46,420 --> 01:14:51,840 +details with you so that when you come +out of this class you not only have I + +688 +01:14:51,840 --> 01:14:57,719 +love understanding but you have a you +have a really good ability to to build + +689 +01:14:57,719 --> 01:15:02,010 +your own deep learning code we want you +to be exposed to state of the art + +690 +01:15:02,010 --> 01:15:08,730 +material you're gonna be learning things +really that's as freshest 2015 and it'll + +691 +01:15:08,729 --> 01:15:11,859 +be fun you get to do things like this + +692 +01:15:11,859 --> 01:15:18,960 +not not all the time but like time the +picture into one goal or or this weird + +693 +01:15:18,960 --> 01:15:27,489 +thing it'll be a fun class in addition +to all the important tasks you you you + +694 +01:15:27,488 --> 01:15:33,589 +you learn we do have grading policies +these are all on our website another + +695 +01:15:33,590 --> 01:15:44,929 +eatery those again one very clear you +are grown ups which grew to like + +696 +01:15:44,929 --> 01:15:51,989 +grown-ups we do not take anything at the +end of the course is my professors want + +697 +01:15:51,988 --> 01:15:56,359 +me to go to this conference and I have +to have like three more late they say no + +698 +01:15:56,359 --> 01:16:03,630 +you are responsible for using your total +eight days you have 7 late you can use + +699 +01:16:03,630 --> 01:16:11,079 +them in whatever way you all 10 penalty +be all those you have to take a penalty + +700 +01:16:11,079 --> 01:16:18,069 +is like really really exceptional +medical family emergency + +701 +01:16:18,069 --> 01:16:21,799 +talk to us on the individual basis but +anything else + +702 +01:16:21,800 --> 01:16:29,539 +conference that why other finally you +know like missing cat or whatever is + +703 +01:16:29,539 --> 01:16:37,850 +we we we we budgeted that into the seven +days another his honor cold this is one + +704 +01:16:37,850 --> 01:16:43,190 +thing I have to say it with a really +straight face you are such a privilege + +705 +01:16:43,189 --> 01:16:50,710 +institution you are you are grown ups I +want you to be responsible for honor + +706 +01:16:50,710 --> 01:16:55,239 +code every single Stampfer student +taking this class should know the other + +707 +01:16:55,239 --> 01:16:58,619 +co if you don't there's no excuse you +should go back + +708 +01:16:58,619 --> 01:17:04,840 +wait a collaboration extremely seriously +I almost hate to say that statistically + +709 +01:17:04,840 --> 01:17:10,380 +given a classless big word Allah have a +few cases but I also want you to be an + +710 +01:17:10,380 --> 01:17:16,210 +exceptional class even with a size this +big we do not want to see anything that + +711 +01:17:16,210 --> 01:17:22,399 +infringes on Academic Honor Code so read +the collaboration policies and risk but + +712 +01:17:22,399 --> 01:17:31,960 +that this is really respecting yourself +I think with all these prereq you can + +713 +01:17:31,960 --> 01:17:38,149 +you can read it I do with anything I +want to say is there any burning + +714 +01:17:38,149 --> 01:17:47,569 +questions that you feel worth asking yes + +715 +01:17:47,569 --> 01:18:06,689 +ok + diff --git a/captions/En/Lecture2_en.srt b/captions/En/Lecture2_en.srt new file mode 100644 index 00000000..52b86a53 --- /dev/null +++ b/captions/En/Lecture2_en.srt @@ -0,0 +1,3644 @@ +1 +00:00:00,000 --> 00:00:03,750 +and we're recording it's like a great +just remind you again + +2 +00:00:03,750 --> 00:00:08,160 +hello recording the closest so if you're +uncomfortable speaking on camera here + +3 +00:00:08,160 --> 00:00:15,929 +not in the picture but your voice might +be on the recording ok great as you can + +4 +00:00:15,929 --> 00:00:19,589 +see also the screen is wider than it +should be and I'm not sure how to fix it + +5 +00:00:19,589 --> 00:00:21,300 +so hard to live with it + +6 +00:00:21,300 --> 00:00:25,269 +likely your visual cortex is very good +and very invariance to stretching so + +7 +00:00:25,268 --> 00:00:26,118 +this is not a problem + +8 +00:00:26,118 --> 00:00:32,259 +ok so what's up with some administrative +things before we dive into the class the + +9 +00:00:32,259 --> 00:00:36,100 +first assignment will come out tonight +or early tomorrow it is you in january + +10 +00:00:36,100 --> 00:00:41,289 +20 have exactly two weeks you will be +writing a classifier earlier classifier + +11 +00:00:41,289 --> 00:00:44,159 +and a small two-layer neural network +you'll be writing the entirety of + +12 +00:00:44,159 --> 00:00:47,979 +backpropagation algorithm for 22 layer +neural network will cover all that + +13 +00:00:47,979 --> 00:00:54,459 +material in next two weeks and morning +by the way there are some in from last + +14 +00:00:54,460 --> 00:00:57,350 +year as well and we're changing the +assignments so they will please do not + +15 +00:00:57,350 --> 00:01:02,890 +complete in 2015 assignment that's +something to be aware of and for your + +16 +00:01:02,890 --> 00:01:07,109 +competition but it will be using Python +and pie and will also be offering + +17 +00:01:07,109 --> 00:01:11,030 +terminal dot com which is which is +basically the virtual machines in the + +18 +00:01:11,030 --> 00:01:13,939 +club that you can use if you don't have +a very good laptop and so on + +19 +00:01:13,938 --> 00:01:17,250 +go into detail about it but i just like +to point out that for the first + +20 +00:01:17,250 --> 00:01:21,090 +assignment we assume that you'll be +relatively familiar with Python you'll + +21 +00:01:21,090 --> 00:01:24,859 +be writing these optimized numpy +expressions where you at manipulating + +22 +00:01:24,859 --> 00:01:28,438 +these matrices and vectors and very +efficient forms so for example if you're + +23 +00:01:28,438 --> 00:01:31,908 +seeing this code and its doesn't mean +anything to you then please have a look + +24 +00:01:31,909 --> 00:01:35,880 +at our python tutorial that is up on the +website as well it's written by Justin + +25 +00:01:35,879 --> 00:01:39,489 +and is very good and so go through that +and familiarize yourself with the + +26 +00:01:39,489 --> 00:01:42,328 +notation because you'll be seeing you +writing a lot of code that looks like + +27 +00:01:42,328 --> 00:01:47,048 +this where we doing all these optimize +operations so they're fast enough to run + +28 +00:01:47,049 --> 00:01:51,610 +on the CPU now in terms of total +basically what this amounts to is that + +29 +00:01:51,609 --> 00:01:54,599 +will give you a link to the assignment +you'll go to a web page and you'll see + +30 +00:01:54,599 --> 00:01:58,309 +something like this this is a virtual +machine in the cloud that has been set + +31 +00:01:58,310 --> 00:02:01,420 +up with all the dependencies of the +assignment they're all installed already + +32 +00:02:01,420 --> 00:02:05,618 +on the data is already there and so you +click on lunch machine and this'll + +33 +00:02:05,618 --> 00:02:09,580 +basically bring you to something like +this this is running your brother and + +34 +00:02:09,580 --> 00:02:13,060 +this is basically a thin layer on top of +an AWS + +35 +00:02:13,060 --> 00:02:17,209 +machine a UI layer here and so you have +an iPod the notebook and a little + +36 +00:02:17,209 --> 00:02:20,739 +terminal and you can go around and this +is just like a machine in the cloud and + +37 +00:02:20,739 --> 00:02:24,310 +so they have some CPU offerings and they +also have some GPU machines that you can + +38 +00:02:24,310 --> 00:02:25,539 +use and so on + +39 +00:02:25,539 --> 00:02:29,090 +normally have to pay for terminal but +will be distributing credits to you so + +40 +00:02:29,090 --> 00:02:33,709 +you just lost to a specific ta that will +decide in a bit you email to a TA and + +41 +00:02:33,709 --> 00:02:36,950 +ask for money will send you money and we +keep track of how much money we sent to + +42 +00:02:36,949 --> 00:02:40,799 +all the people so you have to be +responsible with the funds so this is + +43 +00:02:40,800 --> 00:02:55,689 +also an option for you to use you like +ok any details are you can you can read + +44 +00:02:55,689 --> 00:02:57,680 +it if you like it's not required for +your comment + +45 +00:02:57,680 --> 00:03:03,879 +but you can probably get there around +yeah ok sam says that to happen to the + +46 +00:03:03,879 --> 00:03:07,870 +lecture now today we'll be talking about +which classification and specially will + +47 +00:03:07,870 --> 00:03:13,219 +start off on linear classifiers so we +talk about the classification the basic + +48 +00:03:13,219 --> 00:03:17,560 +task is that we have some number of +categories say dog cat truck plane or so + +49 +00:03:17,560 --> 00:03:20,799 +on we get to decide what these are and +then asked earlier to take an image + +50 +00:03:20,799 --> 00:03:24,950 +which is a giant breed of numbers and we +have to transform into one of these + +51 +00:03:24,949 --> 00:03:29,169 +labels we have to build it into one of +the categories this problem will spend + +52 +00:03:29,169 --> 00:03:32,548 +most of our time talking about this one +specifically but if you'd like to do any + +53 +00:03:32,549 --> 00:03:36,349 +other task in computer vision such as +detection image capture any segmentation + +54 +00:03:36,349 --> 00:03:40,108 +or whatever else you'll find that once +he know about the classification and how + +55 +00:03:40,109 --> 00:03:43,569 +that's done everything else is just the +tiny built on top of it so you'll be in + +56 +00:03:43,568 --> 00:03:47,060 +a great position to do any of the other +tasks so it's really good for conceptual + +57 +00:03:47,060 --> 00:03:50,840 +understanding and we'll work through +that as a specific example to simplify + +58 +00:03:50,840 --> 00:03:54,819 +things in the beginning now why is this +problem hard just give an idea the + +59 +00:03:54,818 --> 00:03:58,518 +problem is what we refer to as a +semantic gap this image here as a giant + +60 +00:03:58,519 --> 00:04:01,739 +grid of numbers the way the images are +represented in the computer is that this + +61 +00:04:01,739 --> 00:04:06,299 +is basically say roughly 300 by a +hundred by three accelerates oh three + +62 +00:04:06,299 --> 00:04:09,620 +dimensional array and threes from the +three color channels red green and blue + +63 +00:04:09,620 --> 00:04:13,590 +and so when you zoom in on the part of +that image is basically a giant great + +64 +00:04:13,590 --> 00:04:18,728 +numbers between 0 and 255 so that's what +we have to work with these numbers + +65 +00:04:18,728 --> 00:04:21,370 +indicate the amount of brightness and +all the three color channels at every + +66 +00:04:21,370 --> 00:04:25,569 +single position in the image and so the +reason that any specification is + +67 +00:04:25,569 --> 00:04:26,269 +difficult + +68 +00:04:26,269 --> 00:04:29,519 +when you think about what we have to +work with decent like millions of + +69 +00:04:29,519 --> 00:04:33,899 +numbers of that form and having to +classify things like cats it quickly + +70 +00:04:33,899 --> 00:04:38,339 +became apparent to the complexity of the +task so for example the camera can be + +71 +00:04:38,339 --> 00:04:42,689 +rotated around this cat and it can be +zoomed in and nothing has shifted the + +72 +00:04:42,689 --> 00:04:46,769 +focal properties and transaxle that +camera can do different and think about + +73 +00:04:46,769 --> 00:04:49,769 +what happens to the brightness values +and as great as you actually do all + +74 +00:04:49,769 --> 00:04:52,779 +these transformations with a camera will +completely ship all the patterns are + +75 +00:04:52,779 --> 00:04:56,559 +changing and we can be robust to all of +this there are also many other + +76 +00:04:56,560 --> 00:05:00,709 +challenges for example charges up +illumination here we have a long cat + +77 +00:05:00,709 --> 00:05:07,728 +white cat we actually have two of them +but you can see beyond its a one cat is + +78 +00:05:07,728 --> 00:05:11,098 +clearly made it quite a bit and the +other is not but you can still recognize + +79 +00:05:11,098 --> 00:05:14,750 +two cats and so think about again the +brightness valleys on the level of the + +80 +00:05:14,750 --> 00:05:18,329 +grid and what happens to them as he +changed all the different things and all + +81 +00:05:18,329 --> 00:05:21,279 +the possible lighting schemes that we +can have in the world with to be robust + +82 +00:05:21,279 --> 00:05:28,179 +to all that there's issues off the +formation many classes lots of strange + +83 +00:05:28,180 --> 00:05:33,668 +arrangement of these objects would like +to recognize so cast coming very + +84 +00:05:33,668 --> 00:05:37,468 +different poses with the slides when I +create them they're quite dry there's a + +85 +00:05:37,468 --> 00:05:41,449 +lot of math and science this is the only +time I get to have fun so that's what I + +86 +00:05:41,449 --> 00:05:45,939 +just somehow everything that occurs to +be robust to all of these affirmations + +87 +00:05:45,939 --> 00:05:50,189 +you can still recognizes the cat and all +of these images despite their problems + +88 +00:05:50,189 --> 00:05:54,240 +so sometimes we might not see the +pelagic but you still recognizes that's + +89 +00:05:54,240 --> 00:06:00,340 +cat the cat behind a water bottle and +there's also a cab there inside a couch + +90 +00:06:00,339 --> 00:06:06,068 +even though you're seeing just 10 PCs +pieces of this class basically there's + +91 +00:06:06,069 --> 00:06:10,500 +problems on background clutter so things +can blend into the environment we have + +92 +00:06:10,500 --> 00:06:15,300 +to be reminded that and there's also the +intra-class variation so cat actually + +93 +00:06:15,300 --> 00:06:19,728 +there's a huge amount of cats just +species and so they can look different + +94 +00:06:19,728 --> 00:06:23,240 +ways with your boss to all of that so i +just like you to appreciate the + +95 +00:06:23,240 --> 00:06:26,718 +complexity of the task we consider any +one of these independently is difficult + +96 +00:06:26,718 --> 00:06:31,908 +but when you consider the cross product +of all these different things and have + +97 +00:06:31,908 --> 00:06:35,769 +to work across all of that it's actually +quite amazing that anything works at all + +98 +00:06:35,769 --> 00:06:39,539 +in fact not only does it work but it +works really really well almost here + +99 +00:06:39,540 --> 00:06:43,740 +accuracy of categories like this and we +can do that in a few dozen milliseconds + +100 +00:06:43,740 --> 00:06:49,040 +with the current technology and so +that's what you learn about this class + +101 +00:06:49,040 --> 00:06:54,390 +classifier look like basically we're +taking this through the area we'd like + +102 +00:06:54,389 --> 00:06:57,539 +to produce a class label and when i'd +like he's noticed that there is no + +103 +00:06:57,540 --> 00:07:01,569 +obvious way up actually encoding and +you'll this of these classifiers right + +104 +00:07:01,569 --> 00:07:04,790 +there's no simple algorithm like say +you're taking it all good in class early + +105 +00:07:04,790 --> 00:07:08,379 +computer science curriculum your writing +bubble sort or you're writing something + +106 +00:07:08,379 --> 00:07:11,939 +else to do any particular task you can +intuit all the possible steps and you + +107 +00:07:11,939 --> 00:07:15,300 +can enumerate them and lets them and +play with it and analyze it but here + +108 +00:07:15,300 --> 00:07:18,530 +there's no algorithm for detecting a cat +under all these variations are it's + +109 +00:07:18,529 --> 00:07:21,509 +extremely difficult to think about how +you actually write that up what is the + +110 +00:07:21,509 --> 00:07:26,039 +sequence of operations you would do an +arbitrary image to detect a cat that's + +111 +00:07:26,040 --> 00:07:28,629 +not to say that people haven't tried +especially early these a computer but + +112 +00:07:28,629 --> 00:07:32,719 +there were these explicit approaches as +I'd like to call them where you think + +113 +00:07:32,720 --> 00:07:37,240 +about okay I can't say is that he would +like to meet you look for little ear + +114 +00:07:37,240 --> 00:07:40,910 +pieces so what we'll do is we'll detect +all the edges UltraISO edges will + +115 +00:07:40,910 --> 00:07:45,380 +classify the different traits of edges +and their junctions will create you know + +116 +00:07:45,379 --> 00:07:48,350 +libraries of the season will try to find +their arrangements and if we ever see + +117 +00:07:48,350 --> 00:07:52,150 +anything like that will detect the cat +we see any particular texture of some + +118 +00:07:52,149 --> 00:07:55,899 +particular frequencies will attack the +cat as you can come up with some rules + +119 +00:07:55,899 --> 00:07:59,870 +but the problem is that once I tell you +okay I'd like to actually recognize the + +120 +00:07:59,870 --> 00:08:03,569 +boat now or a person when you go back to +the drawing board and yet to be like ok + +121 +00:08:03,569 --> 00:08:06,719 +what makes a boat exactly what the +original pages right it's completely + +122 +00:08:06,720 --> 00:08:11,590 +scalable approach to to prosecution as +the pressure dropping this class and + +123 +00:08:11,589 --> 00:08:16,699 +approach that works much better as the +data-driven approach that we like in the + +124 +00:08:16,699 --> 00:08:20,170 +framework of machine learning and just +to point out that in these days actually + +125 +00:08:20,170 --> 00:08:23,840 +in the early days they did not have the +luxury of using data because at this + +126 +00:08:23,839 --> 00:08:27,060 +point in time you're taking your +grayscale images of very low resolution + +127 +00:08:27,060 --> 00:08:30,250 +images in your trying to recognize +things it's obviously not going to work + +128 +00:08:30,250 --> 00:08:33,769 +but with the availability of Internet +huge amount of data I can search for + +129 +00:08:33,769 --> 00:08:38,460 +example for cat on Google and I get lots +of cats everywhere and we know that + +130 +00:08:38,460 --> 00:08:42,840 +these are cats based on the surrounding +text in the web pages so there's a lot + +131 +00:08:42,840 --> 00:08:46,060 +of data so the way that this now looks +like is that we have a training face + +132 +00:08:46,059 --> 00:08:49,079 +where you give me lots of training +samples cast + +133 +00:08:49,080 --> 00:08:52,900 +and you tell me about their cats you +give me lots of examples of any type of + +134 +00:08:52,899 --> 00:08:54,230 +other category you're interested in + +135 +00:08:54,230 --> 00:08:59,920 +I do I go away and I trained to model a +model is a class and I can then use that + +136 +00:08:59,919 --> 00:09:04,250 +model to actually classified data so +what i'm given a new image I can look at + +137 +00:09:04,250 --> 00:09:07,500 +my training data and I can do something +with this based on just a pattern + +138 +00:09:07,500 --> 00:09:13,759 +matching and statistics or someone so as +a simple example will work within this + +139 +00:09:13,759 --> 00:09:17,279 +framework consider the nearest neighbor +classifier the way you're single + +140 +00:09:17,279 --> 00:09:20,939 +classifier works is that effectively +were given destroyed Trade Center will + +141 +00:09:20,940 --> 00:09:23,970 +do a training time as well just remember +all the training data so have all the + +142 +00:09:23,970 --> 00:09:27,820 +training data just got here and I +remember it now when you give me a test + +143 +00:09:27,820 --> 00:09:32,060 +image what we'll do is we'll compare the +test image to every single one of the + +144 +00:09:32,059 --> 00:09:36,729 +images we saw in a train data and we'll +just transfer the label over so I'll + +145 +00:09:36,730 --> 00:09:41,149 +just look through all the images will +work with specific case as I go through + +146 +00:09:41,149 --> 00:09:43,740 +this I like to be as complete as +possible so we'll work with a specific + +147 +00:09:43,740 --> 00:09:47,740 +case of something called Seifert India +set the scene for today as it has 10 + +148 +00:09:47,740 --> 00:09:53,129 +labels labels there are 50,000 training +images that you have access to and then + +149 +00:09:53,129 --> 00:09:57,159 +there's a test set of 10 10,000 images +where we're going to evaluate how well + +150 +00:09:57,159 --> 00:10:00,669 +the classifiers working and these images +are quite tiny they're just little to a + +151 +00:10:00,669 --> 00:10:05,009 +dataset of 32 by 32 little thumbnail +images so the wait nearest neighbor + +152 +00:10:05,009 --> 00:10:07,809 +classifier would work as we take all +this training did others given to us + +153 +00:10:07,809 --> 00:10:12,589 +fifty thousand just not just i'm suppose +we have these ten different examples + +154 +00:10:12,590 --> 00:10:15,920 +here is our test images along the first +call in here what we'll do is we'll look + +155 +00:10:15,919 --> 00:10:19,909 +up the nearest neighbors in the training +set of things that are most similar to + +156 +00:10:19,909 --> 00:10:24,139 +every one of those in just independently +so there you see a ranked list of images + +157 +00:10:24,139 --> 00:10:30,220 +that are most similar to into training +data to any one of those 10 to every one + +158 +00:10:30,220 --> 00:10:32,700 +of those test images over there so in +the first row will see that there's a + +159 +00:10:32,700 --> 00:10:36,230 +truck i think is a test image and +there's quite a few images that look + +160 +00:10:36,230 --> 00:10:40,490 +similar to it will see how exactly where +to find similar teeny bit but you can + +161 +00:10:40,490 --> 00:10:44,269 +see that the first retreat result is in +fact a horse not a truck and that's + +162 +00:10:44,269 --> 00:10:48,289 +because of just the arrangement of the +blue sky that was thrown off so you can + +163 +00:10:48,289 --> 00:10:52,480 +see that this will not probably work +very well how do we define this is + +164 +00:10:52,480 --> 00:10:55,470 +measured how do we actually do the +comparison there are several ways one of + +165 +00:10:55,470 --> 00:10:59,940 +the simplest ways might be a Manhattan +distance so and understands or Manhattan + +166 +00:10:59,940 --> 00:11:01,180 +distance of the Institute + +167 +00:11:01,179 --> 00:11:04,429 +terms interchangeably simply what it +does is you have a test image you're + +168 +00:11:04,429 --> 00:11:07,639 +interested in classifying and considered +one single training image that we want + +169 +00:11:07,639 --> 00:11:11,919 +to compare this image to see what we'll +do is we'll element price compare all + +170 +00:11:11,919 --> 00:11:15,959 +the pics lollies so will form the +absolute value differences and then we + +171 +00:11:15,960 --> 00:11:20,040 +just add all that up so we're just look +at every single position or subtracting + +172 +00:11:20,039 --> 00:11:24,139 +it off and see what the differences are +increasingly special position adding it + +173 +00:11:24,139 --> 00:11:30,169 +all up and that's our similarity so +these two images are for 56 different so + +174 +00:11:30,169 --> 00:11:33,809 +we get a zero if we have identical +images here just to show your code + +175 +00:11:33,809 --> 00:11:36,959 +specifically the way this would look +like this is a full implementation of a + +176 +00:11:36,960 --> 00:11:42,930 +nearest-neighbor classifier and where I +filled in the actual body of the two men + +177 +00:11:42,929 --> 00:11:46,799 +talked about and what we do here at +training time as we're giving this + +178 +00:11:46,799 --> 00:11:52,709 +dataset X and Y which usually the note +the labels so forgiving and labels all + +179 +00:11:52,710 --> 00:11:56,530 +we do is just assigned to the class +instance methods so just remembered the + +180 +00:11:56,529 --> 00:12:01,439 +data nothing is being done I predict +time though what we're doing here is + +181 +00:12:01,440 --> 00:12:06,080 +we're getting newt test set of images X +and I'm not going to go through a full + +182 +00:12:06,080 --> 00:12:09,320 +details but you can see there's a for +loop over every single test image + +183 +00:12:09,320 --> 00:12:13,020 +independently we're getting the +distances to every single training image + +184 +00:12:13,019 --> 00:12:18,360 +and notice that that's only a single +line of vector I used Python code so in + +185 +00:12:18,360 --> 00:12:21,750 +a single line of code were comparing +that test image to every single training + +186 +00:12:21,750 --> 00:12:26,370 +image in the database computing this +distance in a previous slide and I think + +187 +00:12:26,370 --> 00:12:30,720 +alike so that's a crisis code we didn't +have to expend all those four loops that + +188 +00:12:30,720 --> 00:12:35,860 +are involved in processing systems and +then we compute the instance that is + +189 +00:12:35,860 --> 00:12:40,659 +closest so we're getting them in index +the index of the training that is has + +190 +00:12:40,659 --> 00:12:45,719 +the lowest distance and then we'll just +predicting for this image the label of + +191 +00:12:45,720 --> 00:12:51,210 +whatever so here's a question for you in +terms of the nearest neighbor classifier + +192 +00:12:51,210 --> 00:12:56,639 +how does its speed depend on the +training data size what happens is a + +193 +00:12:56,639 --> 00:13:02,779 +scale up the training gear slower + +194 +00:13:02,779 --> 00:13:07,789 +yes it's actually it's actually really +slow right because if I have I just have + +195 +00:13:07,789 --> 00:13:12,129 +to compare every single training sample +independently so it's a little slow down + +196 +00:13:12,129 --> 00:13:16,370 +and actually go as we go through the +classes that this is actually backwards + +197 +00:13:16,370 --> 00:13:19,590 +because what we really care about the +most practical applications as we care + +198 +00:13:19,590 --> 00:13:23,330 +about the test time performance of these +classifiers that means that we want this + +199 +00:13:23,330 --> 00:13:27,240 +class to be very efficient at this time +and so there's a tradeoff between really + +200 +00:13:27,240 --> 00:13:30,419 +how much computer we put on the train +method and how much do we put in a good + +201 +00:13:30,419 --> 00:13:35,240 +nearest neighbor is instant a train but +then it's expensive a test and as we'll + +202 +00:13:35,240 --> 00:13:38,570 +see soon come that's actually flip this +completely the other way around + +203 +00:13:38,570 --> 00:13:41,510 +will see that we do a huge amount of +compute a train time will be training + +204 +00:13:41,509 --> 00:13:45,409 +commercial network system performance +will be super efficient in fact it will + +205 +00:13:45,409 --> 00:13:49,589 +be constant amount of compute for every +single test image with the constant + +206 +00:13:49,590 --> 00:13:53,149 +amount of computation no matter if you +have a million billions or trillions + +207 +00:13:53,149 --> 00:13:57,669 +training I'm just I'd like to have a +trillion trillion trillion trillion just + +208 +00:13:57,669 --> 00:14:01,579 +no matter how large or trade deficit +will do a complete custom computer to + +209 +00:14:01,580 --> 00:14:05,250 +classify any single testing sample so +that's very nice practically speaking + +210 +00:14:05,250 --> 00:14:10,370 +now I'll just like to point out that +there are ways of speeding up here saber + +211 +00:14:10,370 --> 00:14:13,669 +classifiers there's these approximate +nearest neighbor methods plan as an + +212 +00:14:13,669 --> 00:14:16,879 +example library that people use up to +practice that allows you to speed up + +213 +00:14:16,879 --> 00:14:22,909 +this process of nearest-neighbor +matching but that's just a side note ok + +214 +00:14:22,909 --> 00:14:27,490 +so let's go back to the design of the +classifier we saw that we've defined + +215 +00:14:27,490 --> 00:14:32,200 +this distance and I arbitrarily chosen +to show you the Manhattan distance which + +216 +00:14:32,200 --> 00:14:35,720 +compares the difference of the absolute +value there is in fact many ways you can + +217 +00:14:35,720 --> 00:14:38,879 +formulate a distance metric and so +there's many different choices of + +218 +00:14:38,879 --> 00:14:42,700 +exactly how we do this comparison +another sim another choice to people + +219 +00:14:42,700 --> 00:14:46,000 +like to use in practice is what we call +the Euclidean are ultra distance which + +220 +00:14:46,000 --> 00:14:49,850 +instead sums up the differences in the +sums of squares of these differences + +221 +00:14:49,850 --> 00:14:55,690 +between images and so this choice + +222 +00:14:55,690 --> 00:15:02,730 +that someone over there and back + +223 +00:15:02,730 --> 00:15:07,850 +ok so this choice of what how exactly +computer distance it's a discrete choice + +224 +00:15:07,850 --> 00:15:11,769 +that we have control over that something +we called hyper primary it's not really + +225 +00:15:11,769 --> 00:15:14,990 +obvious how you set it it's a hyper +parameters we have to decide later on + +226 +00:15:14,990 --> 00:15:19,120 +exactly how to set this somehow another +sort of hybrid primary they'll talk + +227 +00:15:19,120 --> 00:15:22,828 +about in context of a classifier is when +we generalize nearest neighbor to have + +228 +00:15:22,828 --> 00:15:26,159 +what we call a k nearest neighbor +classifier so in a que horas neighbor + +229 +00:15:26,159 --> 00:15:29,328 +classifiers that are retrieving for +every test match the single nearest + +230 +00:15:29,328 --> 00:15:33,958 +train example will in fact retreat +several examples and will have the new + +231 +00:15:33,958 --> 00:15:37,069 +majority vote over the closest to +actually classified every test instance + +232 +00:15:37,070 --> 00:15:41,829 +so say a neighbor we would be retrieving +the five most similar images in the + +233 +00:15:41,828 --> 00:15:45,528 +training data and doing a majority vote +of the labels here's a simple + +234 +00:15:45,528 --> 00:15:48,970 +two-dimensional data set to illustrate +the point so here we have a three-class + +235 +00:15:48,970 --> 00:15:53,430 +dataset and 2d and Here I am drawing +what we call decision regions this + +236 +00:15:53,429 --> 00:15:57,429 +nearest neighbor classifier here with +this refers to is were truly trained + +237 +00:15:57,429 --> 00:16:02,838 +over there and we're coloring the entire +to deplane by what class this nearest + +238 +00:16:02,839 --> 00:16:05,430 +neighbor classifier with a sign that +every single point suppose you don't + +239 +00:16:05,429 --> 00:16:08,698 +suppose you had a test example some more +here than just saying that this would + +240 +00:16:08,698 --> 00:16:12,549 +have been classified as blue class based +on the nearest neighbor you get personal + +241 +00:16:12,549 --> 00:16:16,708 +note that here is a point that is a +green point inside the blue cluster and + +242 +00:16:16,708 --> 00:16:19,708 +it has its own little region of class +where it would have classified a lot of + +243 +00:16:19,708 --> 00:16:23,750 +tests place around it as green because +if anything to tell their than that + +244 +00:16:23,750 --> 00:16:27,879 +green point of the nearest neighbor now +when you move to higher numbers for ke + +245 +00:16:27,879 --> 00:16:30,809 +such as five years neighbor classifier +what you find is that the boundaries + +246 +00:16:30,809 --> 00:16:36,619 +start to smooth out it's kind of nice +effect where even that there's just one + +247 +00:16:36,620 --> 00:16:37,339 +point + +248 +00:16:37,339 --> 00:16:41,550 +kind of randomly as noise and outliers +in the blue cluster it's actually not + +249 +00:16:41,549 --> 00:16:44,539 +employing the predictions too much +because we always are treating five + +250 +00:16:44,539 --> 00:16:49,679 +nearest neighbors and so they get to +overwhelm the Greenpoint so in practice + +251 +00:16:49,679 --> 00:16:53,088 +you'll find that usually can your summer +classifiers offer better better + +252 +00:16:53,089 --> 00:16:58,180 +performance at US time now but again the +choice of k is again a hyper perimeter + +253 +00:16:58,179 --> 00:17:03,088 +right so I'll come back to this in a bit +just to show you an example of this look + +254 +00:17:03,089 --> 00:17:06,169 +like here I'm returning ten most similar +examples they're ranked by their + +255 +00:17:06,169 --> 00:17:08,939 +distance and I would actually do +majority vote over these training + +256 +00:17:08,939 --> 00:17:13,089 +examples here to classify every test +example here + +257 +00:17:13,088 --> 00:17:20,649 +ok so let's do a bit of questions here +just consider what is the accuracy of + +258 +00:17:20,650 --> 00:17:24,259 +the north of a classifier on the +training data when we're using Euclidean + +259 +00:17:24,259 --> 00:17:29,700 +distance so I suppose our test set is +exactly the training data and we're + +260 +00:17:29,700 --> 00:17:32,580 +trying to find the accuracy in other +words how many how often would we get + +261 +00:17:32,579 --> 00:17:34,750 +the correct answer + +262 +00:17:34,750 --> 00:17:44,808 +hundred-percent good ok among numerous +yeah that's correct so we're always find + +263 +00:17:44,808 --> 00:17:48,450 +a train example exactly on top of that +test which has their own this does and + +264 +00:17:48,450 --> 00:17:52,870 +then it's like will be transferred over +what if we're using the Manhattan + +265 +00:17:52,869 --> 00:18:00,949 +distance that + +266 +00:18:00,950 --> 00:18:04,680 +Manhattan distance doesn't need sum of +squares are you some absolute values + +267 +00:18:04,680 --> 00:18:12,110 +from differences it would it's just a +question would be something like a good + +268 +00:18:12,109 --> 00:18:14,169 +summer or keeping + +269 +00:18:14,170 --> 00:18:18,820 +attention ok what is the accuracy of the +King your neighbor classifier trained it + +270 +00:18:18,819 --> 00:18:25,339 +is a cable spot if is it a hundred +percent not necessarily get because + +271 +00:18:25,339 --> 00:18:29,230 +basically the points around you could +overwhelm you even have your best + +272 +00:18:29,230 --> 00:18:35,269 +example is actually off the glass ok so +we've discussed two choices of different + +273 +00:18:35,269 --> 00:18:39,740 +premise we have just met Rick its high +pressure in this case we're not sure how + +274 +00:18:39,740 --> 00:18:45,160 +to set it should be 1 23 10 and so on so +we're not exactly sure how to set these + +275 +00:18:45,160 --> 00:18:48,750 +in fact their problem dependent you'll +find that you can't find consistently + +276 +00:18:48,750 --> 00:18:52,250 +best choice for these high premise in +some applications some case might look + +277 +00:18:52,250 --> 00:18:56,930 +better than other applications so we're +not really sure how to set this so + +278 +00:18:56,930 --> 00:19:00,799 +here's an idea we have to basically try +out to lots of different primers so I'm + +279 +00:19:00,799 --> 00:19:05,649 +gonna do as I'm going to take my train +data and then I'm going to try out lots + +280 +00:19:05,650 --> 00:19:11,550 +of different parameters so I might just +die and I try out cables 123456 2800 I + +281 +00:19:11,549 --> 00:19:14,529 +tried all the defendants metrics and +whatever works best that's what I'll + +282 +00:19:14,529 --> 00:19:26,670 +take so that will work very well right +lies in its not a good idea because ok + +283 +00:19:26,670 --> 00:19:36,170 +so basically so basically yes so test +data is your proxy for your + +284 +00:19:36,170 --> 00:19:40,039 +generalization of your order them you +should not trust should the test data in + +285 +00:19:40,039 --> 00:19:43,509 +fact you should forget that you ever +have to Stata so it went 1 for giving + +286 +00:19:43,509 --> 00:19:46,079 +your dataset always set aside the +testator pretend you don't have it + +287 +00:19:46,079 --> 00:19:50,129 +that's telling you how will your organs +generalizing to unseen data points and + +288 +00:19:50,130 --> 00:19:52,730 +is important because you're trying to +develop your algorithm and then you're + +289 +00:19:52,730 --> 00:19:56,120 +hoping to eventually the planet and some +setting and you liked understanding of + +290 +00:19:56,119 --> 00:20:01,159 +exactly how will do I expect this to +work in practice right and so you'll see + +291 +00:20:01,160 --> 00:20:03,830 +that for example sometimes you can +perform very very well-intended about + +292 +00:20:03,829 --> 00:20:05,579 +not generalize very well to test it on + +293 +00:20:05,579 --> 00:20:08,659 +you're overthinking someone a lot of +this by 28 to 29 the requirement for + +294 +00:20:08,660 --> 00:20:11,750 +this class so you should be quite +familiar with this disease to most + +295 +00:20:11,750 --> 00:20:16,519 +extent this is kind of more and more +overview for you but basically this test + +296 +00:20:16,519 --> 00:20:20,940 +data is used very sparingly forget that +you have it instead what we do is we + +297 +00:20:20,940 --> 00:20:25,930 +separate our training data into what we +call folds so we separate safely use a + +298 +00:20:25,930 --> 00:20:29,900 +five-fold validation so we use twenty +percent of the training data as a + +299 +00:20:29,900 --> 00:20:35,120 +imagine such data and then we only +training part of it and we test on we + +300 +00:20:35,119 --> 00:20:39,279 +just have two choices applied primarily +on this validation set so I'm going to + +301 +00:20:39,279 --> 00:20:42,569 +train on my phone calls and try out +different case and all the front of some + +302 +00:20:42,569 --> 00:20:45,329 +clerics and whatever else if you're +using approximate nearest neighbor yet + +303 +00:20:45,329 --> 00:20:48,750 +many other choices you try it out see +what works best on that validation data + +304 +00:20:48,750 --> 00:20:51,859 +if you're feeling uncomfortable because +you have very few training data points + +305 +00:20:51,859 --> 00:20:54,939 +people also sometimes used +cross-validation where you actually get + +306 +00:20:54,940 --> 00:20:58,640 +to rate the choice of your test +validation pulled across these choices + +307 +00:20:58,640 --> 00:21:03,840 +so I'll first use for 124 for my +training and try out on five and then I + +308 +00:21:03,839 --> 00:21:07,519 +cycled the choice of the validation +pulled across all the five choices and I + +309 +00:21:07,519 --> 00:21:11,789 +look at what works best across all the +possible choices of my test fold and + +310 +00:21:11,789 --> 00:21:14,839 +then I just take whatever works best +across all the possible scenarios + +311 +00:21:14,839 --> 00:21:19,039 +that's a front-runner cross validation +set screw validation some practice the + +312 +00:21:19,039 --> 00:21:21,769 +way this would look like they were Cross +building for K for nearest neighbor + +313 +00:21:21,769 --> 00:21:26,049 +classifier is we are trying out +different values of K and this is our + +314 +00:21:26,049 --> 00:21:31,690 +performance across five choices of the +fold so you can see that for every + +315 +00:21:31,690 --> 00:21:35,759 +single case we have five data points +there and then this is the accuracy so + +316 +00:21:35,759 --> 00:21:40,240 +high is good and I'm plotting a line +through the mean analyst Sean Arce for + +317 +00:21:40,240 --> 00:21:44,190 +the standard deviations so we see here +is that the performance goes up on the + +318 +00:21:44,190 --> 00:21:49,240 +across these polls as you go up but at +some point starr said idk so for this + +319 +00:21:49,240 --> 00:21:53,460 +particular dataset it seems that K equal +to 7 is the best choice so that's what + +320 +00:21:53,460 --> 00:21:58,440 +I'll do this for all my hyperemesis also +for the symmetric and so on I do my + +321 +00:21:58,440 --> 00:22:03,650 +cross validation i promise i said i fix +them evaluate a single time on the test + +322 +00:22:03,650 --> 00:22:07,800 +site and whatever number I get to that's +what I report as eight accuracy of a + +323 +00:22:07,799 --> 00:22:11,490 +king or some classifier on this dataset +that's what goes into a paper that's + +324 +00:22:11,490 --> 00:22:15,539 +what goes into our final report as long +as the final generalization result of + +325 +00:22:15,539 --> 00:22:16,519 +what you've done + +326 +00:22:16,519 --> 00:22:36,048 +any questions about this basically it's +about the statistics of the distribution + +327 +00:22:36,048 --> 00:22:42,378 +of these data points in your label in +your face and so sometimes it's hard to + +328 +00:22:42,378 --> 00:22:47,769 +say like you get whereas this picture +you see roughly what happening as you + +329 +00:22:47,769 --> 00:22:52,209 +get more cleanliness and more case and +it just depends on how clunkier data + +330 +00:22:52,209 --> 00:22:55,129 +service that's really what it comes down +to is how + +331 +00:22:55,128 --> 00:23:01,569 +lobby is it or how specific is it I know +that's very handy answer but that's + +332 +00:23:01,569 --> 00:23:04,769 +roughly what what that comes into so +different datasets will have different + +333 +00:23:04,769 --> 00:23:27,230 +clicking us right now + +334 +00:23:27,230 --> 00:23:31,769 +because + +335 +00:23:31,769 --> 00:23:37,308 +and different different datasets will +require different choices and need to + +336 +00:23:37,308 --> 00:23:40,629 +see what works best if actually try out +different algorithms + +337 +00:23:40,630 --> 00:23:43,580 +you're not sure what's going to work +best in your data the choice of your + +338 +00:23:43,579 --> 00:23:47,699 +order is also kind of like hyper hammer +so you're just not sure what works + +339 +00:23:47,700 --> 00:23:52,019 +different approaches will be different + +340 +00:23:52,019 --> 00:23:55,190 +generalization boundaries they look +different and some data sets up the + +341 +00:23:55,190 --> 00:23:58,330 +front structure than other some things +work better than others + +342 +00:23:58,329 --> 00:24:05,298 +just ran it tried out ok I just like to +point out that king or something worse + +343 +00:24:05,298 --> 00:24:09,389 +is no one basically uses this sunday +going through this just doesn't get used + +344 +00:24:09,390 --> 00:24:12,480 +to this approach of really how this +works with training just splits and so + +345 +00:24:12,480 --> 00:24:13,450 +on + +346 +00:24:13,450 --> 00:24:17,610 +the reason this is never used as because +first of all it's very inefficient but + +347 +00:24:17,609 --> 00:24:21,139 +second of all this is my tracks all +images which are very high dimensional + +348 +00:24:21,140 --> 00:24:28,179 +objects they acting very unnatural and +intuitive ways I've done is taken in + +349 +00:24:28,179 --> 00:24:32,370 +order to limit and I change it in three +different ways but all these three + +350 +00:24:32,369 --> 00:24:37,168 +different images here have actually the +exact same distance to this one in an L + +351 +00:24:37,169 --> 00:24:42,100 +to Euclidean sense as I just think about +this one here is slightly shifted to the + +352 +00:24:42,099 --> 00:24:46,359 +left it's dropped slightly and it's this +is here are completely different because + +353 +00:24:46,359 --> 00:24:49,329 +these pixels are not matching up exactly +and it's all introducing all these + +354 +00:24:49,329 --> 00:24:53,109 +errors in your getting distance this one +is slightly darkened so you get a small + +355 +00:24:53,109 --> 00:24:57,629 +Delta across all special occasions and +this one is untouched 60 distance eres + +356 +00:24:57,630 --> 00:25:01,650 +across everywhere except in those +positions over there and that is taken + +357 +00:25:01,650 --> 00:25:05,900 +out critical pieces of the image and it +doesn't the nearest neighbor classifier + +358 +00:25:05,900 --> 00:25:08,030 +will not be able to really tell the +difference between these settings + +359 +00:25:08,029 --> 00:25:11,230 +because it's based on these distances +that don't really work very well in this + +360 +00:25:11,230 --> 00:25:16,009 +case so very unintuitive things happen +when you try to throw distances on very + +361 +00:25:16,009 --> 00:25:21,349 +high dimensional objects that's partly +why we don't exist so in summary so far + +362 +00:25:21,349 --> 00:25:26,230 +we're looking at these classifications a +specific case involving two different + +363 +00:25:26,230 --> 00:25:29,679 +settings later in the class of Engineers +neighbor classifier and the idea of + +364 +00:25:29,679 --> 00:25:33,110 +having different splits up your data and +we have these high pressure hose that + +365 +00:25:33,109 --> 00:25:37,240 +will need to pick and we use Cross +foundation for this usually most of the + +366 +00:25:37,240 --> 00:25:39,909 +time people don't actually entire +cross-validation they just have a single + +367 +00:25:39,909 --> 00:25:40,519 +validation + +368 +00:25:40,519 --> 00:25:43,778 +and they try out on the validation set +whatever works best in terms of the high + +369 +00:25:43,778 --> 00:25:47,999 +premise and once you get the best have +primaries you have a lead to single + +370 +00:25:47,999 --> 00:25:54,569 +tenant just said so I'm going to go into +the classification but any questions at + +371 +00:25:54,569 --> 00:26:04,229 +this point I see great we're going to +look at Telenor classification this is a + +372 +00:26:04,229 --> 00:26:07,649 +point where we are starting to work +towards commercial networks it'll be a + +373 +00:26:07,648 --> 00:26:11,148 +series of lectures will snarl +classification that will build up to an + +374 +00:26:11,148 --> 00:26:15,888 +entire commercial network analyzing +image I just like to say that motivated + +375 +00:26:15,888 --> 00:26:20,178 +the class yesterday from a task-specific +view this class is computer vision class + +376 +00:26:20,179 --> 00:26:25,489 +interested in giving machines site and +other way to motivate this class will be + +377 +00:26:25,489 --> 00:26:29,409 +from a model-based point of view in a +sense that we're giving you guys + +378 +00:26:29,409 --> 00:26:34,339 +watching guys about the plumbing and +electrics these are wonderful algorithms + +379 +00:26:34,338 --> 00:26:38,178 +that you can apply to many different +demands not just some particularly over + +380 +00:26:38,179 --> 00:26:42,469 +the last few years we saw that neural +networks can not only see that's what + +381 +00:26:42,469 --> 00:26:46,479 +you'll learn a lot about this class but +he also here there is quite a bit in a + +382 +00:26:46,479 --> 00:26:50,828 +speech recognition now so when you talk +to your phone does not work they can + +383 +00:26:50,828 --> 00:26:56,678 +also do machine translation so here you +are feeding neural network a set of + +384 +00:26:56,679 --> 00:27:00,700 +words one by one in English and the +neural network produces the translation + +385 +00:27:00,700 --> 00:27:05,328 +in print or whatever else target +language you have to perform control so + +386 +00:27:05,328 --> 00:27:09,308 +we've seen your network applications and +manipulate in the robots manipulation + +387 +00:27:09,308 --> 00:27:14,209 +and playing at a party gains work learn +how to play three games just by seeing + +388 +00:27:14,209 --> 00:27:18,089 +the rockets will set the screen and we +seem to be very successful in the + +389 +00:27:18,088 --> 00:27:23,878 +variety of domains and even more than a +bit here and we're uncertain exactly + +390 +00:27:23,878 --> 00:27:27,988 +where this will take us and then I'd +like to also say that we're exploring + +391 +00:27:27,989 --> 00:27:31,749 +ways for lyrics do think that this is +very henry VIII is just wishful thinking + +392 +00:27:31,749 --> 00:27:35,700 +but there are some hints that maybe they +can do that as well + +393 +00:27:35,700 --> 00:27:39,479 +neural networks are very nice because +they're just fun modular things to play + +394 +00:27:39,479 --> 00:27:42,450 +with when I think about working with +their networks I kind of this picture + +395 +00:27:42,450 --> 00:27:46,548 +comes to mind for me here we have a +neural networks practitioner and she's + +396 +00:27:46,548 --> 00:27:51,519 +building what looks to be a roughly 10 +layer at this point + +397 +00:27:51,519 --> 00:27:55,269 +it's very fun really the best way to +think about playing with their looks + +398 +00:27:55,269 --> 00:27:58,619 +like Lego blocks you'll see that we're +building these little function pieces + +399 +00:27:58,619 --> 00:28:02,579 +you look a lot so we can stuck together +to create entire architectures and then + +400 +00:28:02,579 --> 00:28:06,309 +very easily talk to each other and so we +can just create these modules in + +401 +00:28:06,309 --> 00:28:11,519 +Stockton together and play with this +very easily won work that I think + +402 +00:28:11,519 --> 00:28:16,039 +exemplifies this is my homework on which +captioning from roughly a year ago so + +403 +00:28:16,039 --> 00:28:20,289 +here in the task was to take an image +and you're trying to get to work to + +404 +00:28:20,289 --> 00:28:23,639 +produce a sentence description of the +image so for example the top left these + +405 +00:28:23,640 --> 00:28:27,810 +artists set results would say that this +is many black shirt was playing guitar + +406 +00:28:27,809 --> 00:28:32,480 +or a construction worker in Orange City +West is working on the road and so on so + +407 +00:28:32,480 --> 00:28:36,670 +they can look at the image and create +this description of every single image + +408 +00:28:36,670 --> 00:28:41,100 +and when you go to the details of this +model the way this works is we're taking + +409 +00:28:41,099 --> 00:28:45,079 +the convolutional neural network which +we know so there's two modules here in + +410 +00:28:45,079 --> 00:28:49,480 +this system diagram for image capturing +model which we can accomplish on your + +411 +00:28:49,480 --> 00:28:52,880 +network which we know can see we're +taking a recurrent neural network which + +412 +00:28:52,880 --> 00:28:56,150 +we know is very good and modeling +sequences in this case sequences of + +413 +00:28:56,150 --> 00:28:59,720 +words that will be describing the image +and then just as if we were playing with + +414 +00:28:59,720 --> 00:29:02,930 +LEGOs we take those two pieces and we +stick them together its corresponding to + +415 +00:29:02,930 --> 00:29:06,560 +this arrow here in between the two +modules in these networks learned to + +416 +00:29:06,559 --> 00:29:10,639 +talk to each other and in the process of +trying to describe the images these + +417 +00:29:10,640 --> 00:29:13,110 +gradients will be flown through the +comedy show that work on the phone + +418 +00:29:13,109 --> 00:29:16,689 +system will be adjusting itself to +better see the images in order to + +419 +00:29:16,690 --> 00:29:20,200 +describe them at the end and so this +whole system will work together as one + +420 +00:29:20,200 --> 00:29:24,920 +so we'll be working towards this model +will actually come to this class will + +421 +00:29:24,920 --> 00:29:28,279 +have a full understanding exactly off +both this part and this part about + +422 +00:29:28,279 --> 00:29:31,849 +halfway through the course roughly +you'll see how that instructional model + +423 +00:29:31,849 --> 00:29:34,909 +works but that's just a motivation for +really what we're building up to and + +424 +00:29:34,910 --> 00:29:40,290 +you're like really nice models to work +with ok but for now back to see 410 and + +425 +00:29:40,289 --> 00:29:43,159 +all your classification + +426 +00:29:43,160 --> 00:29:47,930 +just remind you are working with this +dataset 2000 and Justin labels and we're + +427 +00:29:47,930 --> 00:29:50,960 +going to approach your classification is +from what we call a parametric approach + +428 +00:29:50,960 --> 00:29:55,079 +can remember that we just discussed now +is something an instance of what we call + +429 +00:29:55,079 --> 00:29:57,439 +nonparametric approach there's no +parameters that we're going to be + +430 +00:29:57,440 --> 00:30:02,430 +optimizing over this distinction will +become clearer and human it's also + +431 +00:30:02,430 --> 00:30:04,240 +apparent to the project we're doing is +worth + +432 +00:30:04,240 --> 00:30:09,089 +thinking about constructing a function +that takes an image and produces the + +433 +00:30:09,089 --> 00:30:12,769 +scores for classes right this is what we +want to do you want to take any image + +434 +00:30:12,769 --> 00:30:17,109 +and we'd like to figure out which one of +the ten plus it is so we'd like to write + +435 +00:30:17,109 --> 00:30:21,169 +down the function and expression that +takes an image and gives you those two + +436 +00:30:21,170 --> 00:30:24,529 +numbers but the expression is not only +function of that image but critically + +437 +00:30:24,529 --> 00:30:28,339 +ill be also a function of these +parameters that are called W sometimes + +438 +00:30:28,339 --> 00:30:33,189 +also called the weights so really it's a +function that goes from 3072 numbers + +439 +00:30:33,190 --> 00:30:37,308 +which make up this image to 10 numbers +that's what we're doing we're defining a + +440 +00:30:37,308 --> 00:30:42,049 +function and we'll go through several +choices of this function in this in the + +441 +00:30:42,049 --> 00:30:45,589 +first case will look at later functions +and then extended to control it works + +442 +00:30:45,589 --> 00:30:49,579 +and then we'll extend that to get +commercial networks but intuitively what + +443 +00:30:49,579 --> 00:30:53,379 +we're building up to is that what we'd +like is when we put this image through + +444 +00:30:53,380 --> 00:30:57,690 +our function we'd like the 10 numbers +that correspond to the scores of the 10 + +445 +00:30:57,690 --> 00:31:01,150 +closest would like the number that +corresponds to the cat class to be high + +446 +00:31:01,150 --> 00:31:06,330 +and all the other numbers to be low and +will have we don't have a choice over X + +447 +00:31:06,329 --> 00:31:11,428 +that acts as our image that's given a +choice over W you will be free to set + +448 +00:31:11,429 --> 00:31:15,179 +aside whatever we want and we want will +want to set it to let this function + +449 +00:31:15,179 --> 00:31:19,050 +gives us the correct answers for every +single image in our training data that's + +450 +00:31:19,049 --> 00:31:23,230 +roughly the approach we're building +towards suppose that we use the simplest + +451 +00:31:23,230 --> 00:31:29,789 +the simplest just a linear +classification here so X is our image in + +452 +00:31:29,789 --> 00:31:34,200 +this case wrongdoing as I'm taking this +array this image that makes up the cat + +453 +00:31:34,200 --> 00:31:38,750 +and I'm stretching out with all the +pixels in that image into a giant column + +454 +00:31:38,750 --> 00:31:46,920 +vector so that there is a column vector +of 3072 numbers and so if you know your + +455 +00:31:46,920 --> 00:31:52,100 +matrix vector operations which you +should that's a prerequisite for this + +456 +00:31:52,099 --> 00:31:55,149 +class that there is just a matrix +multiplication which should be familiar + +457 +00:31:55,150 --> 00:32:00,100 +with and basically we're taking X which +is a 3072 muscle column vector we're + +458 +00:32:00,099 --> 00:32:03,569 +trying to get 10 numbers and it no +longer function so you can go backwards + +459 +00:32:03,569 --> 00:32:08,399 +and figure out the dimensions of this w +are basically 10 by 3072 so there are + +460 +00:32:08,400 --> 00:32:14,370 +30,000 772 202 numbers that goes into W +and that's what we have control over + +461 +00:32:14,369 --> 00:32:16,658 +that's what we have to tweak and find +what works + +462 +00:32:16,659 --> 00:32:21,710 +so those are the parameters in this +particular case when I'm leaving out is + +463 +00:32:21,710 --> 00:32:26,919 +there's also an appended + be sometimes +so you have a bias these biases are + +464 +00:32:26,919 --> 00:32:31,999 +against 10 more parameters and we have +to also find those so usually in a + +465 +00:32:31,999 --> 00:32:36,098 +linear classifier have a WNB we have to +find exactly what works best and this + +466 +00:32:36,098 --> 00:32:39,950 +baby is not a function of the image +that's just independent waits on the on + +467 +00:32:39,950 --> 00:32:44,989 +how likely any one of those just might +be to go back to your question if you + +468 +00:32:44,989 --> 00:32:50,239 +have a very unbalanced datasets for so +maybe you have mostly cats but some dogs + +469 +00:32:50,239 --> 00:32:54,710 +or something like that then you might +expect that the cat the bias for the + +470 +00:32:54,710 --> 00:32:58,200 +catalyst might be slightly higher +because by default the classifier once + +471 +00:32:58,200 --> 00:33:04,009 +to predict the catalyst unless something +comes to the otherwise something in the + +472 +00:33:04,009 --> 00:33:08,069 +image of God this otherwise I think +that's more concrete I just like to + +473 +00:33:08,069 --> 00:33:11,398 +break it down but of course I can't +visualize it very explicitly width 3072 + +474 +00:33:11,398 --> 00:33:17,459 +numbers so imagine that our input image +1024 pixels and imagine so more pics + +475 +00:33:17,460 --> 00:33:21,419 +also stressed out in the column X and +imagine that we have three classes so + +476 +00:33:21,419 --> 00:33:27,109 +red green and blue costs or a cat +adoption process so in this case W will + +477 +00:33:27,108 --> 00:33:30,868 +be only a three by for matrix and what +we're doing here is we're trying to + +478 +00:33:30,868 --> 00:33:36,398 +compute the score of this major acts so +this is major application going on here + +479 +00:33:36,398 --> 00:33:40,608 +to give us the output of path which is +this course we got the three scores for + +480 +00:33:40,608 --> 00:33:45,348 +three different classes so this is +random setting up w just running mates + +481 +00:33:45,348 --> 00:33:50,739 +here and we'll get some scores some +particularly can see that with this this + +482 +00:33:50,739 --> 00:33:55,639 +setting up w is not very good right +because with this setting up w Marquette + +483 +00:33:55,638 --> 00:34:00,449 +score of 96 is much less than any of the +other classes right so this was not + +484 +00:34:00,450 --> 00:34:04,720 +correctly classified for this training +image so that's not a very good + +485 +00:34:04,720 --> 00:34:07,220 +classifier so we want to change a +different double + +486 +00:34:07,220 --> 00:34:10,250 +want to use a different W so that that +score comes up higher than the other + +487 +00:34:10,250 --> 00:34:14,409 +ones but we have to do that consistently +across the entire training such examples + +488 +00:34:14,409 --> 00:34:20,389 +but one thing to notice here as well as +the basically W + +489 +00:34:20,389 --> 00:34:25,700 +it's this function is in parallel +evaluating all the tenant classifiers + +490 +00:34:25,699 --> 00:34:28,230 +but really there are ten independent +classifiers + +491 +00:34:28,230 --> 00:34:32,210 +to some extent here and every one of +these classifiers like say the cats + +492 +00:34:32,210 --> 00:34:36,918 +classifier is just a first row of W here +right in the first row and the first + +493 +00:34:36,918 --> 00:34:41,789 +bias gives you can't score and the dog +classifier is the second row W and the + +494 +00:34:41,789 --> 00:34:46,840 +ship's quarter the ship + 500 W W matrix +has all these different classifier + +495 +00:34:46,840 --> 00:34:50,889 +stacked and rose and they're all being +docked product and with the image to + +496 +00:34:50,889 --> 00:34:56,269 +give you this course so here's a +question for you what does a linear + +497 +00:34:56,269 --> 00:35:02,599 +classifier do in English we saw the +functional form sticking is doing this + +498 +00:35:02,599 --> 00:35:07,589 +funny operation there what was really +interpret and English somehow what this + +499 +00:35:07,590 --> 00:35:28,640 +is doing + +500 +00:35:28,639 --> 00:35:39,048 +X being a high-dimensional data point +and W is really putting plains through + +501 +00:35:39,048 --> 00:35:43,038 +the site and come back to that +interpretation of it but either way can + +502 +00:35:43,039 --> 00:35:59,420 +we think about this team way where every +single one of these rows of W + +503 +00:35:59,420 --> 00:36:03,630 +effectively is like this template that +we're not talking with the image and I + +504 +00:36:03,630 --> 00:36:08,608 +dot product is really a way of like +natural up seeing what what Alliance get + +505 +00:36:08,608 --> 00:36:17,960 +to what other ways + +506 +00:36:17,960 --> 00:36:42,088 +two positions because what we can do is +some of the spatial positions index if + +507 +00:36:42,088 --> 00:36:44,838 +we have zero weights then the classifier +would be + +508 +00:36:44,838 --> 00:36:50,329 +doesn't care what's in part of image so +50 waits for this part here then nothing + +509 +00:36:50,329 --> 00:36:53,389 +affected but for some other parts of the +image of you have positive or negative + +510 +00:36:53,389 --> 00:36:58,118 +weights something's gonna happen there +and contribute to the score in other + +511 +00:36:58,119 --> 00:37:23,200 +ways of describing a space to a label +space + +512 +00:37:23,199 --> 00:37:33,009 +so the question so this image as a +three-dimensional terrain where we have + +513 +00:37:33,010 --> 00:37:37,369 +all these channels you just a stretcher +doubts all the you stretch it out in + +514 +00:37:37,369 --> 00:37:41,849 +whatever way you like say you start the +red green and blue portions side-by-side + +515 +00:37:41,849 --> 00:37:46,030 +only you stretch it out in whatever way +you like but in a consistent way across + +516 +00:37:46,030 --> 00:37:49,930 +all the images you figure out a way to +serialize in which way you want to read + +517 +00:37:49,929 --> 00:37:55,779 +off the pics also used to call him + +518 +00:37:55,780 --> 00:38:05,060 +ok ok so let's say we have a for pixel +grayscale image which is the terrible + +519 +00:38:05,059 --> 00:38:09,420 +example you might think it i dont wanna +confuse people especially because + +520 +00:38:09,420 --> 00:38:12,539 +someone pointed out to me later after I +made this figure that red green and blue + +521 +00:38:12,539 --> 00:38:15,150 +are two color channels but here to red +green and blue course on the closest + +522 +00:38:15,150 --> 00:38:21,380 +this is a complete screw-up on my part +so I apologize not color channels just + +523 +00:38:21,380 --> 00:38:33,769 +three different colored closest sorry +about that okay + +524 +00:38:33,769 --> 00:38:47,309 +large exactly how do we make this all be +a single sized a column vector + +525 +00:38:47,309 --> 00:38:52,369 +the answer is you always always resize +images to be basically the same size we + +526 +00:38:52,369 --> 00:38:56,190 +can't easily deal with different size +than just a weekend we might go into + +527 +00:38:56,190 --> 00:38:59,789 +that later but the simplest thing to +think of it as just resize every single + +528 +00:38:59,789 --> 00:39:04,460 +image to exact same size as the simplest +thing because we want to ensure that all + +529 +00:39:04,460 --> 00:39:08,470 +of them are kind of comparable of the +same stuff so that we can make these + +530 +00:39:08,469 --> 00:39:12,049 +columns and we can analyze the school +patterns that are aligned in the space + +531 +00:39:12,050 --> 00:39:18,380 +in fact state of the art collectors the +way they actually work on this is the + +532 +00:39:18,380 --> 00:39:21,650 +only one square images so if you have a +very long and these methods will + +533 +00:39:21,650 --> 00:39:25,480 +actually work worse because many of them +what they do is to squash it that's what + +534 +00:39:25,480 --> 00:39:30,789 +we do still works fairly well so I feel +very long like panorama just tried to + +535 +00:39:30,789 --> 00:39:34,059 +put that somewhere like some online +service chances are my work worse + +536 +00:39:34,059 --> 00:39:36,679 +because they'll probably want to put it +through come that they will make it a + +537 +00:39:36,679 --> 00:39:41,129 +square because these comments always +work on squares you can make them work + +538 +00:39:41,130 --> 00:39:45,490 +on anything but that's just practice +what happens usually any other questions + +539 +00:39:45,489 --> 00:39:58,199 +are interpreting the W the pacifier yeah +yeah so each image get through this + +540 +00:39:58,199 --> 00:40:04,109 +anyone else would like to interpret this +or so another way to actually put it one + +541 +00:40:04,110 --> 00:40:07,150 +way that I didn't hear but it's also a +nice way of looking at it is that + +542 +00:40:07,150 --> 00:40:12,769 +basically every single score is just a +weighted sum of all the pixel values and + +543 +00:40:12,769 --> 00:40:16,489 +the image and these rates are we get to +choose those eventually but I just a + +544 +00:40:16,489 --> 00:40:20,559 +giant weighted sum it's really all it's +doing is it's coming up colors right + +545 +00:40:20,559 --> 00:40:25,779 +it's coming up colors at different +spatial positions so one way to one way + +546 +00:40:25,780 --> 00:40:29,500 +that was brought up in terms of how we +can interpret this w classified concrete + +547 +00:40:29,500 --> 00:40:33,170 +is that it's kind of like a bit like a +template matching thing so here's what + +548 +00:40:33,170 --> 00:40:37,059 +I've done is I trained classifier and I +have a show you how to do that yet but I + +549 +00:40:37,059 --> 00:40:41,920 +trained my weight matrix and then come +back to the second I'm taking out every + +550 +00:40:41,920 --> 00:40:45,010 +single one of those rows that we've +learned every single classifier and I'm + +551 +00:40:45,010 --> 00:40:46,599 +reshaping in back to an end + +552 +00:40:46,599 --> 00:40:51,809 +so that I can visualize it so I'm taking +it originally just a giant blow-up 3072 + +553 +00:40:51,809 --> 00:40:55,650 +numbers I we ship it back to the image +to undo the distortion have done and + +554 +00:40:55,650 --> 00:40:59,660 +then I have all these templates and so +for example what you see here is that + +555 +00:40:59,659 --> 00:41:04,659 +plane it's like a blue blob here the +reason you see blue blob is that if you + +556 +00:41:04,659 --> 00:41:08,278 +looked at the color channels of this +plane template you'll see that in the + +557 +00:41:08,278 --> 00:41:11,440 +blue channel you have lots of positive +weights because those positive weights + +558 +00:41:11,440 --> 00:41:15,479 +if they see me values then they interact +with those and they get a little + +559 +00:41:15,478 --> 00:41:19,338 +contribution to the score so this plane +classifiers really just counting up the + +560 +00:41:19,338 --> 00:41:23,159 +amount of blue stuff in the image across +all these special occasions and if you + +561 +00:41:23,159 --> 00:41:26,368 +look at the red and green channel for +the plane classifier you might find a + +562 +00:41:26,369 --> 00:41:30,499 +zero values or even negative values +right that's the plan classifier + +563 +00:41:30,498 --> 00:41:35,098 +price for all these other images to say +a frog you can almost see the template + +564 +00:41:35,099 --> 00:41:38,900 +of Prague their right to it looking for +some green starfish green stuff has + +565 +00:41:38,900 --> 00:41:42,849 +positive weights in here and then we see +some brown starfish things on the side + +566 +00:41:42,849 --> 00:41:49,599 +so if that gets butt over an image and +dot product it will get a high score one + +567 +00:41:49,599 --> 00:41:51,430 +thing to note here is a look at this + +568 +00:41:51,429 --> 00:41:56,588 +the car classifier that's not a very +like nice template of a car also hear + +569 +00:41:56,588 --> 00:42:01,679 +the horse looks a bit weird what's up +that was the car looking wherein lies + +570 +00:42:01,679 --> 00:42:11,048 +the horse looking weird yeah yeah +basically that's what's going in the + +571 +00:42:11,048 --> 00:42:14,998 +data the horses someone facing left +somewhere right and this classifier + +572 +00:42:14,998 --> 00:42:19,028 +really is not very powerful classifier +and has to combine the two modes it has + +573 +00:42:19,028 --> 00:42:22,179 +to do both things at the same time +staying up with us two headed horse in + +574 +00:42:22,179 --> 00:42:25,879 +there and you can in fact say that just +when this result there's probably more + +575 +00:42:25,880 --> 00:42:30,599 +left facing horses in seaport in the +right because the stronger they're also + +576 +00:42:30,599 --> 00:42:35,219 +for car right we can have a car like 45 +degrees to the left or right or front + +577 +00:42:35,219 --> 00:42:40,588 +and this classifier here is the optimal +way of mixing across like merging all + +578 +00:42:40,588 --> 00:42:43,608 +those modes into a single template +because that's where forcing it to do + +579 +00:42:43,608 --> 00:42:46,900 +what we're actually doing that's and +neural networks they don't have this + +580 +00:42:46,900 --> 00:42:50,239 +downside they can actually have in +principle they can have a template for + +581 +00:42:50,239 --> 00:42:53,338 +this car that card upcoming combined +across them for giving them more power + +582 +00:42:53,338 --> 00:42:56,478 +to actually carry out this +classification more properly but for now + +583 +00:42:56,478 --> 00:42:57,808 +we are constrained by this + +584 +00:42:57,809 --> 00:43:08,239 +question + +585 +00:43:08,239 --> 00:43:18,389 +yes something so a train time we would +not be taken just exactly what will be + +586 +00:43:18,389 --> 00:43:21,349 +generating them stretching them stealing +them and we'll be putting all that + +587 +00:43:21,349 --> 00:43:25,979 +that's going to become a huge part of +getting to work very well so yes I will + +588 +00:43:25,978 --> 00:43:30,038 +be doing a huge amount of that stuff for +everything will change that we're going + +589 +00:43:30,039 --> 00:43:33,469 +to elucidate many other training +examples of ships since rotates and + +590 +00:43:33,469 --> 00:43:47,009 +stews and that works much better how +these templates chain taking the average + +591 +00:43:47,009 --> 00:43:56,969 +person so you want to explicitly set a +template and the way your set the + +592 +00:43:56,969 --> 00:44:01,068 +template is your average across all the +images and that becomes your template + +593 +00:44:01,068 --> 00:44:13,918 +yeah so this classifier it binds they +would do something similar I would guess + +594 +00:44:13,918 --> 00:44:18,489 +it would work worse because the +classifier when you look at its Michael + +595 +00:44:18,489 --> 00:44:22,028 +formerly what it optimizes for it I +don't think he would have a minimum of + +596 +00:44:22,028 --> 00:44:26,179 +what you described in just a min of the +images but that would be like intuitive + +597 +00:44:26,179 --> 00:44:30,079 +Lee decent heuristic to perhaps that +awaits in the initialization or split + +598 +00:44:30,079 --> 00:44:34,239 +something related to it + +599 +00:44:34,239 --> 00:44:40,349 +yeah yeah but we might be going to that +I'll be able to return to their several + +600 +00:44:40,349 --> 00:44:43,980 +several things + +601 +00:44:43,980 --> 00:45:06,650 +different colors red which is saying +that there's probably more red cars in + +602 +00:45:06,650 --> 00:45:11,750 +the dataset and it may not work for you +in fact yellow cards might be for this + +603 +00:45:11,750 --> 00:45:16,909 +time so this thing just does not have +capacity to do all of that which is why + +604 +00:45:16,909 --> 00:45:19,989 +the powerful enough it can capture all +these different modes correctly and so + +605 +00:45:19,989 --> 00:45:23,689 +this will just go after the numbers +there's more red cars that's where it + +606 +00:45:23,690 --> 00:45:28,389 +will go if this was grayscale I'm not +sure if that would work better he'll + +607 +00:45:28,389 --> 00:45:40,368 +come back to that actually you might +expect as I mentioned for imbalanced + +608 +00:45:40,369 --> 00:45:42,190 +datasets what you might expect + +609 +00:45:42,190 --> 00:45:49,150 +not exactly what you might expect lots +of cats is that the cat bias would be + +610 +00:45:49,150 --> 00:45:53,750 +higher because this class this +classifier is just used to large numbers + +611 +00:45:53,750 --> 00:45:57,980 +based on the loss but we have to go into +loss function to exactly see how that + +612 +00:45:57,980 --> 00:46:01,929 +will play out so it's hard to say right +now + +613 +00:46:01,929 --> 00:46:05,960 +another interpretation of the classifier +that also someone else pointed out that + +614 +00:46:05,960 --> 00:46:09,869 +I'd like to point out is you can think +of these images as very high-dimensional + +615 +00:46:09,869 --> 00:46:17,619 +points in a 3072 dimensional space right +into 3072 pixels space space every image + +616 +00:46:17,619 --> 00:46:22,130 +is a point and these linear classifiers +are describing these gradients across + +617 +00:46:22,130 --> 00:46:25,070 +the three thousand something two +dimensional space these scores are this + +618 +00:46:25,070 --> 00:46:28,580 +region and negative to positive along +some liquor direction across the space + +619 +00:46:28,579 --> 00:46:33,670 +and so for example here for a classifier +I'm taking the first row of W which is + +620 +00:46:33,670 --> 00:46:37,750 +the car class and to the line here is +indicating the zero level set of the + +621 +00:46:37,750 --> 00:46:42,739 +classifier in other words that long that +line the car classifier has a zero score + +622 +00:46:42,739 --> 00:46:46,849 +so the car classifier there has 20 and +then their arrows indicating the + +623 +00:46:46,849 --> 00:46:51,730 +direction along which it will color the +space with more and more + +624 +00:46:51,730 --> 00:46:56,400 +harness score similar we have three +different classifiers in this example + +625 +00:46:56,400 --> 00:46:59,900 +they will also respond to these +gradients with particular level set and + +626 +00:46:59,900 --> 00:47:05,650 +they're basically trying to go in if all +these punks they are in the space and + +627 +00:47:05,650 --> 00:47:08,970 +these local suppliers we initialize then +randomly saw this car classifier would + +628 +00:47:08,969 --> 00:47:11,969 +have its level set at random and then +you'll see when we actually do the + +629 +00:47:11,969 --> 00:47:16,449 +optimization as we optimize this will +start your shift turn animal protein + +630 +00:47:16,449 --> 00:47:20,239 +isolate the car class and will like +through fun to watch these classifiers + +631 +00:47:20,239 --> 00:47:25,038 +trained because it will rotate will snap +into the car crossing Dr Jekyll and will + +632 +00:47:25,039 --> 00:47:28,528 +try to like separate out all the cars +from all the upholding of course it's + +633 +00:47:28,528 --> 00:47:33,289 +really amusing to watch so that's +another way of interpreting that ok + +634 +00:47:33,289 --> 00:47:37,130 +here's a question for you given all +these interpretations would be a very + +635 +00:47:37,130 --> 00:47:43,028 +hard to such a pacifier works what would +you expect to work really really not + +636 +00:47:43,028 --> 00:47:51,909 +well with a linear classifier + +637 +00:47:51,909 --> 00:48:05,230 +concurrent circle see our closest are +your classes exactly how I see so you're + +638 +00:48:05,230 --> 00:48:10,349 +in search of describing is in this +interpretation of space in your images + +639 +00:48:10,349 --> 00:48:15,630 +in one class would be in a blob and then +your other classes like around it so I'm + +640 +00:48:15,630 --> 00:48:19,880 +not sure exactly what that would look +like if you actually space but yes + +641 +00:48:19,880 --> 00:48:22,869 +you're right in that case clinic awesome +I will not be able to separate out those + +642 +00:48:22,869 --> 00:48:26,920 +but what about in terms of like what +would the images look like you would + +643 +00:48:26,920 --> 00:48:31,079 +look at the studio setup images clearly +say that later classifier will probably + +644 +00:48:31,079 --> 00:49:02,380 +not do very well here ya got + +645 +00:49:02,380 --> 00:49:39,210 +trained classifier and that I do a +negative of it negative image of that + +646 +00:49:39,210 --> 00:49:42,699 +classifier you still see the edges and +you'll say okay that's an airplane + +647 +00:49:42,699 --> 00:49:45,710 +obviously by the shape battalion +classifier all the colors would be + +648 +00:49:45,710 --> 00:49:49,760 +exactly wrong and so the cost I would +hate that airplane + +649 +00:49:49,760 --> 00:50:02,330 +example + +650 +00:50:02,329 --> 00:50:12,630 +dogs dogs dogs and one closest dogs in +on the right and you think that would be + +651 +00:50:12,630 --> 00:50:27,090 +a problem right + +652 +00:50:27,090 --> 00:50:32,829 +white background or something that would +be a problem it wouldn't be a problem I + +653 +00:50:32,829 --> 00:50:37,059 +wouldn't be a problem + +654 +00:50:37,059 --> 00:50:52,570 +transformation + +655 +00:50:52,570 --> 00:50:56,789 +you're saying that may be more difficult +thing would be if your dog that our work + +656 +00:50:56,789 --> 00:51:00,309 +in some ways according to class why +wouldn't it be a problem if you actually + +657 +00:51:00,309 --> 00:51:04,279 +do something in the center and something +on the right doesn't actually have an + +658 +00:51:04,280 --> 00:51:08,840 +understanding up especially on that +actually find right there would be a + +659 +00:51:08,840 --> 00:51:15,769 +relatively easy because you would have +positive weights in the middle + +660 +00:51:15,769 --> 00:51:25,219 +ok + +661 +00:51:25,219 --> 00:51:34,348 +yes so this is really really what it's +doing here really what this is doing is + +662 +00:51:34,349 --> 00:51:38,619 +it's counting up coming up colors and +special positions anything that messes + +663 +00:51:38,619 --> 00:51:41,800 +with this will be really hard actually +to go back to your point if you had a + +664 +00:51:41,800 --> 00:51:44,300 +grayscale data set by the way that would +work + +665 +00:51:44,300 --> 00:51:48,070 +not very well with our customers will +probably not work if you could see far + +666 +00:51:48,070 --> 00:51:53,250 +10 and you made or grayscale then doing +the exact same classification grayscale + +667 +00:51:53,250 --> 00:51:56,059 +images would probably work really +terribly because you can't pick up on + +668 +00:51:56,059 --> 00:52:00,739 +the colors you have to pick up on these +textures and fine details now and you + +669 +00:52:00,739 --> 00:52:03,848 +just can't localize them because they +could be very positions can't + +670 +00:52:03,849 --> 00:52:08,400 +consistently come to cross it would be +kind of a disaster + +671 +00:52:08,400 --> 00:52:11,660 +another example would be different +textures if you have say all of your + +672 +00:52:11,659 --> 00:52:16,989 +text are blue but these texts could be +different types then this doesn't really + +673 +00:52:16,989 --> 00:52:20,799 +like say these two different types but +they can be spatially invariant + +674 +00:52:20,800 --> 00:52:29,740 +that would be terrible terrible get so +just remind you I think nearly there + +675 +00:52:29,739 --> 00:52:35,269 +would find this function so with +specific case and W we're looking at + +676 +00:52:35,269 --> 00:52:38,588 +some test images we're getting some +scores out and just looking forward + +677 +00:52:38,588 --> 00:52:43,070 +we're headed now is with some setting up +w for getting some scores for all these + +678 +00:52:43,070 --> 00:52:47,470 +images and so for example with this +setting up w in this image we're seeing + +679 +00:52:47,469 --> 00:52:51,319 +that the cat score is 2.9 but there are +some classes I've got a higher score + +680 +00:52:51,320 --> 00:52:54,588 +like dog so that's not very good right +but some classes have negative scores + +681 +00:52:54,588 --> 00:52:59,909 +which is good for this image so this is +kind of a medium result for this waits + +682 +00:52:59,909 --> 00:53:04,199 +for this image in here we see that the +car class just correct for their has the + +683 +00:53:04,199 --> 00:53:08,439 +highest score which is going to write so +visiting W work too well on this image + +684 +00:53:08,440 --> 00:53:14,940 +here we see that the class is a very low +score so terribly on that so we're + +685 +00:53:14,940 --> 00:53:19,990 +headed now is we're going to define what +we call a loss function and this loss + +686 +00:53:19,989 --> 00:53:23,899 +function will quantify this intuition of +what we considered good or bad right now + +687 +00:53:23,900 --> 00:53:26,440 +we're just eyeballing these numbers are +saying what's good what's + +688 +00:53:26,440 --> 00:53:29,490 +which actually write down the +mathematical expression that tells us + +689 +00:53:29,489 --> 00:53:35,949 +exactly like these setting up w across +our test is 12.5 bad or 1220 whatever + +690 +00:53:35,949 --> 00:53:40,469 +bad or 110 bad because then once we have +a defined specifically we're going to be + +691 +00:53:40,469 --> 00:53:44,318 +looking forw that minimize the loss and +it will be set up in such a way that + +692 +00:53:44,318 --> 00:53:48,500 +when you have a loss of very low numbers +like say even zero and then your + +693 +00:53:48,500 --> 00:53:53,760 +correctly classifying all your images +but if you have a very high loss then + +694 +00:53:53,760 --> 00:53:56,970 +everything is messed up in W is not good +at all so we're going to find a lot of + +695 +00:53:56,969 --> 00:54:01,059 +action and then look for different w's +that actually do very well across all of + +696 +00:54:01,059 --> 00:54:03,469 +it so that's roughly what's coming up + +697 +00:54:03,469 --> 00:54:09,108 +well-defined loss function which is a +quantify a way to quantify how bad HW is + +698 +00:54:09,108 --> 00:54:13,328 +on our dataset the loss function as a +function of your entire training set and + +699 +00:54:13,329 --> 00:54:19,900 +your rates we don't have control over +the transfer of control of weeds then + +700 +00:54:19,900 --> 00:54:22,960 +we're going to look at the process of +optimization how to efficiently find the + +701 +00:54:22,960 --> 00:54:27,420 +set of weights w that works across all +of the images and gives us a very low + +702 +00:54:27,420 --> 00:54:30,940 +loss and then eventually what we'll do +is we'll go back and look at this + +703 +00:54:30,940 --> 00:54:34,250 +expression classifier that we saw we're +going to start meddling with the + +704 +00:54:34,250 --> 00:54:38,260 +function here so we're going to expend +effort to not be that simple your + +705 +00:54:38,260 --> 00:54:41,349 +expression but we're going to make it +slightly more complex will get a workout + +706 +00:54:41,349 --> 00:54:44,630 +and then we can slightly more complex +and will get a coalition that work out + +707 +00:54:44,630 --> 00:54:48,789 +but otherwise the entire framework will +stay unchanged all the time will be + +708 +00:54:48,789 --> 00:54:52,389 +competing these course dysfunctional +formal be changing but we're going to + +709 +00:54:52,389 --> 00:54:56,909 +some sort of course through some +function and will make it more elaborate + +710 +00:54:56,909 --> 00:55:01,179 +overtime and then we're identifying some +loss function and we're looking at what + +711 +00:55:01,179 --> 00:55:04,449 +waits what primaries are given a very +low loss and that's a setup will be + +712 +00:55:04,449 --> 00:55:09,710 +working with going forward so next class +will look into loss functions and then + +713 +00:55:09,710 --> 00:55:13,730 +we'll go to Arsenal Emirates income +that's so I guess this is my last light + +714 +00:55:13,730 --> 00:55:23,920 +so I can take up any last questions and + +715 +00:55:23,920 --> 00:55:36,068 +sorry sorry sorry I didn't hear + +716 +00:55:36,068 --> 00:55:41,969 +the project optimization are sometimes +in opposition settings you can operate + +717 +00:55:41,969 --> 00:55:45,429 +these innovative approaches are +basically the way this will work we'll + +718 +00:55:45,429 --> 00:55:49,598 +see we'll always start off with the +random W so that will give us some loss + +719 +00:55:49,599 --> 00:55:53,249 +and then we we don't have a process of +finding right away the best set of + +720 +00:55:53,248 --> 00:55:57,509 +weights but we do have a process for is +iteratively slightly improving the + +721 +00:55:57,509 --> 00:56:01,309 +weights so little see as we look at the +loss function and will find a gradient + +722 +00:56:01,309 --> 00:56:06,380 +and space and will march down so what we +do know how to do is how do we slightly + +723 +00:56:06,380 --> 00:56:09,890 +improved a set of weights we don't know +how to do the problem of just buying the + +724 +00:56:09,889 --> 00:56:12,858 +best way through right away we don't +know how to do that because especially + +725 +00:56:12,858 --> 00:56:17,108 +when these functions are very complex +likes a intercom that's a huge landscape + +726 +00:56:17,108 --> 00:56:31,038 +of its just a very intractable problem +is that your question I'm not sure how + +727 +00:56:31,039 --> 00:56:40,170 +do we deal with the color problem so ok +so so here we saw that the linear + +728 +00:56:40,170 --> 00:56:44,809 +classifier for car was this red template +for a car and neural network basically + +729 +00:56:44,809 --> 00:56:47,619 +what we'll do is we'll meet will you can +look at it as stacking when you're + +730 +00:56:47,619 --> 00:56:50,818 +classifier to some degree so what it +will end up doing is it will have all + +731 +00:56:50,818 --> 00:56:55,748 +these little templates really for rent +cars cars cars cars going this way or + +732 +00:56:55,748 --> 00:56:58,248 +that way or that way there will be +assigned to the technique every one of + +733 +00:56:58,248 --> 00:57:01,399 +these different modes and then they will +be combined across them on the second + +734 +00:57:01,400 --> 00:57:04,739 +layer so basically you have these are +looking for different types of course + +735 +00:57:04,739 --> 00:57:08,588 +and then next year on will be just like +ok I just take a way to tell if you guys + +736 +00:57:08,588 --> 00:57:13,548 +are doing or operation over you and then +we can detect cars in all of their modes + +737 +00:57:13,548 --> 00:57:17,498 +of their positions that makes sense +that's roughly homework + diff --git a/captions/En/Lecture3_en.srt b/captions/En/Lecture3_en.srt new file mode 100644 index 00000000..f93fe7e0 --- /dev/null +++ b/captions/En/Lecture3_en.srt @@ -0,0 +1,4442 @@ +1 +00:00:00,000 --> 00:00:05,400 +so before we get into some of the +material today on loss function + +2 +00:00:05,400 --> 00:00:09,429 +optimization I wanted to go over some +administrative things first + +3 +00:00:09,429 --> 00:00:12,859 +just as a reminder of the first assignment +is due on next Wednesday so you have + +4 +00:00:12,859 --> 00:00:18,100 +roughly nine days left and just as a +warning Monday is holidays so there will + +5 +00:00:18,100 --> 00:00:23,050 +be no class in office hours so plan out +your time accordingly to make sure that + +6 +00:00:23,050 --> 00:00:25,920 +you can complete the assignment in time +of course he also have some late two + +7 +00:00:25,920 --> 00:00:29,960 +days that you can use and allocate among +your assignment as you see fit + +8 +00:00:29,960 --> 00:00:35,149 +ok so I diving into the material first +i'd like to remind you where we are + +9 +00:00:35,149 --> 00:00:39,100 +currently last time we looked at this +problem visual recognition as + +10 +00:00:39,100 --> 00:00:42,950 +specifically at image classification and +we're talking about the fact that this + +11 +00:00:42,950 --> 00:00:45,780 +is actually a very difficult problem +right so you just consider the cross + +12 +00:00:45,780 --> 00:00:50,829 +product of all the possible variations +that we have to be robust to when we + +13 +00:00:50,829 --> 00:00:54,198 +recognize any of these categories such +as cat just seems like such an + +14 +00:00:54,198 --> 00:00:58,049 +intractable and impossible problem and not +only do we know how to solve these + +15 +00:00:58,049 --> 00:01:02,108 +problems now but we can solve this +problem for thousands of categories and + +16 +00:01:02,109 --> 00:01:05,859 +the state of the art methods work almost +at human accuracy or even slightly + +17 +00:01:05,859 --> 00:01:11,829 +surpassing it and some of those classes +and it's also runs nearly in real-time + +18 +00:01:11,829 --> 00:01:16,539 +on your phone and so basically and all +of this also happened in the last three + +19 +00:01:16,540 --> 00:01:19,790 +years and also you'll be experts by the +end of the class on all of this + +20 +00:01:19,790 --> 00:01:23,609 +technology so it's really cool and +exciting oK so that's the problem of + +21 +00:01:23,609 --> 00:01:27,140 +classification of image recognition we talked +specifically about the data-driven + +22 +00:01:27,140 --> 00:01:30,450 +approach the fact that we can't just +explicitly hardcode these classifiers so + +23 +00:01:30,450 --> 00:01:34,100 +we have to actually trained them from +Dana and so we looked at the idea of + +24 +00:01:34,099 --> 00:01:37,188 +having different the training data +having the validation splits where we + +25 +00:01:37,188 --> 00:01:41,408 +just test out our hyper parameters and a test +that that you don't touch too much we + +26 +00:01:41,409 --> 00:01:44,810 +look specifically at the example of the +nearest neighbor classifier and some more + +27 +00:01:44,810 --> 00:01:48,618 +and K nearest neighbor classifiers and +I talked about the CIFAR-10 dataset + +28 +00:01:48,618 --> 00:01:52,938 +which is our Toyota said that we play +with during this class then I introduced + +29 +00:01:52,938 --> 00:01:58,438 +the idea of this approach that I termed +parametric approach which is really that + +30 +00:01:58,438 --> 00:02:03,639 +we're writing a function F from image +directly to the raw 10 scores if you have 10 classes + +31 +00:02:03,640 --> 00:02:07,618 +And this parameteric form seem to be linear first. + +32 +00:02:07,618 --> 00:02:11,520 +So we just have F=Wx. and we talked about the +interpretations of this linear classifer + +33 +00:02:11,520 --> 00:02:12,850 +the fact that you can + +34 +00:02:12,849 --> 00:02:16,039 +interpreted as matching templates or +that you can interpret it as these + +35 +00:02:16,039 --> 00:02:18,449 +images being in the very +high-dimensional space and our linear classifer + +36 +00:02:18,449 --> 00:02:23,560 +kind of going in and coloring this space by class course + +37 +00:02:23,560 --> 00:02:28,740 +so to speak. and so by the end of the class +we got to this picture where we suppose + +38 +00:02:28,740 --> 00:02:32,240 +we have a training example training +dataset them just three images here + +39 +00:02:32,240 --> 00:02:36,530 +along the columns and we have some classes say +10 classes in CIFAR-10 + +40 +00:02:36,530 --> 00:02:40,740 +basically this function f() is assigning scores +for every single one of these images + +41 +00:02:40,740 --> 00:02:44,510 +with some particular setting of weights +which have chosen randomly here + +42 +00:02:44,509 --> 00:02:47,939 +we get some scores out and so some of these +results are good and some of them are bad + +43 +00:02:47,939 --> 00:02:51,419 +so if you inspect this scores, for example, +in the first image you can see that + +44 +00:02:51,419 --> 00:02:55,509 +the correct class which is cat got a +score of 2.9 and that's kind of in the + +45 +00:02:55,509 --> 00:03:00,060 +middle so some some classes here received +a higher score which is not very good + +46 +00:03:00,060 --> 00:03:03,289 +some classes received a much lower score +which is good for that particular image + +47 +00:03:03,289 --> 00:03:09,019 +the car was very well classified because +class score of car was much higher than all of the other ones + +48 +00:03:09,020 --> 00:03:12,980 +And the Frog was not very well classified at all. +right? + +49 +00:03:12,979 --> 00:03:18,199 +So we had this notion that for different weights +these different weights work better or + +50 +00:03:18,199 --> 00:03:21,389 +worse on different images and of course +we're trying to find weights that + +51 +00:03:21,389 --> 00:03:26,209 +give us scores that are consistent with +all the ground truth labels. All the labels and data. + +52 +00:03:26,210 --> 00:03:30,490 +And so what we're going to do +now is so far with only I believe what I + +53 +00:03:30,490 --> 00:03:33,590 +just described like this is good and +that's not so good and so on but we have + +54 +00:03:33,590 --> 00:03:34,900 +to actually give it a shot + +55 +00:03:34,900 --> 00:03:38,710 +actually quantify this notion we have to +say that this particular set of weights + +56 +00:03:38,710 --> 00:03:44,189 +WSA like 12 bad or 1.5 bad or whatever +and then once we have this loss function + +57 +00:03:44,189 --> 00:03:47,710 +we're going to minimize it so we're +going to find W that gets us the lowest + +58 +00:03:47,710 --> 00:03:50,830 +loss and we're going to look into that +today we're going to look specifically + +59 +00:03:50,830 --> 00:03:55,830 +into how we can define a loss function +that measures this unhappiness and then + +60 +00:03:55,830 --> 00:04:00,030 +we're actually going to look at two +different cases a boston soft max costs + +61 +00:04:00,030 --> 00:04:04,840 +costs and then we're going to look into +the process optimization which is how do + +62 +00:04:04,840 --> 00:04:08,000 +you start off with these random audits +and how do you actually find very very + +63 +00:04:08,000 --> 00:04:13,110 +good sighting of weight sufficiently so +I'm going to downsize this example that + +64 +00:04:13,110 --> 00:04:16,620 +we have a nice working example to work +with suppose we only had three classes + +65 +00:04:16,620 --> 00:04:18,030 +and stuff you know + +66 +00:04:18,029 --> 00:04:22,009 +tens of thousands and we have these +three images and these are our scores + +67 +00:04:22,009 --> 00:04:23,360 +for some setup W + +68 +00:04:23,360 --> 00:04:27,949 +we're going to now try to write down +exactly our unhappiness with this result + +69 +00:04:27,949 --> 00:04:32,680 +of the first loss we're going to look +into it is termed a multi-class SVM loss + +70 +00:04:32,680 --> 00:04:36,629 +this is a generalization of a minority +support vector machine that you may have + +71 +00:04:36,629 --> 00:04:42,379 +seen over the closest I think between 9 +covers as well and so the setup here is + +72 +00:04:42,379 --> 00:04:47,710 +that we miss core function right so as a +vector of Lacoste course these are our + +73 +00:04:47,709 --> 00:04:50,948 +inspectors and there's a specific term +here + +74 +00:04:50,949 --> 00:04:55,348 +loss equal stew stuff and I'm going to +interpret this loss now for Easter that + +75 +00:04:55,348 --> 00:04:59,978 +and we're going to see through a +specific example of why this expression + +76 +00:04:59,978 --> 00:05:06,158 +excess effectively what the SVM losses +same is that it's something across all + +77 +00:05:06,158 --> 00:05:11,399 +the incorrect examples so all the all +the same across all the incorrect course + +78 +00:05:11,399 --> 00:05:17,209 +classes so for every single example we +have that loss and it's coming across + +79 +00:05:17,209 --> 00:05:20,769 +all the incorrect classes and it's +comparing the score at the core class + +80 +00:05:20,769 --> 00:05:25,209 +received in this court that the +incorrect class receipt Jane minus s why + +81 +00:05:25,209 --> 00:05:31,269 +I why I being the correct label plus one +and then that smacks of zero so what's + +82 +00:05:31,269 --> 00:05:35,838 +going on here as we're comparing the +difference in this course and this + +83 +00:05:35,838 --> 00:05:40,338 +particular lost the same that not only +do I want the correct score to be higher + +84 +00:05:40,338 --> 00:05:43,918 +than the incorrect score but there's +actually a safety margin that we putting + +85 +00:05:43,918 --> 00:05:46,079 +on will put were using a safety margin + +86 +00:05:46,079 --> 00:05:53,198 +exactly one and we're going to go into +why one makes sense to use as opposed to + +87 +00:05:53,199 --> 00:05:56,900 +some other hyper primary that we have to +choose their and intuitively you can + +88 +00:05:56,899 --> 00:06:00,508 +look into notes for a much more rigorous +derivation of exactly why that one + +89 +00:06:00,509 --> 00:06:04,278 +doesn't matter but then too early to +think about this underscores our kind of + +90 +00:06:04,278 --> 00:06:08,500 +scale-free because I can skim I W I can +make it larger or smaller and you're + +91 +00:06:08,500 --> 00:06:12,490 +going to get larger or smaller course so +really there's this pre-primary off + +92 +00:06:12,490 --> 00:06:16,550 +discourse and how large or small they +can be that is tied to how large or + +93 +00:06:16,550 --> 00:06:19,930 +weights are in magnitude and so these +whores are kind of arbitrary so using + +94 +00:06:19,930 --> 00:06:25,269 +one is just an arbitrary choice to some +extent ok so let's see specifically how + +95 +00:06:25,269 --> 00:06:29,128 +this expression works with a concrete +example so here I am going to evaluate + +96 +00:06:29,129 --> 00:06:33,899 +that loss for the first example so here +we're competing for plugging in this + +97 +00:06:33,899 --> 00:06:35,949 +course so we see that we're comparing + +98 +00:06:35,949 --> 00:06:40,829 +the score we got your car from 1-3 point +to which is the correct class car and + +99 +00:06:40,829 --> 00:06:45,219 +then adding our safety margin of one and +the maximum 0 and that is really what + +100 +00:06:45,220 --> 00:06:48,770 +it's doing is it's going to be clamping +values 80 right so if we get a negative + +101 +00:06:48,769 --> 00:06:53,759 +result we're going to exclude VAT 0 so +if you see the second class for the + +102 +00:06:53,759 --> 00:06:55,089 +incorrect Plaza frog + +103 +00:06:55,089 --> 00:06:59,699 +1.7 subtracted from 3.2 at a safety +margin and we gonna get to a point nine + +104 +00:06:59,699 --> 00:07:03,629 +and then when you work this through you +get a loss of 2.9 + +105 +00:07:03,629 --> 00:07:07,209 +intuitively what you can see here the +way this worked out is intuitively the + +106 +00:07:07,209 --> 00:07:12,930 +cat score is 3.2 so according to ESPN +Los we would we would like ideally is + +107 +00:07:12,930 --> 00:07:16,100 +that the scores for all the classes are +up to it most + +108 +00:07:16,100 --> 00:07:21,370 +2.2 but the car class actually had much +higher much higher score than one and + +109 +00:07:21,370 --> 00:07:24,620 +this difference in what we would have +liked which is 2.2 and what actually + +110 +00:07:24,620 --> 00:07:30,939 +happened just like 11 is exactly this +difference of 2.9 which is how bad of a + +111 +00:07:30,939 --> 00:07:36,129 +score outcome this was and in the other +case in fraud case you can see the proc + +112 +00:07:36,129 --> 00:07:40,139 +score was quite a bit lower than low 2.2 +and so the way that works out then the + +113 +00:07:40,139 --> 00:07:43,289 +math is that you end up getting a +negative number when you compare this + +114 +00:07:43,290 --> 00:07:48,110 +course and then the max 2000 lost +contribution for that particular part + +115 +00:07:48,110 --> 00:07:54,439 +and you end up with a loss of 2.9 oK so +that's the loss for this first major for + +116 +00:07:54,439 --> 00:07:57,050 +the second image we're going to again do +the same thing + +117 +00:07:57,050 --> 00:08:01,689 +plug in the numbers were comparing the +cats got the car score so we get my + +118 +00:08:01,689 --> 00:08:07,329 +point three months from 19 at a safety +margin and the same for the other class + +119 +00:08:07,329 --> 00:08:11,659 +so when you plug in your actually end up +with a lot of zero and loss of 0 + +120 +00:08:11,660 --> 00:08:17,280 +intuitively is because the car score +here is it is true that the car score is + +121 +00:08:17,279 --> 00:08:22,479 +higher than all the others course for +that image by at least one right that's + +122 +00:08:22,480 --> 00:08:27,490 +why we got zero score 0 lost that is so +the constraint was satisfied and some of + +123 +00:08:27,490 --> 00:08:31,310 +their loss and in this case we end up +with a very bad loss because of course + +124 +00:08:31,310 --> 00:08:34,470 +the frog class received a very low score +but the other classes received quite + +125 +00:08:34,470 --> 00:08:39,349 +high school so this adds up to an +unhappiness of 10.9 and now if we + +126 +00:08:39,349 --> 00:08:42,520 +actually want to combine all of this +into a single loss function we're going + +127 +00:08:42,519 --> 00:08:45,929 +to do the relatively intuitive +transformation here we just take the + +128 +00:08:45,929 --> 00:08:48,049 +average across all the losses we obtain + +129 +00:08:48,049 --> 00:08:51,458 +empower training set and so it would say +that the loss at the end when you + +130 +00:08:51,458 --> 00:08:56,369 +average these numbers is 4.6 so this +particular setting up w on this training + +131 +00:08:56,370 --> 00:09:01,320 +data gives us some course which we plug +into the loss function and we've given + +132 +00:09:01,320 --> 00:09:06,170 +and unhappiness a four-point sex with +this result ok so not going to ask you a + +133 +00:09:06,169 --> 00:09:08,939 +series of questions to kind of test your +understanding a bit about how this works + +134 +00:09:08,940 --> 00:09:12,390 +I'll get into questions in a bit let me +just pose my friend Michael questions + +135 +00:09:12,389 --> 00:09:20,230 +first of all what if that's over there +which is some overall the incorrect + +136 +00:09:20,230 --> 00:09:25,560 +colossus of Jane whatever that means +that some overall the closest not just + +137 +00:09:25,559 --> 00:09:29,799 +the incorrect ones so what if we allowed +J to equal to why I why am I actually + +138 +00:09:29,799 --> 00:09:39,149 +adding that small constraint in the +summer there yes so in fact what would + +139 +00:09:39,149 --> 00:09:43,139 +have happened is the reason the better +gnite equal to I as if we allowed to I + +140 +00:09:43,139 --> 00:09:46,539 +then score of why I cancel reply + +141 +00:09:46,539 --> 00:09:49,828 +you end up with a zero and really what +you're doing is you're adding a constant + +142 +00:09:49,828 --> 00:09:53,549 +of London so if that someone's overall +this course then really maybe just + +143 +00:09:53,549 --> 00:09:59,250 +completing the loss by constant of 10 +that's why that's their second what if + +144 +00:09:59,250 --> 00:10:03,940 +we used a mean instead of a sudden right +so I'm summing up over all these + +145 +00:10:03,940 --> 00:10:10,500 +constraints what if I used to mean just +like I'm using mean to actually averaged + +146 +00:10:10,500 --> 00:10:13,389 +over all the losses for all the examples +what if I use the mean over this course + +147 +00:10:13,389 --> 00:10:28,000 +the score concerns there were too many +classes so you're right in that the + +148 +00:10:28,000 --> 00:10:33,870 +absolute value of the loss will be lower + +149 +00:10:33,870 --> 00:10:37,879 +a constant factor why + +150 +00:10:37,879 --> 00:10:52,689 +did actually do an average here would be +averaging over the number of classes + +151 +00:10:52,690 --> 00:10:56,220 +here but there's a constant number of +classes say three in the specific + +152 +00:10:56,220 --> 00:10:56,889 +example + +153 +00:10:56,889 --> 00:11:01,000 +amounts to putting constant of one-third +in front of the loss and since we're + +154 +00:11:01,000 --> 00:11:04,450 +always in the end so that would make the +Los lower just like you pointed out but + +155 +00:11:04,450 --> 00:11:07,820 +in the end what we're always interested +in as we're going to minimize aw over + +156 +00:11:07,820 --> 00:11:12,470 +that loss so if you're shifting your +lost by one or if you're scaling it with + +157 +00:11:12,470 --> 00:11:15,350 +a constant is actually doesn't change +our solutions but you're still going to + +158 +00:11:15,350 --> 00:11:19,420 +end up at the same optimal W so these +choices are kind of basically free + +159 +00:11:19,419 --> 00:11:23,169 +parameters doesn't matter so for +convenience I'm adding do not equal to Y + +160 +00:11:23,169 --> 00:11:26,299 +and I'm not actually taken to mean +although it's the same thing and the + +161 +00:11:26,299 --> 00:11:33,329 +same goes for us for whether or not we +average for some across the examples ok + +162 +00:11:33,330 --> 00:11:38,410 +next question what if we instead used +not to the formulation of there but a + +163 +00:11:38,409 --> 00:11:42,669 +very similar looking for inflation but +there's an additional squared at the end + +164 +00:11:42,669 --> 00:11:47,809 +so we're taking the difference between +course plus one this morning and then + +165 +00:11:47,809 --> 00:11:54,509 +were squaring that do we obtain the same +or different lost when you think we + +166 +00:11:54,509 --> 00:11:57,710 +obtain the same or different loss in a +sense that if you were to optimize and + +167 +00:11:57,710 --> 00:12:05,759 +find the best W do we get the same +result or not + +168 +00:12:05,759 --> 00:12:20,340 +yes in fact get a different loss it's +not as obvious to see but what one way + +169 +00:12:20,340 --> 00:12:26,639 +to see it as that we're not just clearly +scaling not just clearly scaling the Los + +170 +00:12:26,639 --> 00:12:30,710 +up or down by constant or shifting it by +constant we're actually changing the + +171 +00:12:30,710 --> 00:12:35,580 +differences we are changing the +tradeoffs nonlinearly in terms of how + +172 +00:12:35,580 --> 00:12:38,920 +the SVM support vector machines going to +go there and trade all the different + +173 +00:12:38,919 --> 00:12:43,519 +score margins in different examples but +it's not obvious to see but basically + +174 +00:12:43,519 --> 00:12:46,829 +it's not very clear but I want to to +illustrate that all changes to this loss + +175 +00:12:46,830 --> 00:12:53,320 +are completely and the second permission +here is in fact something we call a + +176 +00:12:53,320 --> 00:12:57,530 +squared hinge loss instead of the one on +top which recall hinge loss and you can + +177 +00:12:57,529 --> 00:13:01,480 +use two different kind of hyper +primarily 20 use most often you see the + +178 +00:13:01,480 --> 00:13:04,750 +first formulation that's what we use +most of the time but sometimes you can + +179 +00:13:04,750 --> 00:13:07,950 +see these assets with the square inch +loss and better so that's something you + +180 +00:13:07,950 --> 00:13:12,550 +play with that's really hyper primer but +its most often used for the first one + +181 +00:13:12,549 --> 00:13:18,919 +let's also think about the scale of this +loss was the min and max possible loss + +182 +00:13:18,919 --> 00:13:23,149 +that you can achieved with the +multi-class SVM on your entire dataset + +183 +00:13:23,149 --> 00:13:26,759 +what is the smallest Mali + +184 +00:13:26,759 --> 00:13:35,029 +0 good what is the highest value so +basically scores could be arbitrarily + +185 +00:13:35,029 --> 00:13:39,870 +terrible so if you're signed score to +the correct example is very very small + +186 +00:13:39,870 --> 00:13:45,230 +then you're going to get your loss going +to infinity and one more question which + +187 +00:13:45,230 --> 00:13:49,480 +becomes kind of important when we start +doing optimization usually when we + +188 +00:13:49,480 --> 00:13:53,200 +actually optimize these loss functions +we start up with the initialization aw + +189 +00:13:53,200 --> 00:13:56,430 +that are very small weights so what ends +up happening is that the scores at the + +190 +00:13:56,429 --> 00:14:00,819 +very beginning of optimization are +roughly near zero all of these are black + +191 +00:14:00,820 --> 00:14:05,650 +small numbers near zero so what are the +loss when all these are new era in this + +192 +00:14:05,649 --> 00:14:12,329 +particular case that's right number of +classes minus 10 if all this course are + +193 +00:14:12,330 --> 00:14:16,639 +zero then he would this particular loss +I put down here and by doing an average + +194 +00:14:16,639 --> 00:14:21,269 +across this way we would have achieved a +loss of two ok so this is not very + +195 +00:14:21,269 --> 00:14:24,429 +important what's important is for safety +checks when you're actually starting + +196 +00:14:24,429 --> 00:14:28,399 +optimization and you're starting with +very small numbers W and you print out + +197 +00:14:28,399 --> 00:14:31,389 +your first loss as you're talking about +migration and you want to make sure that + +198 +00:14:31,389 --> 00:14:34,279 +you kind of understand the functional +forms and that they can think through + +199 +00:14:34,279 --> 00:14:38,929 +whether or not the number you get to +make sense so I'm seeing to in this case + +200 +00:14:38,929 --> 00:14:42,799 +then I'm happy that the further losses +may be implemented correctly percent + +201 +00:14:42,799 --> 00:14:46,990 +sure but shortly certainly there's +nothing wrong with it right away so it's + +202 +00:14:46,990 --> 00:14:51,730 +interesting to think about these I'm +going to go more into this loss of tiny + +203 +00:14:51,730 --> 00:14:55,950 +bit but as a question in terms of the +slide right now + +204 +00:14:55,950 --> 00:15:10,870 +question I was asked questions + +205 +00:15:10,870 --> 00:15:15,029 +efficient to actually not have this +constraint joy is not why I because it + +206 +00:15:15,029 --> 00:15:19,049 +makes it more difficult to actually do +these easy better eyes implementations + +207 +00:15:19,049 --> 00:15:23,799 +of this loss implementation so that +actually predict my next slide to some + +208 +00:15:23,799 --> 00:15:27,459 +degree so let me just going to say here +sometime by code for Hollywood right out + +209 +00:15:27,460 --> 00:15:33,290 +to this loss function in the same here +we're evaluating a lie in bed right now + +210 +00:15:33,289 --> 00:15:37,759 +we're getting a single example here so +acts as a single column vector light is + +211 +00:15:37,759 --> 00:15:42,279 +an integer specifying the label and W is +our weight matrix so what we do is we + +212 +00:15:42,279 --> 00:15:45,799 +validate this course which is just a +couple times X then we compute these + +213 +00:15:45,799 --> 00:15:50,179 +margins which is the difference between +the course we obtained and the correct + +214 +00:15:50,179 --> 00:15:55,569 +score + 10 these are numbers between 0 +and whatever and then see this dish + +215 +00:15:55,570 --> 00:16:03,360 +online margins that y equals 0 YZ them +there + +216 +00:16:03,360 --> 00:16:07,320 +yeah exactly so basically I'm doing this +efficient backgrounds importation which + +217 +00:16:07,320 --> 00:16:11,209 +goes to your point and then I want to +embrace that margin there because I'm + +218 +00:16:11,208 --> 00:16:15,569 +certain that margin said why currently +has one and I don't want to inflate my + +219 +00:16:15,570 --> 00:16:18,360 +score and so I'll set at 20 + +220 +00:16:18,360 --> 00:16:27,269 +yes I suppose you could subtract one of +the end as well so we can optimize if we + +221 +00:16:27,269 --> 00:16:31,200 +want but we're not going to think about +this too much if you do if you do in + +222 +00:16:31,200 --> 00:16:35,050 +your assignment that's very welcome for +extreme punishments and there were some + +223 +00:16:35,049 --> 00:16:40,859 +of those markets and so we got lost +going back to the site anymore questions + +224 +00:16:40,860 --> 00:16:45,320 +about this formulation and by the way +this formulation if you wanted to make + +225 +00:16:45,320 --> 00:16:49,430 +it if you actually write it down for +just two closest you'll see that it + +226 +00:16:49,429 --> 00:16:57,229 +reduces to a minor support vector +machine lost ok so we'll see a different + +227 +00:16:57,230 --> 00:17:00,190 +function soon and then we're going to +look at the comparisons of them as well + +228 +00:17:00,190 --> 00:17:05,400 +but for now actually so at this point +what we have is we have this + +229 +00:17:05,400 --> 00:17:08,699 +wrapping up its course and then we have +this loss function which have not + +230 +00:17:08,699 --> 00:17:11,870 +written out and its full form where we +have these differences between this + +231 +00:17:11,869 --> 00:17:18,178 +course +1 some of her closest and the +Sun and the average across hold examples + +232 +00:17:18,179 --> 00:17:21,309 +right so that's the loss function right +now I'd like to convince you that + +233 +00:17:21,308 --> 00:17:25,149 +there's actually a bug with this loss +function in other words if I'd like to + +234 +00:17:25,150 --> 00:17:31,798 +use this loss on Sunday as in practice I +might get some not very nice properties + +235 +00:17:31,798 --> 00:17:36,589 +ok if this if this was the only thing I +was using my phone and it's not + +236 +00:17:36,589 --> 00:17:39,709 +completely obvious to see exactly what +the issue is so I'll give you guys a + +237 +00:17:39,710 --> 00:17:43,620 +hint in particular suppose that we found +the W + +238 +00:17:43,619 --> 00:17:55,058 +getting zero loss ok on something and +now the question is is this w unique or + +239 +00:17:55,058 --> 00:18:00,329 +face another way can you give me aww +that would be different but also + +240 +00:18:00,329 --> 00:18:04,210 +definitely achieve zero loss in the back + +241 +00:18:04,210 --> 00:18:12,410 +that's right and so you're saying we can +scale it by some constant and in + +242 +00:18:12,410 --> 00:18:20,009 +particular all formats are based on +constraint you probably want to meet + +243 +00:18:20,009 --> 00:18:24,259 +young greater than one right so +basically what I can do as I can change + +244 +00:18:24,259 --> 00:18:28,119 +my weight and make them larger and +larger all I would be doing is I'm just + +245 +00:18:28,119 --> 00:18:31,639 +create making the score differences +larger and larger as I came up w right + +246 +00:18:31,640 --> 00:18:35,890 +because of the liquor law sport here so +basically it's not a very desirable + +247 +00:18:35,890 --> 00:18:40,370 +property because we have the entire +subspace of W that is optimal and all of + +248 +00:18:40,369 --> 00:18:44,319 +them are according to this loss function +completely the same but intuitively + +249 +00:18:44,319 --> 00:18:48,019 +that's not what I can burn as property +to pass and so just to see this in + +250 +00:18:48,019 --> 00:18:51,920 +america to convince yourself that this +is the case I taking this example of + +251 +00:18:51,920 --> 00:18:58,480 +what we achieved previously 0 loss there +before and I suppose I W I twice I mean + +252 +00:18:58,480 --> 00:19:02,360 +this is a very simple math going on here +but basically I would be conflicting or + +253 +00:19:02,359 --> 00:19:07,000 +my scores by two times and so their +difference would also becomes larger so + +254 +00:19:07,000 --> 00:19:11,019 +if all your score differences inside the +max 50 well already negative then + +255 +00:19:11,019 --> 00:19:14,389 +there's going to become more and more +negative and so you end up with larger + +256 +00:19:14,390 --> 00:19:18,040 +and larger negative values inside them +access and just become zero all the time + +257 +00:19:18,039 --> 00:19:32,159 +but the scale factor would have to be +larger than 1 because + +258 +00:19:32,160 --> 00:19:56,940 +another question for simplicity but yeah +basically scores are WX + be so so + +259 +00:19:56,940 --> 00:19:58,309 +you're just yet + +260 +00:19:58,309 --> 00:20:06,589 +forget to buy some just chillin W myself +ok so the way to fix this is intuitively + +261 +00:20:06,589 --> 00:20:10,250 +we have this entire subway some W's and +it all works the same according to this + +262 +00:20:10,250 --> 00:20:13,269 +loss function and what we'd like to do +as we'd like to have a preference over + +263 +00:20:13,269 --> 00:20:17,170 +some W's over others just based on +intrinsic you know what what do we + +264 +00:20:17,170 --> 00:20:21,430 +desire of W to look like forget the data +is just what what are nice things to + +265 +00:20:21,430 --> 00:20:26,110 +happen and so this introduces the notion +of regularization which we're going to + +266 +00:20:26,109 --> 00:20:29,319 +be attending to our loss function so we +have an additional term there which is + +267 +00:20:29,319 --> 00:20:33,309 +land up times a regularization function +of W and the regularization function + +268 +00:20:33,309 --> 00:20:37,500 +measures the niceness of your W ok and +so we don't only want to fit the data + +269 +00:20:37,500 --> 00:20:43,279 +but we also won W to be nice and we're +going to see some ways of framing that + +270 +00:20:43,279 --> 00:20:47,549 +exactly why they make sense and into +going on as regularization has a way of + +271 +00:20:47,549 --> 00:20:52,509 +trading off your training act your +training loss and your generalization + +272 +00:20:52,509 --> 00:20:56,589 +lost on a test set so intuitively +regularization a set of techniques where + +273 +00:20:56,589 --> 00:21:00,899 +we're adding objectives to the loss +which will be fighting with this guy so + +274 +00:21:00,900 --> 00:21:04,560 +this guy just wants to fit your training +data and that guy once W to look some + +275 +00:21:04,559 --> 00:21:07,879 +particular way and so they're fighting +each other sometimes in your objective + +276 +00:21:07,880 --> 00:21:11,730 +because we want to simultaneously +achieve both of them but it turns out + +277 +00:21:11,730 --> 00:21:14,470 +that adding these regularization +techniques even if it makes your + +278 +00:21:14,470 --> 00:21:18,319 +training error worse so we're not +correctly classifying examples which he + +279 +00:21:18,319 --> 00:21:21,599 +noticed is that the test set performance +and something better and we'll see an + +280 +00:21:21,599 --> 00:21:26,089 +example of why that might be actually +what the next for now I just want to + +281 +00:21:26,089 --> 00:21:29,109 +point out the next light but for now I +just want to point out that the most + +282 +00:21:29,109 --> 00:21:33,019 +common form of realization is the what +we call to regularization or weight + +283 +00:21:33,019 --> 00:21:37,539 +decay and really what we're doing is +suppose W in this case is a 2d matrix so + +284 +00:21:37,539 --> 00:21:42,230 +I had to some Sauveur que el the rows +and columns that really is just + +285 +00:21:42,230 --> 00:21:44,230 +element wise W squared + +286 +00:21:44,230 --> 00:21:48,019 +and we're just putting them all into the +Los ok so this this particular + +287 +00:21:48,019 --> 00:21:55,069 +regulation it likes w's to be 0 right so +when WS all 09 realization is happy but + +288 +00:21:55,069 --> 00:21:58,649 +of course you can't people there because +then you can classify so these guys will + +289 +00:21:58,650 --> 00:22:03,140 +fight each other there are different +forms of regularization with different + +290 +00:22:03,140 --> 00:22:08,570 +approaching Kong's will go into some of +them much later in the class and I just + +291 +00:22:08,569 --> 00:22:12,548 +like the 2nd Lt regularization is the +most common form and that's what you'll + +292 +00:22:12,548 --> 00:22:17,569 +use quite often in this class as well +it's not like to convince you I'd like + +293 +00:22:17,569 --> 00:22:20,529 +to convince you that this is a +reasonable thing to want out of it w + +294 +00:22:20,529 --> 00:22:25,779 +that its weights are small so consider +this very simple cooked up example to + +295 +00:22:25,779 --> 00:22:30,149 +get the intuition suppose we have an +example where we are in four-dimensional + +296 +00:22:30,150 --> 00:22:32,370 +space where we're doing this +classification and we have an even + +297 +00:22:32,369 --> 00:22:36,139 +better off just all once X and now +suppose we have these two candidates + +298 +00:22:36,140 --> 00:22:37,880 +weight matrices or wait + +299 +00:22:37,880 --> 00:22:44,780 +single voice I suppose to right now so +one of them is 100 and the other is 25 + +300 +00:22:44,779 --> 00:22:49,200 +everywhere since we have in your loss +functions you'll see that their effects + +301 +00:22:49,200 --> 00:22:55,080 +are the same so basically have a leading +scorer is by WX so the doc product with + +302 +00:22:55,079 --> 00:22:59,109 +ex is identical for both of these +discourse with both of these but + +303 +00:22:59,109 --> 00:23:03,469 +regularization with strictly favor one +of these over the other which one with + +304 +00:23:03,470 --> 00:23:07,720 +the regularization cost favor even +though their effects are the same which + +305 +00:23:07,720 --> 00:23:13,548 +one is better in terms of realization +the second one right and so the + +306 +00:23:13,548 --> 00:23:15,740 +regularization would tell you that even +though they're achieving the same + +307 +00:23:15,740 --> 00:23:19,109 +effects in terms of the data loss +classification down the road we actually + +308 +00:23:19,109 --> 00:23:22,629 +significantly preferred the second one +what's better about the second one is + +309 +00:23:22,630 --> 00:23:27,340 +that a good idea to have + +310 +00:23:27,339 --> 00:23:38,230 +that's correct so well that's one +interpretation I like the most is well + +311 +00:23:38,230 --> 00:23:43,549 +it takes into account the most number of +things in your X Factor right so what + +312 +00:23:43,549 --> 00:23:47,859 +this Delta realization wants to do is to +spread out your WSUS as much as possible + +313 +00:23:47,859 --> 00:23:51,169 +so that you're taking into account all +the input features are only empathic + +314 +00:23:51,170 --> 00:23:55,900 +sauce and wants to use as much as many +other dimensions as it likes its a + +315 +00:23:55,900 --> 00:23:57,600 +cheating the same effect + +316 +00:23:57,599 --> 00:24:01,439 +intuitively speaking and so that's +better than just focusing on just one + +317 +00:24:01,440 --> 00:24:06,990 +dimension is just nice it's something +that often works in practice basically + +318 +00:24:06,990 --> 00:24:11,880 +just the way things are and biggest +arranged and the property that they + +319 +00:24:11,880 --> 00:24:17,230 +usually have to tackle any questions +about the regularization good idea + +320 +00:24:17,230 --> 00:24:22,130 +everyone has sold some basically our +losses will always have this forum where + +321 +00:24:22,130 --> 00:24:25,350 +we we have a dinner loss and they will +also have a regularization it's a very + +322 +00:24:25,349 --> 00:24:29,529 +common thing to have in practice ok I'm +not going to go into the second + +323 +00:24:29,529 --> 00:24:34,629 +classifier pacifier and we'll see some +differences between the US and support + +324 +00:24:34,630 --> 00:24:38,070 +vector machine and this soft mask +classifier in practice these are kind of + +325 +00:24:38,069 --> 00:24:41,369 +like these two choices that you can have +either a spam or something like the most + +326 +00:24:41,369 --> 00:24:47,629 +commonly used linear classifiers often +you'll see that so far as preferred and + +327 +00:24:47,630 --> 00:24:51,480 +I'm not exactly sure why because usually +the end up working about the same and I + +328 +00:24:51,480 --> 00:24:54,420 +just like to mention that this is also +sometimes called multnomah was just + +329 +00:24:54,420 --> 00:24:57,019 +aggression so if you're familiar with +logistic regression this is just the + +330 +00:24:57,019 --> 00:25:00,190 +generalization of it into multiple +dimensions or in this case multiple + +331 +00:25:00,190 --> 00:25:12,009 +clouds of smoke just as they're +questioned over there + +332 +00:25:12,009 --> 00:25:32,150 +why do we want to use if we'd like to +pick between them in some way and I + +333 +00:25:32,150 --> 00:25:36,820 +think we are going for is that one thing +low W us is a reasonable way to pick + +334 +00:25:36,819 --> 00:25:42,700 +among men and the ultra-right favor +diffuse w's like in this case here and + +335 +00:25:42,700 --> 00:25:47,900 +one of the intuitive ways in which I can +try to pitch why this is a good idea is + +336 +00:25:47,900 --> 00:25:54,290 +that diffuse weights basically see this +w one is completely ignoring your inputs + +337 +00:25:54,289 --> 00:25:58,220 +to three and four but W two is using all +of the inputs right because of the way + +338 +00:25:58,220 --> 00:26:04,480 +to defuse and so intuitively this just +end up usually working better at a test + +339 +00:26:04,480 --> 00:26:10,150 +I'm because more evidence is being +accumulated and your decisions instead + +340 +00:26:10,150 --> 00:26:21,470 +of just one single evidence one single +feature that's right + +341 +00:26:21,470 --> 00:26:28,140 +that's right that's right so the idea +here is that these two W 110 w to + +342 +00:26:28,140 --> 00:26:32,630 +achieving the same effect so this data +loss suppose that that's basically it + +343 +00:26:32,630 --> 00:26:35,650 +doesn't care between the two but the +regularization expressed a preference + +344 +00:26:35,650 --> 00:26:39,169 +for them and since we had any objective +and we're going to end up optimizing + +345 +00:26:39,169 --> 00:26:42,240 +over this loss functions are going to +find the W that simultaneously + +346 +00:26:42,240 --> 00:26:46,659 +accomplishes both of those and so we end +up aw that not only classified correctly + +347 +00:26:46,659 --> 00:26:50,360 +but we also have the added preference +that actually wanted to be and we wanted + +348 +00:26:50,359 --> 00:27:05,668 +to be diffuse as much as possible could +also be indifferent L one has some nice + +349 +00:27:05,669 --> 00:27:09,240 +properties which I don't want to go into +right now we might cover it later fell + +350 +00:27:09,240 --> 00:27:16,579 +one has some properties like a sparsity +inducing properties what if you end up + +351 +00:27:16,579 --> 00:27:20,240 +having lunch in your objectives you'll +find that lots of W's will end up being + +352 +00:27:20,240 --> 00:27:25,329 +exactly zero for reasons that we might +go into labor and that sometimes is like + +353 +00:27:25,329 --> 00:27:30,629 +a feature selection almost and so I'll +one is another alternative that we might + +354 +00:27:30,630 --> 00:27:45,760 +go into a bit more later + +355 +00:27:45,759 --> 00:27:54,220 +that isn't it may be a good thing that +were ignoring features and just using + +356 +00:27:54,220 --> 00:28:02,960 +one of them yeah there's many technical +reasons why realization is a good idea I + +357 +00:28:02,960 --> 00:28:09,090 +went to give you just basic intuition so +maybe maybe tell them that but I think + +358 +00:28:09,089 --> 00:28:59,740 +that's a fair point if I have a good +return I would have to be ignoring some + +359 +00:28:59,740 --> 00:29:25,980 +and looking at times and learning theory +and you saw some of that in 229 and + +360 +00:29:25,980 --> 00:29:29,710 +there are some results on white +regularization is a good case in in + +361 +00:29:29,710 --> 00:29:33,650 +those areas and I don't think I'm gonna +go into that and salt also beyond the + +362 +00:29:33,650 --> 00:29:37,610 +scope of this class so far this class +just altering our nation will make your + +363 +00:29:37,609 --> 00:29:44,139 +test error better someone to go to +satisfy to find out which is just + +364 +00:29:44,140 --> 00:29:49,309 +generalization of logistic regression to +the way the way this will work as this + +365 +00:29:49,308 --> 00:29:53,049 +is just a different functional form for +how loss is specified on top of these + +366 +00:29:53,049 --> 00:29:58,539 +course some particular there's this +interpretation that classifier puts on + +367 +00:29:58,539 --> 00:30:02,170 +top of this course these are not just +some arbitrary scores and we want to + +368 +00:30:02,170 --> 00:30:05,769 +margins to be met but we have specific +interpretation that is maybe more + +369 +00:30:05,769 --> 00:30:10,549 +principled kind of from a problem that +point of view where we actually + +370 +00:30:10,549 --> 00:30:14,490 +interpret these course not just as these +things that mean margins but these are + +371 +00:30:14,490 --> 00:30:17,880 +actually the normalized lock +probabilities that are assigned to + +372 +00:30:17,880 --> 00:30:23,140 +different classes ok so we're going to +go into exactly what this means in a bit + +373 +00:30:23,140 --> 00:30:28,880 +these are normalized lock probabilities +of all the twice given the image in + +374 +00:30:28,880 --> 00:30:34,490 +other words we are assuming that the +scores are unlike problem peace than the + +375 +00:30:34,490 --> 00:30:38,799 +way to get probabilities of the closest +like Sasuke is that we take these the + +376 +00:30:38,799 --> 00:30:39,690 +score + +377 +00:30:39,690 --> 00:30:45,029 +exponential all of them to get the +anomalous probabilities and we normalize + +378 +00:30:45,029 --> 00:30:48,849 +them to get them to normalize +probabilities so we divided by the sum + +379 +00:30:48,849 --> 00:30:54,209 +over all the exponential its course and +that's how we actually get this + +380 +00:30:54,210 --> 00:30:58,240 +expression for a probability of a class +given the image and so this function + +381 +00:30:58,240 --> 00:31:02,880 +here is called a soft max function if +you see if someone has to eat to them + +382 +00:31:02,880 --> 00:31:07,840 +the elements are currently interested in +divided by the sum overall expense sheet + +383 +00:31:07,839 --> 00:31:11,918 +its course that's the way this will work +basically is if we're in this problem + +384 +00:31:11,919 --> 00:31:13,040 +premark we're really lucky + +385 +00:31:13,039 --> 00:31:16,869 +that we're deciding that this is the +probability of different classes that + +386 +00:31:16,869 --> 00:31:19,619 +makes sense in terms of what you what +you really want to do in this setting + +387 +00:31:19,619 --> 00:31:23,809 +will probably over different classes one +of these is correct so we wanted to + +388 +00:31:23,809 --> 00:31:25,429 +maximize the log-likelihood + +389 +00:31:25,430 --> 00:31:32,900 +for the loss function and so we want to +maximize the log likelihood of the true + +390 +00:31:32,900 --> 00:31:38,140 +class and since we're running a loss +function we want to minimize the + +391 +00:31:38,140 --> 00:31:42,980 +negative log-likelihood of the true +class ok so you end up with a series of + +392 +00:31:42,980 --> 00:31:46,599 +expressions here really are lost +function as you want the log-likelihood + +393 +00:31:46,599 --> 00:31:51,169 +the correct class to be high so negative +of it want to be low and the + +394 +00:31:51,170 --> 00:31:54,820 +log-likelihood is some expansion of +course let's look at a specific example + +395 +00:31:54,819 --> 00:32:00,599 +to make this more later here I actually +like something that expression so that + +396 +00:32:00,599 --> 00:32:04,839 +this is the Los negative log that +expression let's look at how this + +397 +00:32:04,839 --> 00:32:07,859 +expression works and I think it'll give +you a better intuition know exactly what + +398 +00:32:07,859 --> 00:32:12,009 +this is doing lights what's computing so +suppose here we haven't these scores + +399 +00:32:12,009 --> 00:32:16,379 +that came out from our neural network or +from our earlier pacifier and these are + +400 +00:32:16,380 --> 00:32:19,780 +the unlock problem peace so as I +mentioned we want to exponentially them + +401 +00:32:19,779 --> 00:32:22,879 +first because under this interpretation +that gives us the normalized + +402 +00:32:22,880 --> 00:32:28,150 +probabilities and now he's always some +21 so we have two divided by the sum of + +403 +00:32:28,150 --> 00:32:33,310 +all of these so we add up these guys and +we divide to actually get probably out + +404 +00:32:33,309 --> 00:32:37,609 +under this interpretation we've carried +out the set of transformations and what + +405 +00:32:37,609 --> 00:32:41,219 +this is saying is that this +interpretation the probability assigned + +406 +00:32:41,220 --> 00:32:47,029 +to this image of being a cat is 13% car +is 87% and progress very unlikely 0% + +407 +00:32:47,029 --> 00:32:51,399 +these are the probabilities and not +normally in the setting you want to + +408 +00:32:51,400 --> 00:32:54,960 +maximize the lock probability because it +turns out that maximizing just a rock + +409 +00:32:54,960 --> 00:32:58,049 +probability is not as nice +mathematically so lonely you see + +410 +00:32:58,049 --> 00:33:03,460 +maximizing luck probabilities and then +so you want to minimize the probability + +411 +00:33:03,460 --> 00:33:08,850 +so the correct class here is cat which +is only having 13 percent chance + +412 +00:33:08,849 --> 00:33:14,679 +Anderson misinterpretation so negative +log of points 13 gets us 89 and so + +413 +00:33:14,680 --> 00:33:21,180 +that's the final to find a loss that we +would achieve for this class here under + +414 +00:33:21,180 --> 00:33:25,529 +this interpretation of a classifier so +29 + +415 +00:33:25,529 --> 00:33:32,869 +let's go over some examples were some +questions now related to this to try to + +416 +00:33:32,869 --> 00:33:34,219 +interpret exactly how this works + +417 +00:33:34,220 --> 00:33:38,519 +first I was the min and max possible +lost with this loss function so that the + +418 +00:33:38,519 --> 00:33:44,460 +loss function what is the smallest Mali +and the highest body has to think about + +419 +00:33:44,460 --> 00:33:49,809 +this what is the smallest value that we +can cheap zero and how would that happen + +420 +00:33:49,809 --> 00:33:57,220 +I can get so if you're correct class is +getting probably have one where we have + +421 +00:33:57,220 --> 00:34:02,890 +a one was replying to the law and we're +getting negative log of 110 and the + +422 +00:34:02,890 --> 00:34:09,030 +highest possible loss so just as well as +we were getting the same 0 is minimum + +423 +00:34:09,030 --> 00:34:14,250 +and infinite is maximum so infant loss +would be achieved if you end up giving + +424 +00:34:14,250 --> 00:34:18,769 +your cat score very tiny probability and +then log of 0 gives you negative + +425 +00:34:18,769 --> 00:34:24,679 +infinity so negative that is just the +infinite so yeah so the same balance as + +426 +00:34:24,679 --> 00:34:28,159 +p.m. and also this question + +427 +00:34:28,159 --> 00:34:33,440 +normally when we initialize W with +roughly small small weights wind up with + +428 +00:34:33,440 --> 00:34:37,550 +all these cars are nearly zero what ends +up being the loss in this case + +429 +00:34:37,550 --> 00:34:40,419 +checks at the beginning of your +optimization what do you expect to see + +430 +00:34:40,418 --> 00:34:47,000 +as your first loss + +431 +00:34:47,000 --> 00:34:59,449 +one over a number of classes so you may +be getting older is here you get to all + +432 +00:34:59,449 --> 00:35:04,139 +of one's here and so here is one over a +number of classes and then they get a + +433 +00:35:04,139 --> 00:35:07,599 +blog about and something your final +awesome so actually for myself whenever + +434 +00:35:07,599 --> 00:35:11,569 +I run up my station I sometimes take +note of my number of classes and I + +435 +00:35:11,570 --> 00:35:14,970 +evaluate negative log one of a number of +classes and I'm trying to see what is + +436 +00:35:14,969 --> 00:35:18,429 +the my first beginning lost expect and +so when I start up my decision I make + +437 +00:35:18,429 --> 00:35:21,159 +sure that I'm getting roughly that +otherwise I know some things may be + +438 +00:35:21,159 --> 00:35:24,399 +slightly off expect to get something on +that order + +439 +00:35:24,400 --> 00:35:28,630 +moreover as an optimizing expect that I +go from thats 20 and if I'm seeing + +440 +00:35:28,630 --> 00:35:31,039 +negative numbers then I know from the +functional form that something very + +441 +00:35:31,039 --> 00:35:32,590 +strange is going on right + +442 +00:35:32,590 --> 00:35:37,070 +never actually expected gonna give +numbers out of this assault max loss + +443 +00:35:37,070 --> 00:35:40,630 +I'll show you one more slide nothing +some questions just to reiterate the + +444 +00:35:40,630 --> 00:35:44,599 +difference between them and really what +they look like as we have the score + +445 +00:35:44,599 --> 00:35:48,909 +function which gives aw we get our +scores of actor and now the difference + +446 +00:35:48,909 --> 00:35:54,420 +is just how they interpret what these +course coming out from this function is + +447 +00:35:54,420 --> 00:35:58,500 +so I just ran its course no +interpretation whatsoever we just want + +448 +00:35:58,500 --> 00:36:02,710 +that a lot of a larger score the correct +score to be some margin above the + +449 +00:36:02,710 --> 00:36:07,240 +incorrect course or interpreted to be +these unless lott probabilities and then + +450 +00:36:07,239 --> 00:36:10,569 +in this framework we first went to get +the probabilities and then we want to + +451 +00:36:10,570 --> 00:36:14,450 +maximize the public in the crack losses +or the log of them and so that ends up + +452 +00:36:14,449 --> 00:36:19,250 +giving us the loss function or something +so they start off at the same way but + +453 +00:36:19,250 --> 00:36:22,780 +they just happened to get to the +difference less results we're going to + +454 +00:36:22,780 --> 00:36:31,150 +exactly what the differences are in a +bit there are questions + +455 +00:36:31,150 --> 00:36:41,579 +they take your classified as near +instantaneous to evaluate most of the + +456 +00:36:41,579 --> 00:36:45,949 +work is done in the convolutions and so +will see that the classifier and + +457 +00:36:45,949 --> 00:36:51,629 +especially the losses roughly the same +of course South max involve some XP and + +458 +00:36:51,630 --> 00:36:56,200 +so on so these operations are slightly +more expensive perhaps but usually it + +459 +00:36:56,199 --> 00:36:57,439 +completely washes away + +460 +00:36:57,440 --> 00:36:59,320 +compared to everything else you're +worried about which is all the + +461 +00:36:59,320 --> 00:37:15,260 +competitions over the image of God + +462 +00:37:15,260 --> 00:37:32,600 +probably + +463 +00:37:32,599 --> 00:37:42,210 +exact same problem and so maximizing the +property and maximizing the locality + +464 +00:37:42,210 --> 00:37:46,119 +give you the identical result but in +terms of the match everything comes out + +465 +00:37:46,119 --> 00:37:49,279 +too much nicer looking when you actually +put a lot of there but it's the exact + +466 +00:37:49,280 --> 00:37:51,310 +same optimization problem + +467 +00:37:51,309 --> 00:37:56,539 +ok let's get some interpretations of +these two and exactly how they differ + +468 +00:37:56,539 --> 00:38:01,230 +max vs SEM and trying to give you an +idea about one property that actually + +469 +00:38:01,230 --> 00:38:03,559 +quite different between the two + +470 +00:38:03,559 --> 00:38:08,059 +these two different functional analysis +team that we have these three examples + +471 +00:38:08,059 --> 00:38:12,710 +all three examples and suppose there are +three closest three different examples + +472 +00:38:12,710 --> 00:38:15,980 +and these are discourse of these +examples for every one of these examples + +473 +00:38:15,980 --> 00:38:19,659 +the first class here is the correct +class so 10 is the correct class score + +474 +00:38:19,659 --> 00:38:24,509 +and the other scores are these guys +either the first one second or third one + +475 +00:38:24,510 --> 00:38:30,970 +and now just think about how would these +losses tell you about how desirable + +476 +00:38:30,969 --> 00:38:36,480 +outcomes are in terms of that w in +particular one way to think about it for + +477 +00:38:36,480 --> 00:38:39,530 +example is suppose I think this data +point the third one tenth of a hundred + +478 +00:38:39,530 --> 00:38:44,700 +and eight hundred and suppose I jiggle +it move it around a bit and my input + +479 +00:38:44,699 --> 00:38:58,159 +space what is happening to the losses as +I do that + +480 +00:38:58,159 --> 00:39:03,339 +I do so they increase and decrease as I +would go around do they both increase or + +481 +00:39:03,340 --> 00:39:10,050 +decrease for the third appointment for +example as me and remains the same + +482 +00:39:10,050 --> 00:39:13,740 +correct and why is that it's because the +margin was met by a huge amount so + +483 +00:39:13,739 --> 00:39:17,659 +there's just added robustness when I +take the day off on a sheet around the + +484 +00:39:17,659 --> 00:39:22,379 +SVM is already very happy because the +margins were met by you know we desire + +485 +00:39:22,380 --> 00:39:27,809 +margin of one and here we have a margin +of two hundred and there's a huge margin + +486 +00:39:27,809 --> 00:39:32,299 +ESPN doesn't express a preference over +these examples where this course come + +487 +00:39:32,300 --> 00:39:37,010 +out very negative ads no additional +preference over do I want to be negative + +488 +00:39:37,010 --> 00:39:43,890 +2009 200,000 PSP and wound care but the +s but the South max could always see you + +489 +00:39:43,889 --> 00:39:46,659 +will always get an improvement for +something that's right so soft max + +490 +00:39:46,659 --> 00:39:49,480 +function express a preference for +everyone's needs to be negative about + +491 +00:39:49,480 --> 00:39:53,590 +two hundred or five hundred or thousand +of them will give you better loss right + +492 +00:39:53,590 --> 00:39:58,530 +but the SVM at this point doesn't care +if the other examples I don't know if + +493 +00:39:58,530 --> 00:40:03,320 +it's as clear distinction rights of the +FBI has decided robustness to it once + +494 +00:40:03,320 --> 00:40:07,120 +this margin to be met but beyond that it +doesn't micromanage your course where + +495 +00:40:07,119 --> 00:40:11,400 +soft max will always want peace course +to be you know everything here nothing + +496 +00:40:11,400 --> 00:40:15,300 +there and so that's one kind of very +clear difference between the two + +497 +00:40:15,300 --> 00:40:20,548 +it was a question + +498 +00:40:20,548 --> 00:40:28,568 +yes the margin of one I mentioned very +briefly that that's not a hyper primary + +499 +00:40:28,568 --> 00:40:34,528 +you can fix it to be one reason for that +is that lease course they're the kind of + +500 +00:40:34,528 --> 00:40:40,048 +the absolute values of those course are +kind of don't really mattered because my + +501 +00:40:40,048 --> 00:40:45,088 +W I can make it a larger or smaller and +I can achieve different sizes course and + +502 +00:40:45,088 --> 00:40:49,759 +so one turns out to work better and in +the notes I have a longer duration go + +503 +00:40:49,759 --> 00:40:54,699 +into details exactly why one is safe to +choose so refer to that but i dont wanna + +504 +00:40:54,699 --> 00:41:03,239 +spend time on it in like 20 would be if +you wanna 20 there would be trouble you + +505 +00:41:03,239 --> 00:41:07,358 +can use any positive number and that +would give you a nice p.m. if he was 0 + +506 +00:41:07,358 --> 00:41:14,328 +that would look different + +507 +00:41:14,329 --> 00:41:18,259 +for example this added constant there +one property gives you when you actually + +508 +00:41:18,259 --> 00:41:21,920 +go through the mathematical analysis +likes in the ass p.m. in CST 29 as + +509 +00:41:21,920 --> 00:41:26,269 +you'll see that the chief suspects +margin property where the Eskimo playing + +510 +00:41:26,268 --> 00:41:29,698 +that the best margin when you actually +have a plus + +511 +00:41:29,699 --> 00:41:33,539 +constant their combined with the altar +regularization on the way it's very + +512 +00:41:33,539 --> 00:41:38,499 +small weights that meet specific margin +and as well give you this very nice mix + +513 +00:41:38,498 --> 00:41:42,259 +margin property that I didn't really +going to in this in this lecture right + +514 +00:41:42,259 --> 00:41:46,818 +now but I basically do want a positive +number there otherwise things would + +515 +00:41:46,818 --> 00:41:51,480 +break + +516 +00:41:51,480 --> 00:42:14,780 +numbers that are real numbers and we're +kind of free to get in this course out + +517 +00:42:14,780 --> 00:42:18,200 +and it's up to you to endow them with +interpretation right we can have + +518 +00:42:18,199 --> 00:42:21,669 +different losses in this specific case I +showed you the closest p.m. there's + +519 +00:42:21,670 --> 00:42:25,180 +multiple versions of a multi-class SVM +you can paddle around with exactly the + +520 +00:42:25,179 --> 00:42:30,750 +Los expression in one of the one of the +interpretations we can put on this + +521 +00:42:30,750 --> 00:42:34,510 +course then there'd be some normalized +block probably say they can't be + +522 +00:42:34,510 --> 00:42:37,590 +normalized because they just came we +have to explicitly because there's no + +523 +00:42:37,590 --> 00:42:42,180 +constraint that the output of your +function will be normalized and they + +524 +00:42:42,179 --> 00:42:45,579 +have to be the camp probably because +you're out that in just his real numbers + +525 +00:42:45,579 --> 00:42:51,309 +that can be positive or negative so we +interpret them as a problem peace and + +526 +00:42:51,309 --> 00:42:52,699 +and done + +527 +00:42:52,699 --> 00:42:58,329 +requires us to treat them some very bad +kind of explanation of it but I think + +528 +00:42:58,329 --> 00:43:05,889 +he's got + +529 +00:43:05,889 --> 00:43:57,139 +energy and losses like kind of an +equivalent of all about what you're + +530 +00:43:57,139 --> 00:44:05,690 +saying look at this one here right here +saying if I googled this around + +531 +00:44:05,690 --> 00:44:09,460 +nothing's changing I think the +difference is the loss would definitely + +532 +00:44:09,460 --> 00:44:12,800 +change for max even though it wouldn't +change a lot but I would definitely + +533 +00:44:12,800 --> 00:44:16,660 +change the subject to express preference +whereas p.m. guess you identically zero + +534 +00:44:16,659 --> 00:44:27,339 +wouldn't be very big blunder differently +as preference but in practice basically + +535 +00:44:27,340 --> 00:44:32,720 +this distinction the interaction of +trying to give you is that the SPM has a + +536 +00:44:32,719 --> 00:44:38,469 +very local part of the space immature +classifying that it cares about and + +537 +00:44:38,469 --> 00:44:40,279 +beyond it + +538 +00:44:40,280 --> 00:44:43,700 +its environment and a soft max kind of +physical action of the full data cloud + +539 +00:44:43,699 --> 00:44:48,129 +it cares about it cares about all the +points in your data cloud not just you + +540 +00:44:48,130 --> 00:44:50,590 +know there's like a small class here +that you're trying to separate out from + +541 +00:44:50,590 --> 00:44:51,410 +everything else + +542 +00:44:51,409 --> 00:44:55,659 +assault Maxwell kind of concerned the +full data closet getting your plane and + +543 +00:44:55,659 --> 00:44:59,059 +SPM just want to separate out that tiny +piece from the immediate part of the + +544 +00:44:59,059 --> 00:45:04,219 +data cloud like that in practice when +you actually run the state can give + +545 +00:45:04,219 --> 00:45:09,569 +nearly identical results almost always +so really when trying to I'm not trying + +546 +00:45:09,570 --> 00:45:12,640 +to pitch one or the other I'm just +trying to give you this notion that + +547 +00:45:12,639 --> 00:45:16,809 +you're in charge of the loss function +you get some scores out and you can + +548 +00:45:16,809 --> 00:45:19,199 +write down nearly any mathematical +expression + +549 +00:45:19,199 --> 00:45:23,279 +is differentiable into what you want +your scores to be like and there are + +550 +00:45:23,280 --> 00:45:26,619 +different ways of actually formulating +this and actually two examples that are + +551 +00:45:26,619 --> 00:45:30,579 +coming to see practice but in practice +we can put down any losses for what you + +552 +00:45:30,579 --> 00:45:34,619 +want your scores to be and that's a very +nice picture because we can optimize + +553 +00:45:34,619 --> 00:45:46,700 +overall let me show you an interactive +web them at this point + +554 +00:45:46,699 --> 00:45:54,289 +alright see this so this is an +interactive seminar class page you can + +555 +00:45:54,289 --> 00:45:58,409 +find it at this URL I wrote it last year +and I have to show it to all of you guys + +556 +00:45:58,409 --> 00:46:04,279 +to justify spending one day on +developing ok but some that your last + +557 +00:46:04,280 --> 00:46:12,440 +year not too many people looked at this +vehicle is one day of my life so we have + +558 +00:46:12,440 --> 00:46:18,000 +here is a two-dimensional problem with +three classes and I'm showing here three + +559 +00:46:18,000 --> 00:46:22,139 +classes each has three examples over +here in two dimensions and I'm showing + +560 +00:46:22,139 --> 00:46:24,969 +the three classifiers here because the +level set aside for example the red + +561 +00:46:24,969 --> 00:46:29,659 +classifier is as scores of 0 along the +line and then I'm showing the arrows + +562 +00:46:29,659 --> 00:46:35,509 +which scores increased right here's RW +matrix so as you recall the W matrix + +563 +00:46:35,510 --> 00:46:38,609 +double rows of that w matrix are the +different classifiers so we have the + +564 +00:46:38,608 --> 00:46:42,289 +blue classifier red and green classifier +and Brett classifier and we have both + +565 +00:46:42,289 --> 00:46:47,349 +the weights for both the X&Y component +and also the bias and then here we have + +566 +00:46:47,349 --> 00:46:50,609 +the data said so we have the X&Y +coordinates of all the data points there + +567 +00:46:50,608 --> 00:46:55,779 +correct label and thus course as well as +the loss achieved by all those data + +568 +00:46:55,780 --> 00:46:59,769 +points right now with this setting up w +and so you can see that I'm taking the + +569 +00:46:59,769 --> 00:47:04,568 +mean overall the loss so right now our +data losses 2.77 regularization loss for + +570 +00:47:04,568 --> 00:47:08,509 +this w is 3.5 and talk hola 6.27 + +571 +00:47:08,510 --> 00:47:14,810 +and so basically you can fiddle around +with this so so as I change my W you can + +572 +00:47:14,809 --> 00:47:19,328 +see that here I'm making my W one of the +WC bigger and you can see what that does + +573 +00:47:19,329 --> 00:47:25,940 +in in their order bias you can see the +bias basically shut these high plains + +574 +00:47:25,940 --> 00:47:32,639 +okay and then what we can do is we can +we're going to work this kind of a + +575 +00:47:32,639 --> 00:47:35,848 +preview of what's going to happen we're +getting the loss here and there were + +576 +00:47:35,849 --> 00:47:38,829 +going to do back propagation which has +given us the gradient over how we want + +577 +00:47:38,829 --> 00:47:44,359 +to adjust these W it's in order to make +the law smaller and so we're going to do + +578 +00:47:44,358 --> 00:47:48,838 +is this repeated states where we start +off with this w but now I can improve I + +579 +00:47:48,838 --> 00:47:54,460 +can improve this set of W's so when I do +a perimeter update this is actually + +580 +00:47:54,460 --> 00:47:57,568 +using these gradients which are shown +here in the right now and it's actually + +581 +00:47:57,568 --> 00:47:59,900 +making a tiny changed everything + +582 +00:47:59,900 --> 00:48:03,088 +according to this gradient right so as I +do + +583 +00:48:03,088 --> 00:48:07,699 +primary update you can see that the loss +here is decreasing special the total + +584 +00:48:07,699 --> 00:48:11,338 +loss here so the lost just keeps getting +better and better as I do primary date + +585 +00:48:11,338 --> 00:48:16,639 +so this is the process of optimization +that we're going to go into in a bit so + +586 +00:48:16,639 --> 00:48:20,989 +I can also start a repeated update and +then basically we keep improving this w + +587 +00:48:20,989 --> 00:48:24,808 +over and over until our loss it started +off was roughly three or something that + +588 +00:48:24,809 --> 00:48:29,579 +you are mean loss over the data is point +one like that and we're correctly + +589 +00:48:29,579 --> 00:48:39,068 +classifying all these buttons here so I +can also randomized randomized W so just + +590 +00:48:39,068 --> 00:48:41,980 +kind of knocks it off and then there's +always converges these acting point + +591 +00:48:41,980 --> 00:48:47,650 +through the process optimization and you +can play here with the regularization as + +592 +00:48:47,650 --> 00:48:51,730 +well you have different forms of loss so +the one I shown you right now is there + +593 +00:48:51,730 --> 00:48:55,990 +was a consensus p.m. formulation there's +a few more SPM formulations and there's + +594 +00:48:55,989 --> 00:49:01,098 +also soft max here you'll see that when +I Swisher soft max loss our losses look + +595 +00:49:01,099 --> 00:49:06,670 +different and but the solution and are +being roughly the same so when I switch + +596 +00:49:06,670 --> 00:49:10,700 +back to him you know the type of players +move around the tiny bit but really it's + +597 +00:49:10,699 --> 00:49:21,558 +it's mostly the same and so this is just +a size so this is how much how big steps + +598 +00:49:21,559 --> 00:49:25,650 +are we making when we get the gradient +on how to improve things so much promise + +599 +00:49:25,650 --> 00:49:29,119 +we should start with the very biggest +upside the scenes are giggling trying to + +600 +00:49:29,119 --> 00:49:32,309 +separate out these data points and then +over time we're going to be doing in a + +601 +00:49:32,309 --> 00:49:36,430 +position as we're going to decrease our +updates eyes and this thing or just + +602 +00:49:36,429 --> 00:49:43,298 +slowly converging on the premise that we +want in the end and so so you can play + +603 +00:49:43,298 --> 00:49:47,170 +with us and you can see how he scores to +go around and what the losses and if I + +604 +00:49:47,170 --> 00:49:53,358 +stop repeated update you can also drag +these points but I think on the Mac it + +605 +00:49:53,358 --> 00:49:58,598 +doesn't work so I tried to drag this +point it disappears so good + +606 +00:49:58,599 --> 00:50:02,479 +but it works on a desktop so I don't go +in and figure out exactly what happened + +607 +00:50:02,478 --> 00:50:14,480 +there but they can play with this + +608 +00:50:14,480 --> 00:50:30,840 +we have as mean loss over data plus +regularization this is one other diagram + +609 +00:50:30,840 --> 00:50:35,240 +to show you how did what this looks like +I don't think it's a very good diagram + +610 +00:50:35,239 --> 00:50:38,858 +and there's something confusing about it +that I can't remember from last year but + +611 +00:50:38,858 --> 00:50:45,269 +basically you have this data and why +your images your labels and there's W + +612 +00:50:45,269 --> 00:50:49,719 +and keeping this course and getting the +lawsuit and the regularization losses + +613 +00:50:49,719 --> 00:50:54,939 +only function of the weights not of the +data and mister what we want to do now + +614 +00:50:54,940 --> 00:50:58,608 +is we don't have control over the data +set right that's given to us we have + +615 +00:50:58,608 --> 00:51:04,130 +control over that w and as we changed at +W the loss will be different so for any + +616 +00:51:04,130 --> 00:51:08,340 +W give me I can compute the loss and +that lost is linked to how well we're + +617 +00:51:08,340 --> 00:51:12,730 +classifying all of our examples so one +thing a low loss means world-class find + +618 +00:51:12,730 --> 00:51:15,880 +them very very well on the training data +and then we're crossing our fingers that + +619 +00:51:15,880 --> 00:51:20,809 +also works on some test data that we +haven't seen so here's one strategy for + +620 +00:51:20,809 --> 00:51:26,139 +optimization it's a random search so +because we can evaluate loss for any + +621 +00:51:26,139 --> 00:51:30,500 +arbitrary W when I can afford to do and +I'm not sure if i dont im go through + +622 +00:51:30,500 --> 00:51:34,480 +this in full detail but effectively I +randomly sampled and I can check their + +623 +00:51:34,480 --> 00:51:37,460 +loss and I can just keep track of the W +that works best + +624 +00:51:37,460 --> 00:51:43,090 +oK so that's an amazing process of +optimization of getting check and it + +625 +00:51:43,090 --> 00:51:46,760 +turns out if you do this I think I tried +two thousand times if you do this + +626 +00:51:46,760 --> 00:51:50,970 +thousand times and take the best W found +at random and you run it on your seat + +627 +00:51:50,969 --> 00:51:56,108 +bartend data just made up you end up +with about 15.5 percent accuracy and + +628 +00:51:56,108 --> 00:52:01,150 +since they're acting classes are the +mean baseline as a 10% chance + +629 +00:52:01,150 --> 00:52:06,559 +performance so 15.5 there some signal +actually notably and so state of the art + +630 +00:52:06,559 --> 00:52:10,219 +is that ninety-five which is a common +that so we have some got too close over + +631 +00:52:10,219 --> 00:52:10,980 +the next + +632 +00:52:10,980 --> 00:52:17,670 +two weeks or so so this is so don't use +this just because it's on the slides one + +633 +00:52:17,670 --> 00:52:21,659 +interpretation of exactly what this +looks like this process optimization is + +634 +00:52:21,659 --> 00:52:25,399 +that we have this loss landscape right +this loss landscape is in this high + +635 +00:52:25,400 --> 00:52:32,619 +dimensional W space so here we sit here +in 3d and your losses the height then + +636 +00:52:32,619 --> 00:52:38,369 +you only have 2 W's in this case and +you're here and you're blindfolded W you + +637 +00:52:38,369 --> 00:52:42,269 +can see where the valleys are but you're +trying to find low loss as you're + +638 +00:52:42,269 --> 00:52:45,699 +blindfolded and you have an altitude +meter and so you can tell what your + +639 +00:52:45,699 --> 00:52:49,029 +losses at any single point and you're +trying to get to the bottom of the + +640 +00:52:49,030 --> 00:52:55,430 +valley right and so that's really the +process of optimization and what we've + +641 +00:52:55,429 --> 00:52:59,399 +shown you would actually so far as this +random optimization where you teleport + +642 +00:52:59,400 --> 00:53:03,309 +around and you just check your altitude +right so not the best idea so we're + +643 +00:53:03,309 --> 00:53:06,940 +going to do instead is we're going to +use what I refer to as a gradient or + +644 +00:53:06,940 --> 00:53:12,800 +really we're just computing the slope +across in every single direction so I'm + +645 +00:53:12,800 --> 00:53:17,990 +trying to compute the slope and then I'm +going to go downhill ok so we're + +646 +00:53:17,989 --> 00:53:21,289 +following the slope I'm not going to go +into too much detail on this but + +647 +00:53:21,289 --> 00:53:24,779 +basically there's an expression for the +gradient which is defined like that + +648 +00:53:24,780 --> 00:53:31,859 +there's a derivative populist 101 +definition and multiple dimensions if + +649 +00:53:31,858 --> 00:53:35,409 +you have a director of derivatives +that's referred to as the gradient right + +650 +00:53:35,409 --> 00:53:39,589 +so because we have multiple dimensions +multiple w's we have a gradient vector + +651 +00:53:39,590 --> 00:53:45,660 +ok so this is the expression and in fact +we can numerically evaluate the + +652 +00:53:45,659 --> 00:53:48,769 +expression before I go into the Analects +how to show you what that would look + +653 +00:53:48,769 --> 00:53:54,190 +like to evaluate the gradient at some W +suppose we have some current W and we're + +654 +00:53:54,190 --> 00:53:58,500 +getting some loss ok what we want to do +not want to get an idea about the slope + +655 +00:53:58,500 --> 00:54:03,239 +at this point so we're going to +basically look at this formula and we're + +656 +00:54:03,239 --> 00:54:07,329 +just going to evaluated so I'm going to +go in the first dimension and I'm going + +657 +00:54:07,329 --> 00:54:11,840 +and really what this is telling you to +do is evaluate explosive your altitude + +658 +00:54:11,840 --> 00:54:15,590 +at Xmas H subtracted from FFX and divide +by H + +659 +00:54:15,590 --> 00:54:19,800 +what that response to as me being on +this landscape taking a small step in + +660 +00:54:19,800 --> 00:54:23,130 +some direction and looking whether or +not my foot went up or down + +661 +00:54:23,130 --> 00:54:27,340 +right that's what the gradient is +telling me so I took a small step and + +662 +00:54:27,340 --> 00:54:32,150 +the lost there is 1.25 then I can use +that formula with a finite difference + +663 +00:54:32,150 --> 00:54:36,230 +approximation we review the small H two +actually derived that the gradient here + +664 +00:54:36,230 --> 00:54:41,199 +as negative 2.5 the slope downwards so I +took a step the loss and decreased so + +665 +00:54:41,199 --> 00:54:45,480 +the sloping downwards in terms of the +loss function so negative 2.5 in that + +666 +00:54:45,480 --> 00:54:49,369 +particular dimension so I can do this +for every single dimension independently + +667 +00:54:49,369 --> 00:54:53,210 +right so I go into the second dimension +I add a small amount so I step in a + +668 +00:54:53,210 --> 00:54:56,869 +different direction I look at what +happened to the loss I use that formula + +669 +00:54:56,869 --> 00:55:00,969 +and is telling me that the gradient the +slope is 2.6 I can do that in the third + +670 +00:55:00,969 --> 00:55:06,429 +dimension and I get the grieving ok so +what I'm referring to here is basically + +671 +00:55:06,429 --> 00:55:11,149 +evaluating the numerical ingredient +which is using the spine a difference + +672 +00:55:11,150 --> 00:55:14,539 +approximation where for every single +dimension independently I can take a + +673 +00:55:14,539 --> 00:55:18,500 +small step at the loss and that tells me +the slower is it going upwards or + +674 +00:55:18,500 --> 00:55:23,829 +downwards for every single one of these +parameters and so this is America + +675 +00:55:23,829 --> 00:55:28,500 +gradient the way this would look like it +is by con funk shun here it looks ugly + +676 +00:55:28,500 --> 00:55:32,630 +because it turns out it's slightly +tricky to iterate over all the W's but + +677 +00:55:32,630 --> 00:55:36,780 +basically we're just looking at age +comparing two effects and dividing by + +678 +00:55:36,780 --> 00:55:41,200 +age and we're getting agreement now the +problem with this is if you want to use + +679 +00:55:41,199 --> 00:55:44,960 +the numerical gradient event of course +we have to do this for every single + +680 +00:55:44,960 --> 00:55:47,949 +dimension to get a sense of what the +great iam this endeavor single dimension + +681 +00:55:47,949 --> 00:55:53,079 +and right when you have a comment you +have hundreds of millions of parameters + +682 +00:55:53,079 --> 00:55:58,139 +right so we can't afford to actually +check the loss in hundreds of millions + +683 +00:55:58,139 --> 00:56:02,920 +of primaries before we do a single step +so this approach where we would try to + +684 +00:56:02,920 --> 00:56:06,869 +evaluate the gradient numerically is +approximate because we're using finite + +685 +00:56:06,869 --> 00:56:11,119 +difference approximation second is also +extremely slow because I need to do + +686 +00:56:11,119 --> 00:56:15,460 +million checks on the loss function on +the icon that before I know what the + +687 +00:56:15,460 --> 00:56:20,519 +gradient doesn't I can take a primary +update so very slow approximate turns + +688 +00:56:20,519 --> 00:56:26,730 +out that this is all so silly right +because the loss as a function of W as + +689 +00:56:26,730 --> 00:56:29,800 +we've written about it and really what +we want is we want the gradient of the + +690 +00:56:29,800 --> 00:56:33,220 +last 11 respectively and luckily we can +just write that down + +691 +00:56:33,219 --> 00:56:42,598 +thanks to these guys actually know who +those guys are doing that's right you + +692 +00:56:42,599 --> 00:56:49,400 +know which is which could just get a +look remarkably similar but basically + +693 +00:56:49,400 --> 00:56:54,289 +anything like this to inventors of +calculus there's actually controversy + +694 +00:56:54,289 --> 00:56:59,429 +over who really invented calculus and +these guys each other over it but + +695 +00:56:59,429 --> 00:57:03,799 +basically calculus is this powerful +hammer and so what we can do is instead + +696 +00:57:03,800 --> 00:57:06,440 +of doing the silly thing we're +evaluating numerical gradient we can + +697 +00:57:06,440 --> 00:57:10,230 +actually use calculus and we can break +down an expression for what the gradient + +698 +00:57:10,230 --> 00:57:14,880 +is off the loss function in the white +space so basically instead of fumbling + +699 +00:57:14,880 --> 00:57:18,289 +around and doing this is it going up or +is it going down by checking the loss I + +700 +00:57:18,289 --> 00:57:22,509 +just have an expression where I take the +gradient of this and I can sync simply + +701 +00:57:22,510 --> 00:57:26,500 +evaluate what the entire matter is that +the only way that you can actually run + +702 +00:57:26,500 --> 00:57:30,159 +this in practice right we can just an +expression for years the gradient we can + +703 +00:57:30,159 --> 00:57:35,149 +do to stop and so on so in summary +basically numerical gradient approximate + +704 +00:57:35,150 --> 00:57:39,800 +slow but very easy to write because +you're just doing this very simple + +705 +00:57:39,800 --> 00:57:44,190 +process for any damage or loss function +I can get the gradient vector for a + +706 +00:57:44,190 --> 00:57:47,659 +gradient which is you actually do +calculus its exact no finite + +707 +00:57:47,659 --> 00:57:52,210 +proclamations it's very fast but it's +error-prone because you actually have to + +708 +00:57:52,210 --> 00:57:57,300 +do math right so in practice what you +see is we always use a lot of gradient + +709 +00:57:57,300 --> 00:58:01,380 +we do calculus we figure out what the +gradient should be but then you always + +710 +00:58:01,380 --> 00:58:04,789 +check your implementation using an +American gradient check as its referred + +711 +00:58:04,789 --> 00:58:10,480 +to so I will do all I care about the +loss function should be I write an + +712 +00:58:10,480 --> 00:58:15,500 +expression for the gradient I evaluated +in my code so I get in the holiday + +713 +00:58:15,500 --> 00:58:18,769 +greetings and then I also have a lead in +numerical gradient on the side and that + +714 +00:58:18,769 --> 00:58:22,280 +takes a while but you mature you have a +lead to more convenient and you make + +715 +00:58:22,280 --> 00:58:25,890 +sure that those two are the same and +then we say that you passed the green + +716 +00:58:25,889 --> 00:58:29,500 +truck oK so that's what you see in +practice whenever you try to develop a + +717 +00:58:29,500 --> 00:58:32,519 +new module for internal network you're +right I would have lost your right to + +718 +00:58:32,519 --> 00:58:35,759 +backward pass for a complete the +gradient and then you have to make sure + +719 +00:58:35,760 --> 00:58:40,250 +the gradient check it just to make sure +that your calculus is correct and then I + +720 +00:58:40,250 --> 00:58:43,980 +already referred to this process of +optimization which we saw nicely in the + +721 +00:58:43,980 --> 00:58:45,838 +Web Demo where we have this + +722 +00:58:45,838 --> 00:58:49,548 +loop when we optimized where we simply +Valley the gradient on your loss + +723 +00:58:49,548 --> 00:58:53,759 +function and then knowing the gradient +we can perform a primer update when we + +724 +00:58:53,759 --> 00:58:58,509 +change the WBI tiny amount in particular +we want to update with the negative + +725 +00:58:58,509 --> 00:59:04,509 +step-size times the gradient the +negative is there because the gradient + +726 +00:59:04,509 --> 00:59:07,478 +tells the direction of the greatest +increase it tells you which way the + +727 +00:59:07,478 --> 00:59:10,848 +losses increasing and want to minimize +it which is where the negative it's + +728 +00:59:10,849 --> 00:59:14,298 +coming from where to go and negative +reading direction step size here as a + +729 +00:59:14,298 --> 00:59:17,818 +hyper primary that will cause you a huge +amount of headaches step size are + +730 +00:59:17,818 --> 00:59:23,298 +learning rate this is the most critical +parameter to basically worry about that + +731 +00:59:23,298 --> 00:59:27,778 +really there's two that you have to +worry about the most the step size or + +732 +00:59:27,778 --> 00:59:31,539 +learning rate and there's the +regularization strength lame duck that + +733 +00:59:31,539 --> 00:59:35,180 +we saw already those two parameters are +really the two largest headaches and + +734 +00:59:35,179 --> 00:59:45,219 +that's usually what we cross body Dover +was a question about but it's not that + +735 +00:59:45,219 --> 00:59:50,849 +great just great and it tells you the +slope in every single direction and then + +736 +00:59:50,849 --> 00:59:56,109 +we just take a step by step it so the +process of opposition in the weight + +737 +00:59:56,108 --> 01:00:00,768 +space is your somewhere in your W you +get your gradient any March some amount + +738 +01:00:00,768 --> 01:00:05,228 +in the direction of the gradient but you +don't know how much so that the step + +739 +01:00:05,228 --> 01:00:08,449 +size and you saw that when I increase +the step size in the demo things were + +740 +01:00:08,449 --> 01:00:11,248 +jubilant generating around quite a lot +right there was a lot of energy no + +741 +01:00:11,248 --> 01:00:15,449 +system that's because I was taking huge +jumps all over this base and so here the + +742 +01:00:15,449 --> 01:00:19,578 +loss function is minimal at the blue +part there and it's high in the reports + +743 +01:00:19,579 --> 01:00:23,920 +so we want to get to them as part of the +basin this is actually with the loss + +744 +01:00:23,920 --> 01:00:28,579 +function looks like Princess p.m. or +discretion is our complex problems so + +745 +01:00:28,579 --> 01:00:31,729 +it's really just a bowl and we're trying +to get to the bottom of it but this bowl + +746 +01:00:31,728 --> 01:00:35,009 +is like 30,000 dimensional so that's why +takes awhile + +747 +01:00:35,010 --> 01:00:39,640 +ok so we take a step and we we evaluate +the gradient and repeat this over and + +748 +01:00:39,639 --> 01:00:44,980 +over in practice there's this additional +part I wanted to mention where we don't + +749 +01:00:44,980 --> 01:00:49,860 +actually evaluate the loss for the +entire training did in fact all we do is + +750 +01:00:49,860 --> 01:00:53,370 +we only use what's called me back +reading the st. where we have this + +751 +01:00:53,369 --> 01:00:58,670 +entire dataset but we sample batches +from it so we sample sale like say + +752 +01:00:58,670 --> 01:01:02,300 +thirty two examples out of my training +data I evaluate the loss of the gradient + +753 +01:01:02,300 --> 01:01:05,940 +on this batch of 32 and then I knew my +primary update and I keep doing this + +754 +01:01:05,940 --> 01:01:09,619 +over and over again and make sure what +ends up happening is if you only sample + +755 +01:01:09,619 --> 01:01:14,699 +very few data points from training data +then your estimate of the gradient of + +756 +01:01:14,699 --> 01:01:18,109 +course over the entire training set is +kind of noisy because you're only + +757 +01:01:18,110 --> 01:01:21,970 +estimating based on a small subset of +your data but it allows me to step more + +758 +01:01:21,969 --> 01:01:25,689 +so you can do more steps with +approximate gradient or you can do few + +759 +01:01:25,690 --> 01:01:30,179 +steps with exact gradient and practice +what ends up working better if used me + +760 +01:01:30,179 --> 01:01:35,049 +back and it's much more efficient of +course and it's impractical to actually + +761 +01:01:35,050 --> 01:01:41,550 +do fullback gradient descent so come in +many sizes 32 64 128 256 this is not + +762 +01:01:41,550 --> 01:01:45,940 +usually hyper primarily worry about too +much usually settled based on whatever + +763 +01:01:45,940 --> 01:01:49,380 +fits on your GPU we're going to be +talking about BP's in a bit but they + +764 +01:01:49,380 --> 01:01:53,030 +have finite amount of memory say about +like 6 gigabytes or talk about its good + +765 +01:01:53,030 --> 01:01:58,030 +GPU and usually choose a backside such +that a small me back to example spits in + +766 +01:01:58,030 --> 01:02:01,150 +your memory so that's how usually that's +the term and it's not a primary that + +767 +01:02:01,150 --> 01:02:09,570 +actually matters a lot and optimization +sense + +768 +01:02:09,570 --> 01:02:14,789 +and we're going to get the momentum in a +bit but if you want to use momentum then + +769 +01:02:14,789 --> 01:02:18,969 +this is just fine we always been about +trying to send but momentum very common + +770 +01:02:18,969 --> 01:02:23,799 +to do so just to give you an idea of +what will this look like in practice if + +771 +01:02:23,800 --> 01:02:28,510 +I'm running optimization overtime and +i'm looking at the Los evaluated on just + +772 +01:02:28,510 --> 01:02:32,700 +a small many batch of data and you can +see that basically my loss goes down + +773 +01:02:32,699 --> 01:02:37,309 +over time on these many batches from the +training data so as an optimizing I'm + +774 +01:02:37,309 --> 01:02:42,119 +going downhill now of course if I was +doing pullbacks gradient descent so this + +775 +01:02:42,119 --> 01:02:44,839 +was not just me back a sample from the +data you wouldn't expect as much noise + +776 +01:02:44,840 --> 01:02:48,550 +you just expect this to be aligned to +just goes down but because we use me + +777 +01:02:48,550 --> 01:02:51,730 +back if you get this noise in there +because something about you are better + +778 +01:02:51,730 --> 01:03:01,980 +than others but over time they could all +go down there question + +779 +01:03:01,980 --> 01:03:07,539 +yes sir you're wondering about the shape +of this loss function you're used to + +780 +01:03:07,539 --> 01:03:11,420 +maybe seeing more rapid improvement +quick are these loss functions come in + +781 +01:03:11,420 --> 01:03:17,079 +different shapes sizes so it really +depends it's not necessarily the case + +782 +01:03:17,079 --> 01:03:21,940 +that loss function must look very sharp +in the beginning although sometimes they + +783 +01:03:21,940 --> 01:03:25,929 +do they have different shapes for +example it also matters on your + +784 +01:03:25,929 --> 01:03:29,618 +initialization if I'm careful with my +initialization I would expect less of a + +785 +01:03:29,619 --> 01:03:34,990 +jump but if I initialize very +incorrectly then you would expect that + +786 +01:03:34,989 --> 01:03:38,649 +that's going to be fixed very early on +in the optimization we're going to get + +787 +01:03:38,650 --> 01:03:43,309 +to some of those parts I think much +later I also want to show you a lot of + +788 +01:03:43,309 --> 01:03:49,710 +the effects of learning rate on your +loss function and the still learning + +789 +01:03:49,710 --> 01:03:53,820 +rate is the step size basically a very +high learning rates or step sizes you + +790 +01:03:53,820 --> 01:03:59,240 +start rushing around in your W space and +so i dont converge or you explode if you + +791 +01:03:59,239 --> 01:04:02,618 +have a very low learning rate then +you're barely doing any updates and also + +792 +01:04:02,619 --> 01:04:07,869 +it takes a very long time to actually +converge and if you have a high learning + +793 +01:04:07,869 --> 01:04:11,150 +rate sometimes you can basically get a +kind of stuck in a bad position of a + +794 +01:04:11,150 --> 01:04:14,950 +loss so these loss functions kind of you +need to get down to the minimum so if + +795 +01:04:14,949 --> 01:04:17,929 +you have too much energy in your +stocking too quickly when you don't have + +796 +01:04:17,929 --> 01:04:21,679 +you don't allow your problem to kind of +settle in on the smaller local minima + +797 +01:04:21,679 --> 01:04:25,480 +your objective in general when you talk +about neural networks and optimization + +798 +01:04:25,480 --> 01:04:28,320 +you'll see a lot of hand waving because +that's the only way we communicate about + +799 +01:04:28,320 --> 01:04:32,350 +these losses and distance so just +imagine like a Big Basin of loss and + +800 +01:04:32,349 --> 01:04:36,069 +there are these like smaller pockets of +smaller loss and so if you're thrashing + +801 +01:04:36,070 --> 01:04:39,480 +around and you can settle in on a +smaller loss parts and converter for + +802 +01:04:39,480 --> 01:04:43,730 +their so that's why the learning rate so +good and so you need to find the correct + +803 +01:04:43,730 --> 01:04:47,150 +learning rate which will cause a lot of +headaches and what people do most of the + +804 +01:04:47,150 --> 01:04:49,970 +time is sometimes you start off with a +high learning rates we get some benefits + +805 +01:04:49,969 --> 01:04:55,319 +and then UDK it over time to start up +with high and then we decadence learning + +806 +01:04:55,320 --> 01:05:00,780 +read over time as we're settling in on +the good solution and I also want to + +807 +01:05:00,780 --> 01:05:03,550 +point out who's going to this in much +more detail but the way I'm doing the + +808 +01:05:03,550 --> 01:05:07,890 +update here which is how to use the +gradient to actually modify your W + +809 +01:05:07,889 --> 01:05:12,789 +that's called an update firmware update +there are many different forms of doing + +810 +01:05:12,789 --> 01:05:14,869 +it this is the simplest way which were + +811 +01:05:14,869 --> 01:05:20,299 +just STD simplest custom greeting cent +but there are many formulas such as + +812 +01:05:20,300 --> 01:05:23,740 +momentum that was already mentioned in +momentum you basically imagine as you're + +813 +01:05:23,739 --> 01:05:27,949 +doing this optimization you imagine +keeping track of this blog city so as + +814 +01:05:27,949 --> 01:05:31,389 +I'm stepping am also keeping track of my +velocity so if I keep seeing a positive + +815 +01:05:31,389 --> 01:05:35,519 +reading some direction I will accumulate +velocity in that direction so I don't + +816 +01:05:35,519 --> 01:05:39,550 +need someone to go faster at the russian +and so there are several from Los will + +817 +01:05:39,550 --> 01:05:46,100 +look and shortly the class but Thomas +prop Adam or commonly used so just to + +818 +01:05:46,099 --> 01:05:50,569 +show you what these look like these +different choices and what they might do + +819 +01:05:50,570 --> 01:05:56,760 +in your loss function this is a figure +from Alec so here we have a loss + +820 +01:05:56,760 --> 01:06:02,390 +function and these are low-level clerks +and we start off opposition over there + +821 +01:06:02,389 --> 01:06:06,920 +and we're trying to get to the basin and +different update formulas will give you + +822 +01:06:06,920 --> 01:06:10,670 +better or worse convergence in different +problems so you can see for example this + +823 +01:06:10,670 --> 01:06:15,369 +momentum in green it built up momentum +as it went down and then it overshot and + +824 +01:06:15,369 --> 01:06:19,259 +then it kind of go back go back and this +as UD takes forever to converge can read + +825 +01:06:19,260 --> 01:06:23,370 +that's what I presented you so far as +she takes forever to emerge and are + +826 +01:06:23,369 --> 01:06:27,489 +different ways of actually performing +this primary up there are more or less + +827 +01:06:27,489 --> 01:06:35,259 +efficient in modernization will see much +more of this I also wanted to mention at + +828 +01:06:35,260 --> 01:06:39,950 +this point as likely yes I want to go +slightly into I'm to explain obviously + +829 +01:06:39,949 --> 01:06:43,049 +like your classification we know how to +set up the problem we know that are + +830 +01:06:43,050 --> 01:06:47,070 +different loss functions me know how to +optimize them so we can kind of do at + +831 +01:06:47,070 --> 01:06:51,050 +this point across I wanted to mention +that I want to give you a sense of what + +832 +01:06:51,050 --> 01:06:53,710 +computer vision looked like before +comments came about so that you have a + +833 +01:06:53,710 --> 01:06:57,920 +bit of historical perspective because we +used a linear classifiers all the time + +834 +01:06:57,920 --> 01:07:01,019 +but of course you don't usually your +classic cars on the road original image + +835 +01:07:01,019 --> 01:07:06,759 +because that's all you want to believe +we solve the problems with it like you + +836 +01:07:06,760 --> 01:07:10,250 +have to cover all the modes and so on I +thought the police to do as they used to + +837 +01:07:10,250 --> 01:07:14,380 +compute all these different feature +types of images and then you can view + +838 +01:07:14,380 --> 01:07:17,160 +different descriptors in different +feature types and you get these + +839 +01:07:17,159 --> 01:07:22,049 +statistical summaries of what the image +looks like what the frequencies are like + +840 +01:07:22,050 --> 01:07:26,160 +and so on and then we can capitated all +those into large vectors and then we put + +841 +01:07:26,159 --> 01:07:27,710 +those into linear classifiers + +842 +01:07:27,710 --> 01:07:32,050 +so different feature types all of them +concatenated and then that went until + +843 +01:07:32,050 --> 01:07:35,369 +your classifiers that was usually the +pipeline so just to give you an idea of + +844 +01:07:35,369 --> 01:07:39,088 +really what these talks were like one +very simple feature type you might + +845 +01:07:39,088 --> 01:07:43,269 +imagine is just a color histogram so I +go over all the pixels in the image and + +846 +01:07:43,269 --> 01:07:47,449 +i'd in them and to say how many bands +are there are different colors depending + +847 +01:07:47,449 --> 01:07:50,750 +on the hue of the color as you can +imagine this is kind of like one + +848 +01:07:50,750 --> 01:07:54,250 +statistical summary of what's in the +image is just a number of colors each + +849 +01:07:54,250 --> 01:07:57,400 +been so this will become one of my +teachers that I would eventually become + +850 +01:07:57,400 --> 01:08:03,440 +cutting with many different feature +types and other kind of intimately the + +851 +01:08:03,440 --> 01:08:06,530 +classifier if you think about it the +linear classifier can use these features + +852 +01:08:06,530 --> 01:08:09,690 +to actually perform the classification +because the linear classifier can like + +853 +01:08:09,690 --> 01:08:14,320 +or dislike seeing lots of different +colors in the image with positive or + +854 +01:08:14,320 --> 01:08:17,930 +negative what's very common features +also include things like what we call + +855 +01:08:17,930 --> 01:08:22,440 +610 hawk features basically these were +you go in local neighborhoods in the + +856 +01:08:22,439 --> 01:08:26,539 +invention and you look at whether or not +there are lots of different orientations + +857 +01:08:26,539 --> 01:08:30,588 +so are there lots of horizontal or +vertical edges we make up histograms + +858 +01:08:30,588 --> 01:08:35,850 +over that and so when you end up with +just the summary of what kinds of edges + +859 +01:08:35,850 --> 01:08:40,338 +are wherein the image and you can +calculate all those together there was + +860 +01:08:40,338 --> 01:08:45,250 +lots of different types of our proposed +over to over the years just I'll be + +861 +01:08:45,250 --> 01:08:50,359 +taxed on lots of different ways of +measuring what kinds of things are there + +862 +01:08:50,359 --> 01:08:54,850 +in the image and statistics of them and +then we had these pipelines call back + +863 +01:08:54,850 --> 01:08:59,660 +over to my place where you look at +different points in your + +864 +01:08:59,659 --> 01:09:04,250 +you describe a little local patch with +something that you come up with like + +865 +01:09:04,250 --> 01:09:08,329 +looking at the frequencies are looking +at the colors or whatever and then we + +866 +01:09:08,329 --> 01:09:12,269 +came up with these dictionaries for ok +here's the stuff we're seeing images + +867 +01:09:12,270 --> 01:09:16,250 +like there's lots of high-frequency stop +for low-frequency stuff in blue and so + +868 +01:09:16,250 --> 01:09:16,699 +on + +869 +01:09:16,699 --> 01:09:21,338 +to end up with the centroids using +k-means of what kind of stuff to be seen + +870 +01:09:21,338 --> 01:09:25,818 +in a just and then we express every +single image as statistics over how much + +871 +01:09:25,819 --> 01:09:29,660 +of each thing we see in the image so for +example this image has lots of + +872 +01:09:29,659 --> 01:09:33,949 +high-frequency green stuff so you might +see some feature vector that basically + +873 +01:09:33,949 --> 01:09:38,568 +will have a higher value and high +frequency and green and then we did is + +874 +01:09:38,569 --> 01:09:40,760 +we basically took these feature vectors + +875 +01:09:40,760 --> 01:09:45,210 +needed them and put a linear classifier +on them so really the context for what + +876 +01:09:45,210 --> 01:09:49,090 +we're doing is as follows what it looked +like mostly computer vision before + +877 +01:09:49,090 --> 01:09:52,840 +roughly 2012 will let you take your +image and you have a step of feature + +878 +01:09:52,840 --> 01:09:57,409 +extraction where we decided what are +important things to you know about an + +879 +01:09:57,409 --> 01:10:01,859 +image different frequencies different +tents and we decided on what are + +880 +01:10:01,859 --> 01:10:05,109 +interesting features and you see people +take like 10 different feature types in + +881 +01:10:05,109 --> 01:10:09,369 +every paper and just woke up need all of +it just hit you can double one giant + +882 +01:10:09,369 --> 01:10:12,640 +feature vector over your image and then +you put a linear classifier on top of it + +883 +01:10:12,640 --> 01:10:15,920 +just like we saw it right now and so you +play a train sale in your ass p.m. on + +884 +01:10:15,920 --> 01:10:20,109 +top of all these feature types and what +we're replacing it since then we found + +885 +01:10:20,109 --> 01:10:24,869 +that works much better as you start with +the raw image and you think of the whole + +886 +01:10:24,869 --> 01:10:28,979 +thing you're not designing some part of +it in isolation of what you think is a + +887 +01:10:28,979 --> 01:10:33,479 +good idea or not we come up with an +architecture that can simulate a lot of + +888 +01:10:33,479 --> 01:10:38,189 +different features so to speak and since +everything is just a single function we + +889 +01:10:38,189 --> 01:10:41,879 +don't just trying to top it on top of +the features of we can actually train + +890 +01:10:41,880 --> 01:10:45,400 +all the way down to the pixels and we +can train our feature extractors + +891 +01:10:45,399 --> 01:10:49,989 +effectively so that was a big innovation +and how you approach this problem is we + +892 +01:10:49,989 --> 01:10:53,300 +try to eliminate a lot of hand +engineered components are trying to have + +893 +01:10:53,300 --> 01:10:56,779 +a single the principal blob so that we +can fully trained to pull things + +894 +01:10:56,779 --> 01:11:01,550 +starting at the Rock Texas that's what +historically this is coming from and + +895 +01:11:01,550 --> 01:11:06,760 +what we will be doing and so next last +will be looking specifically at this + +896 +01:11:06,760 --> 01:11:10,520 +problem of we need to compute analytic +gradients and so we're going to go into + +897 +01:11:10,520 --> 01:11:14,860 +backpropagation which is an efficient +way of computing analytics gradient and + +898 +01:11:14,859 --> 01:11:18,839 +so that's backdrop and you're going to +become good at it and then we're going + +899 +01:11:18,840 --> 01:11:20,039 +to go slightly works + diff --git a/captions/En/Lecture4_en.srt b/captions/En/Lecture4_en.srt new file mode 100644 index 00000000..b72523f4 --- /dev/null +++ b/captions/En/Lecture4_en.srt @@ -0,0 +1,5369 @@ +1 +00:00:02,740 --> 00:00:07,000 +Okay, so let me dive into some +administrative + +2 +00:00:09,900 --> 00:00:14,900 +points first. So again, recall that +assignment 1 is due next Wednesday. + +3 +00:00:14,900 --> 00:00:19,050 +You have about 150 hours left, +and I use hours because there's a more + +4 +00:00:19,050 --> 00:00:23,320 +imminent sense of doom and remember that +a third of those hours you'll be + +5 +00:00:23,320 --> 00:00:29,278 +unconscious, so you don't have that much +time. It's really running out. And you + +6 +00:00:29,278 --> 00:00:31,768 +know you might think that you have late +days and so on but these assignments just get + +7 +00:00:31,768 --> 00:00:38,640 +harder over time so you want to save +those and so on, so start now. Let's see. So + +8 +00:00:38,640 --> 00:00:43,109 +there's no office hours or anything like +that on Monday. I'll hold make up office + +9 +00:00:43,109 --> 00:00:45,839 +hours on Wednesday because I want you +guys to be able to talk to me about the + +10 +00:00:45,840 --> 00:00:49,260 +especially projects and so on, so I'll be +moving my office hours from Monday to + +11 +00:00:49,260 --> 00:00:52,820 +Wednesday. Usually I had my office hours +at 6PM. Instead I'll have them at 5PM + +12 +00:00:52,820 --> 00:00:59,909 +and usually it's in Gates 260 but now +it'll be in Gates 259, so minus 1 on both and yeah + +13 +00:00:59,909 --> 00:01:03,429 +and also to note, when you're going to be +studying for midterm that's coming up in + +14 +00:01:03,429 --> 00:01:04,170 +a few weeks + +15 +00:01:04,170 --> 00:01:07,109 +make sure you go through the lecture +notes as well which are really part of + +16 +00:01:07,109 --> 00:01:09,819 +this class and I kind of pick and choose +some of the things that I think are most + +17 +00:01:09,819 --> 00:01:13,579 +valuable to present in a lecture but +there's quite a bit of, you know, more material + +18 +00:01:13,579 --> 00:01:16,548 +to be aware of that might pop up in the +midterm, even though I'm covering some of + +19 +00:01:16,549 --> 00:01:19,610 +the most important stuff usually in the lecture, +so do read through those lecture + +20 +00:01:19,610 --> 00:01:25,618 +notes. They're complimentary to the lectures. +And so the material for the midterm will be + +21 +00:01:25,618 --> 00:01:32,269 +drawn from both the lectures and the notes. Okay. +So having said all that, we're going to + +22 +00:01:32,269 --> 00:01:36,769 +dive into the material. So where we are +right now, just as a reminder, we have + +23 +00:01:36,769 --> 00:01:39,989 +the score function, we looked at several +loss functions such as the SVM loss + +24 +00:01:39,989 --> 00:01:44,359 +function last time, and we look at the +full loss that you achieve for any + +25 +00:01:44,359 --> 00:01:49,379 +particular set of weights on, over your +training data, and this loss is made up of + +26 +00:01:49,379 --> 00:01:53,509 +two components. There's a data loss and +a regularization loss, right. And really what we want to do + +27 +00:01:53,509 --> 00:01:57,200 +is we want to derive out the gradient +expression of the loss function with respect to the + +28 +00:01:57,200 --> 00:02:01,118 +weights and we want to do this so that +we can actually perform the optimization + +29 +00:02:01,118 --> 00:02:07,069 +process. And in the optimization process we're doing +gradient descent, where we iterate evaluating + +30 +00:02:07,069 --> 00:02:11,030 +the gradient on your weights, doing a +parameter update and just repeating this + +31 +00:02:11,030 --> 00:02:14,259 +over and over again, so that we're +converging to + +32 +00:02:14,259 --> 00:02:17,929 +the low points of that loss function and +when we arrive at a low loss, that's + +33 +00:02:17,930 --> 00:02:20,799 +equivalent to making good predictions +over our training data in terms of the + +34 +00:02:20,799 --> 00:02:25,030 +scores that come out. Now we also saw +that are two kinds of ways to evaluate the + +35 +00:02:25,030 --> 00:02:29,019 +gradient. There's a numerical gradient +and this is very easy to write but it's + +36 +00:02:29,019 --> 00:02:32,840 +extremely slow to evaluate, and there's +analytic gradient, which is, which you + +37 +00:02:32,840 --> 00:02:36,658 +obtain by using calculus and we'll be +going into that in this lecture quite a + +38 +00:02:36,658 --> 00:02:41,318 +bit more and so it's fast, exact, which is +great, but it's not, you can get it wrong + +39 +00:02:41,318 --> 00:02:45,969 +sometimes, and so we always what we call +gradient check, where we write all + +40 +00:02:45,969 --> 00:02:48,639 +the expressions to compute the analytic +gradients, and then we double check its + +41 +00:02:48,639 --> 00:02:51,828 +correctness with numerical gradient and +so I'm not sure if you're going to see + +42 +00:02:51,829 --> 00:02:59,250 +that, you're going to see that definitely the +assignments. Okay, so, now you might be + +43 +00:02:59,250 --> 00:03:04,378 +tempted to, when you see this setup, we +just want to derive the gradient of the + +44 +00:03:04,378 --> 00:03:08,459 +loss function with respect to the weights. You +might be tempted to just, you know, right + +45 +00:03:08,459 --> 00:03:11,709 +out the full loss and just start to take +the gradients as you see your calculus + +46 +00:03:11,709 --> 00:03:16,120 +class, but the point I'd like to make is that you +should think much more of this in terms + +47 +00:03:16,120 --> 00:03:22,480 +of computational graphs, instead of just +taking, thinking of one giant expression + +48 +00:03:22,480 --> 00:03:25,369 +that you're going to derive with +pen and paper the expression for the + +49 +00:03:25,370 --> 00:03:27,549 +gradient and the reason for that + +50 +00:03:27,549 --> 00:03:31,689 +so here we are thinking about these +values flow, flowing through a + +51 +00:03:31,689 --> 00:03:35,509 +computational graph where you have these +operations along circles and they're + +52 +00:03:35,509 --> 00:03:38,979 +basically little function pieces that +transform your inputs all the way to the + +53 +00:03:38,979 --> 00:03:43,018 +loss function at the end, so we start off +with our data and our parameters as + +54 +00:03:43,019 --> 00:03:46,079 +inputs. They feed through this +computational graph, which is just all + +55 +00:03:46,079 --> 00:03:49,790 +these series of functions along the way, +and at the end, we get a single number + +56 +00:03:49,790 --> 00:03:53,590 +which is the loss. And the reason that +I'd like you to think about it this way is + +57 +00:03:53,590 --> 00:03:57,069 +that, these expressions right now look +very small and you might be able to + +58 +00:03:57,070 --> 00:04:00,339 +derive these gradients, but these +expressions are, in computational graphs, are + +59 +00:04:00,340 --> 00:04:04,250 +about to get very big and, so for example, +convolutional neural networks will have + +60 +00:04:04,250 --> 00:04:08,829 +hundreds maybe or dozens of operations, +so we'll have all these images + +61 +00:04:08,829 --> 00:04:12,939 +flowing through like pretty big computational +graph to get our loss and so it becomes + +62 +00:04:12,939 --> 00:04:16,858 +impractical to just write out these +expressions, and convolutional networks are + +63 +00:04:16,858 --> 00:04:19,370 +not even the worst of it. Once you +actually start to, for example, do + +64 +00:04:19,370 --> 00:04:23,509 +something called a Neural Turing Machine, +which is a paper from DeepMind, where + +65 +00:04:23,509 --> 00:04:26,329 +this is basically differentiable +Turing machine + +66 +00:04:26,329 --> 00:04:30,128 +so the whole thing is differentiable, the +whole procedure that the computer is + +67 +00:04:30,129 --> 00:04:33,590 +performing on a tape is made smooth +and is differentiable computer basically + +68 +00:04:33,591 --> 00:04:39,519 +and the computational graph of this is huge, +and not only is this, this is not it + +69 +00:04:39,519 --> 00:04:42,478 +because what you end up doing and we're +going to recurrent neural networks in a + +70 +00:04:42,478 --> 00:04:45,848 +bit, but what you end up doing is you end +up unrolling this graph, so think about + +71 +00:04:45,848 --> 00:04:51,658 +this graph copied many hundreds of time +steps and so you end up with this giant + +72 +00:04:51,658 --> 00:04:56,379 +monster of hundreds of thousands of +nodes and little computational units and + +73 +00:04:56,379 --> 00:04:59,819 +so it's impossible to write out, you know, +here's the loss for the Neural Turing + +74 +00:04:59,819 --> 00:05:03,650 +Machine. It's just impossible, it would +take like billions of pages, and so we + +75 +00:05:03,651 --> 00:05:07,068 +have to think about this more in terms +of data structures of little functions + +76 +00:05:07,069 --> 00:05:11,710 +transforming intermediate variables to +guess the loss at the very end. Okay. So we're going + +77 +00:05:11,711 --> 00:05:14,318 +to be looking specifically at +computational graphs and how we can derive + +78 +00:05:14,319 --> 00:05:20,560 +the gradient on the inputs with respect +to the loss function at the very end. Okay. + +79 +00:05:20,560 --> 00:05:25,569 +So let's start off simple and concrete. So let's +consider a very small computational graph where + +80 +00:05:25,569 --> 00:05:29,778 +we have scalars as inputs to this graph, x, y and +z, and they take on these specific values + +81 +00:05:29,778 --> 00:05:35,069 +in this example of -2, 5 and -4, +and we have this very small graph + +82 +00:05:35,069 --> 00:05:38,669 +or circuit, you'll hear me refer to these +interchangeably either as a graph or + +83 +00:05:38,670 --> 00:05:43,038 +a circuit, so we have this graph that at +the end gives us this output -12. + +84 +00:05:43,038 --> 00:05:47,288 +Okay. So here what I've done is I've +already pre-filled what we'll call the + +85 +00:05:47,288 --> 00:05:51,120 +forward pass of this graph, where I set +the inputs and then I compute the outputs + +86 +00:05:51,120 --> 00:05:56,288 +And now what we'd like to do is, we'd like to +derive the gradients of the expression on + +87 +00:05:56,288 --> 00:06:01,250 +the inputs, and, so what we'll do now, is, +I'll introduced this intermediate variable + +88 +00:06:01,250 --> 00:06:07,050 +q after the plus gate, so there's a plus gate +and times gate, as I'll refer to them, and + +89 +00:06:07,050 --> 00:06:10,800 +this plus gate is computing this output +q, and so q is this intermediate as + +90 +00:06:10,800 --> 00:06:14,788 +a result of x plus y, and then f is a +multiplication of q and z. And what I've written + +91 +00:06:14,788 --> 00:06:19,360 +out here is, basically, what we want is the +gradients, the derivatives, df/dx, df/dy, + +92 +00:06:19,360 --> 00:06:25,598 +df/dz. And I've written out +the intermediate, these little gradients + +93 +00:06:25,598 --> 00:06:30,120 +for every one of these two expressions +separately, so now we've performed forward + +94 +00:06:30,120 --> 00:06:33,490 +pass going from left to right, and what +we'll do now is we'll derive the backward + +95 +00:06:33,490 --> 00:06:35,699 +pass, we'll go from the back + +96 +00:06:35,699 --> 00:06:39,300 +to the front, computing gradients of all +the intermediates in our circuit until + +97 +00:06:39,300 --> 00:06:43,509 +at the very end, we're going to build up to +get the gradients on the inputs, and so we + +98 +00:06:43,509 --> 00:06:47,680 +start off at the very right and, as a +base case sort of this recursive + +99 +00:06:47,680 --> 00:06:52,670 +procedure, we're considering the gradient +of f with respective to f, so this is just the + +100 +00:06:52,670 --> 00:06:56,020 +identity function, so what is the +derivative of it, + +101 +00:06:56,021 --> 00:06:57,240 +identity mapping? + +102 +00:06:59,000 --> 00:07:06,240 +What is the gradient of df by df? It's one, right? +So the identity has a gradient of one. + +103 +00:07:06,240 --> 00:07:10,329 +So that's our base case. We start off +with a one, and now we're going to go + +104 +00:07:10,329 --> 00:07:18,519 +backwards through this graph. So, we want +the gradient of f with respect to z. + +105 +00:07:18,519 --> 00:07:21,089 +So what is that in this computational graph? + +106 +00:07:24,019 --> 00:07:27,089 +Okay, it's q, so we have that written out right + +107 +00:07:27,089 --> 00:07:32,879 +here and what is q in this particular +example? It's 3, right? So the gradient + +108 +00:07:32,879 --> 00:07:36,279 +on z, according to this will, become +just 3. So I'm going to be writing the gradients + +109 +00:07:36,279 --> 00:07:42,309 +under the lines in red and the values +are in green above the lines. So with the + +110 +00:07:42,310 --> 00:07:48,420 +gradient on the, in the front is 1, and +now the gradient on z is 3, and what red 3 is telling + +111 +00:07:48,420 --> 00:07:52,009 +you really intuitively, keep in mind the +interpretation of a gradient, is what + +112 +00:07:52,009 --> 00:07:58,459 +that's saying is that the influence of +z on the final value is positive and + +113 +00:07:58,459 --> 00:08:02,859 +with, sort of a force of 3. So if I +increment z by a small amount h + +114 +00:08:02,860 --> 00:08:07,759 +then the output of the circuit will +react by increasing, because it's a + +115 +00:08:07,759 --> 00:08:13,009 +positive 3, will increase by 3h, so +small change will result in a positive + +116 +00:08:13,010 --> 00:08:18,560 +change in the output. Now the +gradient on q in this case will be + +117 +00:08:21,009 --> 00:08:30,860 +So df/dq is z. What is z? -4. Okay? +So we get a gradient of -4 on that path + +118 +00:08:30,860 --> 00:08:34,599 +of the circuit, and what that's saying is +that if q were to increase, then the output + +119 +00:08:34,599 --> 00:08:39,740 +of the circuit will decrease, okay, by, if +you increase by h, the output of the circuit + +120 +00:08:39,740 --> 00:08:44,789 +will decrease by 4h. That's the +slope, is -4. Okay, now we're going + +121 +00:08:44,789 --> 00:08:48,480 +to continue this recursive process through this +plus gate and this is where things get + +122 +00:08:48,480 --> 00:08:49,039 +slightly interesting + +123 +00:08:49,039 --> 00:08:54,328 +I suppose. So we'd like to compute the +gradient on f on y with respect to y + +124 +00:08:54,328 --> 00:09:00,208 +and so the gradient on y in +this particular graph will become + +125 +00:09:03,909 --> 00:09:07,179 +Let's just guess and then we'll +see how this gets derived properly. + +126 +00:09:12,209 --> 00:09:15,208 +So I hear some murmurs of the right answer. +It will be -4. So let's see how. + +127 +00:09:15,209 --> 00:09:17,800 +So there are many ways to derive it at this point + +128 +00:09:17,801 --> 00:09:21,000 +because the expression is very small and you can +kind of, glance at it, but the way I'd like to + +129 +00:09:21,001 --> 00:09:23,979 +think about this is by applying chain rule, okay. + +130 +00:09:23,980 --> 00:09:27,709 +So the chain rule says that if you would +like to derive the gradient of f on y + +131 +00:09:27,710 --> 00:09:33,208 +then it's equal to df/dq times +dq/dy, right? And so we've + +132 +00:09:33,208 --> 00:09:36,438 +computed both of those expressions, in +particular dq/dy, we know, is + +133 +00:09:36,438 --> 00:09:42,519 +-4, so that's the effect of the +influence of q on f, is df/dq, which is + +134 +00:09:42,519 --> 00:09:46,619 +-4, and now we know the local, +we'd like to know the local influence + +135 +00:09:46,619 --> 00:09:52,449 +of y on q, and that local influence +of y on q is 1, because that's the local + +136 +00:09:52,450 --> 00:09:58,969 +as I'll refer to as the local derivative of y +for the plus gate, and so the chain rule + +137 +00:09:58,970 --> 00:10:02,019 +tells us that the correct thing to do to +chain these two gradients, the local + +138 +00:10:02,019 --> 00:10:06,139 +gradient of y on q, and the, +kind of global gradient of q on the + +139 +00:10:06,139 --> 00:10:10,948 +output of the circuit, is to multiply +them. So we'll get -4 times 1 + +140 +00:10:10,948 --> 00:10:14,588 +And so, this is kind of the, the crux of how +backpropagation works. This is a very + +141 +00:10:14,589 --> 00:10:18,209 +important to understand here that, we have +these two pieces that we keep + +142 +00:10:18,210 --> 00:10:24,289 +multiplying through when we perform the chain rule. +We have q computed x + y, and + +143 +00:10:24,289 --> 00:10:29,379 +the derivative x and y, with respect to that +single expression is 1 and 1. So keep + +144 +00:10:29,379 --> 00:10:32,749 +in mind the interpretation of the gradient. +What that's saying is that x and y have a + +145 +00:10:32,749 --> 00:10:38,509 +positive influence on q, with a slope +of 1. So increasing x by h + +146 +00:10:38,509 --> 00:10:44,548 +will increase q by h, and what we'd eventually +like is, we'd like the influence of y + +147 +00:10:44,548 --> 00:10:49,980 +on the final output of the circuit, And so +the way this end up working is, you take + +148 +00:10:49,980 --> 00:10:53,480 +the influence of y on q, and we know +the influence of q on the final loss + +149 +00:10:53,480 --> 00:10:57,058 +which is what we are recursively +computing here through this graph, and + +150 +00:10:57,058 --> 00:11:00,350 +the correct thing to do is to multiply +them, so we end up with -4 times 1 + +151 +00:11:00,351 --> 00:11:05,189 +gets you -4. And so the way this +works out is, basically what this is + +152 +00:11:05,190 --> 00:11:08,649 +saying is that the influence of y on +the final output of the circuit is -4 + +153 +00:11:08,649 --> 00:11:14,649 +so increasing y should decrease the +output of the circuit by -4 times the + +154 +00:11:14,649 --> 00:11:18,230 +little change that you've made. And the way +that end up working out is y has a + +155 +00:11:18,230 --> 00:11:21,810 +positive influence on q, so increasing +y, slightly increases q + +156 +00:11:21,810 --> 00:11:27,959 +which slightly decreases the output of the circuit, +okay? So chain rule is kind of giving us this + +157 +00:11:27,960 --> 00:11:29,320 +correspondence. Go ahead. + +158 +00:11:29,320 --> 00:11:38,360 +(Student is asking question) + +159 +00:11:38,360 --> 00:11:42,559 +Yeap, thank you. So we're going to get into this. +You'll see many, basically this entire class + +160 +00:11:42,559 --> 00:11:45,259 +is about this, so you'll see many +many instantiations of this and + +161 +00:11:45,259 --> 00:11:48,889 +I'll drill this into you by the end of this +class and you'll understand it. You will not + +162 +00:11:48,889 --> 00:11:51,870 +have any symbolic expressions anywhere +once we compute this, once we're actually + +163 +00:11:51,870 --> 00:11:54,639 +implementing this and you'll see +implementations of it later in this. + +164 +00:11:54,639 --> 00:11:57,009 +It will always be just be vectors and numbers. + +165 +00:11:57,009 --> 00:12:02,230 +Raw vectors, numbers. Okay, and looking at x, we +have a very similar thing that happens. + +166 +00:12:02,230 --> 00:12:05,889 +We want df/dx. That's our final objective, + but, and we have to combine it. + +167 +00:12:05,889 --> 00:12:09,799 +We know what the x's, what is x's influence on q +and what is q's influence + +168 +00:12:09,799 --> 00:12:13,979 +on the end of the circuit, and so that +ends up being the chain rule, so you take + +169 +00:12:13,980 --> 00:12:19,240 +-4 times 1 and gives you -4, okay? +So the way this works, to generalize a + +170 +00:12:19,240 --> 00:12:23,289 +bit from this example and the way to think +about it is as follows. You are a gate + +171 +00:12:23,289 --> 00:12:28,429 +embedded in a circuit and this is a very +large computational graph or circuit and + +172 +00:12:28,429 --> 00:12:32,250 +you receive some inputs, some +particular numbers x and y come in + +173 +00:12:32,250 --> 00:12:39,059 +and you perform some operation f on them and +compute some output z. And now this + +174 +00:12:39,059 --> 00:12:43,019 +value of z goes into computational graph and +something happens to it but you're just + +175 +00:12:43,019 --> 00:12:46,169 +a gate hanging out in a circuit and +you're not sure what happens, but by the + +176 +00:12:46,169 --> 00:12:50,939 +end of the circuit the loss gets computed, okay? And +that's the forward pass and then we're + +177 +00:12:50,940 --> 00:12:56,250 +proceeding recursively in the reverse +order backwards, but before that actually, + +178 +00:12:56,250 --> 00:13:01,120 +before I get to that part, right away when I get +x and y, the thing I'd like to point out that + +179 +00:13:01,120 --> 00:13:05,279 +during the forward pass, if you're this +gate and you get to your values x and y + +180 +00:13:05,279 --> 00:13:08,500 +you compute your output z, and there's another +thing you can compute right away and + +181 +00:13:08,500 --> 00:13:10,230 +that is the local gradients on x and y. + +182 +00:13:10,230 --> 00:13:14,789 +So I can compute those right away +because I'm just a gate and I know what + +183 +00:13:14,789 --> 00:13:18,009 +I'm performing, like say addition or +multiplication, so I know the influence that + +184 +00:13:18,009 --> 00:13:24,259 +x and y have on my output value, so I can +compute those guys right away, okay? But then + +185 +00:13:24,259 --> 00:13:25,389 +what happens + +186 +00:13:25,389 --> 00:13:29,769 +near the end, so the loss gets computed +and now we're going backwards, I'll eventually learn + +187 +00:13:29,769 --> 00:13:32,499 +about what is my influence on + +188 +00:13:32,499 --> 00:13:37,839 +the final output of the circuit, the loss. +So I'll learn what is dL/dz in there. + +189 +00:13:37,839 --> 00:13:41,419 +The gradient will flow into me and what I +have to do is I have to chain that + +190 +00:13:41,419 --> 00:13:45,278 +gradient through this recursive case, so +I have to make sure to chain the + +191 +00:13:45,278 --> 00:13:48,778 +gradient through my operation that I performed +and it turns out that the correct thing + +192 +00:13:48,778 --> 00:13:52,068 +to do here by chain rule, really what it's +saying, is that the correct thing to do is to + +193 +00:13:52,068 --> 00:13:56,068 +multiply your local gradient with that +gradient and that actually gives you the + +194 +00:13:56,068 --> 00:13:57,838 +dL/dx that gives you the + +195 +00:13:57,839 --> 00:14:02,739 +influence of x on the final output of +the circuit. So really, chain rule is just + +196 +00:14:02,739 --> 00:14:08,229 +this added multiplication. where we take our, +what I'll call, global gradient of this + +197 +00:14:08,229 --> 00:14:12,669 +gate on the output, and we chain it +through the local gradient, and the same + +198 +00:14:12,669 --> 00:14:18,509 +thing goes for y. So it's just a +multiplication of that guy, that gradient + +199 +00:14:18,509 --> 00:14:22,889 +by your local gradient if you're a gate. +And then remember that these x's and y's + +200 +00:14:22,889 --> 00:14:27,229 +they are coming from different gates, right? +So you end up with recursing + +201 +00:14:27,229 --> 00:14:31,899 +this process through the entire computational +circuit, and so these gates + +202 +00:14:31,899 --> 00:14:36,808 +just basically communicate to each other +the influence on the final loss, so they + +203 +00:14:36,808 --> 00:14:39,688 +tell each other, okay if this is a positive +gradient that means you're positively + +204 +00:14:39,688 --> 00:14:43,198 +influencing the loss, if it's a negative +gradient you're negatively + +205 +00:14:43,198 --> 00:14:46,788 +influencing the loss, and these just get all +multiplied through the circuit by these + +206 +00:14:46,788 --> 00:14:51,019 +local gradients and you end up with, and +this process is called backpropagation. + +207 +00:14:51,019 --> 00:14:54,489 +It's a way of computing through a +recursive application of chain rule + +208 +00:14:54,489 --> 00:14:58,399 +through computational graph, the influence +of every single intermediate value in + +209 +00:14:58,399 --> 00:15:02,158 +that graph on the final loss function. +So we'll see many examples of this + +210 +00:15:02,158 --> 00:15:06,918 +throughout the lecture. I'll go into a +specific example that is a slightly + +211 +00:15:06,918 --> 00:15:11,298 +larger and we'll work through it in detail. +But I don't know if there are any questions at + +212 +00:15:11,298 --> 00:15:13,000 +this point that anyone would like to ask. +Go ahead. + +213 +00:15:13,001 --> 00:15:16,000 +What happens if z is used by two other nodes? + +214 +00:15:16,001 --> 00:15:19,000 +If z is used by multiple nodes, I'm going to come back to that. + +215 +00:15:19,000 --> 00:15:23,537 +You add the gradients. The gradient, the +correct thing to do is you add them. + +216 +00:15:23,538 --> 00:15:29,928 +So if z is being influenced in multiple places +in the circuit, the backward flows will add. + +217 +00:15:29,928 --> 00:15:31,539 +I will come back to that point. Go ahead. + +218 +00:15:31,539 --> 00:15:53,038 +(Student is asking question) + +219 +00:15:53,039 --> 00:15:59,139 +Yeap. So I think, I would've repeated your question, +but you're jumping ahead like 100 slides. + +220 +00:15:59,539 --> 00:16:03,139 +So we're going to get the all of those +issues and we're going to see, you're + +221 +00:16:03,139 --> 00:16:05,769 +going to get what we call vanishing +gradient problems and so on. + +222 +00:16:05,769 --> 00:16:10,669 +We'll see. Okay, let's go through another +example to make this more concrete. + +223 +00:16:10,669 --> 00:16:14,318 +So here we have another circuit. It happens +to be computing a little two-dimensional + +224 +00:16:14,318 --> 00:16:18,179 +sigmoid neuron, but for now don't worry about +that interpretation. Just think of this + +225 +00:16:18,179 --> 00:16:22,849 +as, that's an expression so one-over- +one-plus-e-to-the-whatever, so the number of + +226 +00:16:22,850 --> 00:16:29,000 +inputs here is five, and we're computing that function +and we have a single output over there, okay? + +227 +00:16:29,000 --> 00:16:32,490 +And I've translated that mathematical expression +into this computational graph form, so + +228 +00:16:32,490 --> 00:16:35,769 +we have to recursively from inside out +compute this expression so we first do + +229 +00:16:35,769 --> 00:16:42,129 +all the little w times x's, and then +we add them all up and then we take a + +230 +00:16:42,129 --> 00:16:46,129 +negative of it and then we exponentiate +that and then we add one and then we + +231 +00:16:46,129 --> 00:16:49,769 +finally divide and we get the result of +the expression. And so what we're going to do + +232 +00:16:49,769 --> 00:16:52,409 +now is we're going to backpropagate +through this expression. We're going to + +233 +00:16:52,409 --> 00:16:56,500 +compute what the influence of every +single input value is on the output of + +234 +00:16:56,500 --> 00:16:59,230 +this expression, what is the gradient here. +Yeap, go ahead. + +235 +00:16:59,231 --> 00:17:010,229 +(Student is asking question) + +236 +00:17:10,230 --> 00:17:15,229 +So for now, so you're concerned about the +interpretation of plus may be in these circles. + +237 +00:17:15,230 --> 00:17:22,039 +For now, let's just assume that this plus is a binary +plus. It's a binary plus gate, and we have there + +238 +00:17:22,039 --> 00:17:26,519 +plus one gate. I'm making up these gates on +the spot, and we'll see that what is a + +239 +00:17:26,519 --> 00:17:31,000 +gate or is not a gate is kind of up to +you. I'll come back to this point in a bit. + +240 +00:17:31,001 --> 00:17:35,639 +So for now, I just like, we have several more +gates that we're using throughout, and so + +241 +00:17:35,640 --> 00:17:38,650 +I'd just like to write out as we go +through this example several of these + +242 +00:17:38,650 --> 00:17:42,720 +derivatives. So we have exponentiation and +we know for every little local gate what these + +243 +00:17:42,720 --> 00:17:49,048 +local gradients are, right? So we can derive +that using calculus. So e^x derivative is e^x and + +244 +00:17:49,048 --> 00:17:52,900 +so on. So these are all the operations +and also addition and multiplication + +245 +00:17:52,900 --> 00:17:56,040 +which I'm assuming that you have +memorized in terms of what the gradients + +246 +00:17:56,040 --> 00:17:58,970 +look like. So we're going to start +off at the end of the circuit and I've + +247 +00:17:58,970 --> 00:18:03,450 +already filled in a 1.00 +in the back because that's how we always + +248 +00:18:03,450 --> 00:18:04,890 +start this recursion with a 1.0 + +249 +00:18:04,891 --> 00:18:10,519 +right, since that's the gradient +on the identity function. Now we're going + +250 +00:18:10,519 --> 00:18:17,849 +to backpropagate through this 1/x +operation, okay? So the derivative of 1/x + +251 +00:18:17,849 --> 00:18:22,048 +the local gradient is -1/(x^2), +so that 1/x gate + +252 +00:18:22,048 --> 00:18:27,119 +during the forward pass received +input 1.37 and right away that 1/x gate + +253 +00:18:27,119 --> 00:18:30,759 +could have computed what the +local gradient was. The local gradient was + +254 +00:18:30,759 --> 00:18:35,048 +-1/(x^2) and now during backpropagation, +it has to, by chain rule, + +255 +00:18:35,048 --> 00:18:40,750 +multiply that local gradient by the +gradient of it on the final output of the circuit + +256 +00:18:40,750 --> 00:18:44,789 +which is easy because it happens +to be at the end. So what ends up being the + +257 +00:18:44,789 --> 00:18:49,349 +expression for the backpropagated +gradient here, from the 1/x gate? + +258 +00:18:54,049 --> 00:19:59,048 +The chain rule always has two pieces: local +gradient times the gradient from the top + +259 +00:18:59,049 --> 00:19:01,300 +or from above. + +260 +00:19:04,301 --> 00:19:08,069 +(Student is answering) + +261 +00:19:08,301 --> 00:19:12,500 +Um, yeah. Okay. Yeah, so that's correct. + +262 +00:19:12,501 --> 00:19:18,069 +So we get -1/x^2, which is the gradient df/dx. +So that is the local gradient. + +263 +00:19:18,069 --> 00:19:23,480 +-1/3.7^2 and then multiplied by 1.0 which is + +264 +00:19:23,480 --> 00:19:27,940 +the gradient from above, which is really just 1 +because we've just started, and I'm applying + +265 +00:19:27,940 --> 00:19:34,850 +chain rule right away here and the output is +-0.53. So that's the gradient on + +266 +00:19:34,850 --> 00:19:38,798 +that piece of the wire, where this value +was flowing, okay. So it has a negative + +267 +00:19:38,798 --> 00:19:43,889 +effect on the output. And you might expect +that right, because if you were to + +268 +00:19:43,890 --> 00:19:47,850 +increase this value and then it goes +through a gate of 1/x, then if you + +269 +00:19:47,851 --> 00:19:50,939 +increase this, 1/x get smaller, so +that's why you're seeing a negative + +270 +00:19:50,940 --> 00:19:55,620 +gradient, right. So we're going to continue +backpropagation here. The next gate + +271 +00:19:55,621 --> 00:19:58,400 +in the circuit, it's adding a constant of 1, + +272 +00:19:58,400 --> 00:20:01,048 +so the local gradient, if you look at + +273 +00:20:01,048 --> 00:20:06,960 +adding a constant to a value, the +gradient on x is just 1, right, + +274 +00:20:06,961 --> 00:20:13,169 +from basic calculus. And so the chained +gradient here that we continue along the wire + +275 +00:20:13,169 --> 00:20:27,868 +will be... +(Student is answering) + +276 +00:20:17,869 --> 00:20:22,940 +We have a local gradient, which is +1, times the gradient from above the + +277 +00:20:22,940 --> 00:20:28,590 +gate, which it has just learned is -0.53, okay? +So -0.53 continues along the + +278 +00:20:28,590 --> 00:20:34,709 +wire unchanged. And intuitively that +makes sense right, because this value + +279 +00:20:34,710 --> 00:20:38,319 +floats and it has some influence on the +final circuit and now, if you're + +280 +00:20:38,319 --> 00:20:42,798 +adding 1, then its influence, its rate +of change, its slope towards the final + +281 +00:20:42,798 --> 00:20:46,970 +value doesn't change. If you increase +this by some amount, the effect at the + +282 +00:20:46,970 --> 00:20:51,548 +end will be the same, because the rate of +change doesn't change through the +1 gate. + +283 +00:20:51,548 --> 00:20:57,859 +It's just a constant offset, okay? We continue +derivation here. So the gradient of e^x is + +284 +00:20:57,859 --> 00:21:01,599 +e^x, so to continue backpropagation +we're going to perform, + +285 +00:21:01,599 --> 00:21:05,000 +so this gate saw input of -1. + +286 +00:21:05,000 --> 00:21:08,329 +It right away could have computed its +local gradient, and now it knows that the + +287 +00:21:08,329 --> 00:21:12,259 +gradient from above is -0.53. +So to continue backpropagation + +288 +00:21:12,259 --> 00:21:15,000 +here and apply chain rule, we would receive... + +289 +00:21:15,000 --> 00:21:17,400 +(Student is answering) + +290 +00:21:17,400 --> 00:21:20,000 +Okay, so these are most of +the rhetorical questions so I'm + +291 +00:21:20,000 --> 00:21:25,119 +not sure, but yeah, basically +e^(-1) which is the e^x, + +292 +00:21:25,119 --> 00:21:30,569 +the x input to this exp gate times the chain rule, +right, so the gradient from above is -0.53 + +293 +00:21:30,569 --> 00:21:35,269 +so we keep multiplying that on. So what +is the effect on me and what do I have an + +294 +00:21:35,269 --> 00:21:39,069 +effect on the final end of the circuit, +those are being always multiplied. So we + +295 +00:21:39,069 --> 00:21:46,859 +get -0.2 at this point. So now we +have a *(-1) gate. So what + +296 +00:21:46,859 --> 00:21:50,279 +ends up happening, what happens to the +gradient when you do a times -1 in the + +297 +00:21:50,279 --> 00:21:53,139 +computational graph? + +298 +00:21:53,139 --> 00:21:57,139 +It flips around, right? Because we have +basically, a constant multiply of input + +299 +00:21:57,140 --> 00:22:02,038 +which happened to be a constant of +-1, so 1 * -1 + +300 +00:22:02,038 --> 00:22:05,548 +gave us -1 in the forward pass, +and so now we have to + +301 +00:22:05,548 --> 00:22:09,569 +multiply by a, that's the local gradient, +times the gradient from above which is -0.2 + +302 +00:22:09,569 --> 00:22:14,879 +so we end up with just +0.2 now. +So now we're continuing backpropagation + +303 +00:22:14,880 --> 00:22:21,110 +We're backpropagating '+' and this '+' operation +has multiple inputs here, the gradient, + +304 +00:22:21,110 --> 00:22:25,599 +the local gradient for the plus gate is 1 +and 1, so what ends up happening to, + +305 +00:22:25,599 --> 00:22:27,359 +what gradients flow along the output wires? + +306 +00:22:42,359 --> 00:22:48,089 +So the plus gate has a local gradient on all +of its inputs always will be just one, right, because + +307 +00:22:48,089 --> 00:22:53,769 +if you just have a function, you know, +x+y, then for that function + +308 +00:22:53,769 --> 00:22:58,109 +the gradient on either x or y is just one +and so what you end up getting is just + +309 +00:22:58,109 --> 00:23:03,619 +1 * 0.2. And so, in fact for a +plus gate, always you see the same fact + +310 +00:23:03,619 --> 00:23:07,469 +where the local gradient of all of its +inputs is 1, and so whatever gradient it + +311 +00:23:07,470 --> 00:23:11,289 +gets from above, it just always +distributes gradient equally to all of + +312 +00:23:11,289 --> 00:23:14,339 +its inputs, because in the chain rule, +they'll get multiplied and when you multiply by 1 + +313 +00:23:14,339 --> 00:23:18,129 +something remains unchanged. So a '+' +gate, it's kind of like a gradient + +314 +00:23:18,130 --> 00:23:22,170 +distributor, where if something flows in +from the top, it will just spread out all + +315 +00:23:22,170 --> 00:23:26,560 +the gradients equally to all of its +children. And so we've already received + +316 +00:23:26,560 --> 00:23:32,139 +one of the inputs is gradient 0.2 here +on the very final output of the circuit + +317 +00:23:32,140 --> 00:23:35,970 +and so this influence has been computed +through a series of applications of + +318 +00:23:35,970 --> 00:23:42,450 +chain rule along the way. There was another +plus gate that I've skipped over, and so this + +319 +00:23:42,450 --> 00:23:47,090 +0.2 kind of distributes to both +0.2 0.2 equally so we've already done a + +320 +00:23:47,090 --> 00:23:51,750 +plus gate, and there's a multiply gate there, +and so now we're going to backpropagate + +321 +00:23:51,750 --> 00:23:55,940 +through that multiply operation. +And so the local grad, so the, + +322 +00:23:55,940 --> 00:24:02,450 +so what will be the gradients for w0 and x0? +What will be the gradient for w0, specifically? + +323 +00:24:02,450 --> 00:24:06,450 +(Student is answering) + +324 +00:24:06,450 --> 00:24:17,059 +Did someone say 0? 0 will be wrong. It will be, +so the gradient w1 will be, w0 sorry, will be + +325 +00:24:17,059 --> 00:24:24,389 +-1 * 0.2. Good. And the gradient on x0 will +be, there is a bug, by the way, in the slide + +326 +00:24:24,390 --> 00:24:27,840 +that I just noticed like few minutes +before I actually created the class. + +327 +00:24:27,840 --> 00:24:34,289 +Created the, started the class. So you see +0.39 there it should be 0.4. It's + +328 +00:24:34,289 --> 00:24:37,480 +because of a bug in the visualization +because I'm truncating at 2-decimal + +329 +00:24:37,480 --> 00:24:41,190 +digits, but anyways, basically that should be +0.4 because the way you get that + +330 +00:24:41,190 --> 00:24:45,400 +is 2 * 0.2 gives you 0.4 +just like I've written out over there. + +331 +00:24:45,400 --> 00:24:50,980 +So that's what the output should be there. +Okay, so we've backpropagated this + +332 +00:24:50,980 --> 00:24:55,190 +circuit here and we've backpropagated through +this expression and so you might imagine in + +333 +00:24:55,190 --> 00:24:59,289 +our actual downstream applications, +we'll have data and all the parameters as inputs + +334 +00:24:59,289 --> 00:25:03,450 +the loss function is at the top at the +end, so we'll do forward pass to evaluate + +335 +00:25:03,450 --> 00:25:06,440 +the loss function and then we'll backpropagate +through every piece of + +336 +00:25:06,440 --> 00:25:10,450 +computation we've done along the way, and +we'll backpropagate through every gate to + +337 +00:25:10,450 --> 00:25:14,150 +get our inputs, and backpropagate just +means apply chain rule many many times + +338 +00:25:14,150 --> 00:25:18,220 +and we'll see how that is implemented in a bit. +Sorry, did you have a question? + +339 +00:25:18,220 --> 00:25:20,520 +(Student is asking question) + +340 +00:25:20,521 --> 00:25:23,021 +Oh yes, so I'm going to skip +that because it's the same. + +341 +00:25:23,021 --> 00:25:27,821 +So I'm going to skip the other times gate. +Any other questions at this point? + +342 +00:25:27,821 --> 00:25:32,969 +(Student is asking question) + +343 +00:25:32,969 --> 00:25:37,200 +That's right. so the costs of forward and +backward propagation are roughly equal. + +344 +00:25:37,200 --> 00:25:44,100 +(Student is asking question) + +345 +00:25:44,100 --> 00:25:45,869 +Well, it should be, it almost always ends + +346 +00:25:45,869 --> 00:25:49,500 +up being basically equal when you look +at timings, usually the backward pass is slightly + +347 +00:25:49,500 --> 00:25:52,000 +slower, but yeah. + +348 +00:25:55,000 --> 00:25:58,710 +Okay, so let's see, one thing I +wanted to point out, before we move on, is that + +349 +00:25:58,710 --> 00:26:02,350 +the setting of these gates, like these +gates are arbitrary, so one thing I could + +350 +00:26:02,350 --> 00:26:06,509 +have done, for example, is, some of you +may know this, I can collapse these gates + +351 +00:26:06,509 --> 00:26:10,549 +into one gate if I wanted to. For example, +There is something called the sigmoid function + +352 +00:26:10,549 --> 00:26:14,069 +which has that particular form, so a sigma of x +which is the sigmoid function + +353 +00:26:14,069 --> 00:26:19,460 +computes 1/(1+e^(-x)) +and so I could have rewritten that + +354 +00:26:19,460 --> 00:26:22,650 +expression and I could have collapsed all of +those gates that made up the sigmoid + +355 +00:26:22,650 --> 00:26:27,769 +gate into a single sigmoid gate. And so there's a +sigmoid gate here, and I could have done + +356 +00:26:27,769 --> 00:26:32,440 +that in a single go, sort of, and what I +would have had to do, if I wanted to have + +357 +00:26:32,440 --> 00:26:37,980 +that gate, is I need to compute an +expression for how this, so what is the + +358 +00:26:37,980 --> 00:26:41,670 +local gradient for the sigmoid gate +basically? So what is the gradient on the + +359 +00:26:41,670 --> 00:26:44,470 +sigmoid gate on its input and I have to go +through some math which I'm not going to + +360 +00:26:44,470 --> 00:26:46,980 +go into detail but you end up with that +expression over there. + +361 +00:26:46,980 --> 00:26:51,750 +It ends up being (1-sigmoid(x)) * sigmoid(x). +That's the local gradient and that + +362 +00:26:51,750 --> 00:26:55,450 +allows me to now, put this piece into a +computational graph, because once I know + +363 +00:26:55,450 --> 00:26:58,819 +how to compute the local gradient +everything else is defined just through + +364 +00:26:58,819 --> 00:27:02,389 +chain rule and multiply everything +together. So we can backpropagate + +365 +00:27:02,390 --> 00:27:06,720 +through this sigmoid gate now, and the way +that would look like is, the input to the + +366 +00:27:06,720 --> 00:27:11,750 +sigmoid gate was 1.0, that's what +went into the sigmoid gate, and 0.73 went out. + +367 +00:27:11,750 --> 00:27:18,759 +So 0.73 is sigma of x, okay? And now we want +the local gradient which is, as we've seen + +368 +00:27:18,759 --> 00:27:22,559 +from the math that I performed there +(1 - sigma(x)) * sigma(x) + +369 +00:27:22,559 --> 00:27:26,450 +so you get, sigma(x) is 0.73, multiplying (1 - 0.73) + +370 +00:27:26,450 --> 00:27:31,170 +that's the local gradient and then times, +we happen to be at the end + +371 +00:27:31,170 --> 00:27:34,170 +of the circuit, so times 1.0, +which I'm not even writing. + +372 +00:27:34,170 --> 00:27:36,330 +So we end up with 0.2. And of course we + +373 +00:27:36,330 --> 00:27:37,649 +get the same answer + +374 +00:27:37,650 --> 00:27:42,220 +0.2, as we received before, 0.2, +because calculus works, but basically we + +375 +00:27:42,220 --> 00:27:44,480 +could have broken up +this expression down and + +376 +00:27:44,480 --> 00:27:47,450 +did one piece at a time or we could just +have a single sigmoid gate and that's + +377 +00:27:47,450 --> 00:27:51,569 +kind of up to us at what level of hierarchy +do we break these expressions + +378 +00:27:51,569 --> 00:27:52,339 +and so you'd like to + +379 +00:27:52,339 --> 00:27:55,829 +intuitively, cluster these expressions +into single gates if it's very efficient + +380 +00:27:55,829 --> 00:28:59,800 +or easy to derive the local gradients +because then those become your pieces. + +381 +00:28:00,000 --> 00:28:05,819 +(Student is asking question) + +382 +00:28:05,819 --> 00:28:10,529 +Yes. So the question is, do libraries typically +do that? Do they worry about, you know + +383 +00:28:10,529 --> 00:28:14,058 +what's easy to or convenient to +compute and the answer is yeah, I would say so, + +384 +00:28:14,058 --> 00:28:17,480 +So if you noticed that there are some +piece of operation you'd like to do over + +385 +00:28:17,480 --> 00:28:20,798 +and over again, and it has a very simple +local gradient, then that's something very + +386 +00:28:20,798 --> 00:28:24,900 +appealing to actually create a single +unit out of, and we'll see some of those + +387 +00:28:24,900 --> 00:28:30,230 +examples actually int a bit I think. +Okay, I'd like to also point out that once you, + +388 +00:28:30,230 --> 00:28:32,490 +the reason I like to think about these +computational graphs, is it really helps + +389 +00:28:32,490 --> 00:28:36,289 +your intuition to think about how gradients +flow in a neural network. It's not just, + +390 +00:28:36,289 --> 00:28:39,369 +you don't want this to be a black +box to you, you want to understand + +391 +00:28:39,369 --> 00:28:43,959 +intuitively how this happens, and you +start to develop after a while of + +392 +00:28:43,960 --> 00:28:47,850 +looking at computational graphs intuitions +about how these gradients flow, and this + +393 +00:28:47,850 --> 00:28:52,029 +by the way, helps you debug some issues like, +say, we'll go to vanishing gradient problem + +394 +00:28:52,029 --> 00:28:55,950 +it's much easier to understand exactly +what's going wrong in your optimization + +395 +00:28:55,950 --> 00:28:59,250 +if you understand how gradients flow +in networks. It will help you debug these + +396 +00:28:59,250 --> 00:29:02,740 +networks much more efficiently. And so +some intuitions for example, we already + +397 +00:29:02,740 --> 00:29:07,609 +saw the add gate. It has a local +gradient of 1 to all of its inputs, so + +398 +00:29:07,609 --> 00:29:11,279 +it's just a gradient distributor. That's +like a nice way to think about it + +399 +00:29:11,279 --> 00:29:14,548 +whenever you have a plus operation +anywhere in your score function or your + +400 +00:29:14,548 --> 00:29:18,740 +ConvNet or anywhere else. It just +distributes gradients equally. The max gate is + +401 +00:29:18,740 --> 00:29:23,009 +instead, a gradient router, and the way this +works is, if you look at the expression + +402 +00:29:23,009 --> 00:29:30,970 +like, we have. Great, these markers don't +work. So if you have a very simple binary + +403 +00:29:30,970 --> 00:29:38,410 +expression of max(x, y), so this is a gate. +Then, the gradient on x and y, if you + +404 +00:29:38,410 --> 00:29:42,570 +think about it, the gradient on the larger +one of your inputs, whichever one was larger + +405 +00:29:42,570 --> 00:29:46,389 +the gradient on that guy is one and all this, +and the smaller one has a gradient of 0. + +406 +00:29:46,390 --> 00:29:50,630 +And intuitively, that's because if one +of these was smaller, then wiggling it has no + +407 +00:29:50,630 --> 00:29:53,220 +effect on the output because the other +guy is larger and that's what ends up + +408 +00:29:53,220 --> 00:29:57,009 +propagating through the gate. +So you end up with a gradient of 1 on the + +409 +00:29:57,009 --> 00:30:03,140 +larger one of the inputs, and so that's +why max gate is a gradient router. If I'm + +410 +00:30:03,140 --> 00:30:06,420 +a max gate and I have received several +inputs, one of them was the largest of + +411 +00:30:06,420 --> 00:30:09,550 +all of them and that's the value that I +propagated through the circuit. + +412 +00:30:09,550 --> 00:30:12,909 +At backpropagation time, I'm just going to +receive my gradient from above and I'm + +413 +00:30:12,910 --> 00:30:16,590 +going to route it to whoever was my +largest input. So it's a gradient router. + +414 +00:30:17,000 --> 00:30:22,569 +And the multiply gate is a gradient switcher. +Actually I don't think that's a very good + +415 +00:30:22,569 --> 00:30:26,960 +way to look at it, but I'm referring to +the fact that it's not actually + +416 +00:30:26,960 --> 00:30:28,150 +nevermind about that part. + +417 +00:30:29,560 --> 00:30:30,860 +Go ahead. + +418 +00:30:30,860 --> 00:30:36,650 +(Student is asking question) + +419 +00:30:36,650 --> 00:30:39,150 +So your question is what happens if the two + +420 +00:30:39,150 --> 00:30:41,470 +inputs are equal when +you go through max gate. + +421 +00:30:44,150 --> 00:30:46,150 +Yeah, what happens? + +422 +00:30:46,150 --> 00:30:48,470 +(Student is answering) + +423 +00:30:48,470 --> 00:30:50,000 +Yeah, you pick one. Yeah. + +424 +00:30:52,300 --> 00:30:53,470 +Yeah, I don't think it's + +425 +00:30:53,470 --> 00:30:57,559 +correct to distributed to all of them. I +think you'd have to pick one. + +426 +00:30:58,259 --> 00:31:01,990 +But that basically never +happens in actual practice. + +427 +00:31:05,559 --> 00:31:07,990 +Okay, so max gradient here, I actually + +428 +00:31:07,990 --> 00:31:13,019 +have an example. So z, here, was larger +than w, so only z has an influence on + +429 +00:31:13,019 --> 00:31:16,839 +the output of this max gate, right? +So when 2 flows into the max gate + +430 +00:31:16,839 --> 00:31:20,879 +it gets routed to z, and w gets a 0 gradient +because its effect on the circuit is + +431 +00:31:20,880 --> 00:31:25,360 +nothing. There is 0, because when you +change it, it doesn't matter when you change + +432 +00:31:25,360 --> 00:31:29,689 +it, because z is the larger value +going through the computational graph. + +433 +00:31:29,690 --> 00:31:33,100 +I have another note that is related to +backpropagation which we already + +434 +00:31:33,100 --> 00:31:36,490 +addressed through a question. I just wanted +to briefly point out with a terribly + +435 +00:31:36,490 --> 00:31:40,440 +bad looking figure that if you have +these circuits and sometimes you have a + +436 +00:31:40,440 --> 00:31:43,330 +value that branches out into a circuit +and is used in multiple parts of the + +437 +00:31:43,330 --> 00:31:47,179 +circuit, the correct thing to do by +multivariate chain rule, is to actually + +438 +00:31:47,180 --> 00:31:51,110 +add up the contributions at the operation. + +439 +00:31:51,110 --> 00:31:55,110 +So gradients add when they backpropagate + +440 +00:31:55,110 --> 00:32:00,009 +backwards through the circuit. If they +ever flow, they add up in these backward flow + +441 +00:32:00,009 --> 00:32:04,879 +All right. We're going to go into +implementation very soon. I'll just take some + +442 +00:32:04,880 --> 00:32:05,700 +more questions. + +443 +00:32:05,700 --> 00:32:08,820 +(Student is asking question) + +444 +00:32:08,820 --> 00:32:11,620 +Thank you for the question. The question +is, is there ever, like a loop in these + +445 +00:32:11,620 --> 00:32:15,839 +graphs. There will never be loops, so there +are never any loops. You might think that + +446 +00:32:15,839 --> 00:32:18,589 +if you use a recurrent neural network, +that there are loops in there + +447 +00:32:18,589 --> 00:32:21,658 +but there are actually no loops because what +we'll do is we'll take a recurrent neural + +448 +00:32:21,659 --> 00:32:26,230 +network and we will unfold it through time +steps and this will all become, there + +449 +00:32:26,230 --> 00:32:30,530 +will never be a loop in the unfolded graph where +we've copied pasted that small recurrent net piece + +450 +00:32:30,530 --> 00:32:31,259 +over time. + +451 +00:32:31,259 --> 00:32:35,059 +You'll see that more when we actually +get into it but these are always DAGs + +452 +00:32:35,059 --> 00:32:36,338 +There are no loops. + +453 +00:32:38,059 --> 00:32:39,538 +Okay, awesome. + +454 +00:32:39,538 --> 00:32:42,220 +So let's look at the implementation of how this +is actually implemented in practice and + +455 +00:32:42,220 --> 00:32:46,990 +I think it will help make this more +concrete as well. So we always have these + +456 +00:32:46,990 --> 00:32:48,938 +graphs, computational graphs. + +457 +00:32:48,938 --> 00:32:52,038 +These are the best way to +think about structuring neural networks. + +458 +00:32:52,038 --> 00:32:56,929 +And so what we end up with is, all these +gates that we're going to see a bit, but + +459 +00:32:56,929 --> 00:33:00,059 +on top of the gates, there something's that +needs to maintain connectivity structure + +460 +00:33:00,059 --> 00:33:03,490 +of this entire graph, what gates are +connected to each other. And so usually + +461 +00:33:03,490 --> 00:33:09,710 +that's handled by a graph or a net object, +usually a net, and the net object has these + +462 +00:33:09,710 --> 00:33:13,679 +two main pieces, which is the forward +and the backward piece. And this is just pseudo + +463 +00:33:13,679 --> 00:33:19,929 +code, so this won't run, but basically, +roughly the idea is that in the forward pass + +464 +00:33:19,929 --> 00:33:23,759 +we're iterating over all the gates in the circuit +that, and they're sorted in topological + +465 +00:33:23,759 --> 00:33:27,980 +order. What that means is that all the +inputs must come to every node before + +466 +00:33:27,980 --> 00:33:32,099 +the output can be consumed. So these are just +ordered from left to right and we're just + +467 +00:33:32,099 --> 00:33:35,969 +forwarding, we're calling a forward on every +single gate along the way so we iterate + +468 +00:33:35,970 --> 00:33:39,600 +over that graph and we just go forward in +every single piece and this net object will + +469 +00:33:39,600 --> 00:33:43,189 +just make sure that happens in the +proper connectivity pattern. In backward + +470 +00:33:43,190 --> 00:33:46,620 +pass, we're going in the exact reversed +order and we're calling backward on + +471 +00:33:46,620 --> 00:33:49,709 +every single gate and these gates will +end up communicating gradients through each + +472 +00:33:49,710 --> 00:33:53,429 +other and they all get chained up and +computing the analytic gradient at the back. + +473 +00:33:53,429 --> 00:33:57,860 +So really a net object is a very thin +wrapper around all these gates, or as we + +474 +00:33:57,860 --> 00:34:01,879 +will see they're called layers, layers or +gates. I'm going to use those interchangeably + +475 +00:34:01,880 --> 00:34:05,700 +and they're just very thin wrappers +around connectivity structure of these + +476 +00:34:05,700 --> 00:34:09,369 +gates and calling a forward and backward +function on them. And then let's look at + +477 +00:34:09,369 --> 00:34:12,950 +a specific example of one of the gates +and how this might be implemented. + +478 +00:34:12,950 --> 00:34:16,759 +And this is not just a pseudo code. +This is actually more like correct + +479 +00:34:16,760 --> 00:34:18,730 +implementation. Something like this might run + +480 +00:34:18,730 --> 00:34:23,769 +at the end. So let's consider a multiply +gate and how it could be implemented. + +481 +00:34:23,769 --> 00:34:27,690 +A multiply gate, in this case, is just a +binary multiply, so it receives two inputs + +482 +00:34:27,690 --> 00:34:33,780 +x and y. It computes their multiplication, +z = x * y and it returns z. + +483 +00:34:33,780 --> 00:34:38,950 +And all these gates must basically satisfied this +API of a forward call and a backward call. How + +484 +00:34:38,950 --> 00:34:42,529 +do you behave in a forward pass, and how +do you behave in a backward pass. And + +485 +00:34:42,530 --> 00:34:46,019 +in a forward pass, we just compute whatever. +In a backward pass, we eventually end up + +486 +00:34:46,019 --> 00:34:52,639 +learning about what is our gradient on +the final loss. So dL/dz is what + +487 +00:34:52,639 --> 00:34:55,628 +we learn. That's represented in this +variable dz, and right now + +488 +00:34:55,628 --> 00:35:00,639 +everything here is scalars, so x, y, z are +numbers here. dz is also a number + +489 +00:35:00,639 --> 00:35:03,639 +telling the influence on the end of the circuit. + +490 +00:35:03,639 --> 00:35:07,799 +And what this gate is in charge +of in this backward pass is + +491 +00:35:07,800 --> 00:35:11,550 +performing the little piece of chain rule. +So what we have to compute is how do you + +492 +00:35:11,550 --> 00:35:14,550 +chain this gradient dz into your inputs x and y. + +493 +00:35:14,550 --> 00:35:16,550 +In other words, we have to compute +dx and dy and we have to + +494 +00:35:16,550 --> 00:35:19,820 +returned those in the backward pass, And then +the computational graph will make sure + +495 +00:35:19,820 --> 00:35:23,720 +that these get routed properly to all +the other gates. And if there are any + +496 +00:35:23,720 --> 00:35:28,820 +edges that add up, the computational graph +might add all those gradients together. + +497 +00:35:30,220 --> 00:35:35,650 +Okay, so how would we implement +the dx and dy? So for example, what is + +498 +00:35:35,650 --> 00:35:40,300 +dx in this case? What would it be equal to, +the implementation? + +499 +00:35:43,300 --> 00:35:49,460 +y * dz. Great. And, so y * dz. +Additional point to make here by the way, + +500 +00:35:49,460 --> 00:35:53,659 +note that I've added some lines in the forward +pass. We have to remember these values of + +501 +00:35:53,659 --> 00:35:57,509 +x and y, because we end up using them in the +backward pass, so I'm assigning them to a + +502 +00:35:57,510 --> 00:36:01,000 +'self.' because I need to remember +what x y are because I need access to + +503 +00:36:01,000 --> 00:36:04,949 +them in my backward pass. In general, in +backpropagation, when we build these, + +504 +00:36:04,949 --> 00:36:09,359 +when you actually do forward pass, every +single gate must remember the inputs in + +505 +00:36:09,360 --> 00:36:13,430 +any kind of intermediate calculations that it has +performed that it needs to do, that needs + +506 +00:36:13,430 --> 00:36:17,069 +access to in the backward pass. So basically +when we end up running these networks at + +507 +00:36:17,070 --> 00:36:20,050 +runtime, just always keep in mind that as +you're doing this forward pass, a huge + +508 +00:36:20,050 --> 00:36:22,890 +amount of stuff gets cached in your +memory, and that all has to stick around + +509 +00:36:22,890 --> 00:36:25,909 +because during backpropagation, you might +need access to some of those variables. + +510 +00:36:25,909 --> 00:36:30,779 +And so, your memory ends up ballooning up +during the forward pass, and then in backward pass, + +511 +00:36:30,780 --> 00:36:33,690 +it gets all consumed and we need all those +intermediates to actually compute the + +512 +00:36:33,690 --> 00:36:36,000 +proper backward pass. So that's... + +513 +00:36:36,000 --> 00:36:41,089 +(Student is asking question) + +514 +00:36:41,089 --> 00:36:43,189 +Yes, so if you don't, if you know you +don't want to do backward pass, + +515 +00:36:43,189 --> 00:36:45,289 +then you can get rid of +many of these things and you + +516 +00:36:45,289 --> 00:36:49,710 +don't have to compute, you don't need to cache +them. So you can save memory for sure. + +517 +00:36:49,710 --> 00:36:54,110 +But I don't think most implementations +actually worriy about that. I don't + +518 +00:36:54,110 --> 00:36:58,280 +think there's a lot of logic that deals with that. +Usually we end up remembering it anyway. + +519 +00:37:00,280 --> 00:37:05,870 +(Student is asking question) + +520 +00:37:05,870 --> 00:37:09,369 +I see. Yes, so I think if you're in the +embedded device for example, and you worry + +521 +00:37:09,369 --> 00:37:11,949 +really about your memory constraints, this is +something that you might take advantage + +522 +00:37:11,949 --> 00:37:15,539 +of. If you know that a neural network only +has to run in test time, then you might + +523 +00:37:15,539 --> 00:37:18,750 +want to make sure to go into the code to +make sure nothing gets cached in case + +524 +00:37:18,750 --> 00:37:22,030 +you want to do a backward pass. +Questions. Yes. + +525 +00:37:22,030 --> 00:37:30,990 +(Student is asking question) + +526 +00:37:30,990 --> 00:37:33,130 +You're saying if we remember the local gradients in + +527 +00:37:33,130 --> 00:37:39,250 +the forward pass, then we don't have to +remember the other intermediates? + +528 +00:37:39,250 --> 00:37:45,269 +I think that might only be the case in +some simple expressions like this one. I'm + +529 +00:37:45,269 --> 00:37:49,170 +not actually sure if that's true in general. +But I mean, you're in charge of, remember + +530 +00:37:49,170 --> 00:37:54,950 +whatever you need to, perform the +backward pass, and on a gate-by-gate basis. + +531 +00:37:54,950 --> 00:37:58,509 +You can remember whatever +you feel like. It has lower footprint and so on. + +532 +00:37:58,510 --> 00:38:04,420 +You can be clever with that. Okay, so just to give +you guy's example of what this looks like in + +533 +00:38:04,420 --> 00:38:08,250 +practice, we're going to look at specific +examples, say, in Torch. Torch is a deep + +534 +00:38:08,250 --> 00:38:11,480 +learning framework, which we might +go into a bit near the end of the class. + +535 +00:38:11,480 --> 00:38:16,750 +Some of you might end up using for +your projects. If you go into the Github repo + +536 +00:38:16,750 --> 00:38:20,320 +for Torch, and you'll look at, +basically, it's just a giant collection + +537 +00:38:20,320 --> 00:38:24,580 +of these layer objects and these are the +gates. Layers, gates, the same thing. So there's + +538 +00:38:24,580 --> 00:38:27,429 +all these layers. That's really what a +deep learning framework is. It's just a + +539 +00:38:27,429 --> 00:38:31,559 +whole bunch of layers and a very thin +computational graph thing that keeps track + +540 +00:38:31,559 --> 00:38:36,420 +of all the layer connectivity. And so really, +the image to have in mind is all these + +541 +00:38:36,420 --> 00:38:42,639 +things are your Lego blocks, and then we're +building up these computational graphs out of + +542 +00:38:42,639 --> 00:38:44,829 +your Lego blocks, out of the layers. +You're putting them together in various + +543 +00:38:44,829 --> 00:38:47,549 +ways depending on what you want to +achieve, so you end building all + +544 +00:38:47,550 --> 00:38:51,519 +kinds of stuff. So that's how you work +with neural networks. So every library is + +545 +00:38:51,519 --> 00:38:54,809 +just a whole set of layers that you +might want to compute, and every layer is + +546 +00:38:54,809 --> 00:38:58,840 +just implementing a small function piece, and +that function piece knows how to do a + +547 +00:38:58,840 --> 00:39:02,670 +forward and it knows how to do a backward. +So just to view the specific example, let's + +548 +00:39:02,670 --> 00:39:10,150 +look at the MulConstant layer in +Torch. The MulConstant layer performs + +549 +00:39:10,150 --> 00:39:16,039 +just a scaling by a scalar. So it takes +some tensor X. So this is not a scalar + +550 +00:39:16,039 --> 00:39:19,300 +but it's actually like an array of +numbers basically, because when we + +551 +00:39:19,300 --> 00:39:22,410 +actually work with these, we do a lot of +vectorized operation so we receive a tensor + +552 +00:39:22,410 --> 00:39:28,289 +which is really just a n-dimensional +array, and we scale it by a constant. And you + +553 +00:39:28,289 --> 00:39:31,980 +can see that this layer actually just has 40 +lines. There's some initialization stuff. + +554 +00:39:31,980 --> 00:39:35,940 +This is Lua, by the way, if this is +looking some foreign to you, but there's + +555 +00:39:35,940 --> 00:39:40,510 +initialization, where you actually +pass in that a that you want to use as + +556 +00:39:40,510 --> 00:39:44,630 +your scaling, and then during the +forward pass which they call updateOutput + +557 +00:39:44,630 --> 00:39:49,170 +in a forward pass all they do is +they just multiply aX and return it. And + +558 +00:39:49,170 --> 00:39:53,760 +in the backward pass which they call +updateGradInput, there's an if statement + +559 +00:39:53,760 --> 00:39:56,510 +here but really when you look at these +three lines, they're most important. You can + +560 +00:39:56,510 --> 00:39:59,690 +see that all it's doing is it's copying into a +variable gradInput + +561 +00:39:59,690 --> 00:40:03,539 +which it needs to compute. That's your gradient +that you're passing up. The gradInput is, + +562 +00:40:03,539 --> 00:40:08,309 +you're copying gradOutput. gradOutput is +your gradient on final loss. + +563 +00:40:08,309 --> 00:40:11,989 +You're copying that over into gradInput +and you're multiplying by the scalar, + +564 +00:40:11,989 --> 00:40:15,629 +which is what you should be doing +because your local gradient is just a + +565 +00:40:15,630 --> 00:40:19,980 +and so you take the output you have, you +take the gradient from above and you just + +566 +00:40:19,980 --> 00:40:23,150 +scale it by a, which is what these three +lines are doing. And that's your gradInput + +567 +00:40:23,150 --> 00:40:27,849 +and that's what you return. So +that's one of the hundreds of layers + +568 +00:40:27,849 --> 00:40:32,110 +that are in Torch. We can also look +at examples in Caffe. Caffe is also a + +569 +00:40:32,110 --> 00:40:36,140 +deep learning framework specifically for +images that you might be working with. Again, if + +570 +00:40:36,140 --> 00:40:39,690 +you go into the layers directory in GitHub, +you just see all these layers. All of them implement + +571 +00:40:39,690 --> 00:40:43,490 +the forward backward API. So just to give +you an example, there's a sigmoid layer in Caffe. + +572 +00:40:43,490 --> 00:40:51,269 +So sigmoid layer takes a blob. So Caffe likes to +call these tensors blobs. So it takes a + +573 +00:40:51,269 --> 00:40:54,219 +blob. It's just an n-dimensional array of +numbers, and it passes it + +574 +00:40:54,219 --> 00:40:57,949 +elementwise through a sigmoid function. And so +it's computing in a forward pass a + +575 +00:40:57,949 --> 00:41:04,379 +sigmoid, which you can see there. Let me use my +pointer. Okay, so there, its calling, so a lot of + +576 +00:41:04,380 --> 00:41:07,840 +this stuff is just boilerplate, getting +pointers to all the data, and then we + +577 +00:41:07,840 --> 00:41:11,730 +have a bottom blob, and we're calling a +sigmoid function on the bottom and + +578 +00:41:11,730 --> 00:41:14,829 +that's just a sigmoid function right there. +So that's what we compute. And in the + +579 +00:41:14,829 --> 00:41:18,719 +backward pass, some boilerplate stuff, but +really what's important is we need to + +580 +00:41:18,719 --> 00:41:23,369 +compute the gradient times the chain +rule here, so that's what you see in this + +581 +00:41:23,369 --> 00:41:26,150 +line. That's where the magic happens +where we take the diff, + +582 +00:41:26,150 --> 00:41:32,048 +so they call the gradients diffs. And you +compute the bottom diff is the top diff + +583 +00:41:32,048 --> 00:41:36,869 +times this piece which is really the, +that's the local gradient, so this is + +584 +00:41:36,869 --> 00:41:41,960 +chain rule happening right here through +that multiplication. So, and that's it. So every + +585 +00:41:41,960 --> 00:41:45,179 +single layer just a forward backward API +and then you have a computational graph + +586 +00:41:45,179 --> 00:41:52,288 +on top or a net object that keeps track of all the +connectivity. Any questions about some of + +587 +00:41:52,289 --> 00:42:00,849 +these implementations and so on? Go ahead. + +588 +00:41:54,000 --> 00:42:00,849 +(Student is asking question) + +589 +00:42:00,849 --> 00:42:04,759 +Yes, thank you. So the question is, do we have to +go through forward and backward for every update. + +590 +00:42:04,759 --> 00:42:09,259 +The answer is yes, because when you +want to do update, you need the gradient, + +591 +00:42:09,259 --> 00:42:11,849 +and so you need to do forward +on your sample minibatch. + +592 +00:42:11,849 --> 00:42:15,559 +You do a forward. Right away you do a backward. +And now you have your analytic gradient. + +593 +00:42:15,559 --> 00:42:19,369 +And now I can do an update, where I take my +analytic gradient and I change my weights a tiny + +594 +00:42:19,369 --> 00:42:24,960 +bit in the direction, the negative direction +of your gradient. So forward computes + +595 +00:42:24,960 --> 00:42:28,858 +the loss, backward computes your gradient, +and then the update uses the gradient to + +596 +00:42:28,858 --> 00:42:33,000 +increment your weights a bit. So that's what keeps +happening in the loop. When you train a neural + +597 +00:42:33,000 --> 00:42:36,318 +network that's all that's happening. Forward, +backward, update. Forward, backward, update. + +598 +00:42:36,318 --> 00:42:38,808 +We'll see that in a bit. Go ahead. + +599 +00:42:38,808 --> 00:42:43,808 +(Student is asking question) + +600 +00:42:44,808 --> 00:42:47,008 +You're asking about a for-loop. + +601 +00:42:49,208 --> 00:42:51,808 +Oh, is there a for-loop here? +I didn't even notice. Okay. + +602 +00:42:51,809 --> 00:42:57,160 +Yeah, they have a for-loop. Yes, so you'd like +this to be vectorized and that actually... + +603 +00:42:57,160 --> 00:43:03,679 +Because this is C++, so I think they just do it. +Go for it. + +604 +00:43:06,679 --> 00:43:10,899 +Yeah, so this is a CPU implementation by +the way. I should mention that this is a + +605 +00:43:10,900 --> 00:43:14,599 +CPU implementation of a sigmoid layer. +There's a second file that implements the + +606 +00:43:14,599 --> 00:43:19,420 +sigmoid layer on GPU and that's CUDA code. +And so that's a separate file. It + +607 +00:43:19,420 --> 00:43:22,280 +would be sigmoid.cu or +something like that. I'm not showing you that. + +608 +00:43:23,580 --> 00:43:30,349 +Any questions? Okay, great. So one point I'd like to +make is, we'll be of course working with + +609 +00:43:30,349 --> 00:43:33,519 +vectors, so these things flowing along our +graphs are not just scalars. They're going + +610 +00:43:33,519 --> 00:43:38,449 +to be entire vectors. And so nothing +changes. The only thing that is different + +611 +00:43:38,449 --> 00:43:43,529 +now since these are vectors, x, y, and z are +vectors, is that this local gradient + +612 +00:43:43,530 --> 00:43:47,530 +which before used to be just a scalar, +now they're in general, for general + +613 +00:43:47,530 --> 00:43:51,290 +expressions, they're full Jacobian matrices. +And so Jacobian matrix is this + +614 +00:43:51,290 --> 00:43:54,670 +two-dimensional matrix and basically +tells you what is the influence of every + +615 +00:43:54,670 --> 00:43:58,010 +single element in x on every single +element of z, + +616 +00:43:58,010 --> 00:44:01,880 +and that's what Jacobian matrix +stores, and the gradient is the same + +617 +00:44:01,880 --> 00:44:10,960 +expression as before, but now, say here, +dz/dx is a vector and dL/dz is... sorry. + +618 +00:44:11,560 --> 00:44:16,079 +dL/dz is a vector and dz/dx is an +entire Jacobian matrix, so you end up with + +619 +00:44:16,079 --> 00:44:20,130 +an entire matrix-vector multiply to +actually chain the gradient backwards. + +620 +00:44:20,130 --> 00:44:29,130 +(Student is asking question) + +621 +00:44:31,530 --> 00:44:36,380 +No. So I'll come back to this point in a bit. +You never actually end up forming the full + +622 +00:44:36,380 --> 00:44:40,119 +Jacobian. You'll never actually do this +matrix multiply most of the time. This is + +623 +00:44:40,119 --> 00:44:43,730 +just a general way of looking at, you +know, arbitrary function, and I need to + +624 +00:44:43,730 --> 00:44:46,260 +keep track of this. And I think that +these two are actually out of order + +625 +00:44:46,260 --> 00:44:49,569 +because dz/dx is the Jacobian +which should be on the left side, so + +626 +00:44:49,569 --> 00:44:53,859 +I think that's a mistaken slide because +this should be a matrix-vector multiply. + +627 +00:44:53,859 --> 00:44:57,618 +So I'll show you why you don't actually +need to ever perform those Jacobians. So let's + +628 +00:44:57,619 --> 00:45:02,119 +work with a specific example that is +relatively common in neural networks. + +629 +00:45:02,119 --> 00:45:06,869 +Suppose we have this nonlinearity max(0, x) +So really what this operation + +630 +00:45:06,869 --> 00:45:11,068 +is doing is it's receiving a vector, say +4096 numbers, which is a typical thing + +631 +00:45:11,068 --> 00:45:12,308 +you might want to do. + +632 +00:45:12,309 --> 00:45:14,630 +4096 numbers, real value, come in + +633 +00:45:14,630 --> 00:45:19,630 +and you're computing an element-wise +thresholding at 0, so anything that is lower + +634 +00:45:19,630 --> 00:45:24,680 +than 0 gets clamped to 0, and that's your +function that you're computing. And so output + +635 +00:45:24,680 --> 00:45:28,588 +vector is of the same dimension. So +the question here I'd like to ask is + +636 +00:45:28,588 --> 00:45:32,068 +what is the size of the +Jacobian matrix for this layer? + +637 +00:45:37,588 --> 00:45:40,268 +4096 by 4096. In principle, + +638 +00:45:40,268 --> 00:45:45,018 +every single number in here could have +influenced every single number in there. + +639 +00:45:45,018 --> 00:45:49,459 +But that's not the case necessarily, right? +So the second question is, so this + +640 +00:45:49,460 --> 00:45:52,949 +is a huge matrix, 16 million numbers, +but why would you never form it? + +641 +00:45:52,949 --> 00:45:54,719 +What does the Jacobian actually look like? + +642 +00:45:54,719 --> 00:45:59,019 +(Student is asking question) + +643 +00:45:59,019 --> 00:46:02,719 +No, Jacobian will always be a matrix +because every one of these 4096 + +644 +00:46:02,719 --> 00:46:09,949 +could have influenced every... It is, so the +Jacobian is still a giant 4096 by 4096 + +645 +00:46:09,949 --> 00:46:14,558 +matrix, but has special structure, right? +And what is that special structure? + +646 +00:46:14,558 --> 00:46:17,558 +(Student is answering) + +647 +00:46:17,559 --> 00:46:20,420 +Yeah, so this Jacobian is huge. + +648 +00:46:21,259 --> 00:46:27,420 +So it's 4096 by 4096 matrix, but +there are only elements on the diagonal + +649 +00:46:27,420 --> 00:46:33,700 +because this is an element-wise operation, +and moreover, they're not just 1's, but + +650 +00:46:33,700 --> 00:46:38,129 +for whichever element that was less than 0, +it was clamped to 0, so some of these 1's + +651 +00:46:38,130 --> 00:46:42,798 +actually are zeros, in whichever elements +had a lower-than-zero value during the + +652 +00:46:42,798 --> 00:46:47,429 +forward pass. And so the Jacobian would +just be almost an identity matrix but + +653 +00:46:47,429 --> 00:46:52,250 +some of them are actually zero. So you +never actually would want to form the + +654 +00:46:52,250 --> 00:46:55,429 +full Jacobean because that's silly and +so you never actually want to carry out + +655 +00:46:55,429 --> 00:47:00,808 +this operation as a matrix vector +multiply, because of their special structure + +656 +00:47:00,809 --> 00:47:04,150 +that we want to take advantage of. And so +in particular, the gradient, the backward + +657 +00:47:04,150 --> 00:47:09,269 +pass for this operation is very very +easy because you just want to look at + +658 +00:47:09,269 --> 00:47:14,159 +all the dimensions where your input was +less than zero and you want to kill the + +659 +00:47:14,159 --> 00:47:17,210 +gradient in those dimensions. You want to +set the gradient to 0 in those dimensions. + +660 +00:47:17,210 --> 00:47:21,650 +So you take the grad output here, and +whichever numbers were less than zero, + +661 +00:47:21,650 --> 00:47:25,910 +just set them to 0. Set those gradients to 0 +and then you continue backward pass. + +662 +00:47:26,209 --> 00:47:30,209 +So very simple operations in the +end in terms of efficiency. + +663 +00:47:30,209 --> 00:47:36,809 +(Student is asking question) + +664 +00:47:36,809 --> 00:47:37,300 +That's right. + +665 +00:47:37,300 --> 00:47:45,930 +(Student is asking question) + +666 +00:47:45,930 --> 00:47:51,830 +So the question is, the commication between the +gates is always just vectors. That's right. + +667 +00:47:51,830 --> 00:47:55,940 +So this Jacobian, if you wanted to, you can form +that but that's internal to you inside the gate. + +668 +00:47:55,940 --> 00:47:59,670 +And you can use that to do backprop, but +what's going back to other gates, they + +669 +00:47:59,670 --> 00:48:02,870 +only care about the gradient vector. + +670 +00:48:02,870 --> 00:48:09,070 +(Student is asking question) + +671 +00:48:09,070 --> 00:48:12,070 +Yes, so the question is, unless +you end up having multiple outputs, + +672 +00:48:12,070 --> 00:48:15,070 +because then for each output, +we have to do this, so yeah. + +673 +00:48:15,070 --> 00:48:17,380 +So we'll never actually run into that case + +674 +00:48:17,380 --> 00:48:20,430 +because we almost always have a single +output, scalar value at the end + +675 +00:48:20,430 --> 00:48:24,129 +because we're interested in loss +functions. So we just have a single + +676 +00:48:24,130 --> 00:48:27,318 +number at the end that we're interested +in computing gradients with respect to. If we had + +677 +00:48:27,318 --> 00:48:30,949 +multiple outputs, then we have to keep +track of all of those as well + +678 +00:48:30,949 --> 00:48:35,769 +in parallel when we do the backpropagation. +But we just have scalar value loss + +679 +00:48:35,769 --> 00:48:38,580 +function so we don't have to worry about that. + +680 +00:48:40,269 --> 00:48:46,080 +Okay, makes sense? So I want +to also make the point that actually + +681 +00:48:46,080 --> 00:48:51,230 +4096 dimensions is not even crazy. Usually +we use minibatches, so say, minibatch of a + +682 +00:48:51,230 --> 00:48:54,929 +100 elements going through at the same +time, and then you end up with 100 + +683 +00:48:54,929 --> 00:48:59,038 +4096-dimensional vectors that are all +coming in parallel, but all the examples + +684 +00:48:59,039 --> 00:49:02,539 +in the minibatch are processed independently of +each other in parallel, and so this Jacobian matrix + +685 +00:49:02,539 --> 00:49:08,869 +really ends up being 400 million, 400,000 by 400,000. +So huge so you never form these, + +686 +00:49:08,869 --> 00:49:14,160 +basically. And you take, you take care to +actually take advantage of the sparsity + +687 +00:49:14,160 --> 00:49:17,538 +structure in the Jacobian and you hand +code operations, so you don't actually write + +688 +00:49:17,539 --> 00:49:25,819 +fully generalized chain rule inside +any gate implementation. Okay cool. So I'd like + +689 +00:49:25,819 --> 00:49:30,788 +to point out that in your assignment, you'll be +writing SVMs and Softmax and so on, and I just kind + +690 +00:49:30,789 --> 00:49:33,680 +of would like to give you a hint on the design +of how you actually should approach this + +691 +00:49:33,680 --> 00:49:39,769 +problem. What you should do is just think +about it as a backpropagation, even if + +692 +00:49:39,769 --> 00:49:44,108 +you're doing this for linear classification +optimization. So roughly, your structure + +693 +00:49:44,108 --> 00:49:50,048 +should look something like this where... +again, stage your computation in units that + +694 +00:49:50,048 --> 00:49:53,960 +you know the local gradient of and then +do backprop when you actually evaluate these + +695 +00:49:53,960 --> 00:49:57,679 +gradients in your assignment. So in the +top, your code will look something like + +696 +00:49:57,679 --> 00:49:59,679 +this where we don't have any graph +structure because you're doing + +697 +00:49:59,679 --> 00:50:04,038 +everything inline. So no crazy edges +or anything like that that you have to do. + +698 +00:50:04,039 --> 00:50:07,200 +You will do that in the second assignment. +You'll actually come up with a graph + +699 +00:50:07,200 --> 00:50:10,509 +object and you'll implement your layers. But in +the first assignment, you're just doing it inline + +700 +00:50:10,510 --> 00:50:15,579 +just straight up vanilla setup. And so +compute your scores based on W and X. + +701 +00:50:15,579 --> 00:50:21,798 +Compute these margins which are max of 0 +and the score differences, compute the + +702 +00:50:21,798 --> 00:50:26,239 +loss, and then do backprop. And in +particular, I would really advise you to + +703 +00:50:26,239 --> 00:50:30,949 +have this intermediate scores that you +create. It's a matrix. And then compute the + +704 +00:50:30,949 --> 00:50:34,769 +gradient on scores before you compute +the gradient on your weights. And so + +705 +00:50:34,769 --> 00:50:40,179 +chain, use chain rule here. Otherwise, you might +be tempted to try to just derive W, the + +706 +00:50:40,179 --> 00:50:43,798 +gradient on W equals, and then implement +that and that's an unhealthy way of + +707 +00:50:43,798 --> 00:50:47,349 +approaching the problem. So stage your +computation and do backprop through this + +708 +00:50:47,349 --> 00:50:49,900 +scores and that will help you out. + +709 +00:50:51,500 --> 00:50:52,800 +Okay. cool. + +710 +00:50:54,300 --> 00:50:59,570 +So, let's see. Summary so far. +Neural networks are hopelessly large, + +711 +00:50:59,570 --> 00:51:01,570 +so we end up in this +computational structures and these + +712 +00:51:01,570 --> 00:51:05,470 +intermediate nodes, forward backward API +for both the nodes and also for the + +713 +00:51:05,470 --> 00:51:08,869 +graph structure. And the graph structure is +usually a very thin wrapper around all these + +714 +00:51:08,869 --> 00:51:12,059 +layers and it handles all the +communication between them. And this + +715 +00:51:12,059 --> 00:51:16,380 +communication is always along like +vectors being passed around. In practice, + +716 +00:51:16,380 --> 00:51:19,289 +when we write these implementations, what +we're passing around are these + +717 +00:51:19,289 --> 00:51:23,079 +n-dimensional tensors. Really what that +means is just an n-dimensional array. + +718 +00:51:23,079 --> 00:51:28,059 +So like an numpy array. Those are what goes +between the gates, and then internally, every single + +719 +00:51:28,059 --> 00:51:33,529 +gate knows what to do in the forward and +the backward pass. Okay, so at this point, I'm + +720 +00:51:33,530 --> 00:51:37,690 +going to end with backpropagation and +I'm going to go into neural networks. So + +721 +00:51:37,690 --> 00:51:40,390 +any questions before we move on from +backprop? Go ahead. + +722 +00:51:40,390 --> 00:51:51,860 +(Student is asking a question) + +723 +00:51:51,860 --> 00:51:55,530 +The summation inside Li = blah? +Yes, there is a sum there. + +724 +00:51:55,530 --> 00:52:00,130 +So you want that to be vectorized operation that +you... Yeah so basically, the challenge in your + +725 +00:52:00,130 --> 00:52:03,130 +assignment almost is, +how do you make sure that you do all + +726 +00:52:03,130 --> 00:52:06,750 +this efficiently nicely with matrix vector operations +in numpy, so that's going to be some of the + +727 +00:52:06,750 --> 00:52:09,750 +brain teaser stuff that you guys are +going to have to do. + +728 +00:52:09,750 --> 00:52:14,250 +(Student is asking a question) + +729 +00:52:14,250 --> 00:52:20,030 +Yes, so it's up to you what you want your gates +to be like, and what you want them to be. + +730 +00:52:20,030 --> 00:52:22,490 +(Student is asking a question) + +731 +00:52:22,490 --> 00:52:24,490 +Yeah, I don't think you'd want to do that. + +732 +00:52:25,490 --> 00:52:30,739 +Yeah, I'm not sure. Maybe that works. I don't know. +But it's up to you to design this and to + +733 +00:52:30,739 --> 00:52:38,609 +backprop through. Yeah, so that's fun. Okay. +So we're going to go to neural networks. This is + +734 +00:52:38,610 --> 00:52:44,010 +exactly what they look like. So you'll be +implementing these, and this is just what happens + +735 +00:52:44,010 --> 00:52:46,770 +when you search on Google Images for +neural networks. This is I think the first + +736 +00:52:46,770 --> 00:52:51,590 +result or something like that. So let's +look at neural networks. And before we dive + +737 +00:52:51,590 --> 00:52:55,100 +into neural networks actually, I'd like +to do it first without all the brain + +738 +00:52:55,100 --> 00:52:58,329 +stuff. So forget that they're neural. Forget +that they have any relation whatsoever + +739 +00:52:58,329 --> 00:53:03,170 +to a brain. They don't, but forget if you +thought that they did, that they do. Let's + +740 +00:53:03,170 --> 00:53:07,309 +just look at score functions. Well +before, we saw that f=Wx is what + +741 +00:53:07,309 --> 00:53:11,079 +we've been working with so far. But now +as I said, we're going to start to make + +742 +00:53:11,079 --> 00:53:14,590 +that f more complex. And so if you wanted +to use a neural network then you're + +743 +00:53:14,590 --> 00:53:20,309 +going to change that equation to this. So +this is a two-layer neural network, and + +744 +00:53:20,309 --> 00:53:24,820 +that's what it looks like, and it's just +a more complex mathematical expression of x. + +745 +00:53:24,820 --> 00:53:30,230 +And so what's happening here is, you +receive your input x, and you + +746 +00:53:30,230 --> 00:53:32,369 +multiply it by a matrix, just like we did before. + +747 +00:53:32,369 --> 00:53:36,619 +Now, what's coming next, what comes next +is a nonlinearity or activation function, + +748 +00:53:36,619 --> 00:53:39,710 +and we're going to go into several choices +that you might make for these. In this + +749 +00:53:39,710 --> 00:53:43,800 +case, I'm using the thresholding at 0 as an +activation function. So basically, we're + +750 +00:53:43,800 --> 00:53:47,780 +doing matrix multiply, we threshold +everything negative to 0, and then we do + +751 +00:53:47,780 --> 00:53:52,240 +one more matrix multiply, and that gives us +our scores. And so if I was to draw this, + +752 +00:53:52,240 --> 00:53:58,169 +say in case of CIFAR-10, with 3072 numbers +going in, those are the pixel values, + +753 +00:53:58,170 --> 00:54:02,110 +and before, we just went one single matrix +multiply to scores. We went right away + +754 +00:54:02,110 --> 00:54:05,899 +to 10 numbers. But now, we get to go +through this intermediate representation + +755 +00:54:05,900 --> 00:54:13,019 +of hidden state. We'll call them hidden layers. +So hidden vector h of hundred numbers, say + +756 +00:54:13,019 --> 00:54:16,849 +or whatever you want your size of the neural +network to be. So this is a hyperparameter, + +757 +00:54:16,849 --> 00:54:21,109 +that's, say, a hundred, and we go through +this intermediate representation. So matrix + +758 +00:54:21,109 --> 00:54:24,319 +multiply gives us hundred +numbers, threshold at zero, and + +759 +00:54:24,320 --> 00:54:28,559 +then one more matrix multiply to get the scores. +And since we have more numbers, we have + +760 +00:54:28,559 --> 00:54:33,820 +more wiggle to do more interesting +things. So a more, one particular example + +761 +00:54:33,820 --> 00:54:36,330 +of something interesting you might want +to, you might think that a neural network + +762 +00:54:36,330 --> 00:54:40,210 +could do, is going back to this +example of interpreting linear + +763 +00:54:40,210 --> 00:54:45,690 +classifiers on CIFAR-10, and we saw that the +car class has this red car that tries to + +764 +00:54:45,690 --> 00:54:51,280 +merge all the modes of different cars +facing in different directions. And so in + +765 +00:54:51,280 --> 00:54:57,980 +this case, one single layer, one single +linear classifier had to go across all + +766 +00:54:57,980 --> 00:55:02,250 +those modes, and we couldn't deal with +for example, cars of different colors. That + +767 +00:55:02,250 --> 00:55:05,190 +wasn't very natural to do. But now we +have hundred numbers in this + +768 +00:55:05,190 --> 00:55:08,289 +intermediate, and so you might imagine +for example, that one of those numbers + +769 +00:55:08,289 --> 00:55:11,539 +could be just picking up on red +car facing forward. It's just classifying, + +770 +00:55:11,539 --> 00:55:14,750 +is there a red car facing +forward. Another one could be red car + +771 +00:55:14,750 --> 00:55:16,280 +facing slightly to the left, + +772 +00:55:16,280 --> 00:55:20,650 +red car facing slightly to the right, and +those elements of h would only become + +773 +00:55:20,650 --> 00:55:24,358 +positive if they find that thing in the image, + +774 +00:55:24,358 --> 00:55:28,029 +otherwise, they stay at zero. And so +another h might look for green cars + +775 +00:55:28,030 --> 00:55:31,180 +or yellow cars or whatever else in +different orientations. So now we can + +776 +00:55:31,180 --> 00:55:35,669 +have a template for all these different +modes. And so these neurons turn on or + +777 +00:55:35,670 --> 00:55:41,869 +off if they find the thing they're looking +for. Car of some specific type, and then + +778 +00:55:41,869 --> 00:55:46,660 +this W2 matrix can sum across all +those little car templates. So now we + +779 +00:55:46,660 --> 00:55:50,719 +have like say twenty card templates of +what cars could look like, and now, to compute + +780 +00:55:50,719 --> 00:55:54,149 +the score of car classifier, there's an +additional matrix multiply, so we have a choice + +781 +00:55:54,150 --> 00:55:58,700 +of doing a weighted sum over them. And so if +anyone of them turn on, then through my + +782 +00:55:58,700 --> 00:56:02,269 +weighted sum, with positive weights +presumably, I would be adding up and + +783 +00:56:02,269 --> 00:56:07,358 +getting a higher score. And so now I can +have this multimodal car classifier + +784 +00:56:07,358 --> 00:56:13,098 +through this additional hidden layer in between +there. So that's a handwavy reason for why + +785 +00:56:13,099 --> 00:56:14,720 +these would do something more interesting. + +786 +00:56:15,520 --> 00:56:16,509 +Was there a question? Yeah. + +787 +00:56:16,509 --> 00:56:26,350 +(Student is asking a question) + +788 +00:56:26,350 --> 00:56:32,509 +So the question is, if h had less than 10 units, would +it be inferior to a linear classifier? I think that's... + +789 +00:56:33,200 --> 00:56:39,509 +that's actually not obvious to me. It's an interesting +question. I think... you could make that work. + +790 +00:56:39,509 --> 00:56:40,509 +I think you could make it work. + +791 +00:56:43,509 --> 00:56:47,509 +Yeah, I think that would actually work. Someone +should try that for extra points in the assignment. + +792 +00:56:47,509 --> 00:56:49,509 +So you'll have a section on the +assignment do something fun or extra + +793 +00:56:49,510 --> 00:56:53,220 +and so you get to come up with whatever you +think is interesting experiment and we'll + +794 +00:56:53,220 --> 00:56:56,699 +give you some bonus points. So that's good +candidate for something you might + +795 +00:56:56,699 --> 00:56:59,659 +want to investigate, whether that works or not. + +796 +00:56:59,659 --> 00:57:00,929 +Any other questions? Go ahead. + +797 +00:57:01,329 --> 00:57:11,329 +(Student is asking a question) + +798 +00:57:11,329 --> 00:57:13,589 +Sorry, I don't think I understood the question. + +799 +00:57:13,589 --> 00:57:26,989 +(Student is asking question) + +800 +00:57:26,989 --> 00:57:28,000 +I see. + +801 +00:57:28,900--> 00:57:32,389 +So you're really asking about the layout of the +h vector and how it gets allocated over the + +802 +00:57:32,389--> 00:57:34,989 +the different modes of the dataset +and I don't have a good + +803 +00:57:34,989 --> 00:57:39,500 +answer for that. Since we're going +to train this fully with backpropagation, + +804 +00:57:39,500 --> 00:57:42,690 +I think it's like naive to think that +there will be exact template for, say a + +805 +00:57:42,690 --> 00:57:46,539 +left car facing, red car facing left. You +probably want to find that. You'll find + +806 +00:57:46,539 --> 00:57:50,690 +these kind of like mixes, and weird +things, intermediates, and so on. + +807 +00:57:50,690 --> 00:57:54,390 +So this neural network will come in and it will +optimally find a way to truncate your data + +808 +00:57:54,390 --> 00:57:55,630 +with its linear boundaries + +809 +00:57:55,630 --> 00:57:59,809 +and these weights will all get adjusted +just to make it come out right. So it's + +810 +00:57:59,809 --> 00:58:03,809 +really hard to say. It will all become +tangled up I think. Go ahead. + +811 +00:58:03,809 --> 00:58:09,500 +(Student is asking question) + +812 +00:58:09,500 --> 00:58:10,579 +That's right. So that's the + +813 +00:58:10,579 --> 00:58:14,579 +size of a hidden layer, and it's a hyperparameter. +We get to choose that. So I chose + +814 +00:58:14,579 --> 00:58:18,719 +hundred. Usually that's going to be, usually, +you'll see that with neural networks. We'll go into + +815 +00:58:18,719 --> 00:58:22,739 +this a lot, but usually you want them to +be as big as possible, as it fits in your + +816 +00:58:22,739 --> 00:58:27,659 +computer and so on, so more is better. +We'll go into that. Go ahead. + +817 +00:58:27,659 --> 00:58:33,659 +(Student is asking question) + +818 +00:58:33,659 --> 00:58:38,639 +So you're asking, do we always take max of 0 and h, +and we don't, and I'll get, it's like five slides + +819 +00:58:38,639 --> 00:58:44,359 +away. So I'm going to go into neural networks. +I guess maybe I should preemtively just go + +820 +00:58:44,360 --> 00:58:48,390 +ahead and take questions near the end. +If you wanted this to be a three-layer + +821 +00:58:48,390 --> 00:58:50,940 +neural network by the way, there's a very +simple way in which we just extend + +822 +00:58:50,940 --> 00:58:53,710 +this, right? So we just keep continuing +the same pattern where we have all these + +823 +00:58:53,710 --> 00:58:57,159 +intermediate hidden nodes, and then we +can keep making our network deeper and + +824 +00:58:57,159 --> 00:58:59,750 +deeper, and you can compute more +interesting functions because you're + +825 +00:58:59,750 --> 00:59:03,369 +giving yourself more time to compute +something interesting in a handwavy way. + +826 +00:59:03,369 --> 00:59:09,559 +Now, one other slide I wanted to flash is +that, training a two-layer neural network, + +827 +00:59:09,559 --> 00:59:12,690 +I mean, it's actually quite simple when +it comes down to it. So this is a slide + +828 +00:59:12,690 --> 00:59:17,349 +borrowed from a blog post I found, and +basically it suffices roughly eleven lines of + +829 +00:59:17,349 --> 00:59:21,980 +Python to implement a two layer neural +network, doing binary classification on + +830 +00:59:21,980 --> 00:59:27,570 +what is this, two dimensional data. So you +have a two dimensional data matrix X. You + +831 +00:59:27,570 --> 00:59:32,580 +have, sorry it's three dimensional. And you +have binary labels for y, and then + +832 +00:59:32,580 --> 00:59:36,579 +syn0 syn1 are your weight matrices +weight1 weight2. And so I think they're + +833 +00:59:36,579 --> 00:59:41,150 +called syn for synapse but I'm not sure. And +then this is the optimization loop here + +834 +00:59:41,150 --> 00:59:46,269 +and what you're seeing here, I should +use my pointer for more, what you're + +835 +00:59:46,269 --> 00:59:50,139 +seeing here is we're computing the first +layer activations, but this is using + +836 +00:59:50,139 --> 00:59:54,069 +a sigmoid nonlinearity not a max of 0 and X. +And we'll go into a bit of what + +837 +00:59:54,070 --> 00:59:58,650 +these nonlinearities might be. So sigmoid is +one form. It's computing the first layer, + +838 +00:59:58,650 --> 01:00:03,059 +and then it's computing second layer, and then +it's computing here right away the backward + +839 +01:00:03,059 --> 01:00:08,130 +pass. So this is the l2_delta. It's the gradient +on l2, the gradient on l1, and the + +840 +01:00:08,130 --> 01:00:13,390 +gradient, and this is an update here. +So right away he's doing an update at + +841 +01:00:13,390 --> 01:00:17,150 +the same time as during the final piece +of backprop here where he formulating the + +842 +01:00:17,150 --> 01:00:22,519 +gradient on the W, and right away he's +adding to gradient here. And so really + +843 +01:00:22,519 --> 01:00:24,630 +eleven lines suffice to train a +neural network to do binary + +844 +01:00:24,630 --> 01:00:29,710 +classification. The reason that this loss +might look slightly different from what + +845 +01:00:29,710 --> 01:00:33,500 +you've seen right now, is that this is a +logistic regression loss. So you saw a + +846 +01:00:33,500 --> 01:00:37,159 +generalization of it which is a softmax +classifier into multiple dimensions. But + +847 +01:00:37,159 --> 01:00:40,149 +this is basically a logistic loss being +updated here and you can go through this + +848 +01:00:40,150 --> 01:00:43,500 +in more detail by yourself. But the +logistic regression loss look slightly + +849 +01:00:43,500 --> 01:00:50,539 +different and that's being, that's inside +there. But otherwise, yes, so this is not too + +850 +01:00:50,539 --> 01:00:55,320 +crazy of a computation, and very few +lines of code suffice to actually train + +851 +01:00:55,320 --> 01:00:58,900 +these networks. Everything else is fluff. +How do you make it efficient, how do + +852 +01:00:58,900 --> 01:01:03,019 +you... there's a cross-validation pipeline +that you need to have and all this stuff + +853 +01:01:03,019 --> 01:01:07,050 +that goes on top to actually give these +large code bases, but the kernel of it is + +854 +01:01:07,050 --> 01:01:11,019 +quite simple. We compute these layers, do +forward pass, we do backward pass, we do an + +855 +01:01:11,019 --> 01:01:14,540 +update, we keep iterating this over and over again. +Go ahead. + +856 +01:01:14,540 --> 01:01:16,240 +(Student is asking a question) + +857 +01:01:16,240 --> 01:01:18,840 +The random function is creating +your first initial random + +858 +01:01:18,840 --> 01:01:24,170 +weights, so you need to start somewhere +so you generate a random W. + +859 +01:01:24,170 --> 01:01:29,150 +Okay. Now I wanted to mention that you'll also +be training a two-layer neural network + +860 +01:01:29,150 --> 01:01:32,070 +in this class, so you'll be doing +something very similar to this but + +861 +01:01:32,070 --> 01:01:34,950 +you're not using logistic regression and +you might have different activation + +862 +01:01:34,950 --> 01:01:39,149 +functions. But again, just my advice to +you when you implement this is, stage + +863 +01:01:39,150 --> 01:01:42,789 +your computation into these intermediate +results, and then do proper + +864 +01:01:42,789 --> 01:01:46,909 +backpropagation into every intermediate +result. So you might have, you compute + +865 +01:01:46,909 --> 01:01:54,460 +your... Let's see. You compute, you receive these +weight matrices and also the biases. I don't + +866 +01:01:54,460 --> 01:01:59,940 +believe you have biases actually in your SVM and in +your softmax, but here you'll have biases. So + +867 +01:01:59,940 --> 01:02:03,269 +take your weight matrices in the biases, +compute the first hidden layer, compute your scores, + +868 +01:02:03,269 --> 01:02:08,429 +compute your loss, and then do backward +pass. So backprop into scores, then + +869 +01:02:08,429 --> 01:02:13,739 +backprop into the weights at the second +layer, and backprop into this h1 vector, + +870 +01:02:13,739 --> 01:02:18,849 +and then through h1, backprop into the first +weight matrices and the first biases. Okay, so do + +871 +01:02:18,849 --> 01:02:22,929 +proper backpropagation here. Otherwise, if +you try to right away, just say, what + +872 +01:02:22,929 --> 01:02:26,739 +is dW1, what is the gradient on W1. If you +just try to make it a single expression + +873 +01:02:26,739 --> 01:02:31,099 +for it, it will be way too large and you'll have +headaches. So do it through a series of + +874 +01:02:31,099 --> 01:02:32,619 +steps and back-propagation. + +875 +01:02:32,619 --> 01:02:36,119 +That's just a hint. + +876 +01:02:36,119 --> 01:02:39,940 +Okay. So now I'd like to, so that was the +presentation of neural networks without + +877 +01:02:39,940 --> 01:02:43,940 +all the brain stuff and it looks fairly +simple. So now we're going to make it + +878 +01:02:43,940 --> 01:02:47,740 +slightly more insane by folding in all +kinds of like motivations, mostly + +879 +01:02:47,740 --> 01:02:51,219 +historical about like how this came +about that it's related to brain at all. + +880 +01:02:51,219 --> 01:02:54,939 +And so, we have neural networks and we +have neurons inside these neural + +881 +01:02:54,940 --> 01:02:59,440 +networks. So this is what neurons look like. +This is just what happens when you search on + +882 +01:02:59,440 --> 01:03:03,800 +image search 'neurons', so there you go. Now +your actual biological neurons don't + +883 +01:03:03,800 --> 01:03:09,030 +look like this. Unfortunately, they look +more like that. And so a neuron, + +884 +01:03:09,030 --> 01:03:11,880 +just very briefly, just to give you an +idea about where this is all coming from + +885 +01:03:11,880 --> 01:03:17,220 +you have the cell body or a soma as people like +to call it, and it's got all these dendrites + +886 +01:03:17,220 --> 01:03:21,049 +that are connected to other neurons. So +there's a cluster of other neurons and + +887 +01:03:21,050 --> 01:03:25,450 +cell bodies over here. And dendrites are +really, these appendages that listen to + +888 +01:03:25,450 --> 01:03:30,869 +them. So this is your inputs to a neuron, +and then it's got a single axon that + +889 +01:03:30,869 --> 01:03:35,839 +comes out of the neuron that carries the +output of the computation that this neurons performs. + +890 +01:03:35,840 --> 01:03:40,579 +So usually, usually you have this +neuron, receives inputs. If many of them + +891 +01:03:40,579 --> 01:03:46,179 +align, then this cell, this neuron can +choose to spike. It says an activation + +892 +01:03:46,179 --> 01:03:50,199 +potential down the axon and then this +actually like diverges out to + +893 +01:03:50,199 --> 01:03:54,659 +connect to dendrites of other neurons that +are downstream. So there are other + +894 +01:03:54,659 --> 01:03:57,639 +neurons here and their dendrites +connect to the axons of these guys. + +895 +01:03:57,639 --> 01:04:02,299 +So basically, just neurons connected through +these synapses in between and we had these + +896 +01:04:02,300 --> 01:04:05,840 +dendrites that are the input to a neuron and +this axon that actually carries the + +897 +01:04:05,840 --> 01:04:10,410 +output of a neuron. And so basically, you +can come up with a very crude model of a + +898 +01:04:10,410 --> 01:04:16,769 +neuron, and it will look something like this. +We have an axon, so this is the cell body + +899 +01:04:16,769 --> 01:04:20,909 +here of a neuron. And just imagine an +axon coming from a different neuron, + +900 +01:04:20,909 --> 01:04:24,730 +somewhere in the network, and this neuron is +connected to that neuron through this + +901 +01:04:24,730 --> 01:04:29,840 +synapse. And every one of these synapses +has a weight associated with it + +902 +01:04:29,840 --> 01:04:35,350 +of how much this neuron likes that +neuron basically. And so axon carries + +903 +01:04:35,350 --> 01:04:39,769 +this x. It interacts in the synapse and +they multiply in this crude model. So you + +904 +01:04:39,769 --> 01:04:44,989 +get w0x0 flowing to the soma. +And then that happens for many neurons + +905 +01:04:44,989 --> 01:04:45,849 +so you have lots of + +906 +01:04:45,849 --> 01:04:51,500 +inputs of w times x flowing in. And the +cell body here, it just performs a sum, offset by + +907 +01:04:51,500 --> 01:04:56,940 +a bias, and then if an activation function +is met here, so it passes through an + +908 +01:04:56,940 --> 01:05:02,800 +activation function to actually compute +the output of this axon. Now in + +909 +01:05:02,800 --> 01:05:06,570 +biological models, historically people +liked to use the sigmoid nonlinearity to + +910 +01:05:06,570 --> 01:05:09,430 +actually use for the activation function. +The reason for that is because + +911 +01:05:09,430 --> 01:05:11,730 +you get a number between 0 and 1, and + +912 +01:05:11,730 --> 01:05:15,420 +you can interpret that as the rate at +which this neuron is firing for that + +913 +01:05:15,420 --> 01:05:19,809 +particular input. So it's a rate between +0 and 1 that's going through the + +914 +01:05:19,809 --> 01:05:23,889 +activation function. So if this neuron has +seen something it likes, in the neurons + +915 +01:05:23,889 --> 01:05:27,900 +that connected to it, it will start to +spike a lot, and the rate is described by + +916 +01:05:27,900 --> 01:05:33,139 +f of the input. Okay, so that's the crude +model of the neuron. If I wanted to implement it + +917 +01:05:33,139 --> 01:05:38,819 +it would look something like this. So a +neuron_tick function forward pass, it receives + +918 +01:05:38,820 --> 01:05:44,500 +some inputs. This is a vector and we form +a sum at the cell body, so just a linear sum. + +919 +01:05:44,500 --> 01:05:49,980 +And we put, we compute the firing rate as a sigmoid +of the cell body sum and return the firing + +920 +01:05:49,980 --> 01:05:53,579 +rate. And then this can plug into +different neurons, right? So you can + +921 +01:05:53,579 --> 01:05:56,710 +imagine, you can actually see that this +looks very similar to a linear + +922 +01:05:56,710 --> 01:06:02,750 +classifier, right? We're forming a linear sum here, +a weighted sum, and we're passing that through + +923 +01:06:02,750 --> 01:06:07,050 +nonlinearity. So every single neuron in +this model is really like a small linear + +924 +01:06:07,050 --> 01:06:11,530 +classifier, but these linear classifiers plug into +each other, and they can work together to + +925 +01:06:11,530 --> 01:06:16,650 +do interesting things. Now one note to make +about neurons is that they're very, they're + +926 +01:06:16,650 --> 01:06:21,300 +not like biological neurons. Biological +neurons are super complex, so if you go + +927 +01:06:21,300 --> 01:06:24,670 +around and you start saying that neural +networks work like brain, people are + +928 +01:06:24,670 --> 01:06:28,849 +starting to frown. People will start to frown +at you and that's because neurons are + +929 +01:06:28,849 --> 01:06:33,650 +complex, dynamical systems. There are many +different types of neurons. They function + +930 +01:06:33,650 --> 01:06:38,550 +differently. These dendrites, they +can perform lots of interesting + +931 +01:06:38,550 --> 01:06:42,140 +computation. A good review article is +Dendritic Computation, which I really + +932 +01:06:42,140 --> 01:06:46,069 +enjoyed. These synapses are complex +dynamical systems. They're not just a + +933 +01:06:46,070 --> 01:06:49,720 +single weight. And we're not really sure +if the brain uses rate code to + +934 +01:06:49,720 --> 01:06:54,689 +communicate, so very crude mathematical +model and don't push his analogy too much. + +935 +01:06:54,690 --> 01:06:57,960 +But it's good for, kind of like, media articles, + +936 +01:06:57,960 --> 01:07:01,990 +so I suppose that's why this keeps +coming up again and again as we + +937 +01:07:01,990 --> 01:07:04,989 +explained that this works like your brain. +But I'm not going to go too deep into + +938 +01:07:04,989 --> 01:07:09,829 +this. To go back to a question that was +asked before, there's an entire set of + +939 +01:07:09,829 --> 01:07:11,859 +nonlinearities that we can choose from. + +940 +01:07:14,559 --> 01:07:17,559 +So historically, sigmoid has been used + +941 +01:07:17,559 --> 01:07:20,210 +quite a bit, and we're going to go into +much more detail over what these + +942 +01:07:20,210 --> 01:07:23,690 +nonlinearities are, what are their +tradeoffs, and why you might want to use + +943 +01:07:23,690 --> 01:07:27,838 +one or the other, but for now, I'd just like to +flash them and mention that there are many things to + +944 +01:07:27,838 --> 01:07:28,579 +choose from. + +945 +01:07:28,579 --> 01:07:33,940 +Historically people use to signmoid and tanh. +As of 2012, ReLU became quite popular. + +946 +01:07:33,940 --> 01:07:38,429 +It makes your networks converge quite a bit +faster, so right now, if you wanted a + +947 +01:07:38,429 --> 01:07:41,429 +default choice for nonlinearity, use ReLU. + +948 +01:07:41,429 --> 01:07:45,679 +That's the current default recommendation. +And then there's a few, kind of a hipster + +949 +01:07:45,679 --> 01:07:51,489 +activation functions here. And so Leaky ReLUs +were proposed a few years ago. Maxout is + +950 +01:07:51,489 --> 01:07:54,989 +interesting. And very recently ELU. +And so you can come up with different + +951 +01:07:54,989 --> 01:07:58,319 +activation functions and you can +describe why these might work better or + +952 +01:07:58,320 --> 01:08:01,789 +not. And so this is an active area of +research. It's trying to come up with these + +953 +01:08:01,789 --> 01:08:05,949 +activation functions that perform, that +have better properties in one way or + +954 +01:08:05,949 --> 01:08:10,909 +another. So we're going to go into this with much +more detail soon in the class. But for + +955 +01:08:10,909 --> 01:08:15,980 +now, we have these neurons, we have a +choice of activation function, and then + +956 +01:08:15,980 --> 01:08:19,259 +we arrange these neurons into neural +networks, right? So we just connect them + +957 +01:08:19,259 --> 01:08:23,140 +together so they can talk to each other. +And so here is an example of a + +958 +01:08:23,140 --> 01:08:27,170 +2-layer neural net or 3-layer neural net. When +you want to count the number of layers and the + +959 +01:08:27,170 --> 01:08:30,829 +neural net, you count the number of +layers that have weights. So here, the + +960 +01:08:30,829 --> 01:08:35,449 +input layer does not count as a layer, +because there's no... These neurons are just + +961 +01:08:35,449 --> 01:08:39,729 +single values. They don't actually do any +computation. So we have two layers here + +962 +01:08:39,729 --> 01:08:45,068 +that have weights. So it's a 2-layer net. And +we call these layers fully connected + +963 +01:08:45,069 --> 01:08:50,870 +layers, and so, remember that I shown you that a +single neuron computes this little + +964 +01:08:50,870 --> 01:08:54,750 +weighted sum, and then passed that through +nonlinearity. In a neural network, the + +965 +01:08:54,750 --> 01:08:58,829 +reason we arrange these into layers is +because arranging them into layers allows + +966 +01:08:58,829 --> 01:09:01,759 +us to perform the computation much more +efficiently. So instead of having an + +967 +01:09:01,759 --> 01:09:04,460 +amorphous blob of neurons and every one +of them has to be computed independently, + +968 +01:09:04,460 --> 01:09:08,699 +having them in layers allows us to use +vectorized operations. And so we can + +969 +01:09:08,699 --> 01:09:10,139 +compute an entire set of + +970 +01:09:10,140 --> 01:09:14,410 +neurons in a single hidden layer as just +at a single times a matrix multiply. And + +971 +01:09:14,410 --> 01:09:17,619 +that's why we arrange them in these +layers, where neurons inside a layer can be + +972 +01:09:17,619 --> 01:09:21,119 +evaluated completely in parallel, and they +all see the same input. So it's a + +973 +01:09:21,119 --> 01:09:25,519 +computational trick to arrange them in +layers. So this is a 3-layer neural net + +974 +01:09:25,520 --> 01:09:30,500 +and this is how you would compute it. +Just a bunch of matrix multiplies + +975 +01:09:30,500 --> 01:09:35,550 +followed by activation function. +So now I'd + +976 +01:09:35,550 --> 01:09:40,520 +like to show you a demo of how these +neural networks work. So this is JavaScript demo + +977 +01:09:40,520 --> 01:09:44,770 +that I'll show you in a bit. But +basically, this is an example of a + +978 +01:09:44,770 --> 01:09:50,080 +two-layer neural network classifying a, +doing a binary classification task. So we have two + +979 +01:09:50,080 --> 01:09:54,119 +classes, red and green. And so we have these +points in two dimensions, and I'm drawing + +980 +01:09:54,119 --> 01:09:58,109 +the decision boundaries by the neural +network. And so what you can see is, when + +981 +01:09:58,109 --> 01:10:01,969 +I train a neural network on this data, +the more hidden neurons I have in my + +982 +01:10:01,970 --> 01:10:05,770 +hidden layer, the more wiggle your neural +network has, right? The more it can compute + +983 +01:10:05,770 --> 01:10:12,290 +crazy functions. And just to show you effect +also of regularization strength. So this is the + +984 +01:10:12,290 --> 01:10:17,069 +regularization of how much you penalize +large Ws. So you can see that when you insist + +985 +01:10:17,069 --> 01:10:22,340 +that your Ws are very small, you end up with +a very smooth functions, so they don't + +986 +01:10:22,340 --> 01:10:27,050 +have as much variance. So these neural +networks, there's not as much wiggle + +987 +01:10:27,050 --> 01:10:31,090 +that they can give you, and then as you +decrease the regularization, these neural + +988 +01:10:31,090 --> 01:10:34,090 +networks can do more and more complex +tasks, so they can kind of get in and get + +989 +01:10:34,090 --> 01:10:38,710 +these little squeezed out points to cover +them in a training data. So let me show + +990 +01:10:38,710 --> 01:10:41,489 +you what this looks like + +991 +01:10:41,489 --> 01:10:47,079 +during training. Okay. + +992 +01:10:47,079 --> 01:10:53,010 +So there're some stuff to explain here. +Let me first actually... So you can play with + +993 +01:10:53,010 --> 01:10:56,060 +this because it's all in JavaScript. + +994 +01:10:56,060 --> 01:11:04,060 +Okay. All right. So what we're doing here is we have +six neurons, and this is a binary + +995 +01:11:04,060 --> 01:11:09,000 +classification dataset with circle +data. And so we have a little cluster of + +996 +01:11:09,000 --> 01:11:13,520 +green dots separated by red dots. And we're +training a neural network to classify + +997 +01:11:13,520 --> 01:11:18,080 +this dataset. So if I restart the neural +network, it's just, starts off with a + +998 +01:11:18,080 --> 01:11:20,949 +random W, and then it converges the +decision boundary to actually classify + +999 +01:11:20,949 --> 01:11:26,289 +the data. What I'm showing on the right, which is the +cool part, this visualization, is one interpretation of + +1000 +01:11:26,289 --> 01:11:29,529 +the neural network here, is what I'm +taking this grid here and I'm + +1001 +01:11:29,529 --> 01:11:33,909 +showing how this space gets warped by +the neural network. So you can interpret + +1002 +01:11:33,909 --> 01:11:37,619 +what the neural network is doing is it's +using its hidden layer to transform your + +1003 +01:11:37,619 --> 01:11:41,159 +input data in such a way that the second +hidden layer can come in with a linear + +1004 +01:11:41,159 --> 01:11:47,059 +classifier and classify your data. So +here, you see that the neural network + +1005 +01:11:47,060 --> 01:11:51,920 +arranges your space. It warps it such +that the second layer, which is really a + +1006 +01:11:51,920 --> 01:11:56,779 +linear classifier on top of the first +layer, can put a plane through it, okay? + +1007 +01:11:56,779 --> 01:11:59,939 +So it's warping the space so that you +can put a plane through it and + +1008 +01:11:59,939 --> 01:12:06,259 +separate out the points. So let's look at +this again. So you can roughly see what + +1009 +01:12:06,260 --> 01:12:10,940 +how this gets warped so that you can +linearly classify the data. This is + +1010 +01:12:10,940 --> 01:12:13,569 +something that people sometimes also +referred to as kernel trick. It's + +1011 +01:12:13,569 --> 01:12:19,149 +changing your data representation to a +space where it's linearly separable. Okay. + +1012 +01:12:19,149 --> 01:12:23,079 +Now, here's a question. If we'd like to +separate, so right now we have six + +1013 +01:12:23,079 --> 01:12:27,809 +neurons here in the intermediate layer, +and it allows us to separate out these + +1014 +01:12:27,810 --> 01:12:33,580 +data points. So you can see actually those six +neurons roughly. You can see these lines + +1015 +01:12:33,580 --> 01:12:36,869 +here, like they're kind of like these +functions of one of these neurons. So + +1016 +01:12:36,869 --> 01:12:40,349 +here's a question for you, What is the +minimum number of neurons for which this + +1017 +01:12:40,350 --> 01:12:45,570 +dataset is separable with a neural +network? If I want the neural network + +1018 +01:12:45,570 --> 01:12:49,089 +to correctly classify this, how many neurons do +I need in the hidden layer as a minimum? + +1019 +01:12:57,890 --> 01:13:04,270 +Four? I heard some threes, some fours. +Binary search. + +1020 +01:13:04,270 --> 01:13:08,870 +So intuitively, the way this +would work is, let's see four. + +1021 +01:13:12,270 --> 01:13:15,270 +So what happens with four is, there is one + +1022 +01:13:15,270 --> 01:13:18,910 +neuron here that went from this way to +that way, this way to that way, this way + +1023 +01:13:18,910 --> 01:13:22,689 +to that way. There's four neurons that +are cutting up this plane. And then + +1024 +01:13:22,689 --> 01:13:27,039 +there's an additional layer that's doing a +weighted sum. So in fact, the lowest + +1025 +01:13:27,039 --> 01:13:34,739 +number here would be three, which +would work. So with three neurons... So + +1026 +01:13:34,739 --> 01:13:39,189 +one plane, second plane, third plane. So +three linear functions with a nonlinearity, + +1027 +01:13:39,189 --> 01:13:45,649 +and then you can basically with three +lines, you can carve out the space so + +1028 +01:13:45,649 --> 01:13:50,329 +that the second layer can just combine +them when their numbers are 1 and not 0. + +1029 +01:13:50,329 --> 01:13:52,429 +(Student is asking question) + +1030 +01:13:52,430 --> 01:13:57,850 +At two? Certainly. So at two, this will break +because two lines are not enough. I + +1031 +01:13:57,850 --> 01:14:03,900 +suppose this works... Not going to look very +good here. So with two, basically it will find + +1032 +01:14:03,900 --> 01:14:07,239 +the optimal way of just using these two +lines. They're kind of creating this + +1033 +01:14:07,239 --> 01:14:11,239 +tunnel and that's the best you can do. Okay? + +1034 +01:14:11,239 --> 01:14:14,599 +(Student is asking question) + +1035 +01:14:18,600 --> 01:14:25,400 +The curve, I think... Which nonlinearity am I using? +tanh? Yeah, I'm not sure exactly how that works out. + +1036 +01:14:25,400 --> 01:14:31,300 +If I was using ReLU, I think it would be much, +so ReLU is the... Let me change to ReLU, and I + +1037 +01:14:31,300 --> 01:14:41,460 +think you'd see sharp boundaries. Yeah. +Yes, this is three. You can do four. So let's do... + +1038 +01:14:41,460 --> 01:14:47,460 +(Student is asking question) + +1039 +01:14:47,460 --> 01:14:50,460 +Yeah, that's because, it's + +1040 +01:14:50,460 --> 01:14:52,130 +because in some of these parts + +1041 +01:14:52,130 --> 01:14:57,819 +there's more than one of those ReLUs +are active, and so you end up with... + +1042 +01:14:57,819 --> 01:15:02,359 +There are really three lines. I think like one, two, +three, but then in some of the corners two ReLU + +1043 +01:15:02,359 --> 01:15:05,689 +neurons are active and so these +weights will add up. It's kind of funky. You + +1044 +01:15:05,689 --> 01:15:12,649 +have to think about a bit. But okay. So let's +look at, say, twenty here. So I changed to twenty + +1045 +01:15:12,649 --> 01:15:16,670 +so we have lots of space there, and let's +look at different datasets like say spiral. + +1046 +01:15:16,670 --> 01:15:22,390 +So you can see how this thing just, as I'm +doing this update, it will just go in there + +1047 +01:15:22,390 --> 01:15:32,800 +and figure that out. Very simple dataset +is not... Spiral. Circle, and then random... + +1048 +01:15:33,200 --> 01:15:39,880 +so random data, and so you could, kind +of goes in there, like covers up the green + +1049 +01:15:39,880 --> 01:15:48,039 +ones and the red ones. And yeah. And with +fewer, say like five... I'm going to break this + +1050 +01:15:48,039 --> 01:15:54,890 +now. I'm not going to... Okay. So with five... Yes. + So this will start working worse and worse + +1051 +01:15:54,890 --> 01:15:58,770 +because you don't have enough capacity +to separate out this data. So you can + +1052 +01:15:58,770 --> 01:16:05,270 +play with this in your free time. +Okay. And so as a summary, + +1053 +01:16:05,270 --> 01:16:10,690 +we arrange these neurons in neural +networks into fully connected layers. + +1054 +01:16:10,690 --> 01:16:14,579 +We've looked at backprop and how this gets +chained in computational graphs. And they're + +1055 +01:16:14,579 --> 01:16:19,149 +not really neural. And as we'll see soon, +the bigger the better, and we'll go into + +1056 +01:16:19,149 --> 01:16:23,510 +that a lot. I want to take questions before I end. +Just sorry. Were there any questions? Go ahead. + +1057 +01:16:23,510 --> 01:16:27,710 +(Student is asking question) + +1058 +01:16:27,710 --> 01:16:29,359 +We have two more minutes. Sorry. + +1059 +01:16:29,359 --> 01:16:35,710 +(Student is asking question) + +1060 +01:16:35,710 --> 01:16:36,899 +Yes, thank you. + +1061 +01:16:36,899 --> 01:16:41,119 +So is it always better to have more neurons +in your neural network? The answer to + +1062 +01:16:41,119 --> 01:16:48,809 +that is yes. More is always better. It's +usually computational constraint, so more will + +1063 +01:16:48,810 --> 01:16:52,510 +always work better, but then you have to +be careful to regularize it properly. So + +1064 +01:16:52,510 --> 01:16:55,810 +the correct way to constrain your neural +network to not overfit your data is not by + +1065 +01:16:55,810 --> 01:16:58,940 +making the network smaller. +The correct way to do it is to increase the + +1066 +01:16:58,940 --> 01:17:03,079 +regularization. So you always want to use +as large a network as you want, but then + +1067 +01:17:03,079 --> 01:17:06,269 +you have to make sure to properly +regularize it. But most of the time + +1068 +01:17:06,270 --> 01:17:09,320 +because of computational reasons, you have finite +amount of time, you don't want to wait forever to + +1069 +01:17:09,320 --> 01:17:14,980 +train your networks. You'll use smaller +ones for practical reasons. Question? + +1070 +01:17:14,980 --> 01:17:17,780 +(Student is asking question) + +1071 +01:17:17,780 --> 01:17:19,980 +Do you regularize each layer equally. + +1072 +01:17:19,980 --> 01:17:25,509 +Usually you do, as a simplification. +Yeah. Most of the, often when you see + +1073 +01:17:25,510 --> 01:17:28,030 +networks get trained in practice, they will +be regularized the same way throughout. + +1074 +01:17:28,030 --> 01:17:31,030 +But you don't have to necessarily. Go ahead. + +1075 +01:17:31,030 --> 01:17:35,710 +(Student is asking question) + +1076 +01:17:35,710 --> 01:17:40,500 +Is there any value to using second derivatives using +hashing in optimizing neural networks? There is value + +1077 +01:17:40,500 --> 01:17:44,859 +sometimes when your data sets are small. +You can use things like L-BFGS which I + +1078 +01:17:44,859 --> 01:17:47,729 +didn't go into too much, and that's a +second order method, but usually the datasets + +1079 +01:17:47,729 --> 01:17:50,500 +are really large and that's when +L-BFGS doesn't work very well. + +1080 +01:17:50,500 --> 01:17:57,039 +So when you millions of data points, you can't do +L-BFGS for various reasons. Yeah. And L-BFGS is + +1081 +01:17:57,039 --> 01:18:01,970 +not very good with minibatch. You always +have to do full batch by default. Question. + +1082 +01:18:01,970 --> 01:18:09,950 +(Student is asking question) + +1083 +01:18:09,950 --> 01:18:13,650 +So what is the tradeoff between depth and +size roughly, like how do you allocate? + +1084 +01:18:13,650 --> 01:18:16,450 +Not a good answer for that unfortunately. + +1085 +01:18:16,450 --> 01:18:20,899 +So you want, depth is good, but maybe after +like ten layers maybe, if you have simple dataet + +1086 +01:18:20,899 --> 01:18:25,219 +it's not really adding too much. We have +one more minute so I can still take some + +1087 +01:18:25,220 --> 01:18:26,620 +questions. You had a question for a while. + +1088 +01:18:26,620 --> 01:18:31,520 +(Student is asking question) + +1089 +01:18:31,520 --> 01:18:35,990 +Yeah, so the tradeoff between +where do I allocate my + +1090 +01:18:35,990 --> 01:18:40,019 +capacity, do I want us to be deeper or do +I want it to be wider, not a very good + +1091 +01:18:40,020 --> 01:18:41,860 +answer to that. + +1092 +01:18:41,860 --> 01:18:44,560 +(Student is asking question) + +1093 +01:18:44,560 --> 01:18:47,860 +Yes, usually, especially with +images, we find that more layers are + +1094 +01:18:47,860 --> 01:18:51,199 +critical. But sometimes when you have +simple datasets like 2D or some + +1095 +01:18:51,199 --> 01:18:55,359 +other things like depth is not as +critical, and so it's kind of slightly + +1096 +01:18:55,359 --> 01:19:59,670 +data dependent. We had a question over there. + +1097 +01:18:59,670 --> 01:19:05,670 +(Student is asking question) + +1098 +01:19:05,670 --> 01:19:10,050 +Different activation functions for different layers, +does that help? Usually it's not done. Usually we + +1099 +01:19:10,050 --> 01:19:15,960 +just kind of pick one and go with it. +So say, for ConvNets for example, we'll see that + +1100 +01:19:15,960 --> 01:19:19,279 +most of them are trained just with ReLUs. +And so you just use that throughout and + +1101 +01:19:19,279 --> 01:19:22,389 +there's no real benefit to switch +them around. People don't play with that + +1102 +01:19:22,390 --> 01:19:26,660 +too much, but in principle, there's +nothing preventing you. So it is 4:20, + +1103 +01:19:26,660 --> 01:19:29,789 +so we're going to end here, but we'll see +lots of more neural networks, so a lot of + +1104 +01:19:29,789 --> 01:19:31,738 +these questions, we'll go through them. \ No newline at end of file diff --git a/captions/En/Lecture5_en.srt b/captions/En/Lecture5_en.srt new file mode 100644 index 00000000..1ced507c --- /dev/null +++ b/captions/En/Lecture5_en.srt @@ -0,0 +1,5289 @@ +1 +00:00:00,000 --> 00:00:05,299 +horizon but it would be a seminar most +of you finished and unfinished but + +2 +00:00:05,299 --> 00:00:11,109 +against ok get some decent ok I'll be +holding off makeup office hours right + +3 +00:00:11,109 --> 00:00:15,660 +after this class assignment 2 will be +released tomorrow or day after tomorrow + +4 +00:00:15,660 --> 00:00:19,710 +we haven't fully finalize the date or +still working on it and we're changing + +5 +00:00:19,710 --> 00:00:23,050 +it from last year and so we are in +process of developing and we hope to + +6 +00:00:23,050 --> 00:00:24,580 +have it as soon as possible + +7 +00:00:24,579 --> 00:00:31,469 +its meeting but an occasional so you do +want to get started on that ASAP once + +8 +00:00:31,469 --> 00:00:36,039 +it's released we might be adjusting the +due date or somethin to because it is + +9 +00:00:36,039 --> 00:00:41,850 +slightly larger and yes so so will be +shuffling some of these things around + +10 +00:00:41,850 --> 00:00:46,219 +and also the grading scheme of the stuff +is just tentative and subject to change + +11 +00:00:46,219 --> 00:00:48,929 +because we're still trying to figure out +the course it's still relatively new and + +12 +00:00:48,929 --> 00:00:53,899 +a lot of it is changing so those are +just some heads up before we start in + +13 +00:00:53,899 --> 00:00:57,829 +terms of your project proposal by the +way which is due in roughly 10 days I + +14 +00:00:57,829 --> 00:01:00,799 +wanted to just bring up a few points +because you'll be thinking about your + +15 +00:01:00,799 --> 00:01:05,890 +projects and some of you might have some +misconceptions about what makes a good + +16 +00:01:05,890 --> 00:01:11,159 +or bad project so just two of them the +most common one probably is that people + +17 +00:01:11,159 --> 00:01:14,570 +are hesitant to work with data sets that +are small because they think that that's + +18 +00:01:14,569 --> 00:01:17,669 +require a huge amount of data training +and this is true there's hundreds of + +19 +00:01:17,670 --> 00:01:21,450 +millions of prime minister to come out +and they need training but actually for + +20 +00:01:21,450 --> 00:01:25,019 +your purposes in the project this is +kind of a mess this is not something you + +21 +00:01:25,019 --> 00:01:28,579 +have to worry about a lot you can work +with smaller data sets its ok the reason + +22 +00:01:28,579 --> 00:01:32,188 +it's ok is that we have this process +that will go into much more detail later + +23 +00:01:32,188 --> 00:01:35,938 +in a class called fine-tuning and the +thing is that in practice you rarely + +24 +00:01:35,938 --> 00:01:41,039 +ever trained these giant camel response +crash almost always do this retraining + +25 +00:01:41,040 --> 00:01:43,729 +and planting process so the way this +will work + +26 +00:01:43,728 --> 00:01:47,590 +look like it's almost always take a +commercial network he trained on some + +27 +00:01:47,590 --> 00:01:51,520 +large data set up say images likes a +huge amount of data and then you're + +28 +00:01:51,519 --> 00:01:54,618 +interested in some other data set right +there and you can train your comment on + +29 +00:01:54,618 --> 00:01:58,430 +your small business that will turn it +here and then we'll transfer it over + +30 +00:01:58,430 --> 00:02:01,240 +there and the way this transfer works +like it is + +31 +00:02:01,239 --> 00:02:05,359 +so here's a schematic of a comedy show +network we start for the image and talk + +32 +00:02:05,359 --> 00:02:09,000 +and we'll go through a series of layers +down to a classifier so you're used to + +33 +00:02:09,000 --> 00:02:12,150 +this but we haven't of course talked +about the specific players here but we + +34 +00:02:12,150 --> 00:02:16,120 +take that image net free trade network +we trained on a minute and then we + +35 +00:02:16,120 --> 00:02:20,129 +chopped off the top layer the classifier +with chopped off take it away and we + +36 +00:02:20,129 --> 00:02:24,150 +train the entire commercial network has +a fixed feature extractor and so you can + +37 +00:02:24,150 --> 00:02:27,219 +put that feature extractor on top of +your new dataset and you're just going + +38 +00:02:27,219 --> 00:02:30,739 +to swap in a different layer that +performs a classification on top and so + +39 +00:02:30,739 --> 00:02:34,810 +depending on how much data you have your +own going to train the last layer of + +40 +00:02:34,810 --> 00:02:38,159 +your network or you can do fine tuning +where you actually back propagate + +41 +00:02:38,159 --> 00:02:41,379 +through some portions of the combat and +get more data you're going to do back + +42 +00:02:41,379 --> 00:02:47,229 +propagation deeper through the network +and in particular the spring training + +43 +00:02:47,229 --> 00:02:51,649 +sample image net people do this for you +so there's a huge line of people who've + +44 +00:02:51,650 --> 00:02:55,400 +trained comes home networks will +reluctance of time weeks on different + +45 +00:02:55,400 --> 00:02:58,939 +datasets and then they upload the weight +of the comment on line is there + +46 +00:02:58,939 --> 00:03:02,229 +something called a couple models who for +example and these are all these + +47 +00:03:02,229 --> 00:03:05,629 +commercial networks have been preaching +on large data sets they already have + +48 +00:03:05,629 --> 00:03:09,310 +lots of the parameters learned and see +just take the surrounding swapping your + +49 +00:03:09,310 --> 00:03:12,769 +datacenter you find him through the +network so basically if you don't have a + +50 +00:03:12,769 --> 00:03:16,799 +lot of data that's okay and you just +take a preacher in combat and just fine + +51 +00:03:16,799 --> 00:03:20,500 +tune it and so don't be afraid to work +with small dataset it's going to work + +52 +00:03:20,500 --> 00:03:27,239 +out of the second thing that we saw some +problems with last time is that people + +53 +00:03:27,239 --> 00:03:31,209 +think they have infinite computer and +this is also a metal just like to point + +54 +00:03:31,209 --> 00:03:35,000 +out don't be overly ambitious and what +you propose these things take a while to + +55 +00:03:35,000 --> 00:03:37,959 +train you don't have too many GPUs +you're going to have to hyper + +56 +00:03:37,959 --> 00:03:41,780 +optimization there's a few things you +have to worry about here so we had some + +57 +00:03:41,780 --> 00:03:45,840 +projects last year where people proposed +projects of training on very large data + +58 +00:03:45,840 --> 00:03:51,889 +sets and you just don't have the time so +be mindful of that and yeah you'll get a + +59 +00:03:51,889 --> 00:03:54,980 +better sense as we go through the class +and what is or is not possible given + +60 +00:03:54,979 --> 00:03:59,949 +your computer constraints ok we're going +to dive into lectures are there any + +61 +00:03:59,949 --> 00:04:02,780 +administrative things that I may be left +out that you like to ask about it + +62 +00:04:02,780 --> 00:04:07,068 +ok good so we're going to dive into the +material we have quite a bit of it today + +63 +00:04:07,068 --> 00:04:12,138 +so just as a reminder woodworking +industry mark the passing grade in the + +64 +00:04:12,139 --> 00:04:13,189 +center for training + +65 +00:04:13,189 --> 00:04:16,750 +networks and basically the four-step +process training a neural network is as + +66 +00:04:16,750 --> 00:04:21,589 +simple as 123 for you sample your data +so a batch of your data from a dataset + +67 +00:04:21,589 --> 00:04:25,079 +you forward it through your network to +compute the Los + +68 +00:04:25,079 --> 00:04:29,339 +propagate to complete your radiance and +the new primary update or you tweak your + +69 +00:04:29,339 --> 00:04:33,529 +weight slightly in the direction of the +ingredients and so when you end up + +70 +00:04:33,529 --> 00:04:36,519 +repeating this process that really what +this comes down to is an optimization + +71 +00:04:36,519 --> 00:04:39,909 +problem wherein to wait space were +converging into areas of the white space + +72 +00:04:39,910 --> 00:04:42,990 +we have low loss and that means are +correctly classifying or training center + +73 +00:04:42,990 --> 00:04:48,590 +and we saw that these very large and i +flash disk image of altering sheen + +74 +00:04:48,589 --> 00:04:51,589 +basically these are huge computational +graphs and we need to do back + +75 +00:04:51,589 --> 00:04:54,699 +propagation through them and so we +talked about intuition some back + +76 +00:04:54,699 --> 00:04:57,289 +propagation and the fact that it's +really just a recursive application of + +77 +00:04:57,290 --> 00:05:01,220 +general from back on the circuit to the +front where we're changing gradients + +78 +00:05:01,220 --> 00:05:05,110 +through all the local operations we +looked at some implementations of this + +79 +00:05:05,110 --> 00:05:10,350 +can quickly with the forward backward +API on both coasts competition graph and + +80 +00:05:10,350 --> 00:05:14,379 +also in terms of its nodes which also +implement the same API and do for + +81 +00:05:14,379 --> 00:05:18,750 +propagation and backward propagation we +looked at specific examples in Portugal + +82 +00:05:18,750 --> 00:05:22,199 +cafe and I drew this analogy that these +are kind of like your illegal blocks + +83 +00:05:22,199 --> 00:05:26,159 +these layers are gates are your little +blocks from which you build out to the + +84 +00:05:26,160 --> 00:05:30,280 +intercom system that works then we +talked about neural networks first + +85 +00:05:30,279 --> 00:05:33,329 +without the bring stuff and basically +what that amounts to is we're making + +86 +00:05:33,329 --> 00:05:37,990 +this which goes from your image to class +course more complex and then we looked + +87 +00:05:37,990 --> 00:05:41,800 +at bill that works from the brain stuff +perspective where this is a chronology + +88 +00:05:41,800 --> 00:05:47,168 +of neuron and what we're doing is we're +stopping these emails and letters oK so + +89 +00:05:47,168 --> 00:05:49,370 +that's roughly what we're doing right +now and we're going to talk in this + +90 +00:05:49,370 --> 00:05:54,959 +class about this process of training +early works effectively ok so we're + +91 +00:05:54,959 --> 00:05:58,049 +going to go into that before I dive into +the details of it I just wanted to kind + +92 +00:05:58,050 --> 00:06:02,280 +of pull out and give you a zoomed out +the you up a bit of a history of how + +93 +00:06:02,279 --> 00:06:06,918 +this evolved over time if you try to +find where the spilled oil comes from + +94 +00:06:06,918 --> 00:06:09,870 +where the first proposed and so on + +95 +00:06:09,870 --> 00:06:15,269 +you probably will go back to roughly +1964 Frank rosenblatt in 1957 was + +96 +00:06:15,269 --> 00:06:18,899 +playing around with something called +perceptrons and the perceptron basically + +97 +00:06:18,899 --> 00:06:24,379 +it ended up being this implementation +and hardware so you have to like + +98 +00:06:24,379 --> 00:06:28,269 +they do just write code right now +actually had to build these things out + +99 +00:06:28,269 --> 00:06:37,099 +from circuits and electronics in these +times for most part and submitted the + +100 +00:06:37,100 --> 00:06:42,450 +perceptron roughly was this function +here and it looks very similar to what + +101 +00:06:42,449 --> 00:06:46,110 +we are familiar with its Justin only +explicitly but then the activation + +102 +00:06:46,110 --> 00:06:49,930 +function which were used to as a signal +that activation function was actually a + +103 +00:06:49,930 --> 00:06:54,439 +step function it was either 10 it was a +binary step function and so since this + +104 +00:06:54,439 --> 00:06:57,459 +is my new step function you'll notice +that this is not differentiable + +105 +00:06:57,459 --> 00:07:01,649 +operation so they were not able to back +propagate through this in fact the cost + +106 +00:07:01,649 --> 00:07:04,139 +of the backpropagation for training +neural networks have to come much later + +107 +00:07:04,139 --> 00:07:08,169 +and so they came up with these binary +stepwise functions perceptron and they + +108 +00:07:08,170 --> 00:07:12,449 +came up with these learning rules and so +this is kind of an ad hoc specified + +109 +00:07:12,449 --> 00:07:17,110 +learning rule that tweaked the weights +to make the desired outcome from the + +110 +00:07:17,110 --> 00:07:22,240 +perceptron match the true of the true +desire to balance but there was no + +111 +00:07:22,240 --> 00:07:25,490 +concept of a loss function there was no +concept of backpropagation his DS DS ad + +112 +00:07:25,490 --> 00:07:28,949 +hoc rules which when you look at them +they kind of almost do background but + +113 +00:07:28,949 --> 00:07:32,779 +it's kind of funny because of the step +function which is not differentiable and + +114 +00:07:32,779 --> 00:07:36,809 +then people started to stop these so in +1960 with the advent of Madeline + +115 +00:07:36,810 --> 00:07:42,110 +Madeline by woodrow enough they started +to take these perceptron like things and + +116 +00:07:42,110 --> 00:07:46,470 +stuff them into the first multi-layer +perceptron networks and this was still + +117 +00:07:46,470 --> 00:07:51,980 +all done in this Electronics and LG and +actually building out from Porter and + +118 +00:07:51,980 --> 00:07:55,830 +but still there's no back propagation at +this time this was all of these rules + +119 +00:07:55,829 --> 00:07:59,060 +that they come up with in terms of like +thinking about trying to flip it and + +120 +00:07:59,060 --> 00:08:02,949 +seeing if it works better or not and it +was kind of there was no view of + +121 +00:08:02,949 --> 00:08:06,430 +backpropagation at this time and so +roughly nineteen sixty people got very + +122 +00:08:06,430 --> 00:08:09,560 +excited and building up the circuits and +they thought that you know this could go + +123 +00:08:09,560 --> 00:08:12,930 +really far we can have these circuits +that learn you have to remember that + +124 +00:08:12,930 --> 00:08:17,829 +back then the concept of programming was +very explicit you write a series of + +125 +00:08:17,829 --> 00:08:20,689 +instructions for a computer and this is +the first time that people are thinking + +126 +00:08:20,689 --> 00:08:24,379 +about this kind of data driven approach +where you have some kind of a circuit + +127 +00:08:24,379 --> 00:08:29,019 +that can learn and so this was at the +time a huge conceptual leap that people + +128 +00:08:29,019 --> 00:08:33,179 +are very excited about these networks +with not actually end up working + +129 +00:08:33,179 --> 00:08:37,528 +very well right away in terms of 1964 +example they got slightly over excited + +130 +00:08:37,528 --> 00:08:41,088 +and over promised and the slightly under +delivered and so throughout the period + +131 +00:08:41,089 --> 00:08:45,660 +of nineteen seventies actually in the +field was very quiet and not much + +132 +00:08:45,659 --> 00:08:52,958 +research has been done next boost +actually came about roughly 1986 and in + +133 +00:08:52,958 --> 00:08:57,179 +1986 people there was this influential +paper that basically he is the first + +134 +00:08:57,179 --> 00:09:03,069 +time that you see back propagation like +rules in a nicely presented format and + +135 +00:09:03,070 --> 00:09:07,910 +so this is really hard in 10 and Wilson +and they were playing with multi-layer + +136 +00:09:07,909 --> 00:09:11,129 +perceptrons and this is the first time +when you go to the paper we actually see + +137 +00:09:11,129 --> 00:09:13,879 +something that looks like a back +propagation and so at this point they + +138 +00:09:13,879 --> 00:09:17,830 +already discarded this idea of ad-hoc +rules and become really the lock + +139 +00:09:17,830 --> 00:09:20,589 +function and talked about back +propagation gradient descent and so on + +140 +00:09:20,589 --> 00:09:25,390 +and so this time people get excited +again in 1986 because they felt that + +141 +00:09:25,389 --> 00:09:30,610 +they now had a principal nice credit +assignment kind of skiing by + +142 +00:09:30,610 --> 00:09:35,000 +backpropagation and they could train +networks the problem unfortunately was + +143 +00:09:35,000 --> 00:09:37,690 +that when they tried to scale up these +networks to make them deeper or larger + +144 +00:09:37,690 --> 00:09:41,089 +they didn't work very well compared to +some of the other things that might be + +145 +00:09:41,089 --> 00:09:44,620 +your machine learning tool kits and so +they just did not give a very good + +146 +00:09:44,620 --> 00:09:49,339 +results at this time and training with +get stuck and the competition was + +147 +00:09:49,339 --> 00:09:52,170 +basically not working very well +especially he wanted to have largely + +148 +00:09:52,169 --> 00:09:56,199 +networks and this was the case for +actually roughly twenty years where + +149 +00:09:56,200 --> 00:09:58,940 +again there was less research on your +own works because somehow it wasn't + +150 +00:09:58,940 --> 00:10:04,370 +working very well and you can train +because and in 2006 the research was a + +151 +00:10:04,370 --> 00:10:08,440 +recent once again reinvigorated whether +paper in science by Hinton and and + +152 +00:10:08,440 --> 00:10:14,190 +Russell had enough enough yet say his +name but basically what they found here + +153 +00:10:14,190 --> 00:10:17,430 +was this was roughly the first time we +can actually have likes a penalty or + +154 +00:10:17,429 --> 00:10:22,549 +neural network that trains properly and +what they did was instead of training + +155 +00:10:22,549 --> 00:10:26,319 +all the layers like 10 layers by +backpropagation a single pass they came + +156 +00:10:26,320 --> 00:10:29,230 +up with this unsupervised pre-training +scheme using what's called restricted + +157 +00:10:29,230 --> 00:10:32,139 +Boltzmann machine and so what this +amounts to is you train your first layer + +158 +00:10:32,139 --> 00:10:35,860 +using an unsupervised objective and then +you train your second layer on top of it + +159 +00:10:35,860 --> 00:10:39,850 +and then third and fourth and then once +all of these are trained then you put + +160 +00:10:39,850 --> 00:10:42,959 +them all together and then you start +back propagation then you start to + +161 +00:10:42,958 --> 00:10:46,479 +fine-tuning step it was a two step +process of first read the speech + +162 +00:10:46,480 --> 00:10:49,860 +stepwise through the layers and then we +put them in and then back propagation + +163 +00:10:49,860 --> 00:10:53,459 +works and so this was the first time a +back-propagation + +164 +00:10:53,458 --> 00:10:56,250 +needed basically this initialization +from the unsurprisingly training + +165 +00:10:56,250 --> 00:10:59,490 +otherwise they would not work out of +luck from scratch and we're going to see + +166 +00:10:59,490 --> 00:11:03,680 +why in this lecture it's kind of tricky +to get these indeed networks to train + +167 +00:11:03,679 --> 00:11:07,769 +from scratch using just backdrop and you +have to really think about it and so it + +168 +00:11:07,769 --> 00:11:11,100 +turned out later that you actually don't +need a surprise process you can just + +169 +00:11:11,100 --> 00:11:14,199 +trade with backdrop right away but you +have to be very careful with + +170 +00:11:14,198 --> 00:11:18,109 +initialization and they used signaling +that works at this point and sigmoid are + +171 +00:11:18,110 --> 00:11:23,389 +just not a great option to use and so +basically backdrop works but you have to + +172 +00:11:23,389 --> 00:11:29,250 +be careful in how you use it and so this +was in 2006 so a bit more research is + +173 +00:11:29,250 --> 00:11:32,600 +kind of came back to the area and was +rebranded as deep learning but really + +174 +00:11:32,600 --> 00:11:39,610 +it's still neural networks synonymous +but it's a better word for the art and + +175 +00:11:39,610 --> 00:11:43,990 +basically at this point I think start to +work properly well and people could + +176 +00:11:43,990 --> 00:11:48,940 +actually trained networks now still not +too many people paid attention and when + +177 +00:11:48,940 --> 00:11:53,310 +people start to really pay attention was +roughly I think around 2010 and 2012 so + +178 +00:11:53,309 --> 00:11:56,379 +specifically in 2010 there were two +first really big result for neural + +179 +00:11:56,379 --> 00:11:59,669 +networks really worked really well +compared to everything else that you had + +180 +00:11:59,669 --> 00:12:01,078 +in your machine learning toolkit + +181 +00:12:01,078 --> 00:12:07,888 +kernels or espionage and so on and this +was specifically the speech recognition + +182 +00:12:07,889 --> 00:12:12,839 +area where they took this GMM HMM +framework and they swapped out long part + +183 +00:12:12,839 --> 00:12:17,800 +in sports network and Internet would +give him huge improvements in 2010 and + +184 +00:12:17,799 --> 00:12:21,068 +this was worked on Microsoft and so +people start to pay attention because + +185 +00:12:21,068 --> 00:12:26,189 +this was the first time that works +really came from a large improvements + +186 +00:12:26,190 --> 00:12:30,550 +and then we saw that again in 2012 where +he played out even more dramatically in + +187 +00:12:30,549 --> 00:12:36,039 +the domain of visual recognition and +computer vision where basically we took + +188 +00:12:36,039 --> 00:12:44,448 +this 2012 network by all scratched D +Anton and basically a crush the + +189 +00:12:44,448 --> 00:12:48,719 +competition from all the features and +there was a really large improvement + +190 +00:12:48,720 --> 00:12:52,810 +from these neural networks that we +witnessed and that's what people really + +191 +00:12:52,809 --> 00:12:56,629 +start to pay attention and since then +the field this kind of exploded and + +192 +00:12:56,629 --> 00:12:58,370 +there's a lot of area in this field now + +193 +00:12:58,370 --> 00:13:03,110 +and so will go into details I think a +bit later in the possible why it started + +194 +00:13:03,110 --> 00:13:04,589 +to work early 2010 + +195 +00:13:04,589 --> 00:13:08,860 +it's a combination of things but I think +it's we've got to be figured out a + +196 +00:13:08,860 --> 00:13:12,710 +better way to visualizing of getting +these things to work of activation + +197 +00:13:12,710 --> 00:13:16,690 +functions and we had GPUs and we have +much more data and so really a lot of + +198 +00:13:16,690 --> 00:13:19,710 +the stuff before didn't quite work +because it was just not there in terms + +199 +00:13:19,710 --> 00:13:26,028 +of computer data and some of the ideas +just tweaking and so that's rough + +200 +00:13:26,028 --> 00:13:30,750 +historical setting so we basically went +throughout over promising underdog over + +201 +00:13:30,750 --> 00:13:34,700 +processing and delivery and now it seems +like things are actually trying to work + +202 +00:13:34,700 --> 00:13:37,028 +really well and so that's where we are +at this point + +203 +00:13:37,028 --> 00:13:42,210 +ok I'm going to dive into the specifics +and we'll see exactly will actually + +204 +00:13:42,210 --> 00:13:45,550 +dying to know what works and how we +train them properly so the overview of + +205 +00:13:45,549 --> 00:13:49,139 +what we're going to cover over the +course of the next year lectures is a + +206 +00:13:49,139 --> 00:13:52,809 +whole bunch of independent things so +I'll just become peppering you with all + +207 +00:13:52,809 --> 00:13:55,989 +these little areas that we have to +understand and see what people do in the + +208 +00:13:55,990 --> 00:13:59,409 +case and we'll go through them the pros +and cons of all trades as how you + +209 +00:13:59,409 --> 00:14:05,659 +actually properly trained neural +networks and real-world datasets to the + +210 +00:14:05,659 --> 00:14:06,730 +first thing we're going to talk about + +211 +00:14:06,730 --> 00:14:14,450 +activation functions as I promised I +think a lecture so ago so is this + +212 +00:14:14,450 --> 00:14:19,320 +function at the top of their own and we +saw that it can have many different + +213 +00:14:19,320 --> 00:14:25,230 +phones so these are all different +proposals for what these activation + +214 +00:14:25,230 --> 00:14:28,450 +functions can look like they're going to +go through some prison calls and how you + +215 +00:14:28,450 --> 00:14:31,459 +think about what an activation what are +going to desirable properties of an + +216 +00:14:31,458 --> 00:14:35,289 +activation function so historically the +one that has been used the most is the + +217 +00:14:35,289 --> 00:14:39,009 +sigmoid nonlinearity which looks like +this so it's basically squashing + +218 +00:14:39,009 --> 00:14:40,528 +function it takes a real value number + +219 +00:14:40,528 --> 00:14:45,669 +squashes it to be between 0 and one and +so the first problem with the sigmoid is + +220 +00:14:45,669 --> 00:14:51,120 +that as was pointed out a few lectures +to go there's a problem that saturated + +221 +00:14:51,120 --> 00:14:55,839 +neurons which are either very close to +zero or very close to one of those + +222 +00:14:55,839 --> 00:15:00,070 +neurons kill gradients during back +propagation and so I like to expand on + +223 +00:15:00,070 --> 00:15:03,660 +this entry exactly what this means and +this contributes to something that we're + +224 +00:15:03,659 --> 00:15:08,679 +going to call the bench ingredient +problem so let's look at the gate in the + +225 +00:15:08,679 --> 00:15:11,159 +back in the circuit and receive some + +226 +00:15:11,159 --> 00:15:16,149 +and signal that comes out and then in +back probably have deal by decent and we + +227 +00:15:16,149 --> 00:15:19,940 +like to back drop it through the second +gate to using chain rule so that we have + +228 +00:15:19,940 --> 00:15:24,089 +a deal by Dax at the end and you can see +that through chain rule basically told + +229 +00:15:24,089 --> 00:15:27,569 +us to multiply those two quantities and +so think about what happens when this + +230 +00:15:27,568 --> 00:15:33,399 +signaled gate receives and put off by 10 +or 20 or 10 it competes in value and + +231 +00:15:33,399 --> 00:15:37,309 +then it's getting some gradient from the +top and what happens to that radiant as + +232 +00:15:37,309 --> 00:15:41,549 +your backdrop through the circuit in any +of these cases where is that possible + +233 +00:15:41,549 --> 00:15:56,578 +problem in some of these cases so so +you're saying that the gradient is very + +234 +00:15:56,578 --> 00:16:01,919 +low when Texas negative 10 or 10 and +wait to see this is basically we have + +235 +00:16:01,919 --> 00:16:05,659 +this local gradient here that will be +multiplying with this gradient this + +236 +00:16:05,659 --> 00:16:09,838 +local gradient defund the DOMA bydy X +when you're at the negative 10 you can + +237 +00:16:09,839 --> 00:16:14,370 +see that the gradient is basically zero +because the slope at this point zero and + +238 +00:16:14,370 --> 00:16:18,339 +gradient attend will also be near zero +and so the issue is that you're reading + +239 +00:16:18,339 --> 00:16:24,220 +will drop in from here but if you're on +the saturated so it basically it 0 had + +240 +00:16:24,220 --> 00:16:26,930 +it won then the gradient will be killed + +241 +00:16:26,929 --> 00:16:31,258 +I'll just be multiplied by a very tiny +number and great info will stop through + +242 +00:16:31,259 --> 00:16:36,480 +them through the signature on so you can +imagine if you have a large network of + +243 +00:16:36,480 --> 00:16:39,800 +sigmoid neurons and many of them are in +a saturated regime where they're either + +244 +00:16:39,799 --> 00:16:43,269 +0 or 1 ingredients can't back propagate +through the network because they'll be + +245 +00:16:43,269 --> 00:16:48,230 +stopped if you're sitting in your office +or in the saturated or jeans ingredients + +246 +00:16:48,230 --> 00:16:51,740 +only flow if you're kind of in a safer +zone and what we call an active region + +247 +00:16:51,740 --> 00:16:57,049 +of the sigmoid and so that's kind of a +problem we'll see more about this soon + +248 +00:16:57,049 --> 00:17:03,289 +another problem with the sigmoid is that +there are not zero centered so we'll + +249 +00:17:03,289 --> 00:17:07,078 +talk about the preprocessing soon but +you always want to when you process your + +250 +00:17:07,078 --> 00:17:10,578 +day want to make sure that it's zero +centered right and in this case is + +251 +00:17:10,578 --> 00:17:14,658 +supposed to have a big network of +several layers of sigmund their opening + +252 +00:17:14,659 --> 00:17:19,659 +these 90 centered values between 0 and +one and we're putting more basically + +253 +00:17:19,659 --> 00:17:22,260 +leader classifiers that were stacked on +top of each other + +254 +00:17:22,259 --> 00:17:26,078 +and the problem roughly with non-zero +centered up but I'll just try to give + +255 +00:17:26,078 --> 00:17:31,169 +you a bit of an intuition on what goes +wrong + +256 +00:17:31,170 --> 00:17:36,480 +concern Iran that computes this function +right 0 to 60 in Iran looking at just + +257 +00:17:36,480 --> 00:17:40,589 +competing W must be and what can we say +about think about what you can say about + +258 +00:17:40,589 --> 00:17:45,559 +the gradients on W during +backpropagation if your exes are all + +259 +00:17:45,559 --> 00:17:49,259 +positive in this case between 011 so +maybe you're in Iran somewhere deep in + +260 +00:17:49,259 --> 00:17:54,539 +the network what can you say about the +weights if all the excess are positive + +261 +00:17:54,539 --> 00:18:00,960 +numbers + +262 +00:18:00,960 --> 00:18:13,970 +constrained in a way ahead on the green +WR either a positive or negative and + +263 +00:18:13,970 --> 00:18:17,730 +that is because gradient flows in from +the top and if you think about the + +264 +00:18:17,730 --> 00:18:22,700 +expression for all the W radiance +they're basically X times the gradient + +265 +00:18:22,700 --> 00:18:28,440 +and so the gradient off on the upper of +the neuron is positive then all your W + +266 +00:18:28,440 --> 00:18:32,308 +gratings will be positive and vice versa +so basically you end up at this case + +267 +00:18:32,308 --> 00:18:35,710 +where it's supposed to have just two +weights so you have the first wait a + +268 +00:18:35,710 --> 00:18:40,788 +second wait what ends up happening is +other ingredients for that for that as + +269 +00:18:40,788 --> 00:18:45,099 +this goes through your computer ready in +the weights there either positive or + +270 +00:18:45,099 --> 00:18:49,509 +negative and so the issue is that your +constrained and the kind of update you + +271 +00:18:49,509 --> 00:18:53,609 +can make and you end up with this +undesirables exacting path if you want + +272 +00:18:53,609 --> 00:18:57,808 +to get to some parts that are outside of +these regions this kind of like a + +273 +00:18:57,808 --> 00:19:02,058 +slightly henry VIII reason here but just +to give you intuition and you can see + +274 +00:19:02,058 --> 00:19:04,769 +this empirically when you train with +things that are not zero centered you + +275 +00:19:04,769 --> 00:19:09,319 +observed slower convergence and this is +a bit of a hand with a reason for why + +276 +00:19:09,319 --> 00:19:13,220 +that might happen but I think if you +actually want to go much deeper into + +277 +00:19:13,220 --> 00:19:15,919 +that you can and there are people +talking about this but you have to then + +278 +00:19:15,919 --> 00:19:19,350 +reason about mathematics official major +season natural gradients and gets a bit + +279 +00:19:19,349 --> 00:19:22,959 +more complex than this but i just wanted +to give you intuition for you want to + +280 +00:19:22,960 --> 00:19:25,950 +have zero Center things in the input you +want to have their santa thing + +281 +00:19:25,950 --> 00:19:30,450 +throughout the white thinks things as +nicely and so that is a downside of + +282 +00:19:30,450 --> 00:19:35,569 +signaling their own and the last one is +that XP function inside this expression + +283 +00:19:35,569 --> 00:19:39,099 +is kind of expensive to compute compared +to some of the alternatives of other + +284 +00:19:39,099 --> 00:19:45,199 +charities and so it's just a small +detail I suppose when you actually + +285 +00:19:45,200 --> 00:19:48,028 +trained these large commercial networks +most of the computer time isn't + +286 +00:19:48,028 --> 00:19:53,148 +competitions and these dot product it's +not in this expiration and so it's kind + +287 +00:19:53,148 --> 00:19:55,509 +of banishing small contribution but it's +still something that is a bit of a + +288 +00:19:55,509 --> 00:20:00,710 +downside compared to the other parts so +I'm going to ask if you think a few + +289 +00:20:00,710 --> 00:20:04,230 +questions so tender age is an attempt to +fix one of these problems in particular + +290 +00:20:04,230 --> 00:20:11,440 +the fact that it's 90 centered so +eloquent in 1991 right wrote a very nice + +291 +00:20:11,440 --> 00:20:13,450 +paper on how you optimize your network + +292 +00:20:13,450 --> 00:20:18,700 +and I links to it from the syllabus and +he recommended that people use any extra + +293 +00:20:18,700 --> 00:20:22,350 +steps intended to affect basically is +kind of like two segments but together + +294 +00:20:22,349 --> 00:20:28,219 +you end up with being between negative +one and one and so you're up with 40 + +295 +00:20:28,220 --> 00:20:32,139 +centered but otherwise you have still up +something from the other problems like + +296 +00:20:32,138 --> 00:20:36,240 +for example you have these regions where +if you get saturated no gradients flow + +297 +00:20:36,240 --> 00:20:41,829 +and so we haven't really fix that at +this point but so many just I think + +298 +00:20:41,829 --> 00:20:51,259 +strictly prefer to sigmoid because it +has all the same problems except for 10 + +299 +00:20:51,259 --> 00:20:57,970 +continue and then maybe we can take more +questions so around 2012 in the paper by + +300 +00:20:57,970 --> 00:21:01,038 +Oscar Jessica this is the first +commercial networks paper we propose + +301 +00:21:01,038 --> 00:21:05,240 +that actually we noticed that this +nonlinearity where you use maxis Iran X + +302 +00:21:05,240 --> 00:21:07,339 +instead of sigmoid or 10 each + +303 +00:21:07,339 --> 00:21:10,849 +just make sure networks converter much +quicker and in their experiments almost + +304 +00:21:10,849 --> 00:21:17,699 +my height of 6 and so we can go back and +try to think about why is this and what + +305 +00:21:17,700 --> 00:21:20,450 +kind of reading into it like you can see +that it works better in practice but + +306 +00:21:20,450 --> 00:21:25,580 +explaining it does not always as easy to +hear some reason is hoping for a while + +307 +00:21:25,579 --> 00:21:30,908 +people are thinking that this works much +better so one thing is that this this + +308 +00:21:30,909 --> 00:21:35,570 +role in your own and does not sanctuary +at least a positive region so at least + +309 +00:21:35,569 --> 00:21:38,859 +in this region you don't have the +Spanish ingredient problem where your + +310 +00:21:38,859 --> 00:21:42,019 +brilliance will just kind of died and +you have this issue where the neurons + +311 +00:21:42,019 --> 00:21:47,028 +are only active in a small area that is +bounded from both sides but these + +312 +00:21:47,028 --> 00:21:50,519 +neurons actually active in a sense of +the back propagate correctly or not + +313 +00:21:50,519 --> 00:21:55,419 +correctly but at least they don't like +80 oz at least half of their regions + +314 +00:21:55,419 --> 00:22:00,730 +they're much more computationally +efficient you're just holding and + +315 +00:22:00,730 --> 00:22:04,919 +experimental you can see that this +number just so much much faster so this + +316 +00:22:04,919 --> 00:22:08,929 +is called the rela- near on the file in +your unit was pointed out in this paper + +317 +00:22:08,929 --> 00:22:12,000 +for the first time that this works much +better and this is kind of like a + +318 +00:22:12,000 --> 00:22:15,429 +detailed recommendations what you should +use at this point at the same time there + +319 +00:22:15,429 --> 00:22:18,990 +are several problems with this ruling +Iran so one thing again notice that it's + +320 +00:22:18,990 --> 00:22:23,778 +not zero centered up it's so not +completely ideal perhaps and a slight + +321 +00:22:23,778 --> 00:22:26,130 +annoyance of the ruling Iran + +322 +00:22:26,130 --> 00:22:31,120 +that we can talk about it and think +about is what happens when there's + +323 +00:22:31,119 --> 00:22:37,009 +really no 10 what happens during the +propagation if Iran does not become + +324 +00:22:37,009 --> 00:22:43,269 +active in the forecast stays in active +thundering backdrop what they do it + +325 +00:22:43,269 --> 00:22:47,289 +kills right that kills the gradient and +so the way to see this of course is that + +326 +00:22:47,289 --> 00:22:51,609 +when the same picture and if you read +too negative say 10 than your local + +327 +00:22:51,609 --> 00:22:55,119 +gradient here will just be zero because +there's no there's just zero gradient + +328 +00:22:55,119 --> 00:22:58,589 +identically it's not just you squish +degrading down you actually kill it + +329 +00:22:58,589 --> 00:23:01,689 +completely so anyone that does not +activate will not that propagate + +330 +00:23:01,690 --> 00:23:06,039 +downwards its weights will not be +updated and nothing happens below it at + +331 +00:23:06,039 --> 00:23:13,970 +least for its contribution and a tactic +was ten was the local gradient that's + +332 +00:23:13,970 --> 00:23:19,940 +just one so just passes through +gradients just a gate if if if if its + +333 +00:23:19,940 --> 00:23:24,820 +assets out that was positive and then it +just passing reading through otherwise + +334 +00:23:24,819 --> 00:23:30,250 +it kills a kind of like a great game to +date and by the way what happens when + +335 +00:23:30,250 --> 00:23:38,569 +Texas 0 what is your gradient at that +point it's actually undefined that's + +336 +00:23:38,569 --> 00:23:42,169 +right the green does not exist at that +point we only talked about whenever I + +337 +00:23:42,170 --> 00:23:45,789 +talk about gradient just assumed that I +always mean some gradient which is a + +338 +00:23:45,789 --> 00:23:49,119 +generalization of gradient two functions +that are sometimes not differentiable to + +339 +00:23:49,119 --> 00:23:52,250 +hear the limit does not exist but +there's a whole bunch of some gradients + +340 +00:23:52,250 --> 00:23:58,609 +that could be 0 or 1 and so that's what +we use usually in practice this + +341 +00:23:58,609 --> 00:24:02,119 +distinction doesn't really matter too +much but i wanna talk about the south in + +342 +00:24:02,119 --> 00:24:06,539 +the case of by miramax Kate X&Y and +someone asked the question what happens + +343 +00:24:06,539 --> 00:24:12,629 +if X&Y are equal then that case you you +can also have a kink in the function and + +344 +00:24:12,630 --> 00:24:15,550 +makes them vulnerable but in practice +these things don't really matter just + +345 +00:24:15,549 --> 00:24:20,329 +pick one so you can have a great in 2011 +there and things will work just fine and + +346 +00:24:20,329 --> 00:24:23,490 +that's roughly because these are very +unlikely cases that you end up right + +347 +00:24:23,490 --> 00:24:24,710 +there + +348 +00:24:24,710 --> 00:24:28,519 +ok so the issue with relo roughly here's +the problem that happens in practice he + +349 +00:24:28,519 --> 00:24:32,799 +tried to Israel units and one thing that +you have to be aware of is you have + +350 +00:24:32,799 --> 00:24:37,629 +these neurons that if they don't put +anything they won't get any great dental + +351 +00:24:37,630 --> 00:24:38,290 +kill it + +352 +00:24:38,289 --> 00:24:48,049 +update and so what's the issue is +supposed to have something happen is + +353 +00:24:48,049 --> 00:24:51,059 +when you initialize you're really +neurons you can initialize them in a non + +354 +00:24:51,059 --> 00:24:57,000 +not very lucky way and what ends up +happening is suppose this is your guide + +355 +00:24:57,000 --> 00:25:02,009 +a cloud of inputs to your Eleanor owns +four you can end up with is what we call + +356 +00:25:02,009 --> 00:25:06,650 +a dead relative a dead ringer on so if +this neuron only activates in the region + +357 +00:25:06,650 --> 00:25:12,550 +outside of your data cloud in this bed +trailer will never become activated and + +358 +00:25:12,549 --> 00:25:15,889 +then it will never update and so this +can happen in one of two ways either + +359 +00:25:15,890 --> 00:25:19,090 +during initialization you were really +really unlucky and you happen to sample + +360 +00:25:19,089 --> 00:25:22,959 +waits for her role in your own in such a +way that that neuron will never turn on + +361 +00:25:22,960 --> 00:25:27,549 +in that case in Iran will not rain but +more often what happens is during + +362 +00:25:27,549 --> 00:25:31,769 +training if you are learning rate is +high then think about these neurons ask + +363 +00:25:31,769 --> 00:25:35,339 +around and we can happen sometimes by +chance they just got knocked off the + +364 +00:25:35,339 --> 00:25:39,669 +data manifold and when that happens then +they will never get activated again and + +365 +00:25:39,670 --> 00:25:43,310 +they will not come back to the data +manifold and you can see there's + +366 +00:25:43,309 --> 00:25:48,039 +actually practice like sometimes you can +train a big neural net with delegates + +367 +00:25:48,039 --> 00:25:51,740 +and you try it and it seems to work fine +and then what you do you stop the + +368 +00:25:51,740 --> 00:25:54,279 +training and you pass your entire +training dataset through your network + +369 +00:25:54,279 --> 00:25:59,460 +and you look at the statistics of every +single neuron and what can happen is + +370 +00:25:59,460 --> 00:26:02,620 +that as much as like 10 or 20 percent of +your network is dead + +371 +00:26:02,619 --> 00:26:06,319 +designer on that never turned on for +anything in the training data and this + +372 +00:26:06,319 --> 00:26:09,929 +could actually happen usually it's +because you're learning rate was high + +373 +00:26:09,930 --> 00:26:14,250 +and so those are just like dead parts of +your network and you can call pataki + +374 +00:26:14,250 --> 00:26:16,299 +schemes for real nationalizing these +things and so on + +375 +00:26:16,299 --> 00:26:21,569 +people don't usually do it as much but +it's something to be aware of and it's a + +376 +00:26:21,569 --> 00:26:26,929 +problem with this nonlinearity and so +especially for initialization because of + +377 +00:26:26,930 --> 00:26:30,840 +this dead real problem with people like +to do is normally initialize the bus 10 + +378 +00:26:30,839 --> 00:26:35,289 +instead people in this life was slightly +positive numbers Lexi 0101 because that + +379 +00:26:35,289 --> 00:26:40,389 +makes it more likely that an +initialisation these roman numbers and + +380 +00:26:40,390 --> 00:26:44,170 +old will get updates so it makes it less +likely that the neuron will just never + +381 +00:26:44,170 --> 00:26:48,190 +become activated ever throughout +training but I don't actually I think + +382 +00:26:48,190 --> 00:26:51,350 +this is likely have a controversial +point out some people claim that + +383 +00:26:51,349 --> 00:26:54,849 +help sexy some people say that it +actually doesn't help at all and so just + +384 +00:26:54,849 --> 00:27:02,089 +something to think about any questions +at this point we are going to go into + +385 +00:27:02,089 --> 00:27:08,839 +some other wants ok so let's look at +things like people trying to fix a loose + +386 +00:27:08,839 --> 00:27:13,058 +so one issue with relatives as these +dead neurons are not ideal so here's one + +387 +00:27:13,058 --> 00:27:18,349 +proposal which is called the leaky rain +and the idea of leaking really is + +388 +00:27:18,349 --> 00:27:22,399 +basically we want this kink and we want +this peace finally RT and we want the + +389 +00:27:22,400 --> 00:27:29,070 +sufficiency of but the issue is that in +these this region your dreams die so + +390 +00:27:29,069 --> 00:27:32,379 +instead let's make this slightly +negatively sloped here or slightly + +391 +00:27:32,380 --> 00:27:36,409 +positively sloped I suppose in this +region and so you end up with this + +392 +00:27:36,409 --> 00:27:41,260 +function and that's called a leaky and +so some people are people showing that + +393 +00:27:41,259 --> 00:27:45,519 +this works slightly better you don't +have this issue of neurons dying but I + +394 +00:27:45,519 --> 00:27:51,730 +think it's not completely established +that this works always better and then + +395 +00:27:51,730 --> 00:27:54,870 +some people playing with this even more +so right now this is your apt 101 but + +396 +00:27:54,869 --> 00:27:57,439 +that can actually be an arbitrary +parameter and then you get something + +397 +00:27:57,440 --> 00:28:01,058 +that's called a parametric rectifier or +people who and basically the idea here + +398 +00:28:01,058 --> 00:28:07,519 +is to introduce this is 101 which is a +parameter in your network and this can + +399 +00:28:07,519 --> 00:28:10,808 +be learned you can back up to get into +it and so these neurons basically can + +400 +00:28:10,808 --> 00:28:15,609 +choose what slope to have in his native +region ok and so they can become + +401 +00:28:15,609 --> 00:28:21,250 +irrelevant if they want to or they can +become a leak or they can be they have + +402 +00:28:21,250 --> 00:28:25,798 +the choice roughly every neuron is this +the kind of things that people play with + +403 +00:28:25,798 --> 00:28:40,950 +when they tried to design a good day too +in just a very normal way your + +404 +00:28:40,950 --> 00:28:44,200 +competition go out there every neuron +will have its just like it has its own + +405 +00:28:44,200 --> 00:28:46,659 +bias + +406 +00:28:46,659 --> 00:28:48,490 +go ahead + +407 +00:28:48,490 --> 00:29:00,370 +I'll finds one then you're going to get +an identity so that's probably not + +408 +00:29:00,369 --> 00:29:03,779 +something that the propagation will want +in a sense that if that wasn't identity + +409 +00:29:03,779 --> 00:29:06,819 +then that shouldn't be very competition +a useful so you might expect that baby + +410 +00:29:06,819 --> 00:29:09,939 +back propagation should not actually get +you to those regions of the space and + +411 +00:29:09,940 --> 00:29:13,720 +maybe even perhaps I don't actually +think if I remember correctly there is + +412 +00:29:13,720 --> 00:29:17,069 +no specific things where people really +worried about that too much but I could + +413 +00:29:17,069 --> 00:29:20,529 +be wrong ahead I read the paper while +ago now and I don't use these too much + +414 +00:29:20,529 --> 00:29:27,160 +work and then so one issue still is as +we saw it so these are different schemes + +415 +00:29:27,160 --> 00:29:30,759 +for fixing the bed railing Iran's +there's another people that only came + +416 +00:29:30,759 --> 00:29:34,730 +out for example roughly two months ago +so this just gives you a sense of how + +417 +00:29:34,730 --> 00:29:38,210 +new this field is there are papers +coming out just two months ago trying to + +418 +00:29:38,210 --> 00:29:42,850 +propose a new activation functions one +of them is exponential in your units are + +419 +00:29:42,849 --> 00:29:46,799 +just give you an idea about what people +play with it tries to have all the + +420 +00:29:46,799 --> 00:29:50,869 +benefits of relew buttressed to get rid +of this downside of being non-zero + +421 +00:29:50,869 --> 00:29:54,909 +centered and so they end up with is this +blue function here that looks like a + +422 +00:29:54,910 --> 00:29:58,390 +real issue but in the negative region it +doesn't just go to zero or doesn't just + +423 +00:29:58,390 --> 00:30:02,700 +go down as a leak but it has this funny +shape and there are two pages of math in + +424 +00:30:02,700 --> 00:30:03,480 +the paper + +425 +00:30:03,480 --> 00:30:08,509 +justifying partly why you want that and +roughly when you do this end up with + +426 +00:30:08,509 --> 00:30:12,829 +zero mean outlets and they claim that +the strains better and I think there's + +427 +00:30:12,829 --> 00:30:17,889 +some controversy about this and so we're +basically trying to figure all of this + +428 +00:30:17,890 --> 00:30:18,309 +out + +429 +00:30:18,308 --> 00:30:21,849 +active area of research and we're not +sure what to do yet but rather is right + +430 +00:30:21,849 --> 00:30:26,719 +now are like a safe recommendation if +you if you're careful with it so that's + +431 +00:30:26,720 --> 00:30:31,259 +a loose and one more I would like to +note mention because it's relatively + +432 +00:30:31,259 --> 00:30:35,319 +common in you'll see it if you read +about it works is this max out their own + +433 +00:30:35,319 --> 00:30:42,308 +from hotel and basically it's a very +different from iran it's not just an + +434 +00:30:42,308 --> 00:30:44,000 +activation function that looks different + +435 +00:30:44,000 --> 00:30:47,789 +it actually changes within Iran computer +how computes doesn't just have this form + +436 +00:30:47,789 --> 00:30:54,629 +of W X it actually has two weights and +then compute smacks of W transpose Xbox + +437 +00:30:54,630 --> 00:30:58,970 +be another set of WSYX must be the end +up with these like to hike a place that + +438 +00:30:58,970 --> 00:31:01,440 +you take a max over and that's what the +near a computer + +439 +00:31:01,440 --> 00:31:04,298 +you can see that there are many ways of +playing with these activation functions + +440 +00:31:04,298 --> 00:31:09,339 +so this doesn't have some of the +downsides of this want to die and it + +441 +00:31:09,339 --> 00:31:13,128 +still piecewise linear it's still +efficient but not every single neuron + +442 +00:31:13,128 --> 00:31:16,839 +has two weights and so you kind of +double the number of parameters premiere + +443 +00:31:16,839 --> 00:31:21,689 +on and so maybe that's not as ideal so +some people use this but I think it's + +444 +00:31:21,690 --> 00:31:45,130 +it's not super common I would say that +roads are still most common + +445 +00:31:45,130 --> 00:31:57,870 +into those winds will be different and +so you end up a different weights for + +446 +00:31:57,869 --> 00:32:11,009 +sure it's complicated it's complicated + +447 +00:32:11,009 --> 00:32:15,799 +is a lot of the optimization process is +not just about the loss function but + +448 +00:32:15,799 --> 00:32:19,000 +just like about the dynamics of the +backward flow of greens and we'll see a + +449 +00:32:19,000 --> 00:32:22,250 +bit about that in next week's lies you +have to really think about it + +450 +00:32:22,250 --> 00:32:27,420 +dynamically more than just lost +landscape and how it's so it's too + +451 +00:32:27,420 --> 00:32:32,410 +complex and also you specifically +stochastic gradient descent and has a + +452 +00:32:32,410 --> 00:32:36,340 +particular form and something splaine +nicer some liberties play nicely with + +453 +00:32:36,339 --> 00:32:41,039 +the fact that the optimization is tied +the update is tied into all this as well + +454 +00:32:41,039 --> 00:32:45,519 +as kind of all interacting together and +the choice of these activation functions + +455 +00:32:45,519 --> 00:32:49,619 +and the choice of your updates are kind +of coupled and it's very unclear when + +456 +00:32:49,619 --> 00:32:59,649 +you actually optimizes kind of complex +think so while they are here is that you + +457 +00:32:59,650 --> 00:33:03,620 +can try out these guys you can try out +anybody should expect too much I don't + +458 +00:33:03,619 --> 00:33:06,669 +think people use it too much right now +and don't ignore it because basically + +459 +00:33:06,670 --> 00:33:11,130 +ten I just strictly better and you won't +see people using voice now anymore + +460 +00:33:11,130 --> 00:33:14,350 +of course we use it and things like long +short term memory units palestinian + +461 +00:33:14,349 --> 00:33:17,129 +someone will go into that in a bit in +recurrent neural networks but their + +462 +00:33:17,130 --> 00:33:22,500 +specific reasons why we use them there +and that will see later in class and + +463 +00:33:22,500 --> 00:33:26,700 +they are they're used differently than +what we've covered so far in like this + +464 +00:33:26,700 --> 00:33:32,670 +just fully connected sandwich makers +multiply party and someone just having a + +465 +00:33:32,670 --> 00:33:35,720 +basic neural network oK so that's +everything I wanted to say but + +466 +00:33:35,720 --> 00:33:39,410 +activation functions as basically this +one had primary functions that we worry + +467 +00:33:39,410 --> 00:33:42,990 +about this research about it and we +haven't fully figured it out and there's + +468 +00:33:42,990 --> 00:33:46,640 +some pros and cons and many of them come +down to thinking about how the gradient + +469 +00:33:46,640 --> 00:33:50,690 +flows through your network and discuss +these issues like dead relatives and yet + +470 +00:33:50,690 --> 00:33:54,808 +to really know about the gradient flow +if you try to debug your networks + +471 +00:33:54,808 --> 00:33:59,428 +and a to understand what's going on +let's look at a price + +472 +00:33:59,429 --> 00:34:03,710 +processing very briefly so + +473 +00:34:03,710 --> 00:34:07,440 +processing just very briefly normally +suppose you just have a cloud of + +474 +00:34:07,440 --> 00:34:11,829 +original data and two dimensions here +very common 20 Center your data so that + +475 +00:34:11,829 --> 00:34:15,230 +just means that along every single +picture was to track the mean people + +476 +00:34:15,230 --> 00:34:18,889 +sometimes also when you go through +machine learning literature try to + +477 +00:34:18,889 --> 00:34:22,720 +normalize the data so in every single +dimension you normalize say by standard + +478 +00:34:22,719 --> 00:34:23,759 +deviation + +479 +00:34:23,760 --> 00:34:28,990 +standardizing are you can make sure that +the min and max are within and so on + +480 +00:34:28,989 --> 00:34:33,098 +there are several schemes for doing so +in images it's not as common because you + +481 +00:34:33,099 --> 00:34:35,760 +don't have to separate different +features that can be a different units + +482 +00:34:35,760 --> 00:34:39,619 +everything is just pixels and their own +boundary between 0 and 255 it's not as + +483 +00:34:39,619 --> 00:34:43,970 +common to normalize the data but it's +very common 20 Center your data you can + +484 +00:34:43,969 --> 00:34:44,719 +go further + +485 +00:34:44,719 --> 00:34:48,730 +normally in machine learning you can go +ahead and your data has some covariance + +486 +00:34:48,730 --> 00:34:52,079 +structure by default you can go ahead +and make that communist Russia be + +487 +00:34:52,079 --> 00:34:55,740 +diagonal say for example by applying PCA +or you can go even further and you can + +488 +00:34:55,739 --> 00:35:00,309 +wipe your data and what that means is +you kind of even squish after primed PCR + +489 +00:35:00,309 --> 00:35:05,159 +you also squish your data so that your +various metrics becomes just a diagonal + +490 +00:35:05,159 --> 00:35:08,699 +and so that's another form of +preprocessing I see people talk about it + +491 +00:35:08,699 --> 00:35:14,480 +and these are both I go much more detail +in the class notes on BC I don't want to + +492 +00:35:14,480 --> 00:35:17,500 +go into too many details on that because +it turns out that in images we don't + +493 +00:35:17,500 --> 00:35:20,960 +actually end up using these even the +order coming in machine learning + +494 +00:35:20,960 --> 00:35:25,659 +images specifically what's common is +just a means centering and then a + +495 +00:35:25,659 --> 00:35:28,519 +particular variant of me centering that +is slightly more convenient to practice + +496 +00:35:28,519 --> 00:35:34,780 +so I mean centering we say 330 to buy +three images subsea far if you want to + +497 +00:35:34,780 --> 00:35:38,869 +center your data that for every single +pixel you compete W overtraining such a + +498 +00:35:38,869 --> 00:35:43,318 +track that out so what you end up with +is this mean image that has basically + +499 +00:35:43,318 --> 00:35:47,219 +the mission of 32 by 32 by three so I +think that mean image for example for + +500 +00:35:47,219 --> 00:35:51,409 +image data justice orange blob tracking +up from every single image to center + +501 +00:35:51,409 --> 00:35:56,000 +your data to have better trained +dynamics and one other form they're + +502 +00:35:56,000 --> 00:36:00,818 +slightly more convenient is attracting +just a per channel mean so you go into + +503 +00:36:00,818 --> 00:36:05,639 +red green and blue channel and computed +the mean across all of space to just end + +504 +00:36:05,639 --> 00:36:07,289 +up with basically three numbers + +505 +00:36:07,289 --> 00:36:11,029 +moves in red green and blue channel and +just a practice out and so some networks + +506 +00:36:11,030 --> 00:36:15,250 +use that instead so those are two common +skins this one is like a more convenient + +507 +00:36:15,250 --> 00:36:17,519 +because you only have to worry about +those three numbers you don't have to + +508 +00:36:17,519 --> 00:36:20,670 +worry about the giant array of mean that +you have to ship around every writer + +509 +00:36:20,670 --> 00:36:26,430 +when you're actually putting this up so +not too much more I want to say about + +510 +00:36:26,429 --> 00:36:30,649 +this just basically subtract the mean +and computer vision applications things + +511 +00:36:30,650 --> 00:36:35,039 +don't get much more complex than that in +particular DPC and so on this used to be + +512 +00:36:35,039 --> 00:36:38,860 +slightly common issues you can't apply +to all images because your images are + +513 +00:36:38,860 --> 00:36:43,559 +very high dimensional objects with lots +of pixels and so these jerseys will be + +514 +00:36:43,559 --> 00:36:47,789 +huge and people try to do things like +only doing whitening locally so you + +515 +00:36:47,789 --> 00:36:53,179 +would see slide lightning filter through +your image especially and that used to + +516 +00:36:53,179 --> 00:36:56,389 +be down several years ago but it's not +as common now it doesn't seem to matter + +517 +00:36:56,389 --> 00:37:01,809 +too much ok to wait initialization + +518 +00:37:01,809 --> 00:37:06,539 +very very important topic one of the +reasons that I think early neural + +519 +00:37:06,539 --> 00:37:09,409 +networks didn't quite work with as well +as because people are not careful enough + +520 +00:37:09,409 --> 00:37:14,119 +with this so one of the first things I +will look at is first of all how not to + +521 +00:37:14,119 --> 00:37:18,170 +do it in this legislation so in +particular you might be tempted to just + +522 +00:37:18,170 --> 00:37:23,619 +say ok let's start off at the weights +are equal to zero and you that in your + +523 +00:37:23,619 --> 00:37:27,029 +network says it was like a 10 layer +neural network and you said always 20 + +524 +00:37:27,030 --> 00:37:37,320 +why doesn't that work why isn't that a +good idea as well go ahead + +525 +00:37:37,320 --> 00:37:41,410 +basically just all your neurons at the +same thing in backdrop they will behave + +526 +00:37:41,409 --> 00:37:45,000 +the same way and so there's nothing as +we call it what you call it + +527 +00:37:45,000 --> 00:37:50,360 +symmetry breaking so all the other +computing saying stuff and so they will + +528 +00:37:50,360 --> 00:37:53,570 +all look the same they'll compete the +same gradients and so on so not the best + +529 +00:37:53,570 --> 00:37:57,860 +thing is that people use small numbers +small random numbers so one way you can + +530 +00:37:57,860 --> 00:38:01,820 +do that for example that is a relatively +common thing to do is you sample from + +531 +00:38:01,820 --> 00:38:07,410 +you negotiate with 2010 one standard +deviation so small random numbers so + +532 +00:38:07,409 --> 00:38:11,299 +that's where W matrix Hollywood +initialize it now + +533 +00:38:11,300 --> 00:38:15,340 +issue with this initialization is that +it works ok but you'll find that it only + +534 +00:38:15,340 --> 00:38:20,068 +works ok if you have small networks but +as you start to go deeper and deeper + +535 +00:38:20,068 --> 00:38:24,659 +would have to be much more careful about +the nationalization and I'd like to go + +536 +00:38:24,659 --> 00:38:29,199 +into exactly what breaks and how it +breaks and bite breaks when you try to + +537 +00:38:29,199 --> 00:38:32,499 +do these naive initialisation strategies +and try to have deep networks so let's + +538 +00:38:32,498 --> 00:38:38,798 +look at what goes wrong so what I've +written here is a small book so what + +539 +00:38:38,798 --> 00:38:43,608 +we're doing here is going to step +through this just briefly I'm sampling a + +540 +00:38:43,608 --> 00:38:48,369 +dataset of 1,000 points that are 500 +dimensional and then I'm creating a + +541 +00:38:48,369 --> 00:38:52,170 +whole bunch of hidden layers and +nonlinearities so say right now we have + +542 +00:38:52,170 --> 00:38:58,749 +10 layers of 500 units and we're using +10 h and then I'm doing here as I'm just + +543 +00:38:58,748 --> 00:39:03,798 +basically taking unit gosh and data and +I'm forwarding it through the network + +544 +00:39:03,798 --> 00:39:07,509 +and with this particular initialization +strategy where right now that + +545 +00:39:07,509 --> 00:39:10,920 +initialization strategy is what I +described in previous slide see sample + +546 +00:39:10,920 --> 00:39:14,869 +from gushing he's killed by serb 101 so +what I'm doing here in this part because + +547 +00:39:14,869 --> 00:39:18,608 +I'm bored propagating this network which +is right now made up of just a series of + +548 +00:39:18,608 --> 00:39:25,208 +layers of the same size so if ten layers +of $500 and I'm for propagating with + +549 +00:39:25,208 --> 00:39:29,328 +this initialization strategy for a unit +gushing data and what what I want to + +550 +00:39:29,329 --> 00:39:34,109 +look at is what happens to the +statistics of the hidden neurons + +551 +00:39:34,108 --> 00:39:37,719 +activations throughout the network with +this initialization so we're going to + +552 +00:39:37,719 --> 00:39:40,429 +look specifically at the mean and +standard deviation and we're going to + +553 +00:39:40,429 --> 00:39:44,498 +plot the mean standard deviation and +we're going to block the histograms so + +554 +00:39:44,498 --> 00:39:48,159 +we take all this data through and then +say at the fifth player we're going to + +555 +00:39:48,159 --> 00:39:52,368 +look at what the what values did take on +inside the fifth or sixth or seventh + +556 +00:39:52,369 --> 00:39:56,338 +where we're going to make histograms of +those so with this initialization if you + +557 +00:39:56,338 --> 00:39:59,588 +run this experiment you end up it ends +up looking as follows + +558 +00:39:59,588 --> 00:40:03,889 +so here I am printing it out we start +off with a mean of zero as their + +559 +00:40:03,889 --> 00:40:07,368 +division of one that's our data and now +I'm for propagating + +560 +00:40:07,369 --> 00:40:13,019 +as I go to 10 player in the mean we're +using 10 age so tender age of symmetry + +561 +00:40:13,018 --> 00:40:16,868 +so as you might expect the mean states +around zero but the standard deviation + +562 +00:40:16,869 --> 00:40:21,440 +look at what happens to it started off +at 110 division was 2.2 then pulling + +563 +00:40:21,440 --> 00:40:27,420 +2004 and its plummets down to zero for +the standard deviation of these neurons + +564 +00:40:27,420 --> 00:40:31,639 +just goes down 20 looking at the +histograms here at every single air at + +565 +00:40:31,639 --> 00:40:33,338 +the first layer the histogram is reason + +566 +00:40:33,338 --> 00:40:37,778 +so we have a spread of numbers between +11 and then what ends up happening to it + +567 +00:40:37,778 --> 00:40:42,889 +just collapses to a tight distribution +at exactly zero so what ends up + +568 +00:40:42,889 --> 00:40:46,328 +happening with this initialization +produced only our network is all the 10 + +569 +00:40:46,329 --> 00:40:50,930 +H neurons just end up out the team just +20 so at the last layer these are tiny + +570 +00:40:50,929 --> 00:40:58,719 +numbers of like near zero and so all +occupations basically become zero and + +571 +00:40:58,719 --> 00:41:01,219 +why is this an issue + +572 +00:41:01,219 --> 00:41:05,568 +think about what happens to the dynamics +of the backward pass to the gradients + +573 +00:41:05,568 --> 00:41:10,969 +when you have tiny numbers in the +activations your texts are tiny numbers + +574 +00:41:10,969 --> 00:41:12,548 +on the last few layers + +575 +00:41:12,548 --> 00:41:17,159 +what what do these ingredients look like +on the way it's in these layers and what + +576 +00:41:17,159 --> 00:41:27,478 +happens to the backward pass the first +of all suppose my so there is a layer + +577 +00:41:27,478 --> 00:41:32,399 +here that looks at some later before it +and almost all the inputs are so tiny + +578 +00:41:32,400 --> 00:41:37,789 +numbers that's the x axis a tiny number +what is the gradient what do you expect + +579 +00:41:37,789 --> 00:41:45,509 +to the gradients for the W to be in that +case for those layers you some very + +580 +00:41:45,509 --> 00:41:55,528 +small so why would they be very small W +will be equal to x times the gradient + +581 +00:41:55,528 --> 00:41:56,278 +from the top + +582 +00:41:56,278 --> 00:42:00,789 +ok and so effects are tiny numbers than +your reasons for WR tiny numbers as well + +583 +00:42:00,789 --> 00:42:06,640 +and so these guys will basically have +almost no reason to cannulated now we + +584 +00:42:06,639 --> 00:42:13,228 +can also look at what happens with these +matrices again we we took data that was + +585 +00:42:13,228 --> 00:42:16,659 +distributed as a unit caution and the +beginning and then we ended up + +586 +00:42:16,659 --> 00:42:20,278 +multiplying it by W and activation +function and we saw that basically + +587 +00:42:20,278 --> 00:42:24,699 +everything goes to zero this just +collapses over time and think about the + +588 +00:42:24,699 --> 00:42:27,939 +backward pass as we change the gradient +through these layers and + +589 +00:42:27,940 --> 00:42:31,380 +back-propagation what we're doing +effectively is some of the gradient kind + +590 +00:42:31,380 --> 00:42:35,989 +of folks off into our gradient W and we +saw the numbers but then threw back + +591 +00:42:35,989 --> 00:42:39,108 +propagation we're going through +agreements effects and so we end up + +592 +00:42:39,108 --> 00:42:41,969 +doing when we backdrop through here is +what you get + +593 +00:42:41,969 --> 00:42:47,419 +multiplying by W again and again at +every single layer and if you take unit + +594 +00:42:47,420 --> 00:42:51,460 +gushing data and you multiply by WC at +this scale you can see that everything + +595 +00:42:51,460 --> 00:42:55,010 +goes to zero and the same thing will +happen then backward pass were + +596 +00:42:55,010 --> 00:42:59,180 +successively multiplying by W as we back +propagation two acts on every single air + +597 +00:42:59,179 --> 00:43:03,529 +and we are you that this gradient which +started off with reasonable numbers from + +598 +00:43:03,530 --> 00:43:07,300 +your loss function will end up just +going toward zero as you keep doing this + +599 +00:43:07,300 --> 00:43:11,519 +process and you end up with gradients +here that are basically just tiny tiny + +600 +00:43:11,519 --> 00:43:17,530 +numbers and so you basically end up with +very very low gradients throughout this + +601 +00:43:17,530 --> 00:43:21,500 +network because of this reason and this +is something that we refer to as banish + +602 +00:43:21,500 --> 00:43:24,070 +ingredient as this gradient travels +through with this particular + +603 +00:43:24,070 --> 00:43:27,160 +initialization you can see that the +group the magnitude of the green we'll + +604 +00:43:27,159 --> 00:43:34,239 +just go down when we used this in one of +two so we can try different extreme + +605 +00:43:34,239 --> 00:43:38,569 +instead of the scaling here as we scale +with bunny negative to you can try + +606 +00:43:38,570 --> 00:43:45,530 +different scale of the W matrix at +initialization so suppose I try 110001 + +607 +00:43:45,530 --> 00:43:51,099 +will see another funny thing happened +because now we overshot the other way in + +608 +00:43:51,099 --> 00:43:56,260 +a sense that you can see that well maybe +it's best to look at the decisions here + +609 +00:43:56,260 --> 00:44:00,250 +you can see that everything is +completely saturated these 10 hrs either + +610 +00:44:00,250 --> 00:44:05,079 +all negative one or all one i mean the +distribution is really just everything + +611 +00:44:05,079 --> 00:44:08,389 +is super-saturated your entire network +of neurons throughout the network card + +612 +00:44:08,389 --> 00:44:12,509 +either negative 101 because the weights +are too large and they keep adding that + +613 +00:44:12,510 --> 00:44:15,859 +anyone else because this course that end +up going through the non-linearity are + +614 +00:44:15,858 --> 00:44:19,949 +just very large because the weights are +large and so everything is super + +615 +00:44:19,949 --> 00:44:25,669 +saturated so what are the ingredients +flowing through your network is just + +616 +00:44:25,670 --> 00:44:28,869 +terrible it's complete disaster right +that's just zeros for every just + +617 +00:44:28,869 --> 00:44:34,180 +exponentially 0 and you die so you can +train for a very long time and where + +618 +00:44:34,179 --> 00:44:37,889 +you'll see when this happens is your +losses just nothing at all because + +619 +00:44:37,889 --> 00:44:41,299 +nothing is back propagating because all +the neurons are saturated and nothing is + +620 +00:44:41,300 --> 00:44:46,490 +being updated so this initialization as +you might expect actually is like super + +621 +00:44:46,489 --> 00:44:50,469 +tricky to set and it needs to be kind of +in this particular case it needs to be + +622 +00:44:50,469 --> 00:44:54,629 +somewhere between 10 10 10 K and so + +623 +00:44:54,630 --> 00:44:58,259 +so you can be slightly more principled +instead of trying some different values + +624 +00:44:58,259 --> 00:45:03,059 +and there are some papers written on +this so for example in 2010 there was a + +625 +00:45:03,059 --> 00:45:07,589 +proposal for what we now call the +initialisation from go out at all and + +626 +00:45:07,588 --> 00:45:11,199 +the kind of went through and they looked +at the expression for the variance of + +627 +00:45:11,199 --> 00:45:15,318 +your neurons and you can read this out +and you can basically propose a specific + +628 +00:45:15,318 --> 00:45:19,608 +initialization strategy for how you +spell your gradients so I don't have to + +629 +00:45:19,608 --> 00:45:24,088 +try 2001 I don't have to try one or +whatever else they recommend this kind + +630 +00:45:24,088 --> 00:45:27,500 +of initialization we divided by the +square root of the number of inputs for + +631 +00:45:27,500 --> 00:45:28,750 +every single neuron + +632 +00:45:28,750 --> 00:45:33,630 +lots of inputs then you end up with +lower weights and intuitively that makes + +633 +00:45:33,630 --> 00:45:36,539 +sense because you're doing more with you +have more stuff that goes into your + +634 +00:45:36,539 --> 00:45:39,619 +weight it some so you want less of an +interaction to all of them and if + +635 +00:45:39,619 --> 00:45:43,660 +smaller number of units that are feeding +into your lair when you want larger + +636 +00:45:43,659 --> 00:45:46,980 +weights because then there's only a few +of them and you want to have a variance + +637 +00:45:46,980 --> 00:45:51,019 +of 18 just back up a bit + +638 +00:45:51,018 --> 00:45:54,659 +the idea here is they were looking at +the single neuron no activation + +639 +00:45:54,659 --> 00:45:58,118 +functions include is just the linear +neuron and all they're saying is if you + +640 +00:45:58,119 --> 00:46:02,099 +want if you're getting your data as +input and you like this learner on to + +641 +00:46:02,099 --> 00:46:06,079 +have a variance of one then you should +initialize your weights with this amount + +642 +00:46:06,079 --> 00:46:10,670 +and in the notes I going to exactly how +this is derived is just us two standard + +643 +00:46:10,670 --> 00:46:15,650 +deviations and basically this is a +reasonable initialization so I can use + +644 +00:46:15,650 --> 00:46:18,700 +that instead and you can see that if I +use it here + +645 +00:46:18,699 --> 00:46:22,399 +the distributions end up being more +sensible over again looking at the + +646 +00:46:22,400 --> 00:46:25,660 +history between negative one on one of +these ten agents and you get a more + +647 +00:46:25,659 --> 00:46:31,000 +sensible number here and you actually +have your within the active region of + +648 +00:46:31,000 --> 00:46:33,929 +all these teenagers and so you can +expect that this will be a much better + +649 +00:46:33,929 --> 00:46:38,518 +initialization because things are in the +active regions and things will train + +650 +00:46:38,518 --> 00:46:42,318 +from the start nothing is super +saturated in the beginning the reason + +651 +00:46:42,318 --> 00:46:45,179 +that this doesn't just end up being very +nice and the reason we still have + +652 +00:46:45,179 --> 00:46:48,139 +convergence down here is because this +paper doesn't take into account the + +653 +00:46:48,139 --> 00:46:52,308 +nonlinearities in this case the tenant +and so the tennis nonlinearity and up + +654 +00:46:52,309 --> 00:46:57,650 +like kind of the forming your statistics +of the variance throughout and so if you + +655 +00:46:57,650 --> 00:47:02,309 +start this off it and and up still doing +something to distribution in this case + +656 +00:47:02,309 --> 00:47:05,410 +it seems that the standard deviation +goes down but it's not as dramatic as if + +657 +00:47:05,409 --> 00:47:08,179 +you were to set this bye bye just trial + +658 +00:47:08,179 --> 00:47:11,299 +there and so this is like a reasonable +initialisation + +659 +00:47:11,300 --> 00:47:15,280 +to use internal networks compared to +just setting at the 2001 and so people + +660 +00:47:15,280 --> 00:47:20,760 +end up using the same practice sometimes +but so this works in the case of 10 age + +661 +00:47:20,760 --> 00:47:24,349 +does something reasonable it turns out +if you try to put it into a rectified + +662 +00:47:24,349 --> 00:47:30,019 +linear unit network it doesn't work as +well and decreasing divisions will be + +663 +00:47:30,019 --> 00:47:34,679 +much more rapid so looking at a rally in +Tehran and the first layer it has some + +664 +00:47:34,679 --> 00:47:37,769 +distribution and then distribution as +you can see just gets more and more + +665 +00:47:37,769 --> 00:47:43,130 +picky at zero so more and more neurons +are activated with this initialization + +666 +00:47:43,130 --> 00:47:48,440 +so using the initialisation in a rectify +layer layer net does not do good things + +667 +00:47:48,440 --> 00:47:52,659 +and so again thinking about this paper +they don't actually talk about + +668 +00:47:52,659 --> 00:47:57,578 +nonlinearities and the relevant Iran's +the computer this weighted sum which is + +669 +00:47:57,579 --> 00:48:02,068 +within their demand here but not after +the way that something you do so you + +670 +00:48:02,068 --> 00:48:05,858 +kill half of the distribution you set it +to 0 and intuitively what that does to + +671 +00:48:05,858 --> 00:48:10,380 +your distribution of your up but +basically half their variants and so it + +672 +00:48:10,380 --> 00:48:14,849 +turns out it was proposed in this paper +just last year in fact someone said + +673 +00:48:14,849 --> 00:48:19,000 +basically look there's a factor of two +you're not a company for because he's + +674 +00:48:19,000 --> 00:48:22,809 +really you don't know ron's they +effectively happy or variants each time + +675 +00:48:22,809 --> 00:48:26,510 +because you take everything so you have +not gotten inputs you take them through + +676 +00:48:26,510 --> 00:48:29,960 +your nonlinearity you have you gotten +stuff I would but not you really do that + +677 +00:48:29,960 --> 00:48:35,530 +and so you end up having two variants +seem to account for it with it too and + +678 +00:48:35,530 --> 00:48:38,859 +when you do that then you get proper +distribution specifically for Darrell in + +679 +00:48:38,858 --> 00:48:43,719 +Iran and so in this initialization were +you using nets you have to worry about + +680 +00:48:43,719 --> 00:48:48,618 +that extra tax revenue and everything +will come up nicely and you won't get + +681 +00:48:48,619 --> 00:48:52,358 +this factor of two that keeps building +up and it screws up your activations + +682 +00:48:52,358 --> 00:48:56,769 +exponentially so basically this is +tricky tricky stuff and it really + +683 +00:48:56,769 --> 00:49:01,159 +matters in practice in practice in their +paper for example to compare having the + +684 +00:49:01,159 --> 00:49:04,519 +factor if you are not having a factor +too and it matters we have really deep + +685 +00:49:04,519 --> 00:49:08,500 +networks in this case I think they had a +few dozen players if you account for the + +686 +00:49:08,500 --> 00:49:12,940 +fact that you converge if you don't +count reduction to you does nothing just + +687 +00:49:12,940 --> 00:49:14,950 +zero lots ok + +688 +00:49:14,949 --> 00:49:19,469 +so very important stuff you really need +to think it through you to be careful + +689 +00:49:19,469 --> 00:49:24,789 +with inflation if it's incorrectly such +bad things happen and so specifically + +690 +00:49:24,789 --> 00:49:28,108 +the case if you have known that works +with rail units there is a correct + +691 +00:49:28,108 --> 00:49:36,460 +answer to use and that's this +initialization from coming so this is + +692 +00:49:36,460 --> 00:49:40,220 +partly this is partly why your remark +for a long time as we just i think + +693 +00:49:40,219 --> 00:49:46,088 +people didn't fully maybe appreciate +just how difficult this was to get right + +694 +00:49:46,088 --> 00:49:51,219 +and Turkey and so I just like to point +out that proper initialization basically + +695 +00:49:51,219 --> 00:49:54,419 +active area of research you can see the +papers are still being published on this + +696 +00:49:54,420 --> 00:49:58,849 +a large number of papers just opposing +different ways of initializing your + +697 +00:49:58,849 --> 00:50:03,019 +networks these last few are interesting +as well because they don't give you a + +698 +00:50:03,019 --> 00:50:06,659 +formula for initializing they have these +data driven waste of initializing + +699 +00:50:06,659 --> 00:50:10,399 +networks and to take a batch of data you +forward it to your network which is now + +700 +00:50:10,400 --> 00:50:13,530 +an arbitrary network and you look at the +variances that every single point in + +701 +00:50:13,530 --> 00:50:16,690 +your network and intuitively you don't +want your variances to go to zero you + +702 +00:50:16,690 --> 00:50:20,200 +don't want them to explode you want +everything to have roughly say like be a + +703 +00:50:20,199 --> 00:50:24,328 +unit caution throughout your network and +so they entered a plea scale these + +704 +00:50:24,329 --> 00:50:28,349 +weights in your network so that you have +roughly in the activation everywhere on + +705 +00:50:28,349 --> 00:50:33,568 +the order of that basically and so there +are some data-driven techniques and line + +706 +00:50:33,568 --> 00:50:39,139 +of work on how to properly initialized +ok so I'm going to go into some I'm + +707 +00:50:39,139 --> 00:50:41,848 +going to go into technique that +alleviate a lot of these problems but + +708 +00:50:41,849 --> 00:50:55,369 +right now I could take some questions +and they're only by dividing by the + +709 +00:50:55,369 --> 00:50:59,800 +variance possibly but then you're not +being back propagation because if you + +710 +00:50:59,800 --> 00:51:02,710 +met with the gradient then it's not +clear what your objective is anymore and + +711 +00:51:02,710 --> 00:51:06,710 +so you're not getting necessarily +gradient so this may be the only concern + +712 +00:51:06,710 --> 00:51:11,170 +I'm not sure what would happen if you +can try to normalize the gradient I + +713 +00:51:11,170 --> 00:51:13,730 +think the method I'm going to propose in +a bit + +714 +00:51:13,730 --> 00:51:19,960 +is actually doing something to the +effect of that but in a clean way what's + +715 +00:51:19,960 --> 00:51:23,550 +going to something that actually fix a +lot of these problems in practice it's + +716 +00:51:23,550 --> 00:51:26,630 +called back to my vision and it was only +proposed last year and so i cant even + +717 +00:51:26,630 --> 00:51:30,809 +covered this last year in this class but +now I can actually helps a lot + +718 +00:51:30,809 --> 00:51:37,119 +ok and the basic idea maximization paper +is ok you want roughly unit gotten + +719 +00:51:37,119 --> 00:51:42,039 +activations in every single part of your +network and so just just do that just + +720 +00:51:42,039 --> 00:51:46,369 +just make them you know caution ok you +can do that because making something + +721 +00:51:46,369 --> 00:51:50,720 +unit caution is a completely different +function and so it's ok you can + +722 +00:51:50,719 --> 00:51:54,980 +propagate through it and see what they +do is you taking me back from your data + +723 +00:51:54,980 --> 00:51:57,480 +and you're picking through your network +we're going to meet + +724 +00:51:57,480 --> 00:52:00,900 +inserting these specialization layers +into your network and the best + +725 +00:52:00,900 --> 00:52:06,400 +normalization layers they take your +input X and they make sure that every + +726 +00:52:06,400 --> 00:52:10,420 +single feature dimension across the +batch you have unit gushing activations + +727 +00:52:10,420 --> 00:52:15,909 +so he had a batch of hundred examples +going through the network maybe this is + +728 +00:52:15,909 --> 00:52:19,779 +a good example here is even better +activation so many things in your money + +729 +00:52:19,780 --> 00:52:25,530 +back and have D features or deactivation +of neurons that are at some point some + +730 +00:52:25,530 --> 00:52:28,869 +part and this is an input your back +later + +731 +00:52:28,869 --> 00:52:32,550 +so this is a major subjects of +activations and nationalization + +732 +00:52:32,550 --> 00:52:39,390 +effectively evaluate the empirical mean +and variance along every single feature + +733 +00:52:39,389 --> 00:52:44,989 +and it just divided by it so whatever +your ex was just make sure that every + +734 +00:52:44,989 --> 00:52:49,088 +single column here has unit is a +Univision and so that's a perfectly + +735 +00:52:49,088 --> 00:52:54,219 +differentiable function and just applies +it at every single feature or activation + +736 +00:52:54,219 --> 00:53:02,818 +independently across the batch so you +can do that turns out to be a very good + +737 +00:53:02,818 --> 00:53:08,548 +idea now one problem with this team so +this is the way this will work as well + +738 +00:53:08,548 --> 00:53:11,670 +have normally we have followed by +nonlinearity + +739 +00:53:11,670 --> 00:53:15,900 +party network of this now we're going to +be inserting these nationalization + +740 +00:53:15,900 --> 00:53:19,670 +layers right after political heirs or +equivalently after convolutional layers + +741 +00:53:19,670 --> 00:53:24,490 +as well CCNA with commercial networks +and basically we can start them there + +742 +00:53:24,489 --> 00:53:28,159 +and they make sure that everything is +gushing at every single step of the + +743 +00:53:28,159 --> 00:53:30,190 +network because we just make it so + +744 +00:53:30,190 --> 00:53:36,500 +and one problem I think up with this +this is that it seems like a unnecessary + +745 +00:53:36,500 --> 00:53:41,088 +constraint so when you put it back here +after that the outputs will definitely + +746 +00:53:41,088 --> 00:53:45,389 +be gushing because you normalize them +but it's not clear that 10 H actually + +747 +00:53:45,389 --> 00:53:50,288 +once to recede unit caution inputs so if +you think about the the form of 10 H it + +748 +00:53:50,289 --> 00:53:54,450 +has a specific skill to it it's not +clear that they're all that work once to + +749 +00:53:54,449 --> 00:53:59,730 +have this hard constraint of making sure +that outputs are exactly you negotiate + +750 +00:53:59,730 --> 00:54:06,009 +before the 10 TH because you like the +network to pick if it wants your 10 each + +751 +00:54:06,009 --> 00:54:10,429 +other what's to be more or less diffuse +more or less saturated and right now it + +752 +00:54:10,429 --> 00:54:14,268 +will be able to death so a small patch +on top of it this is the second part of + +753 +00:54:14,268 --> 00:54:19,429 +haitian is not going to normalize acts +but after normalization you live network + +754 +00:54:19,429 --> 00:54:25,068 +to shift by gamma and had to be for +every single feature and so this allows + +755 +00:54:25,068 --> 00:54:28,358 +the network to do and these are our +parameters so gamma and be here are + +756 +00:54:28,358 --> 00:54:33,869 +parameters that we're going to back to +back up into and they just allow the + +757 +00:54:33,869 --> 00:54:38,690 +network 22 shipped after your normal ICU +negotiate they allow this bomb to shift + +758 +00:54:38,690 --> 00:54:44,108 +and scale if the network wants to and so +we initialize the presumably webb 110 or + +759 +00:54:44,108 --> 00:54:48,250 +something like that and then we can the +network can choose to adjust them and by + +760 +00:54:48,250 --> 00:54:51,239 +adjusting these you can imagine that +once we feed into 10 H + +761 +00:54:51,239 --> 00:54:54,719 +the network can choose through the +backdrop signal to make it any more or + +762 +00:54:54,719 --> 00:54:58,618 +less picky or saturated in whatever way +it once but you're not going to get into + +763 +00:54:58,619 --> 00:55:01,910 +this trouble where things just +completely died or explode in the + +764 +00:55:01,909 --> 00:55:06,359 +beginning of optimization and so things +will train right away and then back + +765 +00:55:06,360 --> 00:55:10,579 +propagation can take over and can find +you into overtime and not one more + +766 +00:55:10,579 --> 00:55:16,170 +important feature is that if you set +these gunmen be if you train them if my + +767 +00:55:16,170 --> 00:55:20,230 +back propagation it happens that the end +up taking the empirical variance and the + +768 +00:55:20,230 --> 00:55:24,829 +mean when you can see that basically the +network has the capacity to undo the + +769 +00:55:24,829 --> 00:55:30,519 +nationalization so this part can learn +to undo that part and so that's why back + +770 +00:55:30,519 --> 00:55:34,059 +to realization and can act as an +identity function or can learn to be an + +771 +00:55:34,059 --> 00:55:37,599 +identity whereas before it couldn't and +so when you have these best-known + +772 +00:55:37,599 --> 00:55:42,460 +players in their the network and threw +back propagation learn to take it out or + +773 +00:55:42,460 --> 00:55:45,110 +it can learn to take advantage of it if +it finds it helpful + +774 +00:55:45,110 --> 00:55:51,010 +through the backdrop this will kind of +workout so that's just a nice point to + +775 +00:55:51,010 --> 00:55:58,470 +have and so basically there are several +properties so this is the right number + +776 +00:55:58,469 --> 00:56:03,639 +them as I described my properties are +that it improves the gradient flow + +777 +00:56:03,639 --> 00:56:09,049 +through the network allows for higher +learning rates so your network and learn + +778 +00:56:09,050 --> 00:56:13,080 +faster it reduces this is an important +one introduces the strong dependence on + +779 +00:56:13,079 --> 00:56:16,269 +initialization as you sweep through +different choices of your initialisation + +780 +00:56:16,269 --> 00:56:19,659 +scale you'll see that with and without +bashing on you'll see a huge difference + +781 +00:56:19,659 --> 00:56:23,469 +with maximum you'll see a much more +things will work for much larger + +782 +00:56:23,469 --> 00:56:27,539 +settings of the initial scale and so you +don't have to worry about it as much it + +783 +00:56:27,539 --> 00:56:34,139 +really helps out with this put point and +one more subtle thing to point out here + +784 +00:56:34,139 --> 00:56:39,299 +is it kind of access of money from a +realization and it reduces the need for + +785 +00:56:39,300 --> 00:56:43,900 +a drop of which will go into in a bit +later in class but the way it acts as a + +786 +00:56:43,900 --> 00:56:51,559 +funny regularization is when you have +some kind of an input X and go through + +787 +00:56:51,559 --> 00:56:55,849 +the network then its representation at +some later in the network is basically + +788 +00:56:55,849 --> 00:56:59,858 +not only function of it but it's also a +function of whatever other examples + +789 +00:56:59,858 --> 00:57:02,049 +happened to being a batch so + +790 +00:57:02,050 --> 00:57:05,570 +because whatever other examples are with +you in that batch process completely + +791 +00:57:05,570 --> 00:57:09,840 +independently apparel fashion actually +ties them together and so your + +792 +00:57:09,840 --> 00:57:12,880 +representation that say like the thick +layer of network is actually a function + +793 +00:57:12,880 --> 00:57:16,539 +of whatever it back you happen to be +sampled in and what does a generous your + +794 +00:57:16,539 --> 00:57:19,809 +place in the representation space on +that later and this actually has a nice + +795 +00:57:19,809 --> 00:57:26,139 +regularizing effect and so does generate +sarcastically who this fact that you + +796 +00:57:26,139 --> 00:57:31,609 +happen to be in has this effect and so i +don't realize it actually seems to + +797 +00:57:31,610 --> 00:57:33,920 +actually help out of it + +798 +00:57:33,920 --> 00:57:38,950 +ok and the test I'm passionate later by +the way functions a bit differently you + +799 +00:57:38,949 --> 00:57:42,699 +don't have a test time you want this to +be a deterministic function so just a + +800 +00:57:42,699 --> 00:57:46,500 +quick point that s time when you're +using a Bachelor function differently in + +801 +00:57:46,500 --> 00:57:52,019 +particular you have this new and a sigma +that you keep normalizing by so a test + +802 +00:57:52,019 --> 00:57:55,519 +I'm just remember your view and Sigma +across the dataset you can either + +803 +00:57:55,519 --> 00:57:59,250 +computed like what is the mean and +sigmoid every single point in the + +804 +00:57:59,250 --> 00:58:02,309 +network you can compute that once over +your entire training center or you can + +805 +00:58:02,309 --> 00:58:05,759 +just have a running some amusing six +months while you're training and then + +806 +00:58:05,760 --> 00:58:08,800 +make sure to remember that in the best +player because it just time you don't + +807 +00:58:08,800 --> 00:58:12,460 +want to actually estimate the empirical +mean and variance across your back you + +808 +00:58:12,460 --> 00:58:17,000 +want to just use those directly so +because that's good you're not coming + +809 +00:58:17,000 --> 00:58:26,179 +forward at this time so there's just a +small detail and so that's any questions + +810 +00:58:26,179 --> 00:58:29,049 +about the national motorway so this is a +good thing + +811 +00:58:29,050 --> 00:58:35,559 +use it and your employees that actually +your assignment + +812 +00:58:35,559 --> 00:58:41,039 +thank you so the question is that a +slowdown at all it does so there is a + +813 +00:58:41,039 --> 00:58:44,219 +runtime penalty but you have to pay for +it unfortunately I don't know exactly + +814 +00:58:44,219 --> 00:58:49,088 +how expensive is I heard someone say +like 30 percent even and so I don't know + +815 +00:58:49,088 --> 00:58:54,318 +actually I haven't fully checked this +but basically there is a penalty because + +816 +00:58:54,318 --> 00:58:58,548 +you have to do this normally you it's +very common to be after every + +817 +00:58:58,548 --> 00:59:02,458 +competition later and we have 250 calm +like larry is you end up having all this + +818 +00:59:02,458 --> 00:59:16,719 +stuff buildup of questions raised the +price we pay I suppose so yes so when + +819 +00:59:16,719 --> 00:59:20,249 +can you tell you maybe need national I +think I'll come back to that in a in a + +820 +00:59:20,248 --> 00:59:24,228 +few slides will see like how can you +detect that your network is not healthy + +821 +00:59:24,228 --> 00:59:30,318 +and then maybe you want a transnational +ok so the learning process I have 20 + +822 +00:59:30,318 --> 00:59:36,489 +minutes I think I can do this is like +700 so I think we're fine so we trust + +823 +00:59:36,489 --> 00:59:41,420 +our data we've decided let's let's +decide on some for these purposes of + +824 +00:59:41,420 --> 00:59:44,719 +these experiments I'm going to work with +C for 10 and I'm going to use a + +825 +00:59:44,719 --> 00:59:48,688 +two-layer neural network with safety had +a nuanced and I'd like to give an idea + +826 +00:59:48,688 --> 00:59:51,538 +about like how this looks like impact is +when your training neural networks like + +827 +00:59:51,539 --> 00:59:52,699 +how do you play with it + +828 +00:59:52,699 --> 00:59:56,849 +where someone how do you actually +converted to Primaris what does this + +829 +00:59:56,849 --> 00:59:59,380 +process of playing with a date on +getting things to work look like in + +830 +00:59:59,380 --> 01:00:03,019 +practice and so I decided to try out a +small neural network + +831 +01:00:03,018 --> 01:00:08,248 +preprocess my data and so the first +kinds of things that I would look at if + +832 +01:00:08,248 --> 01:00:11,728 +I want to make sure that my prediction +is correct them think things are working + +833 +01:00:11,728 --> 01:00:16,028 +first of all I'm going to be +initializing here a two-year neural + +834 +01:00:16,028 --> 01:00:19,679 +network so weights and biases +initializing was just naive + +835 +01:00:19,679 --> 01:00:23,969 +initialization here because this is just +a very small network so I can afford to + +836 +01:00:23,969 --> 01:00:28,259 +maybe do just a naive sample from +exhaustion and then this is a function + +837 +01:00:28,259 --> 01:00:31,329 +that basically going to train a neural +network and I'm not showing you the + +838 +01:00:31,329 --> 01:00:35,949 +implementation obviously but just one +thing missing is returned your lost + +839 +01:00:35,949 --> 01:00:39,170 +their returns your premiums on your +model parameters and so that the first + +840 +01:00:39,170 --> 01:00:42,869 +time I tried for example is i disable +the regularization that's passed in the + +841 +01:00:42,869 --> 01:00:45,818 +end and I make sure that my loss comes +out + +842 +01:00:45,818 --> 01:00:49,358 +act right so I mention this and previous +lines so say I have 10 classes and + +843 +01:00:49,358 --> 01:00:53,318 +support n im using soft classifier so I +know that I'm expecting a loss of + +844 +01:00:53,318 --> 01:00:59,099 +negative log of one over 10 because +that's that's an expression for the loss + +845 +01:00:59,099 --> 01:01:03,180 +and that turns out to be 2.3 and so I +put this and I get a lot of 2.3 so I + +846 +01:01:03,179 --> 01:01:05,708 +know that basically the neural network +is currently giving me a diffuse + +847 +01:01:05,708 --> 01:01:09,728 +distribution over the classes because it +doesn't know anything we've just been so + +848 +01:01:09,728 --> 01:01:12,778 +that sucks out the next thing I might +check is that for example I cranked up + +849 +01:01:12,778 --> 01:01:17,318 +the regularization and of course expect +my loss to go up right because now we + +850 +01:01:17,318 --> 01:01:20,380 +have this additional term in the +objective and so that checks out so + +851 +01:01:20,380 --> 01:01:20,940 +that's nice + +852 +01:01:20,940 --> 01:01:25,409 +different the next thing I would usually +try to do it's a very good sanity check + +853 +01:01:25,409 --> 01:01:28,478 +when you're working on their networks is +try to take a small piece of your data + +854 +01:01:28,478 --> 01:01:32,139 +and try to make sure you can over it +you're trying to do just that small + +855 +01:01:32,139 --> 01:01:36,608 +piece some twenty takes like say a +sample of like 20 training examples and + +856 +01:01:36,608 --> 01:01:41,858 +28 labels and I just make sure that I +trained on that small piece and I just + +857 +01:01:41,858 --> 01:01:45,179 +make sure that I can get a loss of +basically near zero I can fully over fit + +858 +01:01:45,179 --> 01:01:48,379 +that because if i cant over fit a tiny +piece of my idea then things are + +859 +01:01:48,380 --> 01:01:54,608 +definitely broken and so here I am +starting the training and I'm starting + +860 +01:01:54,608 --> 01:01:58,969 +with some random number of parameters +here I'm not going to go into full + +861 +01:01:58,969 --> 01:02:04,150 +details there but basically I make sure +that my costs can go down to zero and + +862 +01:02:04,150 --> 01:02:08,519 +that I'm getting accuracy 100% on this +tiny piece of data and that gives me + +863 +01:02:08,518 --> 01:02:12,659 +confidence that probably backdrop is +working probably the update is working + +864 +01:02:12,659 --> 01:02:16,798 +the learning rate is set to somehow +reasonably and so I can put a small + +865 +01:02:16,798 --> 01:02:21,190 +dataset not happy at this point in time +maybe I'm thinking about scaling up to a + +866 +01:02:21,190 --> 01:02:28,079 +larger than something + +867 +01:02:28,079 --> 01:02:33,960 +so you should be able to overpower it +sometimes I can try like say like one or + +868 +01:02:33,960 --> 01:02:37,409 +two or three examples you can really +practice down and you should be able to + +869 +01:02:37,409 --> 01:02:40,460 +afford even with smaller networks and so +that's a very good sanity check because + +870 +01:02:40,460 --> 01:02:45,289 +you can afford to have small networks +and just make sure if you can't help it + +871 +01:02:45,289 --> 01:02:49,039 +implementations probably incorrect +something's very funky was wrong so you + +872 +01:02:49,039 --> 01:02:52,039 +should not be scaling up your full day I +said before you can pass the Senate + +873 +01:02:52,039 --> 01:03:02,380 +check so basically the way I try to +approach this is taking a small piece of + +874 +01:03:02,380 --> 01:03:05,990 +data and now we're scaling it up over +but it's an arms coming up to like the + +875 +01:03:05,989 --> 01:03:10,049 +bigger dataset I'm trying to find the +learning rate that works and you have to + +876 +01:03:10,050 --> 01:03:13,289 +really play with this right you can just +eyeball delivering great to have to find + +877 +01:03:13,289 --> 01:03:17,219 +the scale roughly some trying first the +small learning rate like many negative + +878 +01:03:17,219 --> 01:03:22,559 +six and I see that the loss as bait +barely barely going down so this lost + +879 +01:03:22,559 --> 01:03:27,509 +this learning rate of one negative six +is probably too small right nothing is + +880 +01:03:27,510 --> 01:03:30,250 +changing of course there could be many +other things wrong because they lost + +881 +01:03:30,250 --> 01:03:34,409 +because in for like a million reasons +but we passed the small sanity check so + +882 +01:03:34,409 --> 01:03:38,339 +I'm thinking that this is probably +losses too low and I need you to hit by + +883 +01:03:38,340 --> 01:03:43,130 +the way this is a fine example hear of +something funky going on that is fun to + +884 +01:03:43,130 --> 01:03:48,280 +think about my loss just barely went +down but actually my training accuracy + +885 +01:03:48,280 --> 01:03:54,000 +shot up to 20% from the default 10% how +does that make any sense how can I beat + +886 +01:03:54,000 --> 01:03:58,050 +by lost just barely changed but my costs +my accuracy so good + +887 +01:03:58,050 --> 01:04:08,130 +well much much better than 10% of that +even possible + +888 +01:04:08,130 --> 01:04:38,860 +still + +889 +01:04:38,860 --> 01:04:46,120 +ok maybe not quite so think about how +accuracy is computed and how this custom + +890 +01:04:46,119 --> 01:05:04,799 +computer right now what's happening is +your training so these scores are tiny + +891 +01:05:04,800 --> 01:05:08,769 +shifting your losses still roughly +diffuse end up in the same loss but now + +892 +01:05:08,769 --> 01:05:12,619 +you're correct answers are not tiny bit +more probably and so we actually + +893 +01:05:12,619 --> 01:05:16,210 +competing the accuracy D art maxi class +is actually end up doing the correct one + +894 +01:05:16,210 --> 01:05:19,530 +of these are some of the fun things you +run into when you actually trained some + +895 +01:05:19,530 --> 01:05:24,900 +of the stuff you do have to think about +the expressions ok so now I start I + +896 +01:05:24,900 --> 01:05:27,619 +tried very low learning rate things are +barely happening soon I'm going to go to + +897 +01:05:27,619 --> 01:05:30,719 +the other extreme and I'm going to try +out the learning 32 million what could + +898 +01:05:30,719 --> 01:05:36,199 +possibly go wrong so what happens in +that case you get some weird errors and + +899 +01:05:36,199 --> 01:05:40,429 +things explode to get nancy really fun +stuff happens so ok one of the 1,000,000 + +900 +01:05:40,429 --> 01:05:44,639 +this probably too high as what I'm +thinking at this point so then I tried + +901 +01:05:44,639 --> 01:05:48,179 +to narrow in on rough region that +actually gives me a decrease in my cost + +902 +01:05:48,179 --> 01:05:51,409 +thread that's what I'm trying to do with +my binary search here and so at some + +903 +01:05:51,409 --> 01:05:54,739 +point I get some idea about you know +roughly where should I be cross + +904 +01:05:54,739 --> 01:05:55,929 +validating + +905 +01:05:55,929 --> 01:06:00,019 +like a proper optimization at this point +I'm trying to find the best I promise + +906 +01:06:00,019 --> 01:06:04,030 +for my network right we like to do in +practice is go from course to find + +907 +01:06:04,030 --> 01:06:07,820 +strategy so first I just have a rough +idea by playing with it we're learning + +908 +01:06:07,820 --> 01:06:11,550 +Richard being then I do a course search +are alarming rates of like a bigger a + +909 +01:06:11,550 --> 01:06:16,180 +segment and then I repeat this process I +look at what works and then I narrow in + +910 +01:06:16,179 --> 01:06:20,500 +on the region's that work well ok do +this here are quickly and your codes for + +911 +01:06:20,500 --> 01:06:23,719 +example detect explosions and break out +early it's like a nice step in terms of + +912 +01:06:23,719 --> 01:06:28,339 +implementation so what I'm doing +effectively here as I have a loop where + +913 +01:06:28,340 --> 01:06:31,579 +I sample my prime minister saying this +case the regularization and learning + +914 +01:06:31,579 --> 01:06:36,849 +rate I sample them I train I get some +results here so these are the accuracy + +915 +01:06:36,849 --> 01:06:40,179 +in the validation data and these are too +high primaries that produced them and + +916 +01:06:40,179 --> 01:06:44,440 +some of the accuracy as you can see that +they were quite well so 50% 40% some of + +917 +01:06:44,440 --> 01:06:47,409 +them don't work well at all so this +gives me an idea about what range of + +918 +01:06:47,409 --> 01:06:50,659 +learning rates and regulations are +working relatively well + +919 +01:06:50,659 --> 01:06:55,079 +and when you do this optimization you +can start out first with just a small + +920 +01:06:55,079 --> 01:06:58,090 +number of epochs you going to run for a +very long time just run for a few + +921 +01:06:58,090 --> 01:07:02,680 +minutes you can already get the sense of +what's working better than some other + +922 +01:07:02,679 --> 01:07:08,259 +things and also one note when you're +optimizing over regularization learning + +923 +01:07:08,260 --> 01:07:12,320 +rate it's best to simply walk space you +don't just want to sample from a uniform + +924 +01:07:12,320 --> 01:07:16,510 +distribution because these learning +rates and regularization they act + +925 +01:07:16,510 --> 01:07:20,180 +multiplicative Lee on the dynamics of +your back propagation and so that's why + +926 +01:07:20,179 --> 01:07:25,319 +you want to do this in lock space so you +can see that I'm sampling from nigga 326 + +927 +01:07:25,320 --> 01:07:28,350 +the exponent from the learning rate and +then I'm raising it to the power of 10 + +928 +01:07:28,349 --> 01:07:33,319 +amazing 10 to the power of it and so you +don't want to just be sampling from a + +929 +01:07:33,320 --> 01:07:38,610 +uniform 0012 like a hundred because the +most of your samples are kind of in a + +930 +01:07:38,610 --> 01:07:41,820 +bad region right because the learning +rate is a multiplicative interaction + +931 +01:07:41,820 --> 01:07:50,050 +something to be aware of what works +relatively well I'm doing a second pass + +932 +01:07:50,050 --> 01:07:52,950 +where I'm kind of going in and I'm +changing these again a bit and i'm + +933 +01:07:52,949 --> 01:07:58,139 +looking at what works so I find that I +can now get 253 some of these work + +934 +01:07:58,139 --> 01:08:02,460 +really well one thing to be aware of +sometimes you get a result like this + +935 +01:08:02,460 --> 01:08:06,920 +53 is working quite well and this is +actually worse if I see this I'm + +936 +01:08:06,920 --> 01:08:11,440 +actually worried at this point because +I'm so through this cross validation + +937 +01:08:11,440 --> 01:08:14,490 +here I have a result here and there +something actually wrong about this + +938 +01:08:14,489 --> 01:08:21,880 +result that hints at some issue + +939 +01:08:21,880 --> 01:08:31,279 +problem + +940 +01:08:31,279 --> 01:08:54,109 +actually quite consistent too much +happening here look amazing learning + +941 +01:08:54,109 --> 01:08:58,759 +rate between 93 94 tend to that and I +end up with a very good result that is + +942 +01:08:58,760 --> 01:09:00,690 +just the boundaries of what I'm + +943 +01:09:00,689 --> 01:09:06,960 +optimizing over so this is almost 13 +it's almost 0001 which ends which is + +944 +01:09:06,960 --> 01:09:10,510 +really a boundary of what I'm searching +over some getting a really good result + +945 +01:09:10,510 --> 01:09:14,780 +at an edge of what I'm looking for and +that's not good because maybe this year + +946 +01:09:14,779 --> 01:09:18,719 +the way I've defined it is not actually +optimal and so I want to make sure that + +947 +01:09:18,720 --> 01:09:21,560 +I spot these things and I just my ranges +because there might be even better + +948 +01:09:21,560 --> 01:09:22,520 +results + +949 +01:09:22,520 --> 01:09:26,390 +going slightly this way so maybe I want +to change negative 32 negative two or + +950 +01:09:26,390 --> 01:09:32,570 +2.5 and but for regularization I see +that is working quite well so maybe i'm + +951 +01:09:32,569 --> 01:09:38,529 +in a slightly better spot and so I'm so +worried about this one thing I like to + +952 +01:09:38,529 --> 01:09:42,739 +point out as you'll see me sample bees +randomly also tend to the uniform of + +953 +01:09:42,739 --> 01:09:46,639 +this some sampling random regularization +learning return doing this what you + +954 +01:09:46,640 --> 01:09:49,829 +might see sometimes people do with +what's called a grid search so really + +955 +01:09:49,829 --> 01:09:53,920 +the difference here is instead of +sampling randomly people like to go in + +956 +01:09:53,920 --> 01:09:58,789 +steps of fixed amounts in both the +learning rate and the regulation and so + +957 +01:09:58,789 --> 01:10:02,519 +you end up with this double loop here +over some settings of learning even some + +958 +01:10:02,520 --> 01:10:03,740 +settings of regularization + +959 +01:10:03,739 --> 01:10:07,590 +trying to be exhaustive and this is +actually a bad idea doesn't actually + +960 +01:10:07,590 --> 01:10:12,720 +work as well as a few simple randomly +and unintuitive but you actually always + +961 +01:10:12,720 --> 01:10:16,280 +want to sample randomly don't want to go +into next steps and here's the reason + +962 +01:10:16,279 --> 01:10:23,319 +for that it's kind of think about it but +this is great search way so I sampled at + +963 +01:10:23,319 --> 01:10:31,579 +set intervals and I can't have company +you know sweep out the tax base and a + +964 +01:10:31,579 --> 01:10:35,090 +random sampling where I just randomly +sampled from the to the issue is that an + +965 +01:10:35,090 --> 01:10:38,930 +optimization and training they're all +that works what often happens is that + +966 +01:10:38,930 --> 01:10:41,800 +she's one of the parameters can be much +much more important than the other + +967 +01:10:41,800 --> 01:10:43,039 +parameter + +968 +01:10:43,039 --> 01:10:45,989 +so say that this is an important +parameter its performance the + +969 +01:10:45,989 --> 01:10:49,349 +performance of your loss function is not +really a function of the white dimension + +970 +01:10:49,350 --> 01:10:52,510 +but it's really a function of the +exhibition you get much better result in + +971 +01:10:52,510 --> 01:10:58,699 +a specific range along the x-axis and if +this is true then which is often the + +972 +01:10:58,699 --> 01:11:02,170 +case then in this case you're actually +going to end up something lots of + +973 +01:11:02,170 --> 01:11:06,300 +different taxes and you end up with a +better spot than here where you've + +974 +01:11:06,300 --> 01:11:09,850 +sampled at exact spot and you're not +getting any kind of information across + +975 +01:11:09,850 --> 01:11:14,910 +the ex if that makes sense so always use +random because in these cases which are + +976 +01:11:14,909 --> 01:11:24,220 +common the random will actually give you +more bang for the buck so I promise you + +977 +01:11:24,220 --> 01:11:28,520 +want to play with the most common ones +are probably the learning rate the + +978 +01:11:28,520 --> 01:11:32,920 +update to type maybe we're going to +we're going to go into this in a bit + +979 +01:11:32,920 --> 01:11:36,899 +the regularization and the dropout +amount we're going to go into so this is + +980 +01:11:36,899 --> 01:11:42,979 +really it's so much fun so in practice +the way but this looks like as we have a + +981 +01:11:42,979 --> 01:11:46,679 +for example of computer vision cluster +we have so many machines so I can just + +982 +01:11:46,680 --> 01:11:49,829 +distributes my training across so many +machines and I've written myself for + +983 +01:11:49,829 --> 01:11:53,100 +example comment and turned to face where +these are all the loss functions on all + +984 +01:11:53,100 --> 01:11:56,880 +the different machines and computers and +cluster these are all here are some + +985 +01:11:56,880 --> 01:12:01,270 +searching over and I can see basically +what's working and what isn't and I can + +986 +01:12:01,270 --> 01:12:04,370 +send commands to my workers so I can say +ok this isn't working at all stages + +987 +01:12:04,369 --> 01:12:07,399 +resample you're not doing well at all +and some of these are doing very well + +988 +01:12:07,399 --> 01:12:10,960 +and I look at what's exactly working +well and I'm adjusting its a dynamic + +989 +01:12:10,960 --> 01:12:14,020 +process that I have to go through to +actually get the stuff to work well + +990 +01:12:14,020 --> 01:12:17,490 +because he just have too much stuff to +optimize over and you can afford to just + +991 +01:12:17,489 --> 01:12:21,569 +spray and pray you have to work with it + +992 +01:12:21,569 --> 01:12:25,759 +ok so you optimizing you're looking at a +loss functions + +993 +01:12:25,760 --> 01:12:29,289 +loss functions can take various +different forms and you need to be able + +994 +01:12:29,289 --> 01:12:34,510 +to read into what that means so you'll +be you'll get quite good at looking at a + +995 +01:12:34,510 --> 01:12:38,289 +loss function as an interesting what +happens this one for example it was + +996 +01:12:38,289 --> 01:12:42,409 +pointing out that previous lecture it's +not as exponential as a maybe used to my + +997 +01:12:42,409 --> 01:12:47,359 +loss functions I'd like it to you know +it looks a little to linger and so that + +998 +01:12:47,359 --> 01:12:50,949 +maybe tells me that the learning rate as +may be slightly too low so that doesn't + +999 +01:12:50,949 --> 01:12:53,069 +mean the learning rate is too low just +means that I might want to consider + +1000 +01:12:53,069 --> 01:12:54,359 +trying + +1001 +01:12:54,359 --> 01:12:58,549 +morning sometimes you get all kinds of +funny things so you can have a plateau + +1002 +01:12:58,550 --> 01:13:04,199 +where at some point that would decide +that now runs you optimize usually so + +1003 +01:13:04,198 --> 01:13:15,948 +what is the prime suspect in these kinds +of cases just a guess me and i think is + +1004 +01:13:15,948 --> 01:13:19,388 +the prime suspect you initialize +correctly the gradients and barely + +1005 +01:13:19,389 --> 01:13:23,579 +flowing but at some point they add up +and just saw some research training so + +1006 +01:13:23,579 --> 01:13:27,420 +lots of fun in fact it's so much fun +that I started an entire tumblr a while + +1007 +01:13:27,420 --> 01:13:34,260 +ago and lost function so they can go +through these people contribute these + +1008 +01:13:34,260 --> 01:13:38,300 +which is nice and services I think so +and training especially transfer network + +1009 +01:13:38,300 --> 01:13:43,550 +we're going to go into that this is all +kinds of exotic shapes I'm not exactly + +1010 +01:13:43,550 --> 01:13:48,730 +know at some point you're not really +sure what any of this means it's going + +1011 +01:13:48,729 --> 01:13:52,569 +so well + +1012 +01:13:52,569 --> 01:14:04,469 +yeah so here this several tasks that are +training at the same time and just this + +1013 +01:14:04,469 --> 01:14:08,139 +by the way I know what happened here +it's this is actually training a + +1014 +01:14:08,139 --> 01:14:11,170 +reinforcement learning agent the problem +in reinforcement learning as you don't + +1015 +01:14:11,170 --> 01:14:14,679 +have a stationary distribution you don't +have a fixed asset investment learning + +1016 +01:14:14,679 --> 01:14:17,800 +agent interacting with the environment +if your policy changes and you end up + +1017 +01:14:17,800 --> 01:14:21,199 +like staring at the wall or you end up +looking at different parts of your space + +1018 +01:14:21,198 --> 01:14:24,629 +you end up with different data +distributions and so suddenly I'm + +1019 +01:14:24,630 --> 01:14:27,109 +looking at something very different than +what I used to be looking at and I'm + +1020 +01:14:27,109 --> 01:14:30,098 +training my agent and lost goes up +because the agent is unfamiliar with + +1021 +01:14:30,099 --> 01:14:33,569 +that kind of templates and so you have +all kinds of fun stuff happening there + +1022 +01:14:33,569 --> 01:14:40,578 +and then this one is one of my favorites +I have no idea what basically happened + +1023 +01:14:40,578 --> 01:14:45,988 +here this loss oscillates but roughly +does and then it comes just explodes + +1024 +01:14:45,988 --> 01:14:53,238 +clearly something was not right in this +case and also here just got someone + +1025 +01:14:53,238 --> 01:14:57,789 +decides to converge and no idea was +wrong so you get all kinds of funny + +1026 +01:14:57,789 --> 01:15:01,368 +things if you end up with funny plots in +your assignment please do send them to + +1027 +01:15:01,368 --> 01:15:02,948 +Los Panchos but + +1028 +01:15:02,948 --> 01:15:06,219 +robust during training + +1029 +01:15:06,219 --> 01:15:09,899 +don't only look at the loss function and +other thing to look at is your accuracy + +1030 +01:15:09,899 --> 01:15:14,929 +especially accuracies for example so you +sometimes prefer looking at the accuracy + +1031 +01:15:14,929 --> 01:15:18,248 +over what functions because accuracies +are interpretable I know what these + +1032 +01:15:18,248 --> 01:15:22,519 +classification accuracies mean in +absolute terms for loss function is + +1033 +01:15:22,519 --> 01:15:27,369 +maybe not as interpretable and so in +particular I have a loss for my + +1034 +01:15:27,368 --> 01:15:31,589 +salvation data and my training and so +for example in this case I'm saying that + +1035 +01:15:31,590 --> 01:15:35,288 +my training data accuracy is getting +much much better and validation accuracy + +1036 +01:15:35,288 --> 01:15:38,929 +has stopped improving and so based on +this guy that can give you hints on what + +1037 +01:15:38,929 --> 01:15:42,380 +might be going on under the hood in this +particular case there's a huge gap here + +1038 +01:15:42,380 --> 01:15:44,440 +so maybe I'm thinking of overfitting + +1039 +01:15:44,439 --> 01:15:48,069 +100% sure but I might be overpaying I +might want to try to regular I strongly + +1040 +01:15:48,069 --> 01:15:57,038 +when things might also be looking at is +tracking the difference between the + +1041 +01:15:57,038 --> 01:16:01,988 +scale of your parameters and the scale +of your updates to those parameters so + +1042 +01:16:01,988 --> 01:16:06,748 +say you're so you're suppose that your +weights are on the order of unit gushing + +1043 +01:16:06,748 --> 01:16:10,599 +then intuitively the update that your +incrementing your weights by and + +1044 +01:16:10,599 --> 01:16:14,349 +back-propagation you don't want those +updates to be much larger than the + +1045 +01:16:14,349 --> 01:16:16,679 +weights obviously or you want them to be +tiny + +1046 +01:16:16,679 --> 01:16:20,529 +your updates to be on the order of 1987 +when your weights are on the order of + +1047 +01:16:20,529 --> 01:16:25,359 +one negative too and so look at the +update that you're about to increment + +1048 +01:16:25,359 --> 01:16:29,439 +onto your weights and just look at this +norm for example the color squares and + +1049 +01:16:29,439 --> 01:16:34,129 +compared to the update the scale of your +parameters and usually a good rule of + +1050 +01:16:34,130 --> 01:16:38,550 +thumb is this should be roughly 13 so +basically everything will update your + +1051 +01:16:38,550 --> 01:16:41,360 +modifying on the order of like a third +significant digits for every single + +1052 +01:16:41,359 --> 01:16:44,118 +parameter right you're not making huge +updates you're not making very small + +1053 +01:16:44,118 --> 01:16:49,708 +updates so that's one thing to look at +roughly 13 usually works ok if this is + +1054 +01:16:49,708 --> 01:16:53,038 +too high I want to maybe decrease my +learning made its way too low like say + +1055 +01:16:53,038 --> 01:17:00,069 +it's 107 maybe I want to increase my +learning rate and so in summary today we + +1056 +01:17:00,069 --> 01:17:05,308 +looked at a whole bunch of things to do +with training neural networks the teal + +1057 +01:17:05,309 --> 01:17:09,729 +the arms of all of them are basically +you lose track mean use the + +1058 +01:17:09,729 --> 01:17:11,869 +initialization + +1059 +01:17:11,869 --> 01:17:15,750 +or if you think you have a small network +you can maybe get away with just + +1060 +01:17:15,750 --> 01:17:20,399 +choosing your scale 2001 or maybe you +want to play with that a bit and there's + +1061 +01:17:20,399 --> 01:17:26,719 +no strong recommendation here I think +just use and when you're doing I'm not + +1062 +01:17:26,720 --> 01:17:34,110 +my decision make sure to sample programs +and doing lots base when appropriate and + +1063 +01:17:34,109 --> 01:17:39,449 +that's something to be aware of and this +is what we still have to cover and that + +1064 +01:17:39,449 --> 01:17:44,269 +will be next we do have two more minutes +so I will take questions if there are + +1065 +01:17:44,270 --> 01:18:01,520 +any + +1066 +01:18:01,520 --> 01:18:11,120 +correlation between + +1067 +01:18:11,119 --> 01:18:15,729 +I don't think there's any obviously I +can recommend there you have to get a + +1068 +01:18:15,729 --> 01:18:18,769 +check of it I don't think there's +anything jumps out at me that's obvious + +1069 +01:18:18,770 --> 01:18:35,210 +another couple in ok great questions + +1070 +01:18:35,210 --> 01:18:35,949 +question regarding + diff --git a/captions/En/Lecture6_en.srt b/captions/En/Lecture6_en.srt new file mode 100644 index 00000000..5c5c51f7 --- /dev/null +++ b/captions/En/Lecture6_en.srt @@ -0,0 +1,4497 @@ +1 +00:00:00,000 --> 00:00:07,009 +ok so what's now first today we'll talk +about training neural networks again and + +2 +00:00:07,009 --> 00:00:10,449 +then I'll give you a bit of an interview +coming to show that works before we dive + +3 +00:00:10,449 --> 00:00:15,489 +into that the material just some +administrative things first first I + +4 +00:00:15,490 --> 00:00:18,618 +didn't get a chance to actually +interviews Justin last lecture justin is + +5 +00:00:18,618 --> 00:00:21,579 +your instructor also for this class and +he was missing for the first two weeks + +6 +00:00:21,579 --> 00:00:28,409 +and they can can ask me anything about +anything he's very knowledgeable maybe + +7 +00:00:28,410 --> 00:00:29,428 +that's an understatement + +8 +00:00:29,428 --> 00:00:37,960 +ok and the 72 is out as a reminder it's +quite long so I encourage you to start + +9 +00:00:37,960 --> 00:00:43,850 +to build here and it's do basically next +Friday so get started on that as soon as + +10 +00:00:43,850 --> 00:00:47,679 +possible and you implement know that +works with the proper API of forward + +11 +00:00:47,679 --> 00:00:50,429 +backward classes and you'll see the +abstraction of a competition will grab + +12 +00:00:50,429 --> 00:00:54,820 +and go back to my session drop out and +then you'll actually implement + +13 +00:00:54,820 --> 00:00:57,770 +commercial networks so by the end of +this assignment to actually have a + +14 +00:00:57,770 --> 00:01:00,770 +fairly good understanding of all the +low-level details of how come on strong + +15 +00:01:00,770 --> 00:01:06,530 +network classifiers I'm just ok so where +we are in this class just as a reminder + +16 +00:01:06,530 --> 00:01:10,140 +again we're training neural networks and +turns out the training on networks is + +17 +00:01:10,140 --> 00:01:15,590 +really a four-step process you have an +entire dataset images and labels we + +18 +00:01:15,590 --> 00:01:18,920 +sample a small back from the dataset we +thought propagating through the network + +19 +00:01:18,920 --> 00:01:23,060 +to get to the loss which is telling us +how well we're currently classifying + +20 +00:01:23,060 --> 00:01:26,390 +dispatch of data and we back propagates +to complete the gradient of all the + +21 +00:01:26,390 --> 00:01:29,969 +weights and this gradient is telling us +how we should not sure every single wait + +22 +00:01:29,969 --> 00:01:33,789 +in the network so that we're better +classifying these images and then once + +23 +00:01:33,790 --> 00:01:36,700 +we have the gradient we can use it for a +primary update where we actually do that + +24 +00:01:36,700 --> 00:01:38,930 +small notch + +25 +00:01:38,930 --> 00:01:42,659 +last class we looked into activation +functions and I'm tired of activation + +26 +00:01:42,659 --> 00:01:45,368 +functions and some pros and cons of +using any of these insider neural + +27 +00:01:45,368 --> 00:01:49,060 +network a good question came up in +Piazza so when asked why would you even + +28 +00:01:49,060 --> 00:01:53,939 +using your activation function why not +just skip it and question was posed it + +29 +00:01:53,938 --> 00:01:57,618 +and I've got to really address this very +nicely in the last lecture by basically + +30 +00:01:57,618 --> 00:02:00,790 +if you don't use an activation function +than your entire neural network ends up + +31 +00:02:00,790 --> 00:02:05,500 +being one single in your sandwich and so +your capacity is equal to that of just a + +32 +00:02:05,500 --> 00:02:10,080 +linear classifier so those activation +functions are really critical to have + +33 +00:02:10,080 --> 00:02:13,880 +between and they they are the ones that +give you all this way that you can use + +34 +00:02:13,879 --> 00:02:17,490 +to actually put your data we talked +briefly about the preprocessing + +35 +00:02:17,490 --> 00:02:21,860 +techniques but very briefly we also +looked at the activation functions and + +36 +00:02:21,860 --> 00:02:24,830 +their distributions throughout the +neural network and so the problem here I + +37 +00:02:24,830 --> 00:02:31,370 +see your call is we have to choose this +initial weights and in particular the + +38 +00:02:31,370 --> 00:02:34,930 +scale of how large you want those who +wait to be in the beginning and we saw + +39 +00:02:34,930 --> 00:02:38,260 +that if that if those weights are too +small then your activation in a neural + +40 +00:02:38,259 --> 00:02:41,909 +network as you have a deep network goes +toward zero and if you set that skill is + +41 +00:02:41,909 --> 00:02:45,129 +likely to higher than all of them will +explode instead and so you end up with + +42 +00:02:45,129 --> 00:02:48,939 +other super-saturated networks or you +end up with networks that just about all + +43 +00:02:48,939 --> 00:02:54,189 +zeros and so that scale is very very +tricky thing to set we looked into the + +44 +00:02:54,189 --> 00:02:59,579 +initialisation which gives you a +reasonable kind of thing to use in that + +45 +00:02:59,580 --> 00:03:03,290 +form and that gives you basically +roughly good active activations or + +46 +00:03:03,289 --> 00:03:06,459 +distributions of activation throughout +the network in the beginning of training + +47 +00:03:06,459 --> 00:03:10,959 +and then we went into best normalization +which is this thing that alleviate a lot + +48 +00:03:10,959 --> 00:03:14,120 +of these headaches with actually setting +that skill properly and Sebastian + +49 +00:03:14,120 --> 00:03:16,689 +legislation makes this a much more +robust choices they don't have to + +50 +00:03:16,689 --> 00:03:20,550 +precisely get that initial scale correct +and we went to all of its present calls + +51 +00:03:20,550 --> 00:03:23,620 +and we talked about that for a while and +then we talked about the learning + +52 +00:03:23,620 --> 00:03:26,920 +process by trying to show you a kind of +tips and tricks for how you actually be + +53 +00:03:26,919 --> 00:03:29,809 +said these neural networks how you get +them to train properly and also how you + +54 +00:03:29,810 --> 00:03:34,860 +run across violations and how you slowly +over time in to get up rendering just so + +55 +00:03:34,860 --> 00:03:37,769 +we talked about all that last time so +this time we're going to go into some of + +56 +00:03:37,769 --> 00:03:41,060 +the remaining items for training neural +networks in particular parameter up the + +57 +00:03:41,060 --> 00:03:44,989 +schemes I think most part and then we'll +talk a bit about my l'ensemble dropout + +58 +00:03:44,989 --> 00:03:49,480 +and so on so before I dive into that any +administrative things my way that I'm + +59 +00:03:49,479 --> 00:03:53,509 +forgetting not necessarily so + +60 +00:03:53,509 --> 00:03:58,030 +primary updates because there's a +process to training a neural network and + +61 +00:03:58,030 --> 00:04:01,199 +this is a pseudocode really in what it +looks like that about you violate the + +62 +00:04:01,199 --> 00:04:04,419 +law severely the gradient and performer +primary update when I talk about + +63 +00:04:04,419 --> 00:04:08,030 +parameter updates were specifically +looking at this last line in here where + +64 +00:04:08,030 --> 00:04:12,129 +we are trying to make that more complex +where so right now what we're doing in + +65 +00:04:12,129 --> 00:04:17,129 +school just reading the st. where we +take that break into my computer and we + +66 +00:04:17,129 --> 00:04:21,639 +just multiply it scaled by the learning +rate on to our primary factor we can be + +67 +00:04:21,639 --> 00:04:23,159 +much more elaborate with how we + +68 +00:04:23,160 --> 00:04:27,960 +on that date and so I flash this image +briefly in the last few lectures where + +69 +00:04:27,959 --> 00:04:30,759 +you can see different parameter update +schemes and how quickly they actually + +70 +00:04:30,759 --> 00:04:35,129 +optimize this simple loss function here +and so in particular can see that STD + +71 +00:04:35,129 --> 00:04:38,550 +which is what we're using right now in +the fourth line here that's a speedy and + +72 +00:04:38,550 --> 00:04:41,710 +read to you can see that that's actually +the slowest one of all of them so + +73 +00:04:41,709 --> 00:04:45,139 +practice you rarely ever use just basic +custody and are better schemes that we + +74 +00:04:45,139 --> 00:04:48,979 +can use we're going to go into those in +the structure so let's look at what the + +75 +00:04:48,980 --> 00:04:54,810 +problem is with Sgt why is it so slow so +consider this particular slightly + +76 +00:04:54,810 --> 00:04:58,589 +contrived example here where we have a +loss function surface level sets of our + +77 +00:04:58,589 --> 00:05:02,099 +loss as opposed to elevated long one +direction much more than another + +78 +00:05:02,100 --> 00:05:05,500 +direction so basically this loss +function here is very shallow + +79 +00:05:05,500 --> 00:05:10,199 +horizontally but very steep vertically +and we want to of course minimize this + +80 +00:05:10,199 --> 00:05:13,469 +and right now we're at the Rex Baltimore +trying to get to the minimum denoted by + +81 +00:05:13,470 --> 00:05:19,240 +the smiley face that's where we're happy +but think about what's the trajectory of + +82 +00:05:19,240 --> 00:05:22,980 +this is both X&Y directions + +83 +00:05:22,980 --> 00:05:30,650 +judy if we try to optimize this +landscape with that look like so what + +84 +00:05:30,649 --> 00:05:35,729 +would it look like horizontally and +vertically I see someone's butt so what + +85 +00:05:35,730 --> 00:05:43,540 +are you planning out there and why is it +so I'm going to bounce up and down like + +86 +00:05:43,540 --> 00:05:52,030 +that and why is it not making a lot of +progress right is basically has this + +87 +00:05:52,029 --> 00:05:56,969 +forum where when we look at the gradient +horizontally we see that the radiant is + +88 +00:05:56,970 --> 00:06:00,680 +very small because this is a shallow +function horizontally but we have a + +89 +00:06:00,680 --> 00:06:03,439 +large rating because it's a very steep +function as to what's going to happen + +90 +00:06:03,439 --> 00:06:06,389 +when you roll out a street in these +kinds of cases and you end up with this + +91 +00:06:06,389 --> 00:06:10,250 +kind of pattern where you're going way +too slow in horizontal direction but + +92 +00:06:10,250 --> 00:06:13,300 +you're going way too fast and vertical +direction because you end up at this + +93 +00:06:13,300 --> 00:06:17,918 +year so one way of remedying this kind +of situation as we recall and momentum + +94 +00:06:17,918 --> 00:06:22,189 +update to the momentum update will +change our update in the following way + +95 +00:06:22,189 --> 00:06:25,319 +so right now we're just implementing the +gradient + +96 +00:06:25,319 --> 00:06:28,409 +taking the gradient and we're +integrating our current position by the + +97 +00:06:28,410 --> 00:06:34,220 +ratings in a date instead we're going to +take the gradient that we computed and + +98 +00:06:34,220 --> 00:06:36,449 +instead of integrating the position +directly + +99 +00:06:36,449 --> 00:06:40,840 +we're going to increment this variable V +which I could leave for velocity so + +100 +00:06:40,839 --> 00:06:44,049 +we're going to see why that is in a bit +so we increment + +101 +00:06:44,050 --> 00:06:48,020 +velocity variable be and instead instead +we're basically building up this + +102 +00:06:48,019 --> 00:06:53,278 +exponential some credence in the past +and that's what integrating the position + +103 +00:06:53,278 --> 00:06:58,610 +this new here is a happy primer and mute +as kind of a number between 0 and one + +104 +00:06:58,610 --> 00:07:03,629 +and was doing it became the previous be +and adding on the screen gradient so + +105 +00:07:03,629 --> 00:07:07,180 +what's nice about the momentum updated +you can interpret it in a very physical + +106 +00:07:07,180 --> 00:07:14,310 +terms and in the following way basically +using momentum update corresponds to + +107 +00:07:14,310 --> 00:07:18,899 +interpreting discount list as really a +bold rolling allows this round is + +108 +00:07:18,899 --> 00:07:22,459 +landscape and the gradient in this case +is your forest that the particles + +109 +00:07:22,459 --> 00:07:26,408 +feeling so this article is feeling some +force you to gradient instead of + +110 +00:07:26,408 --> 00:07:31,158 +directly integrating the position this +force in physics so force is equivalent + +111 +00:07:31,158 --> 00:07:36,019 +to acceleration there and so +acceleration is what we're competing and + +112 +00:07:36,019 --> 00:07:39,938 +so the velocity gets integrated by the +acceleration here and then the new times + +113 +00:07:39,939 --> 00:07:43,039 +he has the interpretation of friction in +that case because it every single + +114 +00:07:43,038 --> 00:07:47,759 +iteration were slightly slowing down and +intuitively if this new times be was not + +115 +00:07:47,759 --> 00:07:51,550 +there then does bold with never come to +rest because it was just around the law + +116 +00:07:51,550 --> 00:07:54,509 +surface forever and there will be no +loss of energy where it would settle at + +117 +00:07:54,509 --> 00:07:58,158 +the end of a loss function and so that +the momentum update is taking this + +118 +00:07:58,158 --> 00:08:01,810 +physical interpretation of optimization +but we have a ball rolling around around + +119 +00:08:01,810 --> 00:08:08,249 +and it's slowing down over time and so +the way this works is what's very nice + +120 +00:08:08,249 --> 00:08:11,669 +about this update as you end up building +up this velocity and in particular in + +121 +00:08:11,668 --> 00:08:14,959 +the shallow directions is very easy to +see that if you have a shallow but + +122 +00:08:14,959 --> 00:08:18,449 +consistent direction then the momentum +update will slowly build up the velocity + +123 +00:08:18,449 --> 00:08:21,360 +vector in the direction you end up +speeding up and up across the shallow + +124 +00:08:21,360 --> 00:08:24,999 +direction but in a very steep directions +what's going to happen is you start of + +125 +00:08:24,999 --> 00:08:28,919 +course generally around but then you're +always being pulled up the other + +126 +00:08:28,918 --> 00:08:32,429 +direction toward the center and with the +damping and the kind of oscillating to + +127 +00:08:32,429 --> 00:08:36,338 +the middle and so it's kind of denting +these oscillations in a steep directions + +128 +00:08:36,339 --> 00:08:41,139 +and it's kind of encouraging it's +encouraging process and is consistent + +129 +00:08:41,139 --> 00:08:44,889 +shallow directions and that's why it +ends up improving the convergence in + +130 +00:08:44,889 --> 00:08:49,600 +many cases so for example here in this +visualization we see the SED update in + +131 +00:08:49,600 --> 00:08:53,459 +momentum update isn't green and so you +can see what happens with the green one + +132 +00:08:53,458 --> 00:08:57,008 +hit over shoes because it built up all +this publicity + +133 +00:08:57,009 --> 00:09:00,909 +overshoots the minimum but then it +eventually ends up converting gallon and + +134 +00:09:00,909 --> 00:09:04,169 +of course it's over shot but once it +emerges there you can see that it's + +135 +00:09:04,169 --> 00:09:07,879 +converging there much quicker than just +basic as did the update to end up + +136 +00:09:07,879 --> 00:09:11,230 +building up too much of a statement than +you eventually get there quicker than if + +137 +00:09:11,230 --> 00:09:17,110 +you did not have the velocity got the +momentum update are going to a + +138 +00:09:17,110 --> 00:09:20,430 +particular variation of the momentum +appeared in a bit I just wanted to ask + +139 +00:09:20,429 --> 00:09:34,289 +questions about the momentum updates +when I got a single like a primer and + +140 +00:09:34,289 --> 00:09:40,078 +usually it takes some values of roughly +8.5 4.9 and usually people sometimes + +141 +00:09:40,078 --> 00:09:43,219 +it's not super comet but people +sometimes in the lead from 25 2.99 + +142 +00:09:43,220 --> 00:09:54,200 +slowly over time but it's just a single +number + +143 +00:09:54,200 --> 00:09:57,180 +yes so you can avoid those with a +smaller learning rate but then the issue + +144 +00:09:57,179 --> 00:10:03,000 +is if you had a slower learning rate is +applied globally to all directions in + +145 +00:10:03,000 --> 00:10:06,070 +the gradient and so then you would +basically do no progress in the + +146 +00:10:06,070 --> 00:10:09,390 +horizontal direction right you wouldn't +get as much but then it would take you + +147 +00:10:09,389 --> 00:10:12,710 +forever to go horizontally few small +learning says this kind of trade off + +148 +00:10:12,710 --> 00:10:25,350 +their selected describe a modification +on the question is how to initialize + +149 +00:10:25,350 --> 00:10:29,050 +lost you usually 10 and it doesn't +matter too much because you end up + +150 +00:10:29,049 --> 00:10:32,490 +building it up in the first few steps +and then you end up like this if you + +151 +00:10:32,490 --> 00:10:35,480 +spend out this recurrence you'll see +that basically it's exponentially + +152 +00:10:35,480 --> 00:10:39,330 +decaying some of your previous greetings +and so once you've got it up to you you + +153 +00:10:39,330 --> 00:10:46,020 +have certain 10 so particular variation +of momentum has got something called + +154 +00:10:46,019 --> 00:10:53,449 +mister on momentum and gradient descent +and the idea here is we have the + +155 +00:10:53,450 --> 00:10:57,550 +ordinary momentum equation here and the +way to think about it is that your + +156 +00:10:57,549 --> 00:10:59,789 +excess recommended by really two parts + +157 +00:10:59,789 --> 00:11:03,279 +there's a part of that you build up some +momentum in a particular direction so + +158 +00:11:03,279 --> 00:11:06,799 +that's the momentum step in green that's +the new times and that's where the + +159 +00:11:06,799 --> 00:11:09,959 +momentum is currently trying to carry +you and then you have the second + +160 +00:11:09,960 --> 00:11:12,610 +contribution from the gradients the +gradient is pulling you this way towards + +161 +00:11:12,610 --> 00:11:17,450 +the decrease of a loss function and the +actual step ends up being the vector sum + +162 +00:11:17,450 --> 00:11:21,350 +of the two so the blue as much you end +up with is just the green plus the red + +163 +00:11:21,350 --> 00:11:24,840 +and the idea but necessary momentum and +this ends up working better in practice + +164 +00:11:24,840 --> 00:11:29,629 +as the following we know at this point +regardless of what the current input was + +165 +00:11:29,629 --> 00:11:33,439 +to us so we haven't competed against up +yet but we know that we've built up some + +166 +00:11:33,440 --> 00:11:37,240 +momentum and we know we're definitely +going to take this green direction ok so + +167 +00:11:37,240 --> 00:11:41,220 +we're definitely going to take this +Green Valley ingredient here at our + +168 +00:11:41,220 --> 00:11:45,310 +current spot Nesterov momentum does +wants to look ahead and instead + +169 +00:11:45,309 --> 00:11:49,379 +evaluates the gradient at this point +this point at the top of the arrow so + +170 +00:11:49,379 --> 00:11:53,679 +what you end up with is the following +difference here we know we're going to + +171 +00:11:53,679 --> 00:11:57,089 +go this way anyway so why not just like +look ahead to get to that part of the + +172 +00:11:57,090 --> 00:12:00,420 +objective and evaluate the green at that +point and it doesn't of course you're + +173 +00:12:00,419 --> 00:12:02,309 +reading is going to be slightly +different because you're in a different + +174 +00:12:02,309 --> 00:12:05,669 +position in Los function and this one +step ahead give you a slightly better + +175 +00:12:05,669 --> 00:12:06,259 +direction + +176 +00:12:06,259 --> 00:12:11,109 +over there and get it a slightly +different update now you can do you can + +177 +00:12:11,109 --> 00:12:14,379 +theoretically show that this actually +enjoys better theoretical guarantees on + +178 +00:12:14,379 --> 00:12:18,069 +convergence rates but not only is a true +in theory but also in practice and + +179 +00:12:18,068 --> 00:12:23,068 +almost always works better than just a +moment to ok so the difference roughly + +180 +00:12:23,068 --> 00:12:28,358 +is the following year I've written like +like notations that of code but we still + +181 +00:12:28,359 --> 00:12:29,589 +have the time + +182 +00:12:29,589 --> 00:12:33,089 +mutants the previous velocity vector and +the gradient that you're currently + +183 +00:12:33,089 --> 00:12:37,629 +evaluating and then we do an update here +and so the necessary update the only + +184 +00:12:37,629 --> 00:12:41,720 +differences were pending here this new +plus new times bTW minus 11 will + +185 +00:12:41,720 --> 00:12:44,949 +evaluate the gradient we have evaluated +at a slightly different position in this + +186 +00:12:44,948 --> 00:12:48,278 +look ahead to position and so that's +really in the strong momentum it almost + +187 +00:12:48,278 --> 00:12:51,698 +always works that are now there's a +slight technology here which I don't + +188 +00:12:51,698 --> 00:12:57,068 +think I'm going to go into too much but +it's slightly inconvenient the fact that + +189 +00:12:57,068 --> 00:13:00,418 +normally we think about just going +forward and backward pass so what we end + +190 +00:13:00,418 --> 00:13:04,288 +up with is we have a primary victories +data and the gradient at that point but + +191 +00:13:04,288 --> 00:13:09,088 +you're never off wants us to have a +breeding parameters and gradient at a + +192 +00:13:09,089 --> 00:13:12,600 +different point so doesn't quite fit in +with like a simple API between only + +193 +00:13:12,600 --> 00:13:16,019 +having your code and so turns out that +there's a way and I don't want to really + +194 +00:13:16,019 --> 00:13:19,899 +probably spent too much time on this but +there's a way to basically do a variable + +195 +00:13:19,899 --> 00:13:23,379 +transformer get the notice the beefy you +do some rearrangement and then you get + +196 +00:13:23,379 --> 00:13:26,079 +something that looks much more like of +the newly updated that you can just + +197 +00:13:26,078 --> 00:13:29,538 +swipe in from Amanda Martin swapping +impressed ed because you end up with + +198 +00:13:29,538 --> 00:13:34,119 +only needing gradient atrophy and you up +to update something and this feature is + +199 +00:13:34,119 --> 00:13:35,209 +really do look ahead + +200 +00:13:35,208 --> 00:13:38,159 +version of the parameters since they're +just the raw parameter vector that's + +201 +00:13:38,159 --> 00:13:40,608 +just a technicality you can go into +notes to check this out + +202 +00:13:40,609 --> 00:13:46,709 +ok so here Nesterov accelerated reading +is in magenta and you can see the + +203 +00:13:46,708 --> 00:13:50,208 +original momentum here over shop but not +a lot but because mister of accelerating + +204 +00:13:50,208 --> 00:13:53,958 +momentum has this one step ahead you'll +see that it's curls around much more + +205 +00:13:53,958 --> 00:13:57,738 +quickly and that's because all these +tiny contributions of a slightly better + +206 +00:13:57,739 --> 00:14:01,619 +gradient at where you're about to be end +up adding up and you almost always + +207 +00:14:01,619 --> 00:14:08,600 +converge faster so that's necessary so +until recently as UD momentum was the + +208 +00:14:08,600 --> 00:14:11,329 +standard default way of training +commercial networks and many people + +209 +00:14:11,328 --> 00:14:14,658 +still trained using just a moment to +update this is a common thing to see in + +210 +00:14:14,658 --> 00:14:17,610 +practice and even better if necessary + +211 +00:14:17,610 --> 00:14:20,990 +so mag here stands for a week + +212 +00:14:20,990 --> 00:14:44,350 +question you're thinking about that so i +think it's slightly incorrect to was + +213 +00:14:44,350 --> 00:14:46,990 +only think about a lot of options for +neural networks usually think about + +214 +00:14:46,990 --> 00:14:50,350 +these crazy ravines and lots of local +minima everywhere it's actually not a + +215 +00:14:50,350 --> 00:14:53,670 +correct way to look at it that's a +correct approximation to have conceptual + +216 +00:14:53,669 --> 00:14:56,278 +in your mind when you have a very small +neural networks and people used to think + +217 +00:14:56,278 --> 00:14:59,769 +that local minima an issue and +optimizing networks but actually turns + +218 +00:14:59,769 --> 00:15:04,269 +out with a lot of recent theoretical +work that as you scale up your models + +219 +00:15:04,269 --> 00:15:10,740 +these local minimum has become less and +less of an issue so that the picture to + +220 +00:15:10,740 --> 00:15:14,389 +have in mind is there are lots of local +minima but they're all about the same + +221 +00:15:14,389 --> 00:15:18,958 +actually loss that's a better way to +look at it so these functions neural + +222 +00:15:18,958 --> 00:15:22,078 +networks actually in practice and i'm +looking much more like like a bowl + +223 +00:15:22,078 --> 00:15:25,599 +instead of the crazy ravine landscape +and you can show that as you still up + +224 +00:15:25,600 --> 00:15:28,360 +the neural network the difference +between like the worst than your best + +225 +00:15:28,360 --> 00:15:29,259 +local minima + +226 +00:15:29,259 --> 00:15:32,448 +actually kinda like shrinks down over +time with some researchers also + +227 +00:15:32,448 --> 00:15:36,120 +basically there's no bad local minima +this only happens in very small networks + +228 +00:15:36,120 --> 00:15:41,409 +so and in fact in practice what you find +is if you initialize with different + +229 +00:15:41,409 --> 00:15:44,610 +random initialization almost always end +up getting the same answer like the same + +230 +00:15:44,610 --> 00:15:48,009 +loss in the end so you don't end up +there's no like bad local minima you + +231 +00:15:48,009 --> 00:15:57,429 +sometimes especially when you have begun +networks question with a question + +232 +00:15:57,429 --> 00:16:10,849 +Nesterov as an oscillating feature which +part + +233 +00:16:10,850 --> 00:16:14,819 +ok I think you're jumping had maybe by +by several slides were going to go into + +234 +00:16:14,818 --> 00:16:19,849 +second or two methods in a bit okay let +me jump into another update that is very + +235 +00:16:19,850 --> 00:16:23,069 +common to see in practice it's called a +ground and it was originally developed + +236 +00:16:23,068 --> 00:16:25,969 +in a convex optimization literature and +then it was kind of ported over to + +237 +00:16:25,970 --> 00:16:30,019 +neural networks and people sometimes use +it so the other great update looks as + +238 +00:16:30,019 --> 00:16:30,560 +follows + +239 +00:16:30,559 --> 00:16:35,619 +we have this update as we normally see +some basic stochastic gradient descent + +240 +00:16:35,620 --> 00:16:37,500 +here learning great times here + +241 +00:16:37,500 --> 00:16:42,259 +gradient but now we're scaling this +gradient but this additional variable + +242 +00:16:42,259 --> 00:16:47,589 +that we keep accumulating note here that +this cash which were building up and is + +243 +00:16:47,589 --> 00:16:52,199 +the sum of gradient square this cache +contains positive numbers only + +244 +00:16:52,198 --> 00:16:55,599 +and note that the cache variable here is +a joint venture of the same size as your + +245 +00:16:55,600 --> 00:17:00,730 +primary factor and so this cash and up +building up in a personal dimension were + +246 +00:17:00,730 --> 00:17:03,839 +keeping track of the sum of squares of +the gradients or as we like to sometimes + +247 +00:17:03,839 --> 00:17:07,679 +called the second moment of those the +Oncenter take a moment and so we keep + +248 +00:17:07,679 --> 00:17:12,409 +building up this cash and then we divide +element why's this step function by the + +249 +00:17:12,409 --> 00:17:21,709 +square root of cash and so what ends up +happening here so that's the reason that + +250 +00:17:21,709 --> 00:17:26,189 +people call it a purr purr parameter +adaptive learning rate method because + +251 +00:17:26,189 --> 00:17:31,090 +every single product every single +dimension of your parameter space now + +252 +00:17:31,089 --> 00:17:34,569 +has its own kind of like learning rate +that is scaled dynamically based on what + +253 +00:17:34,569 --> 00:17:39,079 +kinds of ingredients are seeing in terms +of their scale so with this + +254 +00:17:39,079 --> 00:17:42,859 +interpretation what happens with +autograph in this particular case if we + +255 +00:17:42,859 --> 00:17:47,019 +do this what happens in the horizontal +and vertical direction but this kind of + +256 +00:17:47,019 --> 00:17:51,359 +dynamics + +257 +00:17:51,359 --> 00:18:03,789 +what you'll see as we have a large +gradient vertically and that large + +258 +00:18:03,789 --> 00:18:07,259 +gradient will be added up to cash and +then we end up dividing by larger and + +259 +00:18:07,259 --> 00:18:11,359 +larger numbers so will get smaller and +smaller updates in the vertical step so + +260 +00:18:11,359 --> 00:18:14,798 +since we're seeing lots of large regions +very clean this will decayed learning + +261 +00:18:14,798 --> 00:18:18,859 +rate and will make smaller and smaller +steps in the vertical direction but in + +262 +00:18:18,859 --> 00:18:22,009 +the horizontal direction it's a very +shallow direction so we end up with + +263 +00:18:22,009 --> 00:18:25,750 +smaller numbers in denominator and +you'll see that the relative to the Y + +264 +00:18:25,750 --> 00:18:29,058 +dimension we're going to end up making +faster progress so we have this equalize + +265 +00:18:29,058 --> 00:18:35,058 +the effect of accounting for this the +steepness and inshallah directions you + +266 +00:18:35,058 --> 00:18:40,319 +can actually have much larger learning +right then instead of the vertical + +267 +00:18:40,319 --> 00:18:48,048 +directions and but so that's one problem +without a grad is think about what + +268 +00:18:48,048 --> 00:18:53,009 +happens to the step size as we're +updating this position if we want to + +269 +00:18:53,009 --> 00:18:55,900 +train an entire deep neural network the +stakes for a long time and we're + +270 +00:18:55,900 --> 00:19:01,970 +training this summer long time what's +going to happen in a degree of course so + +271 +00:19:01,970 --> 00:19:05,169 +your cash end up building up all the +time you add all these positive numbers + +272 +00:19:05,169 --> 00:19:09,100 +goes into denominator you're literally +just the case 20 and you end up stopping + +273 +00:19:09,099 --> 00:19:14,579 +learning like completely and so that's +not so that's ok income tax problems + +274 +00:19:14,579 --> 00:19:17,970 +perhaps we just have a bowling just kind +of decay down to the optimum and you're + +275 +00:19:17,970 --> 00:19:21,919 +done but in the neural network the stuff +is kinda like shuttling around then it's + +276 +00:19:21,919 --> 00:19:24,549 +trying to picture based on that's like a +better way to think of it and so this + +277 +00:19:24,548 --> 00:19:28,329 +thing needs continuous kind of energy to +get your data and so you don't want to + +278 +00:19:28,329 --> 00:19:33,009 +just decay to a halt so there's a very +simple change to an autographed that was + +279 +00:19:33,009 --> 00:19:37,829 +proposed by Jeff Hinton recently and the +idea here is that instead of keeping + +280 +00:19:37,829 --> 00:19:42,289 +completely just a sum of squares and I +was able to mention weekend we make that + +281 +00:19:42,289 --> 00:19:46,250 +counter a leaky counter so instead we +end up with this decay rate hike the + +282 +00:19:46,250 --> 00:19:52,500 +primary which we set to something like +0.99% squares but the sum of squares is + +283 +00:19:52,500 --> 00:19:57,750 +leaking slowly but that's ok so we we +still maintain this nice equalizing + +284 +00:19:57,750 --> 00:20:01,569 +effect of equalizing the step sizes in +steep or shelling directions + +285 +00:20:01,569 --> 00:20:05,869 +we're not going to just convert +completely 20 updates that sold arms + +286 +00:20:05,869 --> 00:20:10,299 +prop 19 is historical contact about +Armas proper way is the way was + +287 +00:20:10,299 --> 00:20:11,430 +introduced to us + +288 +00:20:11,430 --> 00:20:14,340 +you think that it would be a paper that +proposed this method but in fact it was + +289 +00:20:14,339 --> 00:20:18,789 +a slide and Justin Scott Sarah class +just a few years ago and so Justin just + +290 +00:20:18,789 --> 00:20:22,240 +was giving this Corsair class and +flashed a slide of life this is + +291 +00:20:22,240 --> 00:20:25,630 +unpublished but this usually works well +in practice and do this and it's + +292 +00:20:25,630 --> 00:20:29,920 +basically our math problem and so I +implemented it then I saw like better + +293 +00:20:29,920 --> 00:20:34,060 +results on my optimization right away +and I thought that was really funny and + +294 +00:20:34,059 --> 00:20:37,769 +so in fact mike in papers not only my +papers but many other papers of people + +295 +00:20:37,769 --> 00:20:44,559 +have cited slide from Coursera just +slide lecture 6 the slide just this + +296 +00:20:44,559 --> 00:20:48,389 +problem since then this is actually now +an actual paper and there's more results + +297 +00:20:48,390 --> 00:20:52,300 +on exactly what he's doing and and so on +but for a while this was really funny + +298 +00:20:52,299 --> 00:20:57,609 +and so in this up my perspective we can +see the ground here is blue and Aramis + +299 +00:20:57,609 --> 00:20:58,579 +prop is this + +300 +00:20:58,579 --> 00:21:02,490 +and black and we can see that both of +them covered quite quickly down here + +301 +00:21:02,490 --> 00:21:07,519 +this way in this particular case at a +grad and converting slightly faster than + +302 +00:21:07,519 --> 00:21:11,589 +Armas problem but that's not always the +case something usually what you see in + +303 +00:21:11,589 --> 00:21:15,839 +practice when you train to Penn Jillette +works as a grad stops too early and are + +304 +00:21:15,839 --> 00:21:21,329 +miserable end up usually the winning out +in these these methods and questions + +305 +00:21:21,329 --> 00:21:24,509 +about our most prob go ahead + +306 +00:21:24,509 --> 00:21:55,150 +the issue is very steep directions you +probably don't want to this method is + +307 +00:21:55,150 --> 00:21:58,800 +saying to make very fast updates in that +direction so yourself down so maybe in + +308 +00:21:58,799 --> 00:22:02,220 +this particular case you'd like to go +faster but you're kind of reading into + +309 +00:22:02,220 --> 00:22:05,019 +this particular example and that's not +true kind of in general and these + +310 +00:22:05,019 --> 00:22:09,940 +optimization landscape that no networks +are made up of a good strategy to apply + +311 +00:22:09,940 --> 00:22:22,930 +in those cases in the beginning + +312 +00:22:22,930 --> 00:22:25,730 +oh by the way I skipped over this +exploration of 17 but you guys can + +313 +00:22:25,730 --> 00:22:30,380 +hopefully see that 127 is there just to +prevent the division by zero it's moving + +314 +00:22:30,380 --> 00:22:34,550 +back to its high proprietor usually we +sat at two one five or six or seven or + +315 +00:22:34,549 --> 00:22:39,139 +something like that in the beginning +your cash is 0 so then you can come into + +316 +00:22:39,140 --> 00:22:46,540 +your life learning rate 22 what you get +is this adaptive behavior but the scale + +317 +00:22:46,539 --> 00:22:50,420 +of it is still in your control the +absolute scale of it distilling or + +318 +00:22:50,420 --> 00:22:57,370 +control is still learning rate this +story just interrupt its kind of thing + +319 +00:22:57,369 --> 00:23:00,989 +can look at more like a relative thing +with respect to different primers how + +320 +00:23:00,990 --> 00:23:12,190 +are you equalizing the steps but +absolute global step is still up to you + +321 +00:23:12,190 --> 00:23:18,710 +from the very beginning so effectively +doing what you're describing right + +322 +00:23:18,710 --> 00:23:23,038 +because it ends up for getting sort of +ingredients from very long time ago and + +323 +00:23:23,038 --> 00:23:27,750 +it's only really its expression at time +t is only a function of the last few + +324 +00:23:27,750 --> 00:23:36,480 +ingredients but in an exponentially +decaying weighted sum up we're going to + +325 +00:23:36,480 --> 00:23:43,819 +go into last update glad + +326 +00:23:43,819 --> 00:24:03,039 +would be similar to exponentially +weighted way and so you want to have a + +327 +00:24:03,039 --> 00:24:09,789 +finite window on this or I don't think +people have really tried you can ya + +328 +00:24:09,789 --> 00:24:19,889 +takes too much memory when you're 10 +optimizing networks will see that X for + +329 +00:24:19,890 --> 00:24:23,560 +example 240 million parameters so that's +taking up quite a lot of memory and so + +330 +00:24:23,559 --> 00:24:29,659 +you don't want to keep track of 10 +previous grievances well okay then we're + +331 +00:24:29,660 --> 00:24:37,540 +gonna go in 20 sure if you combine a +degraded momentum thank you for the + +332 +00:24:37,539 --> 00:24:45,269 +question and that's the slide so so +roughly what's happening is Adam is this + +333 +00:24:45,269 --> 00:24:49,119 +last update was only prison actually +proposed very recently and it has + +334 +00:24:49,119 --> 00:24:52,959 +elements of both as you'll notice +momentum is kind of keeping track of the + +335 +00:24:52,960 --> 00:24:57,190 +first order moment of your of your +reading that sums up the wrong gradients + +336 +00:24:57,190 --> 00:25:02,350 +and keeping this exponential some and a +grandson are keeping track of the second + +337 +00:25:02,349 --> 00:25:07,869 +moment the gradient and and what you end +up with is Adam Adam update as you end + +338 +00:25:07,869 --> 00:25:13,389 +up with the step that's basically take +it's kind of like yeah it's kind of like + +339 +00:25:13,390 --> 00:25:16,980 +our most probably momentum a bit so you +end up with this thing that looks like + +340 +00:25:16,980 --> 00:25:21,650 +it's basically keep track of this +velocity in a decaying way and that's + +341 +00:25:21,650 --> 00:25:25,420 +your step but then you also scaling it +down by this exponentially adding up + +342 +00:25:25,420 --> 00:25:29,490 +leaky counter of your square gradients +and so you end up with both in the same + +343 +00:25:29,490 --> 00:25:36,009 +formula and thats update combining those +do so you're doing both momentum and + +344 +00:25:36,009 --> 00:25:41,759 +you're also doing this adaptive scaling +and let see so here's the army's prob + +345 +00:25:41,759 --> 00:25:44,789 +actually I should have flashed this +earlier so even when compared this + +346 +00:25:44,789 --> 00:25:46,339 +basically our most prob + +347 +00:25:46,339 --> 00:25:52,079 +red is the same thing as here except +we've replaced TX which there was just a + +348 +00:25:52,079 --> 00:25:56,220 +previous just a gradient currently right +now we're replacing this gradient TX + +349 +00:25:56,220 --> 00:25:56,630 +with it + +350 +00:25:56,630 --> 00:26:01,170 +which is this running counter of RDX so +if you imagine for example one way to + +351 +00:26:01,170 --> 00:26:04,090 +look at it also is your nasty kasich +setting your sampling many batches + +352 +00:26:04,089 --> 00:26:07,359 +there's gonna be lots of randomness in a +poor pass and you get all these noisy + +353 +00:26:07,359 --> 00:26:10,990 +gradients so instead of using any great +impact every single time step we're + +354 +00:26:10,990 --> 00:26:14,309 +actually going to be using this became +some of previous greetings and it can + +355 +00:26:14,309 --> 00:26:19,139 +stabilize your gradient direction of it +and that's the function of the momentum + +356 +00:26:19,140 --> 00:26:23,720 +here and the scaling here is to make +sure that the step-size workout relative + +357 +00:26:23,720 --> 00:26:29,940 +to each other and Steven L directions +thank you don't want to be that you are + +358 +00:26:29,940 --> 00:26:31,269 +hyper parameters + +359 +00:26:31,269 --> 00:26:36,119 +801 usually point 9802 usually Point 995 + +360 +00:26:36,119 --> 00:26:42,869 +somewhere there so it's a high premium +across a leader in my own work I found + +361 +00:26:42,869 --> 00:26:45,719 +that this is a relatively robust +settings across I don't actually usually + +362 +00:26:45,720 --> 00:26:50,690 +end up leaving these I just set them to +put smileys usually but you can play + +363 +00:26:50,690 --> 00:27:04,259 +with those of it and sometimes it can +help you get momentum we saw the + +364 +00:27:04,259 --> 00:27:08,789 +restaurant works better clean do that +yes you can actually just read the paper + +365 +00:27:08,789 --> 00:27:12,849 +about this yesterday and actually wasn't +a paper it was a project report from 229 + +366 +00:27:12,849 --> 00:27:17,149 +someone actually that that I'm not sure +if there's a paper about it but you can + +367 +00:27:17,150 --> 00:27:20,250 +play with that simply does not being +done here + +368 +00:27:20,250 --> 00:27:25,759 +ok and one more thing that I so I have +to make Adam slightly more complex here + +369 +00:27:25,759 --> 00:27:30,849 +as you see it's incomplete so let me +just put into complete immersion in Adam + +370 +00:27:30,849 --> 00:27:33,949 +there's one more thing where you might +be confused when you see it there's this + +371 +00:27:33,950 --> 00:27:38,220 +thing called bias correction to insert +their and despise correction the way to + +372 +00:27:38,220 --> 00:27:40,920 +the reason I'm expanding of the loop is +that the bias correction depends on your + +373 +00:27:40,920 --> 00:27:46,940 +absolute time step T 00 T is used here +and the reason for that is what this is + +374 +00:27:46,940 --> 00:27:49,730 +doing is kind of like a minor point and +I don't want to be confused about this + +375 +00:27:49,730 --> 00:27:54,049 +too much but basically it's compensated +for compensating for the fact that MMV + +376 +00:27:54,049 --> 00:27:58,659 +ornish 500 statistics are incorrect in +the beginning and so what he's doing is + +377 +00:27:58,660 --> 00:28:01,269 +really at scaling up your Mb + +378 +00:28:01,269 --> 00:28:04,250 +the first few iterations so you don't +end up with a very kind of biased + +379 +00:28:04,250 --> 00:28:07,359 +estimate of the first and the second +moment so don't worry about that + +380 +00:28:07,359 --> 00:28:11,279 +too much this is only this is only +changing your update at the very first + +381 +00:28:11,279 --> 00:28:15,190 +few times that's as as the item is +warming up and so it's done in a proper + +382 +00:28:15,190 --> 00:28:18,210 +way in terms of the statistics Mb + +383 +00:28:18,210 --> 00:28:23,380 +I don't go too much into that ok so we +talked about several different updates + +384 +00:28:23,380 --> 00:28:26,710 +and we saw that all these updates have +this learning great primer still there + +385 +00:28:26,710 --> 00:28:31,279 +and so I just want to briefly talk about +the fact that although still require a + +386 +00:28:31,279 --> 00:28:34,369 +learning and we saw what happens with +the front racism learning rates for all + +387 +00:28:34,369 --> 00:28:37,639 +of these methods and the question i'd +like to pose is which one of these + +388 +00:28:37,640 --> 00:28:47,290 +learning rates is best to use + +389 +00:28:47,289 --> 00:28:55,509 +so when you're running neural networks +this is a slide about learning rate the + +390 +00:28:55,509 --> 00:28:59,819 +case the trick answer is that none of +those are good learning race to use what + +391 +00:28:59,819 --> 00:29:04,259 +you should do is you should use the high +learning rate first because it optimizes + +392 +00:29:04,259 --> 00:29:07,869 +faster than the good learning rate is +seen you make a very fast progress but + +393 +00:29:07,869 --> 00:29:10,779 +at some point you're going to be two +stochastic and you can't converging to + +394 +00:29:10,779 --> 00:29:13,829 +your main my very nicely because you +have too much energy in your system and + +395 +00:29:13,829 --> 00:29:17,869 +you can't settle down into black nice +parts of your loss function and so what + +396 +00:29:17,869 --> 00:29:21,399 +you do then is UDK you're learning rate +and then you can kind of ride this + +397 +00:29:21,400 --> 00:29:26,269 +dragon of decreasing learning rates and +do best in all of them are many + +398 +00:29:26,269 --> 00:29:28,670 +different ways that people begin to +learn rates over time and you should + +399 +00:29:28,670 --> 00:29:32,400 +also became your assignment of their +stuff decay which is kind of like the + +400 +00:29:32,400 --> 00:29:36,810 +simplest one perhaps or after one epoch +of training data is referring to you've + +401 +00:29:36,809 --> 00:29:41,619 +seen every single training sample one +time so after saying what a Paki decayed + +402 +00:29:41,619 --> 00:29:45,219 +learning rates to my point nine or +something like that you can also use + +403 +00:29:45,220 --> 00:29:49,600 +exponential decay or one of the TDK +there several several of them are going + +404 +00:29:49,599 --> 00:29:54,379 +to know it's likely expanding on some of +the theoretical properties that improve + +405 +00:29:54,380 --> 00:29:58,260 +about these different case unfortunately +not many of them apply because I think + +406 +00:29:58,259 --> 00:30:01,150 +they're mostly from convex optimization +literature and we're dealing with very + +407 +00:30:01,150 --> 00:30:05,160 +different objectives but usually in +practice I just used for something that + +408 +00:30:05,160 --> 00:30:12,330 +was a question + +409 +00:30:12,329 --> 00:30:25,259 +not committing to any one of these +between them during training + +410 +00:30:25,259 --> 00:30:28,470 +yeah I don't think that's the standard +at all + +411 +00:30:28,470 --> 00:30:32,990 +an interesting point I'm not sure I'm +not sure when you'd want to use yeah + +412 +00:30:32,990 --> 00:30:37,839 +it's not clear to me you could try +something to try and practice I like to + +413 +00:30:37,839 --> 00:30:42,079 +make the point that you almost always I +find at least impact is right now is + +414 +00:30:42,079 --> 00:30:46,189 +usually the nice default rose to go with +so I use a time for everything now and + +415 +00:30:46,190 --> 00:30:49,840 +seems to work quite well better than +momentum are our most problems or + +416 +00:30:49,839 --> 00:30:56,638 +anything like that so it's a tall order +methods as we call them because they + +417 +00:30:56,638 --> 00:31:00,579 +only use your gradient information at +your loss function so we've evaluated + +418 +00:31:00,579 --> 00:31:03,720 +the gradient and we basically know the +slope and every single direction and + +419 +00:31:03,720 --> 00:31:05,710 +that's the only thing that we use + +420 +00:31:05,710 --> 00:31:09,600 +there's an entire set of second order +methods for optimization but you should + +421 +00:31:09,599 --> 00:31:13,168 +be aware of the second order opposition +as I do want to go into too much detail + +422 +00:31:13,169 --> 00:31:17,919 +but the end up forming a larger +approximation to your loss function so + +423 +00:31:17,919 --> 00:31:20,820 +they don't only approximated with this +basically hyperplane of like which way I + +424 +00:31:20,819 --> 00:31:26,069 +was hoping but you also approximated by +discussion which is telling you how your + +425 +00:31:26,069 --> 00:31:29,710 +services curbing so you don't only need +the gradient he also need the Hessian + +426 +00:31:29,710 --> 00:31:36,808 +need to compute that as well and you may +have seen you tonight I'd say for + +427 +00:31:36,808 --> 00:31:38,500 +example in 229 + +428 +00:31:38,500 --> 00:31:44,190 +Newton method it's basically giving you +an update that was you formed your bowl + +429 +00:31:44,190 --> 00:31:47,259 +like fashion approximation to your +objective you can use this updated + +430 +00:31:47,259 --> 00:31:54,259 +number to jump directly to the minimum +of that that approximation scheme so + +431 +00:31:54,259 --> 00:31:58,490 +what's nice about second order methods +why do people like these are used them + +432 +00:31:58,490 --> 00:32:02,099 +especially the Newton method is +presented here what's nice about this + +433 +00:32:02,099 --> 00:32:05,399 +update for convergence + +434 +00:32:05,400 --> 00:32:13,410 +you'll notice no learning rate know how +primary in this update ok and that's + +435 +00:32:13,410 --> 00:32:17,220 +because if you see your gradient in this +loss function in this loss function but + +436 +00:32:17,220 --> 00:32:20,480 +you also know the curvature and that +place and so if you approximated with + +437 +00:32:20,480 --> 00:32:23,920 +this bull you know exactly where to go +to the minimum order approximation so + +438 +00:32:23,920 --> 00:32:26,900 +there's no need for learning you can +jump directly to that minimum of that + +439 +00:32:26,900 --> 00:32:30,610 +approximating bowl so that's a very nice +feature I think those are the two that I + +440 +00:32:30,609 --> 00:32:32,969 +had in mind you have a fast convergence +because you're using second order + +441 +00:32:32,970 --> 00:32:38,839 +information as well why is it kind of +impractical to use this step update in + +442 +00:32:38,839 --> 00:32:47,069 +training all that works for the issue of +course is passion say you have a hundred + +443 +00:32:47,069 --> 00:32:48,500 +million primary network + +444 +00:32:48,500 --> 00:32:52,299 +hundred-million by hundred-million +matrix and then you want to convert it + +445 +00:32:52,299 --> 00:32:59,259 +so good luck with that this is not going +to happen so there are several + +446 +00:32:59,259 --> 00:33:02,480 +algorithms and I just like you to be +aware of your not going to use them as + +447 +00:33:02,480 --> 00:33:05,650 +class which is below where there's +something called DHS which basically + +448 +00:33:05,650 --> 00:33:08,360 +lets you get away with not converting +the fashion and build up an + +449 +00:33:08,359 --> 00:33:11,819 +approximation of the Hessian through +successive updates that are all ranked + +450 +00:33:11,819 --> 00:33:15,000 +one and it kind of builds up the session +but you still need to store the Hessian + +451 +00:33:15,000 --> 00:33:18,279 +in memory so still no good for large +networks and then there's something + +452 +00:33:18,279 --> 00:33:22,710 +called lbs short for limited Jeremy BFGS +was not actually store in the fall + +453 +00:33:22,710 --> 00:33:26,980 +fashion or it's approximated members and +that's what people use in practice + +454 +00:33:26,980 --> 00:33:33,549 +sometimes now lbs you'll see sometimes +mentioned in optimization literature + +455 +00:33:33,549 --> 00:33:37,769 +especially when it works really really +well for us if you have a single small + +456 +00:33:37,769 --> 00:33:42,450 +deterministic function like a box +there's no stochastic noise like there's + +457 +00:33:42,450 --> 00:33:47,920 +no city in and everything fits in your +memory address can usually crushing loss + +458 +00:33:47,920 --> 00:33:53,200 +functions very easily but what's tricky +as to extend lbs gs2 basically very very + +459 +00:33:53,200 --> 00:33:56,539 +large datasets and the reason is that +were subsampling these many doctors + +460 +00:33:56,539 --> 00:33:59,730 +because we can't fit all the training +data into memory so wassup simple many + +461 +00:33:59,730 --> 00:34:02,930 +batches and then I'll be at risk of +works on these many matches and its + +462 +00:34:02,930 --> 00:34:06,810 +approximation is in the being incorrect +as you swap different many batches and + +463 +00:34:06,809 --> 00:34:10,449 +and also has the capacity you have to be +careful with it then you have to make + +464 +00:34:10,449 --> 00:34:12,539 +sure that you fix a dropout + +465 +00:34:12,539 --> 00:34:17,690 +you have to make sure that your function +so internally albeit rascals your + +466 +00:34:17,690 --> 00:34:20,679 +function many many different times is +doing all these approximations and lie + +467 +00:34:20,679 --> 00:34:24,480 +search and stuff like that it's a very +heavy function and so you have to make + +468 +00:34:24,480 --> 00:34:26,668 +sure that when you use this you disable +or sources + +469 +00:34:26,668 --> 00:34:29,889 +randomness because really not going to +like it so basically in practice we + +470 +00:34:29,889 --> 00:34:33,779 +don't use all BHS because it seems to +not great not worked really well right + +471 +00:34:33,780 --> 00:34:36,970 +now compared to other methods is +basically to have too much stuff is + +472 +00:34:36,969 --> 00:34:41,529 +happening and you it's better to just do +this and noisy our stuff but do more of + +473 +00:34:41,530 --> 00:34:47,880 +it that's the trade off so in summary +used as a good choice and if you can + +474 +00:34:47,880 --> 00:34:51,570 +afford to just have you can afford for +banks to maybe your day as it is not + +475 +00:34:51,570 --> 00:34:55,419 +very large income for 2009 memory and +the forward and they get passes in + +476 +00:34:55,418 --> 00:35:00,460 +memory then you can look into lbs but +you won't see it in practice used in + +477 +00:35:00,460 --> 00:35:05,220 +large-scale setting right now although +its research direction right now right + +478 +00:35:05,219 --> 00:35:10,009 +so that concludes my discussion of +different private updates because you're + +479 +00:35:10,010 --> 00:35:14,830 +learning rates we're not going to look +into all beatrice in this class there's + +480 +00:35:14,829 --> 00:35:24,739 +a question the very back + +481 +00:35:24,739 --> 00:35:34,609 +you're asking about so a great for +example it automatically case you're + +482 +00:35:34,610 --> 00:35:38,510 +learning rate over time so would you use +also learning break the case if you're + +483 +00:35:38,510 --> 00:35:41,930 +using a grand or so usually you see +learning Reiki very common when you + +484 +00:35:41,929 --> 00:35:55,379 +actually I'm not sure if you use it but +at a grad or or Adam yeah it's it's not + +485 +00:35:55,380 --> 00:36:04,900 +not not a very good answer that you can +certainly do it but maybe item is not + +486 +00:36:04,900 --> 00:36:08,910 +like Adam will not just wantonly make +your learning 30 at the Android because + +487 +00:36:08,909 --> 00:36:12,339 +it's a leaky gradient but he was a great +concern he became the learning rate + +488 +00:36:12,340 --> 00:36:15,170 +probably does not make sense because +it's decayed automatically 20 Indian + +489 +00:36:15,170 --> 00:36:22,710 +alright okay we're going to go into +model ensembles I just very briefly like + +490 +00:36:22,710 --> 00:36:24,829 +to talk about it because it's quite +simple + +491 +00:36:24,829 --> 00:36:28,750 +turns out that if you train multiple +independent models on your training data + +492 +00:36:28,750 --> 00:36:32,949 +instead of just a single one and then +you averaged results at this time you've + +493 +00:36:32,949 --> 00:36:39,929 +always got 22 percent extra performance +ok so this is not really a theoretical + +494 +00:36:39,929 --> 00:36:43,289 +result here it's kind of like a result +but just like in practice + +495 +00:36:43,289 --> 00:36:46,570 +basically this is like a good thing to +do almost always works better + +496 +00:36:46,570 --> 00:36:48,850 +the downside of course is not have to +have all these different independent + +497 +00:36:48,849 --> 00:36:52,259 +models and need to do forward and +backward classes of all of them and you + +498 +00:36:52,260 --> 00:36:56,850 +have trained all of them so that's not +ideal and presumably you're slow down + +499 +00:36:56,849 --> 00:37:00,989 +just time with the number of models in +your ensemble and so there are some tips + +500 +00:37:00,989 --> 00:37:05,689 +and tricks for using on some kind of +picking up a bit so one approach for + +501 +00:37:05,690 --> 00:37:08,619 +example is as your training your neural +network you have all these different + +502 +00:37:08,619 --> 00:37:11,680 +checkpoints usually are saving them +every single hockey save a checkpoint + +503 +00:37:11,679 --> 00:37:14,750 +and you figure out what your was your +validation performance so one thing you + +504 +00:37:14,750 --> 00:37:18,119 +can do for example it turns out to +actually gets like this sometimes is you + +505 +00:37:18,119 --> 00:37:23,420 +just take some different checkpoints on +your model and you were those that + +506 +00:37:23,420 --> 00:37:26,349 +actually turns out to sometimes improve +things in it and so that way you don't + +507 +00:37:26,349 --> 00:37:29,730 +have to train seven independent models +US-trained one but you ensemble some + +508 +00:37:29,730 --> 00:37:34,809 +different checkpoints related to that +there's a trick of + +509 +00:37:34,809 --> 00:37:39,739 +protest what's happening here this is +your four steps that we've seen before + +510 +00:37:39,739 --> 00:37:44,709 +I'm keeping another set of primaries +here X test and this text as a running + +511 +00:37:44,710 --> 00:37:49,590 +some exponentially decaying off my +actual parameter vector X and when I use + +512 +00:37:49,590 --> 00:37:52,750 +text test and validation or test data it +turns out that this almost always + +513 +00:37:52,750 --> 00:37:57,199 +perform slightly better than using X +alone ok so this is kind of doing like a + +514 +00:37:57,199 --> 00:38:00,919 +small like weighted ensemble of last +previous few primary factors it's kind + +515 +00:38:00,920 --> 00:38:05,309 +of a kind of difficult to interpret +actually but basically one way to + +516 +00:38:05,309 --> 00:38:08,329 +interpret it one way I can handle about +why this is actually a good thing to do + +517 +00:38:08,329 --> 00:38:12,900 +is think about optimizing your ball +function and you're stepping too much + +518 +00:38:12,900 --> 00:38:16,849 +around your minimum that actually taking +the average of all those steps gets you + +519 +00:38:16,849 --> 00:38:20,980 +closer to the minimum ok I can do for +why this actually is important slightly + +520 +00:38:20,980 --> 00:38:25,639 +better so that small ensembles I had to +discuss my life because we're going to + +521 +00:38:25,639 --> 00:38:29,759 +look into dropout and this is a very +important technique that you will be + +522 +00:38:29,760 --> 00:38:34,590 +using an implementation and so on so the +idea for dropout is very interesting + +523 +00:38:34,590 --> 00:38:38,620 +what you do with dropout is you as +you're doing your whole purpose of + +524 +00:38:38,619 --> 00:38:45,429 +neural network you will randomly set +some neurons 20 in the park pass so just + +525 +00:38:45,429 --> 00:38:49,839 +to clarify what you will do is as you're +doing a forward pass of your data X your + +526 +00:38:49,840 --> 00:38:52,670 +computing a say in this function + +527 +00:38:52,670 --> 00:38:57,010 +your first hidden layer is the +nonlinearity of W one times XP sp1 so + +528 +00:38:57,010 --> 00:39:02,830 +that's a little later and then I will +compute here a mask of binary numbers + +529 +00:39:02,829 --> 00:39:05,230 +either 0 or 1 based on whether or not + +530 +00:39:05,230 --> 00:39:09,469 +numbers between 0 and one are smaller +than P which we hear serious pump so + +531 +00:39:09,469 --> 00:39:13,469 +this you want is a binary mask of zeros +and ones half and half and then we + +532 +00:39:13,469 --> 00:39:17,469 +multiply that are hidden activations +actively dropping half of them so we + +533 +00:39:17,469 --> 00:39:21,349 +computed all the activations each one +hidden layer and then we drop have two + +534 +00:39:21,349 --> 00:39:25,730 +units at random and then we do second +and then we drop half of them at random + +535 +00:39:25,730 --> 00:39:30,699 +ok and of course this is only the +forward pass the backward pass has to be + +536 +00:39:30,699 --> 00:39:35,719 +appropriately adjusted as well so these +drops have to be also back propagated + +537 +00:39:35,719 --> 00:39:39,309 +through so remember to do that when you +implement drop out so it's not only in + +538 +00:39:39,309 --> 00:39:41,980 +the forward pass a drop but in a +backward pass if the backpropagation + +539 +00:39:41,980 --> 00:39:45,829 +multiplying by u2 and buy you one so you +killed radiance basically in places + +540 +00:39:45,829 --> 00:39:46,559 +where you dropped + +541 +00:39:46,559 --> 00:39:52,179 +ok so you might be thinking when I +showed you this for the first time how + +542 +00:39:52,179 --> 00:39:56,799 +does this make any sense at all and how +was this good idea why would you want to + +543 +00:39:56,800 --> 00:40:00,390 +compute your neuroses and then set them +a trend in 20 make any sense whatsoever + +544 +00:40:00,389 --> 00:40:12,369 +so I don't know let's let's do you guys +think ahead to prevent overheating in + +545 +00:40:12,369 --> 00:40:23,880 +what sense + +546 +00:40:23,880 --> 00:40:27,170 +you're really getting the right +information so you're saying it will + +547 +00:40:27,170 --> 00:40:31,240 +prevent overfitting because if I'm only +using half of my network then roughly I + +548 +00:40:31,239 --> 00:40:34,500 +have like smaller capacity I'm only +using half of my network any one time + +549 +00:40:34,500 --> 00:40:37,739 +and one smaller networks there's only +like I'm basically there's only so much + +550 +00:40:37,739 --> 00:40:40,209 +I can do what happened at work then +there's a full network so it's kind of + +551 +00:40:40,210 --> 00:40:44,798 +like control of your of your variance in +terms of what you can represent + +552 +00:40:44,798 --> 00:40:55,619 +yeah I would like to meet the terms of +like by various trade often so I haven't + +553 +00:40:55,619 --> 00:40:59,480 +really we're not going to that too much +but you have a smaller model it's harder + +554 +00:40:59,480 --> 00:41:08,579 +to over that but having many ensembles +of different neural networks were going + +555 +00:41:08,579 --> 00:41:34,289 +to go into that point in a bit because +if that was the only one that was used + +556 +00:41:34,289 --> 00:41:38,119 +upstairs ok I have a better way of +phrasing that point in my next life + +557 +00:41:38,119 --> 00:41:43,028 +let's look at a particular example is +that okay suppose that we are trying to + +558 +00:41:43,028 --> 00:41:47,130 +compute the cat score in the neural +network and the idea here is that you + +559 +00:41:47,130 --> 00:41:51,380 +have all these different units and +dropout is doing sports sing their many + +560 +00:41:51,380 --> 00:41:54,920 +way to look at dropout but one of them +is it's forcing your code your + +561 +00:41:54,920 --> 00:41:59,608 +representation for what the image was +about to be redundant because you need + +562 +00:41:59,608 --> 00:42:03,318 +that redundancy because you're about to +in a way that you can't control get half + +563 +00:42:03,318 --> 00:42:06,710 +of your network dropped off and so you +need to make your cat score on many more + +564 +00:42:06,710 --> 00:42:09,900 +features if you're going to cook +correctly compute the cat score because + +565 +00:42:09,900 --> 00:42:14,000 +any any one of them you can't rely on it +because it might be dropped and so + +566 +00:42:14,000 --> 00:42:17,068 +that's one way to look at it so in this +case we can still classify catskill + +567 +00:42:17,068 --> 00:42:22,639 +properly even if we don't have access to +whether or not it's very essential so + +568 +00:42:22,639 --> 00:42:24,768 +that's one interpretation of dropout + +569 +00:42:24,768 --> 00:42:29,088 +another interpretation of dropout is as +was mentioned in terms of muscle so + +570 +00:42:29,088 --> 00:42:33,358 +dropout is effectively can be looked at +as training a large ensemble of models + +571 +00:42:33,358 --> 00:42:36,420 +that are basically subnetworks + +572 +00:42:36,420 --> 00:42:43,099 +one large network but they cannot share +primaries in a good way so you + +573 +00:42:43,099 --> 00:42:46,650 +understand this you have to notice the +following if we do it for us and we + +574 +00:42:46,650 --> 00:42:49,970 +randomly drop off some of the units than +in backward pass think about what + +575 +00:42:49,969 --> 00:42:53,669 +happens with the gradient right so I +suppose we bring a random dropped off + +576 +00:42:53,670 --> 00:42:57,409 +these units in a backward pass we're +back propagating through the max that + +577 +00:42:57,409 --> 00:43:01,879 +were induced by the dropout so in +particular only the neurons that were + +578 +00:43:01,880 --> 00:43:05,349 +used in a forward pass will actually be +updated or have any grievance flowing + +579 +00:43:05,349 --> 00:43:09,599 +through them because any neuron that was +shut off 20 no gradient will flow + +580 +00:43:09,599 --> 00:43:13,650 +through it and its weights to its +previous layer will not be updated so + +581 +00:43:13,650 --> 00:43:18,550 +actively anymore on that was dropped out +its connections to the previous layer + +582 +00:43:18,550 --> 00:43:22,750 +will not be updated and it was just it's +as if it wasn't there so really what the + +583 +00:43:22,750 --> 00:43:27,230 +drop-off masks your sub sampling a part +of your neural network and you're only + +584 +00:43:27,230 --> 00:43:30,789 +training that neural network on that +single example that you happen that + +585 +00:43:30,789 --> 00:43:44,980 +point in time so as one model and gets +rained on only one data point + +586 +00:43:44,980 --> 00:43:51,250 +ok I can try to repeat that + +587 +00:43:51,250 --> 00:44:04,239 +came from somewhere here I want you guys +to understand this or not + +588 +00:44:04,239 --> 00:44:10,789 +ok so when you drop drop drop in your +own I wish I had the example of the + +589 +00:44:10,789 --> 00:44:14,429 +neuron right but if I drop in the value +I multiply its up to buy 09 its effect + +590 +00:44:14,429 --> 00:44:17,918 +on the loss function there's no effect +right so its gradient 10 because it's + +591 +00:44:17,918 --> 00:44:21,668 +about he was not used in computing the +loss and so it's weights will not get an + +592 +00:44:21,668 --> 00:44:25,679 +update and so it's as if we've subsample +a part of the network and we only train + +593 +00:44:25,679 --> 00:44:28,959 +that single data point that currently +came to network with only trained on it + +594 +00:44:28,958 --> 00:44:32,348 +and every time we do it for possibly +subsample two different part of your + +595 +00:44:32,349 --> 00:44:35,899 +neural network but they all share +parameters so it's kind of like a weird + +596 +00:44:35,898 --> 00:44:39,778 +ensemble of lots of different models all +training Monday a point but they all + +597 +00:44:39,778 --> 00:44:48,458 +share parameters so that's kind of +roughly the idea here doesn't make sense + +598 +00:44:48,458 --> 00:45:07,108 +usually save 50% is a very rough way to +raise the same size so in this in this + +599 +00:45:07,108 --> 00:45:09,798 +world powers will notice we actually +computer H + +600 +00:45:09,798 --> 00:45:14,009 +we compute them just as we did before +all of the computer it more than half of + +601 +00:45:14,009 --> 00:45:17,119 +the values will get dropped 20 + +602 +00:45:17,119 --> 00:45:29,250 +nothing changes they're good + +603 +00:45:29,250 --> 00:45:38,349 +stations instead of competing on the +issues you want to compete in the roads + +604 +00:45:38,349 --> 00:45:42,150 +are not being dropped in that case you +want to do sports updates so you could + +605 +00:45:42,150 --> 00:45:44,950 +in theory but I don't think that's +unusual in practice we don't worry about + +606 +00:45:44,949 --> 00:46:12,369 +it too much and so you always gotta work +training so every single iteration we + +607 +00:46:12,369 --> 00:46:15,469 +get a minute match we sample or noise +pattern for what we're gonna drop out + +608 +00:46:15,469 --> 00:46:19,359 +and go forward and backward pass and the +gradient and we keep turning this over + +609 +00:46:19,360 --> 00:46:31,360 +and over again so your question is like +somehow cleverly true the binary mask in + +610 +00:46:31,360 --> 00:46:35,829 +like a way that best optimize the model +or something that not really I don't + +611 +00:46:35,829 --> 00:46:44,769 +think that's done or anyone has looked +into too much sorry I yes I'm going to + +612 +00:46:44,769 --> 00:46:47,389 +get into that in one slide next slide + +613 +00:46:47,389 --> 00:46:57,618 +we're going to look at this time I'll +take up one last question + +614 +00:46:57,619 --> 00:47:04,519 +questions one drop out to different +amounts in different layers you can + +615 +00:47:04,518 --> 00:47:05,459 +there's nothing stopping you + +616 +00:47:05,460 --> 00:47:09,338 +its intuitively you want to apply +stronger drop out if you need more + +617 +00:47:09,338 --> 00:47:12,690 +regularization so there's a layer that +has a huge amount of Primaris will see + +618 +00:47:12,690 --> 00:47:16,349 +that income that's in one example you +want to hit by strong drop out there + +619 +00:47:16,349 --> 00:47:20,269 +conversely there might be some layers +that we'll see what the network's early + +620 +00:47:20,268 --> 00:47:24,248 +on the comedy show layers are very small +he don't really play as much drop out + +621 +00:47:24,248 --> 00:47:27,368 +there it's quite common for example the +color networking going to this in a bit + +622 +00:47:27,369 --> 00:47:30,740 +you start off with a low dropout ending +up over time so the answer to that is + +623 +00:47:30,739 --> 00:47:38,848 +yes and I forgot your second question +can you instead units dropout just + +624 +00:47:38,849 --> 00:47:41,880 +individual weights you can and that's +something called dropped connect we want + +625 +00:47:41,880 --> 00:47:46,349 +to go into too much in this class but +there's a way to do that as well i got + +626 +00:47:46,349 --> 00:47:52,829 +now it s time i trust ideally what you +want to do is we've introduced all this + +627 +00:47:52,829 --> 00:47:56,940 +noise right into the park pass and so if +you would like to do now it just time as + +628 +00:47:56,940 --> 00:48:00,349 +we'd like to integrate out all that +noise and want to cuddle approximation + +629 +00:48:00,349 --> 00:48:03,318 +to that would be something like you have +a test image that you like to classify + +630 +00:48:03,318 --> 00:48:06,909 +you can do many forward passes with many +different settings of your binary masks + +631 +00:48:06,909 --> 00:48:10,558 +and you're only using the subnetworks +and then you can averaged across all + +632 +00:48:10,559 --> 00:48:14,329 +those probably distributions so that +would be great but unfortunately is not + +633 +00:48:14,329 --> 00:48:17,818 +very efficient so it turns out that you +can actually approximate this process to + +634 +00:48:17,818 --> 00:48:22,338 +some degree has given to point out when +first introduced dropout and the way + +635 +00:48:22,338 --> 00:48:26,170 +will do this intuitively you want to +take advantage of all your neurons you + +636 +00:48:26,170 --> 00:48:29,509 +don't want to be dropping my random +we're going to try to copy the way we + +637 +00:48:29,509 --> 00:48:33,548 +can leave all the neurons turned on so +dunno drop out in a forward pass on a + +638 +00:48:33,548 --> 00:48:39,920 +test image but we have to actually be +careful with how we do this so we can so + +639 +00:48:39,920 --> 00:48:43,480 +in a poor pass your test images we're +not going to drop any units but we have + +640 +00:48:43,480 --> 00:48:48,028 +to be careful with something and +basically one way to get that what the + +641 +00:48:48,028 --> 00:48:54,880 +issue is supposed that this was an Iran +and its got two inputs and I suppose + +642 +00:48:54,880 --> 00:48:59,079 +that with all these inputs present at +this time so we're not dropping unit so + +643 +00:48:59,079 --> 00:49:02,630 +it s time these two have some +activations and the other doctors near a + +644 +00:49:02,630 --> 00:49:06,400 +computer to be some value tax yet to +compare this + +645 +00:49:06,400 --> 00:49:12,608 +value of x two what the neurons out but +would be during training time in + +646 +00:49:12,608 --> 00:49:18,440 +expectation ok because in training time +this dropout masks very randomly and so + +647 +00:49:18,440 --> 00:49:21,170 +there are many different cases that +could have happened any different in + +648 +00:49:21,170 --> 00:49:27,068 +those cases of this would be a different +scale and have to worry about this let + +649 +00:49:27,068 --> 00:49:32,259 +me show you exactly what this means I +think this + +650 +00:49:32,260 --> 00:49:35,539 +computes say there's no nonlinearity +were only looking at the lingering Iran + +651 +00:49:35,539 --> 00:49:39,990 +during stress tests this activation +becomes W 0 which is the wait here 10 + +652 +00:49:39,989 --> 00:49:44,848 +sacks + W one times why oK so that's +what I want to compute a test on and the + +653 +00:49:44,849 --> 00:49:48,420 +reason I have to be careful is that +during training time the expected output + +654 +00:49:48,420 --> 00:49:51,528 +of a in this particular case would have +been quite different so we have four + +655 +00:49:51,528 --> 00:49:55,619 +possibilities we could drop one or the +other or both or none so in those four + +656 +00:49:55,619 --> 00:49:56,720 +possibilities + +657 +00:49:56,719 --> 00:50:00,750 +computer different Valley was actually +crunch do this math you'll see that when + +658 +00:50:00,750 --> 00:50:01,659 +you reduce it + +659 +00:50:01,659 --> 00:50:07,548 +you end up with one half off WRX + W one +times why so in expectation at training + +660 +00:50:07,548 --> 00:50:15,630 +time the update of this neuron was +actually just time and so when you want + +661 +00:50:15,630 --> 00:50:19,640 +to use all the time you have to +compensate for this and this and that + +662 +00:50:19,639 --> 00:50:22,730 +happened away as coming from the fact +that we've dropped units with probably + +663 +00:50:22,730 --> 00:50:29,219 +the half and so that's why this end up +being half and so with probably point + +664 +00:50:29,219 --> 00:50:35,358 +five Olympic Singapore pass so basically +if we did not do this then we end up + +665 +00:50:35,358 --> 00:50:39,019 +having to large enough but compared to +what we had an expectation during + +666 +00:50:39,019 --> 00:50:42,960 +training time and you're out the +distribution will change and basically + +667 +00:50:42,960 --> 00:50:45,639 +things in the world that would break +because they're not used to seeing such + +668 +00:50:45,639 --> 00:50:49,368 +large epithermal neutrons and she have +to compensate for that and you have to + +669 +00:50:49,369 --> 00:50:53,798 +squish down so you're not using all your +things instead of just happened things + +670 +00:50:53,798 --> 00:50:57,480 +but you have to scratch daily +activations to get back to recover your + +671 +00:50:57,480 --> 00:51:03,099 +expected output ok this is actually a +tricky point but I think I was told once + +672 +00:51:03,099 --> 00:51:06,559 +a story that when Jeff Hinton came up +with drop out in the beginning he + +673 +00:51:06,559 --> 00:51:10,710 +actually didn't fully come up with this +part so we tried drop out any didn't + +674 +00:51:10,710 --> 00:51:16,088 +work and actually the reason it didn't +work as he he missed out on this tricky + +675 +00:51:16,088 --> 00:51:19,340 +point actually admittedly and so we have +to scale your activation + +676 +00:51:19,340 --> 00:51:24,070 +system down because of this effect and +then everything works much better so I + +677 +00:51:24,070 --> 00:51:28,500 +just I'm just to show you what this +looks like we basically compute these + +678 +00:51:28,500 --> 00:51:33,449 +neural nets as normal so we can be the +first or second but now it just time we + +679 +00:51:33,449 --> 00:51:38,869 +have to multiply by P so for example of +peace haha dropping probability scale + +680 +00:51:38,869 --> 00:51:43,139 +and down the activation so that the +expectation expected out but now has the + +681 +00:51:43,139 --> 00:51:46,969 +same as expected output in the training +time and so at this time you actually + +682 +00:51:46,969 --> 00:51:52,449 +recover for dropout and expected outputs +are matching and this actually works + +683 +00:51:52,449 --> 00:52:18,069 +really well so I'm dropping from this is +just the discrepancy between train and + +684 +00:52:18,070 --> 00:52:20,780 +test like every using all your neurons +are dropping them there's a discrepancy + +685 +00:52:20,780 --> 00:52:24,580 +so either you can correct it at this +time or you can use what we call in + +686 +00:52:24,579 --> 00:52:29,469 +Burley dropout which I'll show you in a +bit so we'll get to that in a bit + +687 +00:52:29,469 --> 00:52:34,319 +dropout summary if you want to drop out +drop your units with probably off with + +688 +00:52:34,320 --> 00:52:38,210 +keeping a probability of pee and then it +just forget to scale them so if you do + +689 +00:52:38,210 --> 00:52:40,820 +this network will do will work better + +690 +00:52:40,820 --> 00:52:44,190 +ok and don't forget to also back +propagate the masks which I'm not + +691 +00:52:44,190 --> 00:52:49,710 +showing an inverted dropout by the way +to do is to take care of this + +692 +00:52:49,710 --> 00:52:53,349 +discrepancy between the train and test +solution a slightly different way in + +693 +00:52:53,349 --> 00:52:57,710 +particular what we'll do is we're +changing this year so before you one was + +694 +00:52:57,710 --> 00:53:01,250 +a biomass cups frozen ones we're not +going to do is we're going to do the + +695 +00:53:01,250 --> 00:53:04,980 +scaling here at training time so we're +going to scale down the activations a + +696 +00:53:04,980 --> 00:53:07,960 +trying time for another skill them up +because if he spent five then we're + +697 +00:53:07,960 --> 00:53:12,079 +boosting accusations a train time by hot +and then it s time we can leave our code + +698 +00:53:12,079 --> 00:53:16,029 +touched right so we're doing the +boosting of the activations a train time + +699 +00:53:16,030 --> 00:53:20,880 +we're making everything artificially +greater by two acts and then it s time + +700 +00:53:20,880 --> 00:53:24,450 +we're supposed to have but now we're +just going to recover the clean + +701 +00:53:24,449 --> 00:53:27,819 +expressions because we've done the +scaling a trying time so now you'll be + +702 +00:53:27,820 --> 00:53:31,010 +you'll be properly calibrated +expectations between the train and test + +703 +00:53:31,010 --> 00:53:39,290 +every year on and work that's right so +using a dropout that's most common want + +704 +00:53:39,289 --> 00:53:42,779 +to use in practice so infected really +comes down to a few lines and then the + +705 +00:53:42,780 --> 00:53:47,300 +backward pass changes a bit but the +networks almost always work better with + +706 +00:53:47,300 --> 00:54:15,070 +this unless you're severely under +fitting in their actual exact and that's + +707 +00:54:15,070 --> 00:54:17,230 +why this is as i mentioned here + +708 +00:54:17,230 --> 00:54:22,039 +approximation is an approximation to +assemble and one of the reasons an + +709 +00:54:22,039 --> 00:54:25,029 +approximation is because once you +actually happened in the picture then + +710 +00:54:25,030 --> 00:54:27,769 +these expected outputs are all kind of +screwed up because of the nonlinear + +711 +00:54:27,769 --> 00:54:37,500 +effects on top of these questions thank +you for pointing that I go ahead + +712 +00:54:37,500 --> 00:54:44,769 +I see you're saying that they are +inverted drop-in and drop-out are not + +713 +00:54:44,769 --> 00:54:49,039 +equivalent so doing her job whether or +not is not a problem because of the the + +714 +00:54:49,039 --> 00:54:59,309 +nineties I'd have to think about it +maybe maybe you're right you may be + +715 +00:54:59,309 --> 00:55:37,949 +right here and I think all of this is +just about expectations in expectation + +716 +00:55:37,949 --> 00:55:41,349 +you're dropping a half and so that's the +correct thing to use even though there's + +717 +00:55:41,349 --> 00:55:44,049 +some randomness in exactly the amount +that actually end up being dropped + +718 +00:55:44,050 --> 00:55:47,370 +okay great + +719 +00:55:47,369 --> 00:55:51,869 +oh yeah there's like to tell you as a +fun story will drop out so I was in a + +720 +00:55:51,869 --> 00:55:55,509 +deep learning summer school in 2012 and +Jeff Hinton was for the first time or at + +721 +00:55:55,510 --> 00:55:56,590 +least the first time I saw it + +722 +00:55:56,590 --> 00:56:00,930 +presenting dropout and so he's basically +just saying okay said your neurons 20 at + +723 +00:56:00,929 --> 00:56:04,589 +random and just I'm just busy +activations and this always works better + +724 +00:56:04,590 --> 00:56:07,750 +better and we're like wow that's +interesting as a friend of mine sitting + +725 +00:56:07,750 --> 00:56:10,469 +next to me he just pulled up his laptop +right there he has a station has + +726 +00:56:10,469 --> 00:56:13,959 +University machines and implement it +right there during the talk and by the + +727 +00:56:13,960 --> 00:56:17,340 +time Jeff Hinton finish to talk he was +getting better results and getting + +728 +00:56:17,340 --> 00:56:18,950 +actually state of the art reporter like + +729 +00:56:18,949 --> 00:56:25,189 +on his data that he was working with the +fastest I've seen someone go like get an + +730 +00:56:25,190 --> 00:56:30,490 +extra 5% it was right there and then +while Japan too much going to talk I + +731 +00:56:30,489 --> 00:56:33,589 +thought that was really funny there's +very few times actually the something + +732 +00:56:33,590 --> 00:56:36,590 +like this happens it's a dropout is a +great thing because it's one of those + +733 +00:56:36,590 --> 00:56:42,390 +few investors that is very simple and it +always works just better and there's + +734 +00:56:42,389 --> 00:56:45,579 +very few of those kinds of tips and +tricks that we've picked up and I guess + +735 +00:56:45,579 --> 00:56:49,659 +the question is how many more simple +things like dropout are there and that + +736 +00:56:49,659 --> 00:56:50,879 +just give you two percent boost + +737 +00:56:50,880 --> 00:56:54,140 +always so we don't know + +738 +00:56:54,139 --> 00:57:01,199 +ok so I was going to go on at this point +into gradient checking but I think I + +739 +00:57:01,199 --> 00:57:04,588 +actually I decided I'm gonna skip this +because I'm tired of all the neural + +740 +00:57:04,588 --> 00:57:07,130 +network like we've been talking about +lots of details in training all that + +741 +00:57:07,130 --> 00:57:10,180 +works and I think you guys are tired as +well and so I'm going to skip gradient + +742 +00:57:10,179 --> 00:57:13,469 +checking because it's quite well +described herein notes I encourage you + +743 +00:57:13,469 --> 00:57:19,028 +to go through it is kind of a tricky +process takes a bit of time to to + +744 +00:57:19,028 --> 00:57:23,190 +appreciate all the difficulties with the +process and so just read through it I + +745 +00:57:23,190 --> 00:57:27,250 +don't think there's anything I can drive +around to make it more interesting to + +746 +00:57:27,250 --> 00:57:29,469 +you so I would encourage you to just +check it out + +747 +00:57:29,469 --> 00:57:33,118 +meanwhile we're going to jump right hand +and it's going to come that works and + +748 +00:57:33,119 --> 00:57:42,358 +look at pictures so look like this this +is Aileen at five from nineteen eighty + +749 +00:57:42,358 --> 00:57:46,538 +roughly and we're going to go into +details of how commercial networks mark + +750 +00:57:46,539 --> 00:57:49,609 +and in this class we're not actually +going to do any of the low-level details + +751 +00:57:49,608 --> 00:57:52,768 +I'm just going to try to give you +intuition about how this field can about + +752 +00:57:52,768 --> 00:57:56,868 +some days total context and just come +back from that works in general so if + +753 +00:57:56,869 --> 00:57:59,559 +you'd like to talk about the history of +commercial networks you have to go back + +754 +00:57:59,559 --> 00:58:04,910 +to roughly nineteen sixties experiments +approval and weasel so in particular + +755 +00:58:04,909 --> 00:58:10,449 +they were studying the primary visual +cortex and cat and they were sending an + +756 +00:58:10,449 --> 00:58:14,710 +early visual area and the cat brain as +the cat was looking at patterns on the + +757 +00:58:14,710 --> 00:58:19,500 +screen and they ended up actually +winning a Nobel Prize for this sometime + +758 +00:58:19,500 --> 00:58:23,449 +later for these experiments as we'd like +to show you what these experiments look + +759 +00:58:23,449 --> 00:58:27,518 +like just so they're really fun to look +at so I pulled up eighty video here in + +760 +00:58:27,518 --> 00:58:32,258 +and see what's going on here is the cat +is fixed in position and we're recording + +761 +00:58:32,259 --> 00:58:35,900 +from its cortex somewhere in the area of +processing which is in the back of your + +762 +00:58:35,900 --> 00:58:39,809 +brain could be one and now we're showing +different light patterns to the cat and + +763 +00:58:39,809 --> 00:58:43,519 +we're recording and sharing the neurons +fire for different stimuli let's look at + +764 +00:58:43,518 --> 00:58:48,039 +how this experience will look like + +765 +00:58:48,039 --> 00:59:14,050 +here + +766 +00:59:14,050 --> 00:59:27,410 +experiments like these cells and they +seem to turn all four edges in a + +767 +00:59:27,409 --> 00:59:30,279 +particular orientation and they get +excited about the edges and one + +768 +00:59:30,280 --> 00:59:36,360 +orientation and northern orientation +does not excite them and so like this + +769 +00:59:36,360 --> 00:59:42,150 +through a long process like a 10 minute +video so we're not going to do this for + +770 +00:59:42,150 --> 00:59:45,450 +a long time they spirited and they came +up with a model of how the visual cortex + +771 +00:59:45,449 --> 00:59:52,349 +process information in the brain and so +they can several things that ended up + +772 +00:59:52,349 --> 00:59:56,059 +leading to the Nobel Prize for example +they figured out that the cortex is + +773 +00:59:56,059 --> 00:59:56,759 +arranged + +774 +00:59:56,760 --> 01:00:02,570 +topically the visual cortex and what +that means is that she was my printer + +775 +01:00:02,570 --> 01:00:06,920 +basically nearby cells in the cortex so +this is cortical tissue unfolded nearby + +776 +01:00:06,920 --> 01:00:11,389 +salt air cortex are actually processing +nearby areas in your visual field so + +777 +01:00:11,389 --> 01:00:15,049 +you're whatever is not a recognized +processed nearby and your bring this + +778 +01:00:15,050 --> 01:00:20,510 +locality is preserved in your processing +and they also figured out that there was + +779 +01:00:20,510 --> 01:00:23,790 +an entire year of these roles what's +called the simple cells and they + +780 +01:00:23,789 --> 01:00:27,659 +responded to a particular orientation of +an edge and then there were all these + +781 +01:00:27,659 --> 01:00:31,809 +other cells that had more complex +responses so for example some cells + +782 +01:00:31,809 --> 01:00:34,949 +would be turning offer specific +orientation but were slightly + +783 +01:00:34,949 --> 01:00:38,159 +translation invariant so they don't care +about the specific position of the edge + +784 +01:00:38,159 --> 01:00:41,839 +but they only cared about the +orientation and so they hypothesize + +785 +01:00:41,840 --> 01:00:44,120 +through all of these experiments that +the visual cortex has this kind of + +786 +01:00:44,119 --> 01:00:48,269 +hierarchical organization where you end +up a simple sell their reading to other + +787 +01:00:48,269 --> 01:00:52,679 +cells called complex cells and etc and +these cells are built on top of each + +788 +01:00:52,679 --> 01:00:56,369 +other and the simple songs in particular +have these relatively local receptive + +789 +01:00:56,369 --> 01:01:00,019 +fields and they were building up more +and more complex kind of representations + +790 +01:01:00,019 --> 01:01:04,320 +in the brain through successive layers +of representation and so these are + +791 +01:01:04,320 --> 01:01:09,240 +experienced a lot of course some people +are trying to reproduce this in + +792 +01:01:09,239 --> 01:01:14,649 +computers and trying to model the visual +cortex with code and so one of the first + +793 +01:01:14,650 --> 01:01:19,389 +examples of this was gonna drop from +Fukushima and he basically ended up + +794 +01:01:19,389 --> 01:01:20,429 +setting up + +795 +01:01:20,429 --> 01:01:26,710 +architecture with these local receptive +cells that basically look at a small + +796 +01:01:26,710 --> 01:01:31,760 +region of the impact and then he stepped +up layers and layers of these and so he + +797 +01:01:31,760 --> 01:01:34,750 +had these simple assault on the complex +also simple solves complex also the + +798 +01:01:34,750 --> 01:01:39,000 +sandwich of simple and complex also +building up into iraqi now back then + +799 +01:01:39,000 --> 01:01:41,849 +though in nineteen eighties back +propagation will still not really around + +800 +01:01:41,849 --> 01:01:45,380 +and so pushing my head and unsupervised +learning procedure for training these + +801 +01:01:45,380 --> 01:01:49,599 +networks with like a clustering scheme +but this is not back propagates at the + +802 +01:01:49,599 --> 01:01:54,150 +time but it had this idea of successive +layers small cells building up on top of + +803 +01:01:54,150 --> 01:02:00,039 +each other and then these experiments +further and he kind of built on top of + +804 +01:02:00,039 --> 01:02:04,739 +work and he kept the architectural +layout but what he did was actually + +805 +01:02:04,739 --> 01:02:09,009 +trainees network the back propagation +and so for example he trained different + +806 +01:02:09,010 --> 01:02:12,770 +classifiers four digits or letters and +so on and so trained all of it + +807 +01:02:12,769 --> 01:02:16,769 +backdrop and they actually ended up +using this in complex systems that read + +808 +01:02:16,769 --> 01:02:23,469 +to check the radar like digits from +postal mail service and so on and so + +809 +01:02:23,469 --> 01:02:27,239 +that's actually go back to quite a long +time ago to nineteen nineties and + +810 +01:02:27,239 --> 01:02:33,199 +someone who was using them back then but +they were quite small ok and so in 2012 + +811 +01:02:33,199 --> 01:02:37,559 +is when the come to start to get quite a +bit bigger so this was the paper from + +812 +01:02:37,559 --> 01:02:43,549 +that I keep referring to escape into +they took all of that and it's not as a + +813 +01:02:43,550 --> 01:02:48,200 +dataset that comes actually from our lab +so it's a million images with thousand + +814 +01:02:48,199 --> 01:02:51,339 +classes huge amount of data you take +this model which is roughly 60 million + +815 +01:02:51,340 --> 01:02:56,380 +parameters and cold in Alex net based on +the first name of Alex Kozinski these + +816 +01:02:56,380 --> 01:02:59,260 +networks were going to see that they +have names so this is Alex Knapp is a + +817 +01:02:59,260 --> 01:03:05,560 +region that has that Google at their +several minutes so just like this one is + +818 +01:03:05,559 --> 01:03:09,630 +a limit and so we give them names so +this was Alex net and it was the one + +819 +01:03:09,630 --> 01:03:13,090 +that actually outperformed by quite a +bit on the other algorithms what's + +820 +01:03:13,090 --> 01:03:17,530 +interesting to note historically is the +difference between Alex nothing 2012 and + +821 +01:03:17,530 --> 01:03:21,850 +the limit in nineteen nineties there's +basically very very little difference is + +822 +01:03:21,849 --> 01:03:25,940 +when you look at these two different +networks this one used I think signals + +823 +01:03:25,940 --> 01:03:31,789 +or 10 H pennies probably and this one is +real and it was bigger and deeper and + +824 +01:03:31,789 --> 01:03:33,460 +was training GPU and have more data + +825 +01:03:33,460 --> 01:03:38,889 +and that's basically it that's the only +like that's roughly the difference and + +826 +01:03:38,889 --> 01:03:41,098 +so really what we've done is we've +figured out better ways of course + +827 +01:03:41,099 --> 01:03:45,000 +initializing them and it works better +with national army and rebels work much + +828 +01:03:45,000 --> 01:03:49,480 +better but other than that it was just +killing up both the data and compute + +829 +01:03:49,480 --> 01:03:53,740 +but for the most part the actor was +quite similar and we've done a few more + +830 +01:03:53,739 --> 01:03:56,719 +tricks like for example they used a big +filters will see that we use a much + +831 +01:03:56,719 --> 01:04:01,379 +smaller filters we also now this is only +a few tens of players we now have a + +832 +01:04:01,380 --> 01:04:05,059 +hundred and fifty later come that so we +really just skill is up quite a bit in + +833 +01:04:05,059 --> 01:04:08,150 +some respects but otherwise the basic +concept of how you process information + +834 +01:04:08,150 --> 01:04:09,789 +is similar + +835 +01:04:09,789 --> 01:04:15,150 +oK so that's are now basically +everywhere so they can do all kinds of + +836 +01:04:15,150 --> 01:04:19,280 +things like classify things of course +they're very good at retrieval so if you + +837 +01:04:19,280 --> 01:04:24,119 +show them an image they can retrieve +other images like it they can also do + +838 +01:04:24,119 --> 01:04:29,809 +detection so here and there detecting +dogs or horses are people and so on + +839 +01:04:29,809 --> 01:04:33,230 +this might be used for example in some +German cars all have this in the next + +840 +01:04:33,230 --> 01:04:36,588 +line they can also do some +experimentation so every single pixel is + +841 +01:04:36,588 --> 01:04:41,409 +labeled for example the person or a road +or tree or sky rebuilding segmentation + +842 +01:04:41,409 --> 01:04:47,529 +for their use in cars for example here's +an Nvidia Tegra which is small embedded + +843 +01:04:47,530 --> 01:04:51,480 +GPU we can run come that's one reason +for example this might be useful in the + +844 +01:04:51,480 --> 01:04:55,480 +car where you can identify all the you +can be skewed perception of rounding + +845 +01:04:55,480 --> 01:04:57,219 +things around you + +846 +01:04:57,219 --> 01:05:02,039 +comments are identifying faces probably +if you some of your friends are tacked + +847 +01:05:02,039 --> 01:05:04,909 +on Facebook automatically it's almost +certainly I would guess at this point + +848 +01:05:04,909 --> 01:05:10,069 +that video classification on YouTube +identify what's inside YouTube videos + +849 +01:05:10,070 --> 01:05:14,900 +they're used in this is a project from +Google that was very successful where + +850 +01:05:14,900 --> 01:05:17,900 +basically Google was really interested +in taking street view images and + +851 +01:05:17,900 --> 01:05:20,809 +automatically reading outhouse numbers +from them + +852 +01:05:20,809 --> 01:05:25,019 +ok and turns out this is a perfect +astrakhan that so they had lots of human + +853 +01:05:25,019 --> 01:05:30,289 +labor is at eight huge amounts of data +and then put a giant comment on it and + +854 +01:05:30,289 --> 01:05:33,429 +it ended up working almost as well as a +human and that's the thing that we'll + +855 +01:05:33,429 --> 01:05:37,710 +see throughout that this stuff works +really really well make an estimate + +856 +01:05:37,710 --> 01:05:41,730 +poses they can play computer games + +857 +01:05:41,730 --> 01:05:46,559 +they detect all kinds of cancer or +something like that and bye bye bye + +858 +01:05:46,559 --> 01:05:53,519 +images they can read Chinese characters +recognized street signs this is I think + +859 +01:05:53,519 --> 01:05:57,690 +segmentation of neural tissue they can +also do things that are not visual so + +860 +01:05:57,690 --> 01:06:02,510 +for example they can recognize speech +for speech processing they've been used + +861 +01:06:02,510 --> 01:06:07,780 +also for text documents so you can see +that text into comments as well they've + +862 +01:06:07,780 --> 01:06:11,400 +been used for to recognize different +types of galaxies they've been used to + +863 +01:06:11,400 --> 01:06:15,570 +in the recent cattle competition to +recognize different Wales this is a + +864 +01:06:15,570 --> 01:06:18,420 +particular well there was like a hundred +miles or something like that and that's + +865 +01:06:18,420 --> 01:06:24,409 +just my specific individual so this will +buy the pattern of its white spots on + +866 +01:06:24,409 --> 01:06:28,179 +its head is a particular way I'll become +it has recognized so it's amazing that + +867 +01:06:28,179 --> 01:06:32,618 +works at all they're using satellite +images quite a bit because now there are + +868 +01:06:32,619 --> 01:06:35,280 +several companies that have lots of +satellite data so this is all analyzed + +869 +01:06:35,280 --> 01:06:39,530 +with large comments in this case it's +winding roads but you can also look at + +870 +01:06:39,530 --> 01:06:43,850 +agriculture applications or someone they +can also do image capturing you might + +871 +01:06:43,849 --> 01:06:48,829 +have seen some of these results my work +included as well we take images and + +872 +01:06:48,829 --> 01:06:53,369 +captions that more sentences instead of +just a single category and they can also + +873 +01:06:53,369 --> 01:06:56,150 +be used for various artistic endeavors + +874 +01:06:56,150 --> 01:06:59,800 +so this is something called deep dream +and we're going to go into how this + +875 +01:06:59,800 --> 01:07:00,350 +works + +876 +01:07:00,349 --> 01:07:04,440 +actually implementing your third +assignment may be ok maybe you will + +877 +01:07:04,440 --> 01:07:08,099 +implement in your third assignment you +give it an image and using that you can + +878 +01:07:08,099 --> 01:07:11,349 +make it do weird stuff + +879 +01:07:11,349 --> 01:07:17,380 +particularly a lot of hallucinations of +dogs and we're going to go into why dogs + +880 +01:07:17,380 --> 01:07:20,349 +appear it has to do with the fact that +image net which is where these networks + +881 +01:07:20,349 --> 01:07:25,579 +get trained to the end up they have a +lot of dogs and so these these networks + +882 +01:07:25,579 --> 01:07:28,259 +and apple juice and eating dogs it's +kind of like they're used to some + +883 +01:07:28,260 --> 01:07:32,440 +patterns and then you should have a +different image you can make them put + +884 +01:07:32,440 --> 01:07:36,710 +them in the loop with the image and dole +hallucinate things so we'll see how this + +885 +01:07:36,710 --> 01:07:42,769 +works in a bit I'm not going to explain +the slide but it looks cool so you can + +886 +01:07:42,769 --> 01:07:47,559 +imagine that it's probably involved +somewhere I also want to point out that + +887 +01:07:47,559 --> 01:07:51,579 +what's interesting there's this paper +called the networks rival representation + +888 +01:07:51,579 --> 01:07:55,420 +of private I think cortex call for a +quarter of the recognition what they did + +889 +01:07:55,420 --> 01:08:00,250 +here is basically looking at I think +this was a macaque monkey and the + +890 +01:08:00,250 --> 01:08:05,280 +recording from the ITV from the cortex +here and there recording neural + +891 +01:08:05,280 --> 01:08:09,030 +activations monkeys looking at images +and then they fed the same images to + +892 +01:08:09,030 --> 01:08:12,660 +accomplish on your network and what +they're trying to do is from the popular + +893 +01:08:12,659 --> 01:08:16,960 +prom the commercial network code or from +the population of neurons only sparse + +894 +01:08:16,960 --> 01:08:21,560 +population of context they're trying to +perform classification of some concepts + +895 +01:08:21,560 --> 01:08:25,820 +and what you see is that the coating +from the idea cortex and classifying + +896 +01:08:25,819 --> 01:08:30,519 +images is almost as good as using this +neural network from 2013 in terms of the + +897 +01:08:30,520 --> 01:08:35,400 +information that they're about the image +you can do almost equal in performance + +898 +01:08:35,399 --> 01:08:40,279 +for classification perhaps even more +striking results here we're comparing + +899 +01:08:40,279 --> 01:08:43,759 +the fed a lot of images through the +competition at work and they got this + +900 +01:08:43,760 --> 01:08:46,720 +month he took a lot of images and then +you look at how these images are + +901 +01:08:46,720 --> 01:08:48,789 +represented in the brain or in the +comment + +902 +01:08:48,789 --> 01:08:53,019 +so these are two spaces representation +of how images are arranged in the space + +903 +01:08:53,020 --> 01:08:57,520 +by the comment and you can compare the +similarity matrices and statistics + +904 +01:08:57,520 --> 01:09:00,450 +you'll see that the I T cortex and the +comment + +905 +01:09:00,449 --> 01:09:04,099 +that's are basically very very similar +representation there's a mapping between + +906 +01:09:04,100 --> 01:09:08,440 +them it almost seems like similar things +are being computed the way they arranged + +907 +01:09:08,439 --> 01:09:12,399 +a visual space of different concepts and +what's closed and what's far is very + +908 +01:09:12,399 --> 01:09:16,809 +very remarkably similar to what you see +in the in the brain and so some people + +909 +01:09:16,810 --> 01:09:20,780 +think that this is just some evidence +that companies are doing something brain + +910 +01:09:20,779 --> 01:09:23,769 +like and that's very interesting so the +only question that remains then in that + +911 +01:09:23,770 --> 01:09:24,330 +case + +912 +01:09:24,329 --> 01:09:27,210 +is this work + +913 +01:09:27,210 --> 01:09:28,609 +and we'll find out the next class + diff --git a/captions/En/Lecture8_en.srt b/captions/En/Lecture8_en.srt new file mode 100644 index 00000000..62222deb --- /dev/null +++ b/captions/En/Lecture8_en.srt @@ -0,0 +1,4377 @@ +1 +00:00:00,000 --> 00:00:07,519 +clocks let's let's get started so I know +lecture today a little bit of a break so + +2 +00:00:07,519 --> 00:00:11,269 +today we're the last time we talked +about sort of we saw all the parts of + +3 +00:00:11,269 --> 00:00:14,439 +comments we put everything together +today we're going to see some + +4 +00:00:14,439 --> 00:00:16,250 +applications of contacts + +5 +00:00:16,250 --> 00:00:20,550 +aspect actually dive inside images and +talk about spatial localization and + +6 +00:00:20,550 --> 00:00:25,550 +detection we were we actually moved this +lecture up a little bit we had it later + +7 +00:00:25,550 --> 00:00:29,080 +on the schedule we saw a lot of guys +were interested in this type of projects + +8 +00:00:29,079 --> 00:00:31,839 +who wanted to move it earlier to kind of +give you an idea of what's what's + +9 +00:00:31,839 --> 00:00:38,378 +feasible so first couple administrative +things are the project proposals were + +10 +00:00:38,378 --> 00:00:41,988 +doing Saturday my inbox kind of exploded +over the weekend so I think most of you + +11 +00:00:41,988 --> 00:00:45,909 +submit it but if you didn't you should +probably get on that we're in the + +12 +00:00:45,909 --> 00:00:49,328 +process of looking through those will go +to make sure that the project proposals + +13 +00:00:49,329 --> 00:00:52,530 +are reasonable never once admitted one +so we'll hopefully get back to you on + +14 +00:00:52,530 --> 00:01:02,149 +your projects this week also home or two +is due on Friday so who's who's done who + +15 +00:01:02,149 --> 00:01:04,519 +stuck on patch norm + +16 +00:01:04,519 --> 00:01:09,820 +okay good good that's fewer hands then +we saw last week so we're making + +17 +00:01:09,819 --> 00:01:13,688 +progress also keep in mind that we're +asking you to actually trained a pretty + +18 +00:01:13,688 --> 00:01:17,798 +big continent on C far for this homework +so if you're starting to train on + +19 +00:01:17,799 --> 00:01:22,570 +Thursday night that might be top so +maybe start early on that last part also + +20 +00:01:22,569 --> 00:01:25,618 +homework 1 were in the process of +creating hopefully we'll have those back + +21 +00:01:25,618 --> 00:01:30,540 +to this week you can get feedback before +homework to do also keep in mind though + +22 +00:01:30,540 --> 00:01:35,450 +we actually have a in class midterm next +week on Wednesday so that's a week from + +23 +00:01:35,450 --> 00:01:41,159 +Wednesday so be ready in class should be +a lot of fun + +24 +00:01:41,159 --> 00:01:46,359 +alright so last lecture we were talking +about competition that works we can + +25 +00:01:46,358 --> 00:01:50,438 +absolve the pieces we spent a long time +understanding how this convolution + +26 +00:01:50,438 --> 00:01:53,699 +operator works how we're sort of +transforming feature maps from one to + +27 +00:01:53,700 --> 00:01:58,329 +another by running into products over by +sliding this window over the map + +28 +00:01:58,328 --> 00:02:01,809 +computing products and actually +transforming our representation through + +29 +00:02:01,810 --> 00:02:05,759 +many layers of processing and if you +remember if you remember these lower + +30 +00:02:05,759 --> 00:02:09,299 +layers of convolutions tent wherein +things like edges and colors and higher + +31 +00:02:09,299 --> 00:02:14,790 +layers tend to learn more complex object +parts we talked about pulling which is + +32 +00:02:14,789 --> 00:02:18,509 +used to some sample and downsize our +feature representations inside networks + +33 +00:02:18,509 --> 00:02:24,209 +that's a common ingredient we saw we +also did case studies on particular + +34 +00:02:24,209 --> 00:02:27,479 +content architectures you could see how +these things tend to get hooked up in + +35 +00:02:27,479 --> 00:02:31,568 +practice so we talk about one at which +is something from 98 it's a little fiber + +36 +00:02:31,568 --> 00:02:35,189 +content that was used four digit +recognition we talked about Alex not the + +37 +00:02:35,189 --> 00:02:38,949 +kind of kicked off the big deep deep +learning boom in 2012 by winning image + +38 +00:02:38,949 --> 00:02:45,568 +not come that we talked about ZF that +one image net classification in 2013 was + +39 +00:02:45,568 --> 00:02:51,108 +pretty similar to Alex now and then we +saw that deeper is often better for + +40 +00:02:51,109 --> 00:02:55,709 +classification we looked at Google Matt +and PGG that did really well in 2014 + +41 +00:02:55,709 --> 00:03:00,609 +competitions that were much much deeper +than Alex Natanz and a lot better and we + +42 +00:03:00,609 --> 00:03:05,430 +also saw this new fancy crazy thing for +Microsoft called the ResNet that one + +43 +00:03:05,430 --> 00:03:10,909 +just in december in 2015 with hundred +and fifty where architecture and as your + +44 +00:03:10,909 --> 00:03:14,579 +caller just over the last couple years +these different architectures have been + +45 +00:03:14,579 --> 00:03:19,109 +getting deeper and getting a lot better +but this is just for classification so + +46 +00:03:19,109 --> 00:03:23,980 +now in this lecture we're going to talk +about localisation and detection which + +47 +00:03:23,979 --> 00:03:28,500 +is actually another really big important +problem in computer vision and this idea + +48 +00:03:28,500 --> 00:03:32,699 +of deeper networks doing better chance +that all kind of will revisit that a lot + +49 +00:03:32,699 --> 00:03:37,798 +in these new attacks as well so so far +in the class we've really been talking + +50 +00:03:37,799 --> 00:03:42,639 +about classification which is sort of +given an image we want to classify which + +51 +00:03:42,639 --> 00:03:47,049 +are some number object categories it is +that's not nice basic problem in + +52 +00:03:47,049 --> 00:03:50,340 +computer vision that we've using that +were using to understand comments and + +53 +00:03:50,340 --> 00:03:53,800 +such but there's actually a lot of other +tasks that people were coming to + +54 +00:03:53,800 --> 00:03:59,350 +so some of these are classification and +localisation now instead of just + +55 +00:03:59,349 --> 00:04:03,699 +classifying an edge as well as some +category labels we also want to drop + +56 +00:04:03,699 --> 00:04:07,349 +down box in the image to say where that +class occurs + +57 +00:04:07,349 --> 00:04:11,549 +another problem people work on its +detection so here there's again some + +58 +00:04:11,550 --> 00:04:15,689 +pics number of object categories but we +actually want to find all instances of + +59 +00:04:15,689 --> 00:04:20,238 +those categories inside the image and +Dropbox around them another more recent + +60 +00:04:20,238 --> 00:04:24,189 +task but people have started to work on +a bit as this crazy thing called instant + +61 +00:04:24,189 --> 00:04:27,490 +segmentation where again you want you +have some pics number about two + +62 +00:04:27,490 --> 00:04:30,829 +categories you want to find all +instances of those categories your image + +63 +00:04:30,829 --> 00:04:35,319 +but instead of using a box you actually +want to draw little contour around and + +64 +00:04:35,319 --> 00:04:37,279 +identify all the pixels + +65 +00:04:37,279 --> 00:04:41,549 +belonging to each instance instance +segmentations kind of crazy so we're not + +66 +00:04:41,550 --> 00:04:44,710 +going to talk about that today just +thought you should be aware of it and + +67 +00:04:44,709 --> 00:04:47,959 +we're gonna really focus on this these +localisation and detection tasks today + +68 +00:04:47,959 --> 00:04:52,009 +and the big difference between these is +the number of objects that were finding + +69 +00:04:52,009 --> 00:04:56,250 +so and localisation there's sort of one +object or in general effects number of + +70 +00:04:56,250 --> 00:05:00,129 +objects whereas in detection we might +have multiple objects or a variable + +71 +00:05:00,129 --> 00:05:04,000 +number of objects and this seems like a +small difference but it'll turn out to + +72 +00:05:04,000 --> 00:05:05,360 +actually make a big + +73 +00:05:05,360 --> 00:05:10,480 +be pretty important for architectures so +we're gonna first talked about + +74 +00:05:10,480 --> 00:05:15,610 +classification and localisation cuz its +kind of the simplest so just to recap + +75 +00:05:15,610 --> 00:05:16,389 +what I just sad + +76 +00:05:16,389 --> 00:05:21,849 +classification one image to a category +label localisation is image to a box and + +77 +00:05:21,850 --> 00:05:26,730 +classification localisation means we're +gonna do both the same time just to give + +78 +00:05:26,730 --> 00:05:30,669 +you an idea of the kinds of dance that +people use for this we talked we've + +79 +00:05:30,668 --> 00:05:33,849 +talked about the image that +classification challenge image not also + +80 +00:05:33,850 --> 00:05:37,810 +has run a classification + localisation +challenge so here + +81 +00:05:37,810 --> 00:05:42,269 +similar to the classification task +there's a thousand classes and each + +82 +00:05:42,269 --> 00:05:46,319 +training instance in those classes +actually has one class and several + +83 +00:05:46,319 --> 00:05:51,069 +bounding boxes for that class inside the +image and now a test tinier algorithm + +84 +00:05:51,069 --> 00:05:55,709 +organics bypasses where instead of your +guesses just being class labels it's a + +85 +00:05:55,709 --> 00:05:59,370 +class label together with the bounding +box and to get it right you need to get + +86 +00:05:59,370 --> 00:06:03,288 +the class label rights and the bounding +box rights we're getting a bounding box + +87 +00:06:03,288 --> 00:06:06,589 +right just means you're close in some +thing called intersection of + +88 +00:06:06,589 --> 00:06:11,310 +that you don't need to care about too +much right now so again you get it for + +89 +00:06:11,310 --> 00:06:15,259 +image that at least you get it right if +you one of your 5 gases is correct and + +90 +00:06:15,259 --> 00:06:18,129 +this is kind of the main dataset people +work on for classification + + +91 +00:06:18,129 --> 00:06:25,159 +localisation so one really fundamental +paradigm it's really useful when + +92 +00:06:25,160 --> 00:06:28,700 +thinking about localisation is this idea +of regression so I don't know if + +93 +00:06:28,699 --> 00:06:31,219 +thinking back to a machine learning +class you kind of saw like + +94 +00:06:31,220 --> 00:06:36,160 +classification and regression may be +with me regression or something fancier + +95 +00:06:36,160 --> 00:06:39,689 +and when we're talking about +localisation it's really implies we can + +96 +00:06:39,689 --> 00:06:42,980 +really just frame this as a regression +problem where we have an image that's + +97 +00:06:42,980 --> 00:06:46,700 +coming in that image is going to go +through some some processing and/or + +98 +00:06:46,699 --> 00:06:49,990 +eventually going to produce for +real-valued numbers that promote rise + +99 +00:06:49,990 --> 00:06:53,829 +this box there's different +parameterizations people use common is + +100 +00:06:53,829 --> 00:06:57,759 +XY coordinates of the upper left hand +corner and the width and height of the + +101 +00:06:57,759 --> 00:07:01,000 +box but you'll see some other variants +as well but always four numbers for + +102 +00:07:01,000 --> 00:07:04,680 +bounding box and then there's some +ground truth bounding box which again is + +103 +00:07:04,680 --> 00:07:08,810 +just four numbers and now we have we can +compute a loss like maybe Euclidean + +104 +00:07:08,810 --> 00:07:12,699 +losses a pretty pretty standard choice +between the numbers that we produced in + +105 +00:07:12,699 --> 00:07:16,339 +the correct numbers and now we can just +turn this thing just like we did our + +106 +00:07:16,339 --> 00:07:20,489 +classification networks where we sample +so many batch of data with some ground + +107 +00:07:20,490 --> 00:07:24,210 +truth boxes we propagate forward +computer lost between our predictions + +108 +00:07:24,209 --> 00:07:29,359 +and the correct predictions back +propagate and just update the network so + +109 +00:07:29,360 --> 00:07:33,250 +this paradigm is is really easy that's +actually makes this localization task + +110 +00:07:33,250 --> 00:07:37,269 +actually pretty easy to implement so +here's a really simple recipe for how + +111 +00:07:37,269 --> 00:07:41,289 +you could implement classification + +localisation so first you just download + +112 +00:07:41,290 --> 00:07:44,370 +some existing preteen model are you +train yourself if you're ambitious + +113 +00:07:44,370 --> 00:07:48,139 +something like Alex Knight BGG Google +met all these things we talked about + +114 +00:07:48,139 --> 00:07:53,180 +last lecture now we're going to take +those fully connected layers that were + +115 +00:07:53,180 --> 00:07:57,100 +producing our class scores were gonna +set those aside for the moment and we're + +116 +00:07:57,100 --> 00:08:00,410 +gonna attach a couple new fully +connected layers to some point in the + +117 +00:08:00,410 --> 00:08:04,840 +network this will be called call this a +regression had but I mean it's basically + +118 +00:08:04,839 --> 00:08:08,119 +the same thing as a couple fully +connected layers and then I'll puts some + +119 +00:08:08,120 --> 00:08:13,889 +real valued numbers now we train this +thing just like we train our + +120 +00:08:13,889 --> 00:08:17,209 +classification network the only +difference is that now instead of class + +121 +00:08:17,209 --> 00:08:18,359 +wars + +122 +00:08:18,360 --> 00:08:24,550 +and graduate classes we use Lt loss and +crown jewel boxes of Matt we train this + +123 +00:08:24,550 --> 00:08:28,918 +network exactly the same way now it s +time we just use both heads to do + +124 +00:08:28,918 --> 00:08:32,218 +classification and localisation we have +an image we've changed the + +125 +00:08:32,219 --> 00:08:36,700 +classification has we train +delocalization heads we pass it through + +126 +00:08:36,700 --> 00:08:40,620 +we get class course we get boxes and +when we're done like really that's all + +127 +00:08:40,620 --> 00:08:44,259 +you need to do so this is kind of a +really nice simple recipe that you guys + +128 +00:08:44,259 --> 00:08:50,208 +could use for classification + +localisation on your projects so other + +129 +00:08:50,208 --> 00:08:54,750 +one slight detail with this approach +there's sort of two main ways that + +130 +00:08:54,750 --> 00:08:59,990 +people do this regression task you could +imagine a class agnostic regresar or + +131 +00:08:59,990 --> 00:09:04,190 +class-specific regresar you could +imagine that no matter what class I'm + +132 +00:09:04,190 --> 00:09:07,760 +going to use the same architecture the +same weights in those fully connected + +133 +00:09:07,759 --> 00:09:11,600 +layers to produce my bounding box that +would be in your sort of outputting + +134 +00:09:11,600 --> 00:09:15,379 +always four numbers which are just the +box no matter the class I'm an + +135 +00:09:15,379 --> 00:09:19,139 +alternative you'll see sometimes it's +class-specific regression we're now + +136 +00:09:19,139 --> 00:09:23,389 +you're gonna put see times for numbers +that's sort of like one bounding box per + +137 +00:09:23,389 --> 00:09:27,569 +class and different people have found +that sometimes these work better and + +138 +00:09:27,570 --> 00:09:31,269 +different cases but it i mean +intuitively it kind of makes sense that + +139 +00:09:31,269 --> 00:09:35,470 +something that the way you might think +about localizing a cat could be a little + +140 +00:09:35,470 --> 00:09:38,129 +bit different than the way you localize +are trained so maybe you wanna have + +141 +00:09:38,129 --> 00:09:42,289 +different parts of your network that are +responsible for those things but it's + +142 +00:09:42,289 --> 00:09:45,569 +it's pretty easy venue just it changes +your back the way you came to Los a + +143 +00:09:45,570 --> 00:09:49,329 +little bit you compute loss only using +the ground truth class + +144 +00:09:49,328 --> 00:09:52,809 +the box for the ground truth class but +even that still basically the same idea + +145 +00:09:52,809 --> 00:09:57,750 +and other design choice here is where +exactly you attach the regression had + +146 +00:09:57,750 --> 00:10:01,360 +again this isn't too important different +people if you'll see different people do + +147 +00:10:01,360 --> 00:10:05,120 +it in different ways some common choices +would be to attach it right after the + +148 +00:10:05,120 --> 00:10:09,948 +fall of the last convolutional air just +sort of mean like you're really serious + +149 +00:10:09,948 --> 00:10:14,909 +initializing new fully connected layers +will see things like over feet and BG + +150 +00:10:14,909 --> 00:10:18,909 +localisation work this way another +common choice is to just attach your + +151 +00:10:18,909 --> 00:10:22,939 +aggression had actually after the last +fully connected layers from the + +152 +00:10:22,940 --> 00:10:27,310 +classification of work and you'll see +some other things like depots on our CNN + +153 +00:10:27,309 --> 00:10:31,099 +kind of work in this labor but either +one works fine + +154 +00:10:31,100 --> 00:10:38,129 +you could attach to just about anywhere +and do something so as an aside this is + +155 +00:10:38,129 --> 00:10:42,029 +we can actually generalize this +framework to localizing more than one + +156 +00:10:42,029 --> 00:10:46,610 +object so normally with this +classification localisation task that we + +157 +00:10:46,610 --> 00:10:50,440 +sort of set up an image that we care +about producing exactly one object + +158 +00:10:50,440 --> 00:10:54,620 +bounding box for the input image but in +some cases you might know ahead of time + +159 +00:10:54,620 --> 00:10:59,279 +that you always want to localize some +fixed number of objects so here this is + +160 +00:10:59,279 --> 00:11:03,730 +really easy to generalize now your +aggression had just outputs box for each + +161 +00:11:03,730 --> 00:11:07,039 +of those objects that you care about and +again you train the network in the same + +162 +00:11:07,039 --> 00:11:12,839 +way and this idea of actually localizing +multiple objects the same time is pretty + +163 +00:11:12,840 --> 00:11:16,790 +general and pretty powerful so for +example this kind of approach has been + +164 +00:11:16,789 --> 00:11:21,559 +used for human pose estimation so the +idea is we want to input a crime a + +165 +00:11:21,559 --> 00:11:25,299 +close-up view of a person and anyone to +figure out what's the pose of that + +166 +00:11:25,299 --> 00:11:29,789 +person so well people sort of generally +have a fixed number of joints like their + +167 +00:11:29,789 --> 00:11:34,370 +breasts and their neck and their elbows +and that sort of stuff so we just know + +168 +00:11:34,370 --> 00:11:39,060 +that we need to find all the joints so +we import our image we run it through a + +169 +00:11:39,059 --> 00:11:43,829 +convolutional network and we regress xy +coordinates for each joint location and + +170 +00:11:43,830 --> 00:11:47,490 +that gives us our action that actually +lets you predict a whole human pose + +171 +00:11:47,490 --> 00:11:52,409 +using the sort of localisation framework +in this paper and there's a paper from + +172 +00:11:52,409 --> 00:11:55,819 +Google from a year or two ago that does +this sort of approach that a couple + +173 +00:11:55,820 --> 00:11:59,740 +other bells and whistles but the basic +idea was just regressing using a CNN to + +174 +00:11:59,740 --> 00:12:05,100 +these joint sessions so overall this +idea of localisation and treating it as + +175 +00:12:05,100 --> 00:12:09,769 +regression 46 number of objects is +really really simple so I know some of + +176 +00:12:09,769 --> 00:12:12,659 +you guys on your projects have been +thinking about you want to actually run + +177 +00:12:12,659 --> 00:12:16,850 +detection cause you want to understand +like any parts of your images or find + +178 +00:12:16,850 --> 00:12:21,290 +parts inside the image and if you're +thinking of a project along those lines + +179 +00:12:21,289 --> 00:12:25,019 +I really encourage you to think about +this localization framework instead that + +180 +00:12:25,019 --> 00:12:27,750 +if there's actually a fixed number of +objects that you know you want to + +181 +00:12:27,750 --> 00:12:31,929 +localize and every image you should try +to frame it as a localization problem + +182 +00:12:31,929 --> 00:12:38,129 +that's tends to be a lot easier to setup +alright so actually the simple idea of + +183 +00:12:38,129 --> 00:12:42,019 +localisation via regression actually is +really simple it'll actually work I + +184 +00:12:42,019 --> 00:12:44,120 +would really encourage you to try it for +projects + +185 +00:12:44,120 --> 00:12:47,330 +but if you wanna win competitions like +image that you need to add a little bit + +186 +00:12:47,330 --> 00:12:52,330 +of other fancy stuff so another thing +that people do for localisation is this + +187 +00:12:52,330 --> 00:12:56,410 +idea of sliding window so we'll step +through this in more detail but the idea + +188 +00:12:56,409 --> 00:13:00,809 +is that you still have your +classification localisation two-headed + +189 +00:13:00,809 --> 00:13:04,929 +network but you're actually gonna run it +not once on the image but at multiple + +190 +00:13:04,929 --> 00:13:08,269 +positions on the image and you're gonna +aggregated across those different + +191 +00:13:08,269 --> 00:13:13,100 +positions and you can actually do this +in an efficient way so it took sort of + +192 +00:13:13,100 --> 00:13:17,290 +see more concretely how how this sliding +window localisation works we're gonna + +193 +00:13:17,289 --> 00:13:21,980 +look at the over-the-air architecture so +over feat was actually the winner of the + +194 +00:13:21,980 --> 00:13:25,399 +image that localisation challenge in +2013 + +195 +00:13:25,399 --> 00:13:29,730 +it this this architect this this sort of +setup looks basically like what we saw a + +196 +00:13:29,730 --> 00:13:33,839 +couple nights ago we have an Alex not at +the beginning then we have a + +197 +00:13:33,839 --> 00:13:37,820 +classification had to have a regression +had classification head is spinning out + +198 +00:13:37,820 --> 00:13:38,740 +class for us + +199 +00:13:38,740 --> 00:13:44,450 +regression had a speeding up the boxes +and this thing because it's in Alex nat + +200 +00:13:44,450 --> 00:13:51,120 +type of architecture is expecting an +input of 221 221 but actually we can run + +201 +00:13:51,120 --> 00:13:55,679 +this on larger images and this can help +sometimes so suppose we have a large + +202 +00:13:55,679 --> 00:14:02,799 +larger image of what and when I say 257 +by 257 now we could imagine taking our + +203 +00:14:02,799 --> 00:14:06,659 +classification + localisation network +and running at just on the upper corner + +204 +00:14:06,659 --> 00:14:11,799 +of this image and that'll give us some +some class score and also some summer + +205 +00:14:11,799 --> 00:14:15,979 +grass bounding box and we're gonna +repeat this take our same classification + +206 +00:14:15,980 --> 00:14:21,820 ++ localisation network and run it on all +four corners of this image and after + +207 +00:14:21,820 --> 00:14:26,230 +doing so will end up with for grass +bounding boxes one from each of those + +208 +00:14:26,230 --> 00:14:30,509 +four locations together with a class +classification score for each location + +209 +00:14:30,509 --> 00:14:35,700 +but we actually want just a single +bounding box so then they use some + +210 +00:14:35,700 --> 00:14:39,770 +heuristics to Mercedes bounding boxes in +scores and that puts a little bit ugly I + +211 +00:14:39,769 --> 00:14:42,809 +don't wanna go into the details here +they have it in the paper but the idea + +212 +00:14:42,809 --> 00:14:46,699 +is that public combining aggregating +these boxes across multiple locations + +213 +00:14:46,700 --> 00:14:50,959 +can help that can help the model sort of +credits on airs and this tends to work + +214 +00:14:50,958 --> 00:14:55,058 +really well and that mean that won them +the challenge that year + +215 +00:14:55,058 --> 00:14:58,149 +but in practice they actually use many +more than four locations + +216 +00:14:58,149 --> 00:15:08,989 +oh ya ought to be fully with them + +217 +00:15:08,989 --> 00:15:12,939 +well I mean it's actually good point so +once you're doing regression you're just + +218 +00:15:12,938 --> 00:15:15,498 +predicting for numbers you couldn't +crack you couldn't be reproduced + +219 +00:15:15,499 --> 00:15:20,149 +anywhere it doesn't have to be inside +the image although I know that brings up + +220 +00:15:20,149 --> 00:15:23,698 +a good point when you're doing this +especially when they when you're + +221 +00:15:23,698 --> 00:15:27,088 +training this network in this sliding +window way you actually to ship the + +222 +00:15:27,089 --> 00:15:30,429 +ground truth box in a little bit ship +ship the coordinate frame for those + +223 +00:15:30,428 --> 00:15:35,999 +different slices that's kind of an ugly +details just worried about ya but in + +224 +00:15:35,999 --> 00:15:39,428 +practice they use many more than four +image locations and they actually do + +225 +00:15:39,428 --> 00:15:43,629 +multiple scales as well as you can see +this is actually figure from their paper + +226 +00:15:43,629 --> 00:15:47,129 +I'm a left you see all the different +positions where they kind of evaluated + +227 +00:15:47,129 --> 00:15:52,058 +this network in the middle you see those +output progressed boxes one for each of + +228 +00:15:52,058 --> 00:15:55,678 +those positions on the bottom easy to +score map for each of those positions + +229 +00:15:55,678 --> 00:16:00,139 +and then I mean they're pretty noisy but +it's kinda convert their kind of + +230 +00:16:00,139 --> 00:16:03,899 +generally over the bear so they'd run +this fancy aggregation method and they + +231 +00:16:03,899 --> 00:16:07,839 +get a final box for the bear and they +decide that the same as a pair and they + +232 +00:16:07,839 --> 00:16:12,869 +actually won the challenge with this but +one problem you might anticipate is it + +233 +00:16:12,869 --> 00:16:15,759 +could be pretty expensive to actually +run the network on every one of those + +234 +00:16:15,759 --> 00:16:20,259 +crops but there's actually more +efficient with thing we could do so we + +235 +00:16:20,259 --> 00:16:23,489 +normally think of these networks as +having convolutional errors and then + +236 +00:16:23,489 --> 00:16:26,048 +fully connected Lares but when you think +about it + +237 +00:16:26,048 --> 00:16:31,108 +a fully connected larry is just 4096 +numbers right it's just a factor but + +238 +00:16:31,109 --> 00:16:34,679 +instead of thinking of it as a vector we +could think of it as just another + +239 +00:16:34,678 --> 00:16:39,269 +convolutional feature map is kinda crazy +we just transpose that added to + +240 +00:16:39,269 --> 00:16:45,019 +one-by-one dimensions so now the idea is +that we can now treat our car fully + +241 +00:16:45,019 --> 00:16:49,499 +connected layers and convert them into +convolutional there's a few imagined in + +242 +00:16:49,499 --> 00:16:54,339 +our fully connected network we had this +convolutional feature map and we had one + +243 +00:16:54,339 --> 00:16:57,749 +way from each element of that +competition will feature map to produce + +244 +00:16:57,749 --> 00:17:02,048 +each element of our 4096 dimensional +vector but we instead of thinking about + +245 +00:17:02,048 --> 00:17:06,288 +reshaping and having a fine layer that's +sort of equivalent to just having a five + +246 +00:17:06,288 --> 00:17:06,970 +by five + +247 +00:17:06,970 --> 00:17:10,120 +solution it's a little bit weird but if +you think about it it should make sense + +248 +00:17:10,119 --> 00:17:16,318 +eventually but alright so then we take +this fully connected later turns into a + +249 +00:17:16,318 --> 00:17:21,899 +five by five convolution than this than +we previously had another fully + +250 +00:17:21,900 --> 00:17:26,409 +connected mayor going from 4096 4096 +this is actually a one-by-one + +251 +00:17:26,409 --> 00:17:30,570 +convolution right that's that's kinda +weird but if you if you think hard and + +252 +00:17:30,569 --> 00:17:35,369 +work out the math on paper and go send a +quiet room you'll figure it out and so + +253 +00:17:35,369 --> 00:17:38,769 +we basically can't earn each of these +fully connected layers and our network + +254 +00:17:38,769 --> 00:17:43,509 +into a convolutional air and now now +this is pretty cool because now our + +255 +00:17:43,509 --> 00:17:47,589 +network is composed entirely of just +contributions and pooling and elements + +256 +00:17:47,589 --> 00:17:51,819 +operations so now we can actually run +the network on images of different sizes + +257 +00:17:51,819 --> 00:17:56,889 +and this sort of will give us very +cheaply equip the equivalent of + +258 +00:17:56,890 --> 00:18:01,840 +operating but not work independently on +different locations so to kind of see + +259 +00:18:01,839 --> 00:18:02,609 +how that works + +260 +00:18:02,609 --> 00:18:07,219 +you imagine a training time you may be +working over 14 by 14 template you run + +261 +00:18:07,220 --> 00:18:11,960 +some convolutions and then here are are +fully connected layers that we're now + +262 +00:18:11,960 --> 00:18:17,140 +re-imagining as convolutional Ayers said +and we have this by by five con block + +263 +00:18:17,140 --> 00:18:22,600 +that gets turned into these one-by-one +specially sized elements so we've sort + +264 +00:18:22,599 --> 00:18:26,449 +of eliminating not showing the depth +dimension here but these like this one + +265 +00:18:26,450 --> 00:18:30,900 +by one would be one by one by 4096 +rights or just converting these layers + +266 +00:18:30,900 --> 00:18:35,259 +into a convolutional there's now that we +know that their convolutions we could + +267 +00:18:35,259 --> 00:18:39,700 +actually run on in part of a larger size +and you can see that now we've got we've + +268 +00:18:39,700 --> 00:18:43,558 +added a couple extra pixels and now we +actually run all these things the + +269 +00:18:43,558 --> 00:18:47,869 +convolutions and get a two-by-two output +but what's really cool here is that + +270 +00:18:47,869 --> 00:18:52,058 +we're able to share computation to make +this really efficient so now our output + +271 +00:18:52,058 --> 00:18:56,428 +is four times as big but we've done much +less than four times the compute cuz if + +272 +00:18:56,429 --> 00:19:00,360 +you think about the difference between +where we're doing computation here the + +273 +00:19:00,359 --> 00:19:04,449 +only extra computation happened in these +yellow parts so now we're actually very + +274 +00:19:04,450 --> 00:19:08,610 +efficiently evaluating the network at +many many different positions without + +275 +00:19:08,609 --> 00:19:11,918 +actually spending much computation so +this is how they're able to evaluate + +276 +00:19:11,919 --> 00:19:15,240 +that network in that very very dense +multiscale way that you saw a couple + +277 +00:19:15,240 --> 00:19:19,388 +nights ago that make sense any questions +on this + +278 +00:19:19,388 --> 00:19:25,558 +ok writes actually we can look at the +classification + localisation results on + +279 +00:19:25,558 --> 00:19:30,858 +a mission over the last couple of years +so in 2012 Alex Alex Kozinski Jack + +280 +00:19:30,858 --> 00:19:36,358 +Hinton they won not only classification +but also localisation but I wasn't able + +281 +00:19:36,358 --> 00:19:40,978 +to find any published details of exactly +how they did that in 2013 was the + +282 +00:19:40,979 --> 00:19:45,249 +over-the-top that we just saw actually +improved on Alex's results a little bit + +283 +00:19:45,249 --> 00:19:50,429 +the year after we talked about VGG and +they're sort of really deep 19 their + +284 +00:19:50,429 --> 00:19:54,009 +network they got second place on +classification but actually 1 I'm + +285 +00:19:54,009 --> 00:19:59,139 +localisation and the BGG actually used +basically exactly the same strategy that + +286 +00:19:59,138 --> 00:20:03,918 +over feat dead they just use the deeper +network and actually interesting the BGG + +287 +00:20:03,919 --> 00:20:08,288 +used fewer scales they stand pat network +out in fewer places and used fewer + +288 +00:20:08,288 --> 00:20:12,878 +skills but they actually decrease the +era quite a bit so basically the only + +289 +00:20:12,878 --> 00:20:17,868 +difference being over feet and BG here +is that BGU the deeper network so here + +290 +00:20:17,868 --> 00:20:20,858 +we could see that these really powerful +image features actually improve the + +291 +00:20:20,858 --> 00:20:24,098 +localization performance quite a bit +with enough to change the localisation + +292 +00:20:24,098 --> 00:20:28,418 +architecture at all we just swapped out +about her CNN and it improved results a + +293 +00:20:28,419 --> 00:20:34,169 +lot and then this year in 2015 Microsoft +swept everything as that'll be a theme + +294 +00:20:34,169 --> 00:20:39,239 +in this lecture as well this this +hundred fifty lair ResNet from Microsoft + +295 +00:20:39,239 --> 00:20:43,629 +crushed localisation here and drunk +proper performance from 25 all the way + +296 +00:20:43,628 --> 00:20:48,738 +down to nine but I mean this this is a +little bit and this is talk to really + +297 +00:20:48,739 --> 00:20:52,798 +isolate the deep features so yes they +did have deeper features but Microsoft + +298 +00:20:52,798 --> 00:20:56,398 +actually it's a different localization +method called rpms region proposal + +299 +00:20:56,398 --> 00:21:00,699 +networks so it's not really clear +whether this which part whether it's a + +300 +00:21:00,700 --> 00:21:04,929 +better localization strategy or whether +the better features but at any rate they + +301 +00:21:04,929 --> 00:21:10,139 +did really well that's pretty much all I +want to say about classification + +302 +00:21:10,138 --> 00:21:13,848 +localisation just consider doing it for +projects and if there's any questions + +303 +00:21:13,848 --> 00:21:19,509 +about this task we should talk about +that now before moving on ya + +304 +00:21:19,509 --> 00:21:32,890 +performance especially with a loss right +so then I'll two losses when having + +305 +00:21:32,890 --> 00:21:37,050 +outliers is actually really bad so +sometimes people don't use an L to loss + +306 +00:21:37,049 --> 00:21:40,609 +instead you can try and sell one loss +that can help with outliers a little bit + +307 +00:21:40,609 --> 00:21:45,279 +people also will do sometimes a smooth +one loss where it looks like he'll one + +308 +00:21:45,279 --> 00:21:49,339 +sort of a tales but then near zero it'll +be quadratic so actually swapping out + +309 +00:21:49,339 --> 00:21:53,319 +that regression loss function can help a +bit with outliers sometimes but also if + +310 +00:21:53,319 --> 00:21:56,399 +you have a little bit of noise sometimes +hopefully you're not just figured out + +311 +00:21:56,400 --> 00:22:14,380 +cross your fingers don't think too hard +questions questions so people do both + +312 +00:22:14,380 --> 00:22:18,560 +actually I'm so over feet actually I +don't remember I don't remember exactly + +313 +00:22:18,559 --> 00:22:23,409 +which oversee dead but BGG actually +backdrops into the entire network so + +314 +00:22:23,410 --> 00:22:27,230 +it'll be it'll be faster to just +actually work fine if you just trained + +315 +00:22:27,230 --> 00:22:30,289 +in the regression had but you'll tend to +get a little bit better results if you + +316 +00:22:30,289 --> 00:22:34,049 +back drop into the home network and BG +did this experiment and they got maybe + +317 +00:22:34,049 --> 00:22:37,659 +one or two points extra buyback dropping +through the whole thing but it at the + +318 +00:22:37,660 --> 00:22:41,320 +expense of a lot more competition and +training time so it so I would I would + +319 +00:22:41,319 --> 00:22:44,769 +say it like as a first thing don't just +talk tried not back dropping and the + +320 +00:22:44,769 --> 00:22:50,440 +network + +321 +00:22:50,440 --> 00:22:57,110 +generally not right because your testing +on the same classes that you saw + +322 +00:22:57,109 --> 00:23:00,839 +training time you're gonna see different +instances obviously but I mean you're + +323 +00:23:00,839 --> 00:23:04,759 +still bears a tough time in OC bears at +training time we're not expecting you to + +324 +00:23:04,759 --> 00:23:07,370 +generalize across classes I'll be pretty +hard + +325 +00:23:07,369 --> 00:23:20,638 +yea good question yes so sometimes +people will do that they'll train with + +326 +00:23:20,638 --> 00:23:24,349 +both simultaneously also sometimes +people will just end up with separate + +327 +00:23:24,349 --> 00:23:27,089 +networks one that sort of only +responsible for aggression when it's + +328 +00:23:27,089 --> 00:23:38,089 +only responsible for classification +those both work well glad you asked + +329 +00:23:38,089 --> 00:23:40,558 +that's that's actually the next thing +we're gonna talk about that's that's a + +330 +00:23:40,558 --> 00:23:50,740 +different task of object detection so + +331 +00:23:50,740 --> 00:23:56,808 +well yeah well so I mean it kinda +depends on the training strategy if + +332 +00:23:56,808 --> 00:23:59,920 +you're like if you also kind of goes +back to this idea of class agnostic + +333 +00:23:59,920 --> 00:24:03,610 +first-class Pacific regression class +agnostic regression it doesn't matter + +334 +00:24:03,609 --> 00:24:06,889 +you just regress to the boxes tomorrow +the class class specific you're sort of + +335 +00:24:06,890 --> 00:24:13,950 +training separate aggressors for each +class right let's talk about object + +336 +00:24:13,950 --> 00:24:19,220 +detection so object detection is is much +fancier much cooler but also a lot + +337 +00:24:19,220 --> 00:24:22,890 +harrier so the idea is that again we +have an input image we have some sort of + +338 +00:24:22,890 --> 00:24:26,660 +classes we want to find all instances of +those classes in that in that input + +339 +00:24:26,660 --> 00:24:31,670 +image so I mean you know regression +worked pretty well for localisation why + +340 +00:24:31,670 --> 00:24:37,470 +don't we try it for for detection to +mark as an SMS we have these these dogs + +341 +00:24:37,470 --> 00:24:41,429 +and cats and we have four things we have +16 numbers thats looks like that looks + +342 +00:24:41,429 --> 00:24:46,250 +like regression rate image in numbers +out but if we look at another image then + +343 +00:24:46,250 --> 00:24:50,609 +you know this one only has two things +coming out so it has eight numbers they + +344 +00:24:50,609 --> 00:24:54,589 +look at this one there's a whole bunch +of cats we need a bunch of numbers so I + +345 +00:24:54,589 --> 00:24:57,519 +mean it's it's kind of hard to treat +detection a straight-up regression + +346 +00:24:57,519 --> 00:25:01,450 +because we have this problem of variable +size outputs so we're gonna have to do + +347 +00:25:01,450 --> 00:25:04,460 +something fancier although actually +there is a method will talk about later + +348 +00:25:04,460 --> 00:25:09,539 +that sort of does this anyway and does +treated as as regression but we'll get + +349 +00:25:09,539 --> 00:25:12,950 +to that we'll get to that later but in +general you wanna not treat this as + +350 +00:25:12,950 --> 00:25:18,360 +regression because you have very precise +outputs so we're really easy problem + +351 +00:25:18,359 --> 00:25:22,779 +really easy way to solve this is to +think of detection not as regression but + +352 +00:25:22,779 --> 00:25:25,960 +as classification right in machine +learning regression and classification + +353 +00:25:25,960 --> 00:25:29,929 +are your two hammers you just want to +use those to eat all your problems right + +354 +00:25:29,929 --> 00:25:34,250 +so we regression and works will do +classification instead we know how to + +355 +00:25:34,250 --> 00:25:38,558 +classify image regions we just for CNN +right we're going to do is we're gonna + +356 +00:25:38,558 --> 00:25:43,349 +take many of these input regions of the +image of a classifier there and say like + +357 +00:25:43,349 --> 00:25:46,129 +alright this region of the alleged +attack at No + +358 +00:25:46,130 --> 00:25:50,770 +as a dog know that over a little bit we +found a cat that's great but over a + +359 +00:25:50,769 --> 00:25:54,460 +little bit that's that's not anything so +then we can actually just try out a + +360 +00:25:54,460 --> 00:25:58,558 +whole bunch different image regions run +a classifier in each one and this will + +361 +00:25:58,558 --> 00:26:02,490 +basically solve our variable size output +problem + +362 +00:26:02,490 --> 00:26:11,160 +so there's there's no question so the +question of how decide how to decide + +363 +00:26:11,160 --> 00:26:14,558 +what the window size the answer is we +just tried them all right just literally + +364 +00:26:14,558 --> 00:26:18,879 +tried them all so that's that's that's +actually a big problem right because we + +365 +00:26:18,880 --> 00:26:21,910 +need to try Windows of different sizes +of different positions of different + +366 +00:26:21,910 --> 00:26:25,290 +scales me do this properly test and this +is going to be really expensive right + +367 +00:26:25,289 --> 00:26:39,089 +there's a whole lot of places we need to +look yeah also when you're doing this + +368 +00:26:39,089 --> 00:26:45,058 +you add an extra two things one you can +add an extra class to say background and + +369 +00:26:45,058 --> 00:26:49,569 +say like oh there's nothing here another +thing you can do is not is to actually + +370 +00:26:49,569 --> 00:26:54,159 +multi-label classification you cannot +put multiple positive things right + +371 +00:26:54,160 --> 00:26:56,950 +that's actually pretty easy to do and +just instead of a soft max loss you have + +372 +00:26:56,950 --> 00:27:01,390 +independent regression loss of +independent logistic regression class so + +373 +00:27:01,390 --> 00:27:05,100 +I can actually let you say yes I +multiple classes at one point but that's + +374 +00:27:05,099 --> 00:27:10,189 +just walking on a loss function so +that's that's pretty easy to do right so + +375 +00:27:10,190 --> 00:27:13,220 +actually like what we see a problem with +this approach is that there's a whole + +376 +00:27:13,220 --> 00:27:17,690 +bunch of different positions we need to +evaluate the solution sort of a couple + +377 +00:27:17,690 --> 00:27:21,308 +you as a couple of years ago was just +usually class fat used really fast + +378 +00:27:21,308 --> 00:27:26,299 +classifiers try them all so actually +detection is this really all problem in + +379 +00:27:26,299 --> 00:27:29,119 +computer vision so you should probably +have a little bit more historical + +380 +00:27:29,119 --> 00:27:34,109 +perspective so starting in about 2005 +there was this really successful + +381 +00:27:34,109 --> 00:27:38,490 +approach to it but I'm really successful +detection that use this feature + +382 +00:27:38,490 --> 00:27:42,039 +representation called histograms of +Oriented radiance so if you are call + +383 +00:27:42,039 --> 00:27:46,609 +back to homework 1 you actually use this +feature on the last part to actually do + +384 +00:27:46,609 --> 00:27:50,979 +classification as well so this was +actually a sort of the best feature that + +385 +00:27:50,980 --> 00:27:55,670 +we had in computer vision Sircar in 2005 +the idea is we're just gonna do linear + +386 +00:27:55,670 --> 00:27:59,550 +classifiers on top of this feature and +that's going to be our our classifier so + +387 +00:27:59,549 --> 00:28:03,460 +linear classifiers are really fast so if +this works is that we compute are + +388 +00:28:03,460 --> 00:28:08,250 +oriented gradient feature for the whole +image at multiple scales and we run this + +389 +00:28:08,250 --> 00:28:12,660 +linear classifier at every scale every +position just do it really fast just do + +390 +00:28:12,660 --> 00:28:13,210 +it everywhere + +391 +00:28:13,210 --> 00:28:15,329 +classifier and its past to evaluate + +392 +00:28:15,329 --> 00:28:21,029 +and this worked really well in 2005 sort +of people took this idea and worked on + +393 +00:28:21,029 --> 00:28:25,029 +it a little bit more in the next couple +of years so sort of the one of the most + +394 +00:28:25,029 --> 00:28:29,879 +important detection paradigms 3d +planning is this thing called deep but + +395 +00:28:29,880 --> 00:28:34,470 +deformable parts model so I don't wanna +go too much into the details are best + +396 +00:28:34,470 --> 00:28:39,309 +but I mean the basic idea is that we're +still working on these history memorial + +397 +00:28:39,309 --> 00:28:42,619 +gradient features but now our model +rather than just being a linear + +398 +00:28:42,619 --> 00:28:46,659 +classifier we have this linear click +this linear sort of template for the + +399 +00:28:46,660 --> 00:28:51,370 +object and we also have these templates +for parts that are allowed to sort of + +400 +00:28:51,369 --> 00:28:57,119 +very over spatial positions and deform a +little bit and they have some some fancy + +401 +00:28:57,119 --> 00:29:01,939 +fancy think about late in the AM to +learn these things and really fancy + +402 +00:29:01,940 --> 00:29:07,190 +dynamic programming algorithms actually +evaluate this thing really fast test + +403 +00:29:07,190 --> 00:29:11,100 +time is actually kind of fun if you +enjoy our thumbs this this thing at this + +404 +00:29:11,099 --> 00:29:16,119 +part is kind of fun to think about but +the end result is that it's it's a much + +405 +00:29:16,119 --> 00:29:19,209 +more powerful classifier that allows a +little bit of deformability in your + +406 +00:29:19,210 --> 00:29:23,079 +model and you can still about weight +really fast so we're still just going to + +407 +00:29:23,079 --> 00:29:26,490 +evaluate it everywhere every scale every +position every aspect ratio just do it + +408 +00:29:26,490 --> 00:29:33,039 +everywhere its past and this actually +worked really well in 2010 around there + +409 +00:29:33,039 --> 00:29:37,619 +that was sort of state of the art and +detection for many problems at a time so + +410 +00:29:37,619 --> 00:29:40,509 +this is I don't spend too much time on +this but there was a really cool paper + +411 +00:29:40,509 --> 00:29:45,049 +last year that argued that these dpi +models are actually just a certain type + +412 +00:29:45,049 --> 00:29:47,480 +of content right and so right + +413 +00:29:47,480 --> 00:29:51,329 +these these history I'm going crazy ants +are like little edges we can just look + +414 +00:29:51,329 --> 00:29:55,539 +on delusions and history and was kinda +like pooling that sort of thing so if + +415 +00:29:55,539 --> 00:30:00,349 +you're interested check out this paper +it's kind of fun to think about right + +416 +00:30:00,349 --> 00:30:02,250 +but we really want to work on + +417 +00:30:02,250 --> 00:30:06,259 +make this thing work on classifiers that +are not fast without weights like maybe + +418 +00:30:06,259 --> 00:30:11,809 +a CNN so here this week this problem is +still hard right we have many different + +419 +00:30:11,809 --> 00:30:14,940 +positions you want to try when we +probably can't actually afford to try + +420 +00:30:14,940 --> 00:30:19,220 +them all so the solution is that we +don't try them all we have some other + +421 +00:30:19,220 --> 00:30:23,380 +things that sort of guesses where we +want to look and then we only apply our + +422 +00:30:23,380 --> 00:30:28,720 +expense of classifier at those smaller +number of locations so that idea + +423 +00:30:28,720 --> 00:30:35,419 +is called region proposals so we on our +region proposal method is this thing + +424 +00:30:35,419 --> 00:30:39,900 +that takes in an image and then outputs +a whole bunch of regions where maybe + +425 +00:30:39,900 --> 00:30:45,280 +possibly an object might be located so +one way you can think about region + +426 +00:30:45,279 --> 00:30:48,428 +proposals is that they're kinda like a +really fast + +427 +00:30:48,429 --> 00:30:53,038 +class agnostic object detector right +they don't care about the class they're + +428 +00:30:53,038 --> 00:30:56,038 +not very accurate but they're pretty +fast to run and they give us a whole + +429 +00:30:56,038 --> 00:31:00,769 +bunch of boxes and the general intuition +behind behind these region proposal + +430 +00:31:00,769 --> 00:31:04,639 +methods is that they're kinda looking +for blob like structure is an image rate + +431 +00:31:04,640 --> 00:31:09,740 +so like objects are generally the dog i +mean if you can ask when it looks kinda + +432 +00:31:09,740 --> 00:31:13,940 +like a white blob the cat looks like a +white blobs flowers I kinda blah be the + +433 +00:31:13,940 --> 00:31:17,929 +eyes and nose are kinda blah be so +anyone these region proposal methods a + +434 +00:31:17,929 --> 00:31:21,650 +lot of times what you'll see is the kind +of put boxes around a lot of these + +435 +00:31:21,650 --> 00:31:27,820 +blobby regions in the image so probably +the most famous region proposal method + +436 +00:31:27,819 --> 00:31:31,538 +is called selective search you don't +really need to know exact into much + +437 +00:31:31,538 --> 00:31:36,980 +detail how this works but the idea is +that you start from your pixels and you + +438 +00:31:36,980 --> 00:31:40,919 +kind of merger adjacent pixels together +if they have similar color and texture + +439 +00:31:40,919 --> 00:31:45,770 +and form these are connected reid +disconnected blob like regions and then + +440 +00:31:45,769 --> 00:31:50,740 +you merge yuppies blob like regions to +get bigger and bigger body parts and + +441 +00:31:50,740 --> 00:31:53,829 +then for each of these different scales +you could actually convert each of these + +442 +00:31:53,829 --> 00:31:58,710 +Bobby regions into a box by just drawing +a box around it so then by doing this + +443 +00:31:58,710 --> 00:32:02,548 +over multiple scales you end up with a +whole bunch of boxes around sort of a + +444 +00:32:02,548 --> 00:32:06,359 +lot of blobby stuff in the image and its +are reasonably fast to compute and + +445 +00:32:06,359 --> 00:32:11,500 +actually cuts down the search space +quite a lot but selectively certainly + +446 +00:32:11,500 --> 00:32:14,720 +isn't the only game in town is just may +be the most famous there's a whole bunch + +447 +00:32:14,720 --> 00:32:18,319 +of different region proposal methods +that people have developed there was + +448 +00:32:18,319 --> 00:32:21,509 +this paper last year that actually did a +really cool thorough scientific + +449 +00:32:21,509 --> 00:32:25,890 +evaluation of all these different region +proposal methods and sort of gave you + +450 +00:32:25,890 --> 00:32:29,950 +the pros and the cons of each and all +that kind of stuff but I mean my + +451 +00:32:29,950 --> 00:32:33,620 +takeaway from this paper was just use +that boxes if you had to pick one so + +452 +00:32:33,619 --> 00:32:37,459 +it's it's it's really fast it you can +run it in the bottom third of a second + +453 +00:32:37,460 --> 00:32:40,950 +per image compared to about 10 seconds +for selective search + +454 +00:32:40,950 --> 00:32:49,000 +but more stars is better and it gets a +lot of stars so it's going right so now + +455 +00:32:49,000 --> 00:32:51,970 +that we have this idea region proposals +and we have this idea of a CNN + +456 +00:32:51,970 --> 00:32:56,679 +classifier let's just put everything +altogether so that's and so this this + +457 +00:32:56,679 --> 00:33:02,830 +idea was sort of first put together in a +really nice way in 2014 in this method + +458 +00:33:02,829 --> 00:33:08,740 +called RCN on the idea is it's a +region-based CNN method so it's it's + +459 +00:33:08,740 --> 00:33:12,179 +it's pretty simple man what we've seen +all the pieces we have an input image + +460 +00:33:12,179 --> 00:33:17,028 +we're gonna run a region proposal method +like selective search to get about maybe + +461 +00:33:17,028 --> 00:33:21,929 +two thousand boxes of different scales +and positions mean 2000 still a lot but + +462 +00:33:21,929 --> 00:33:26,380 +it's a lot less than all possible boxes +in the image now for each of those boxes + +463 +00:33:26,380 --> 00:33:31,510 +we're gonna have cropped and warp that +image region to some fixed size and then + +464 +00:33:31,509 --> 00:33:35,898 +run it former through I CNN to classify +and then this CNN is going to have a + +465 +00:33:35,898 --> 00:33:41,199 +regression head and the regression had +here and a classification had been used + +466 +00:33:41,200 --> 00:33:46,259 +as PM's here so the idea is that this +this regression had can sort of correct + +467 +00:33:46,259 --> 00:33:50,369 +for region proposals that were a little +bit off writes this this actually works + +468 +00:33:50,369 --> 00:33:55,219 +really well it's really simple yeah it's +pretty cool but unfortunately so + +469 +00:33:55,220 --> 00:33:59,460 +unfortunately the training pipeline is a +little bit complicated so the way that + +470 +00:33:59,460 --> 00:34:03,788 +you end up train training src and a +model is you know like many like many + +471 +00:34:03,788 --> 00:34:06,970 +models you first start by downloading a +model from the internet that works well + +472 +00:34:06,970 --> 00:34:13,240 +for classification originally they were +using and how it's not then then next we + +473 +00:34:13,239 --> 00:34:16,868 +actually want to fine tune this model +for detection rate because this this + +474 +00:34:16,869 --> 00:34:20,780 +classification model was probably +trained on image that 4,000 classes but + +475 +00:34:20,780 --> 00:34:24,019 +your detection dataset has a different +number of classes in the image that + +476 +00:34:24,019 --> 00:34:28,398 +extra little bit different so you still +run this you still train this network + +477 +00:34:28,398 --> 00:34:29,679 +for classification + +478 +00:34:29,679 --> 00:34:33,429 +you have to add a couple new layers at +the end to deal with your classes and to + +479 +00:34:33,429 --> 00:34:38,068 +help you deal with slightly different +statistics of your image data so here + +480 +00:34:38,068 --> 00:34:41,579 +you're just doing classification still +but you're not running on hold images + +481 +00:34:41,579 --> 00:34:44,869 +you're running out on positive and +negative regions of your images from + +482 +00:34:44,869 --> 00:34:49,950 +your detection dataset right so you +initially as a new layer and you and you + +483 +00:34:49,949 --> 00:34:53,599 +train this thing again for your day is +that + +484 +00:34:53,599 --> 00:34:57,889 +next we actually want to casualties +features two desks so for every + +485 +00:34:57,889 --> 00:35:02,230 +engineered in your data that you run +selective search you run that image you + +486 +00:35:02,230 --> 00:35:07,079 +extract those regions you were down here +on the CNN and you cash those features + +487 +00:35:07,079 --> 00:35:12,319 +to desk and something important for this +step is to have a large hard drive the + +488 +00:35:12,320 --> 00:35:16,289 +passcode they decided not too big maybe +order a couple tens of thousands of + +489 +00:35:16,289 --> 00:35:20,170 +images but extracting these features +actually takes hundreds of gigabytes so + +490 +00:35:20,170 --> 00:35:26,869 +that's not so great and then next we +have this we want to train RSP arms to + +491 +00:35:26,869 --> 00:35:30,909 +actually be able to classify different +are different classes based on these + +492 +00:35:30,909 --> 00:35:35,649 +features so here we want to run a bunch +of we want to change a bunch of + +493 +00:35:35,650 --> 00:35:40,760 +different binary as PM's to classify +image regions as to whether or not they + +494 +00:35:40,760 --> 00:35:45,220 +contain or don't contain that that one +object to this goes back to a question a + +495 +00:35:45,219 --> 00:35:49,029 +little bit ago that sometimes you +actually might wanna how one region have + +496 +00:35:49,030 --> 00:35:53,460 +multiple positive be able to output YES +on multiple classes for the same image + +497 +00:35:53,460 --> 00:35:56,889 +region and one way that they do that is +just my training separate binary SVM + +498 +00:35:56,889 --> 00:36:01,579 +speech class right so then this is sort +of an offline process they just used the + +499 +00:36:01,579 --> 00:36:08,230 +best p.m. so you have these features +these are maybe those are positive + +500 +00:36:08,230 --> 00:36:11,820 +samples for a countess yeah it doesn't +make any sense right but you get the + +501 +00:36:11,820 --> 00:36:14,700 +idea rate you have these different +imagery you have these different image + +502 +00:36:14,699 --> 00:36:18,599 +regions you have these features that you +save to disk for those regions and then + +503 +00:36:18,599 --> 00:36:22,029 +you divide them into positive and +negative samples for each for each class + +504 +00:36:22,030 --> 00:36:27,269 +and you just train these these binary +SVM you do this you do this the same + +505 +00:36:27,269 --> 00:36:33,239 +thing for dog and you just do this for +every class near to decide right now + +506 +00:36:33,239 --> 00:36:37,029 +there's another stop right if so then +there's this idea of Cox regression so + +507 +00:36:37,030 --> 00:36:40,450 +sometimes you region proposals aren't +perfect so what we actually want to do + +508 +00:36:40,449 --> 00:36:45,549 +is be able to regress from from his cast +features to a correction on to the + +509 +00:36:45,550 --> 00:36:50,269 +region proposal and that correction has +this kind of funny premise rise + +510 +00:36:50,269 --> 00:36:54,320 +normalize representation the country +details about in the paper but kind of + +511 +00:36:54,320 --> 00:36:58,300 +intuition is that maybe for this for +this for this region proposal it was + +512 +00:36:58,300 --> 00:37:02,030 +pretty good we don't really need to make +any any corrections but maybe this one + +513 +00:37:02,030 --> 00:37:06,250 +in the middle that proposal was too far +to the left it should be like the crib + +514 +00:37:06,250 --> 00:37:09,510 +cracked ground truth as a little bit to +the right we want to regress to this + +515 +00:37:09,510 --> 00:37:12,530 +correction factor that actually tell us +that we need to shift a little bit to + +516 +00:37:12,530 --> 00:37:15,780 +the right or maybe this guy is a little +bit too wide + +517 +00:37:15,780 --> 00:37:19,100 +they didn't lose too much of the stuff +outside the cat so we want to regress to + +518 +00:37:19,099 --> 00:37:21,880 +this correction factor that tells us we +need to shrink + +519 +00:37:21,880 --> 00:37:26,539 +region proposal a little bit so again +this is just let me just do linear + +520 +00:37:26,539 --> 00:37:30,340 +regression which you can but you know +from 229 you have these these features + +521 +00:37:30,340 --> 00:37:35,490 +you have these targets you you just ran +linear regression i SAT so before we + +522 +00:37:35,489 --> 00:37:39,219 +look at the results we should talk to +talk a little bit about the different + +523 +00:37:39,219 --> 00:37:42,769 +datasets the people used for detection +there's kind of three that you'll see in + +524 +00:37:42,769 --> 00:37:48,489 +practice one as the Pascal the OC +dataset it was pretty important I think + +525 +00:37:48,489 --> 00:37:53,399 +in the earlier to thousands but now it's +a little bit small this one's about 20 + +526 +00:37:53,400 --> 00:37:57,820 +classes and about 20,000 images and +hence have about two objects percentage + +527 +00:37:57,820 --> 00:38:01,550 +so because this is a relatively small +ish dataset you'll see a lot of + +528 +00:38:01,550 --> 00:38:05,860 +detection papers work on this just goes +it's easier to handle but there's also + +529 +00:38:05,860 --> 00:38:09,970 +an image that detection dataset image +that runs a whole bunch of challenges as + +530 +00:38:09,969 --> 00:38:13,109 +you've probably seen by now we saw a +classification we sought localisation + +531 +00:38:13,110 --> 00:38:17,820 +there's also an image that detection +challenge but protection there's only + +532 +00:38:17,820 --> 00:38:21,600 +two hundred classes not the thousand +from classification but it's it's very + +533 +00:38:21,599 --> 00:38:25,619 +big almost half a million images so you +don't see as many papers work on it just + +534 +00:38:25,619 --> 00:38:29,819 +cuz it's kind of annoying to handle but +there's only about 100 per image and + +535 +00:38:29,820 --> 00:38:32,760 +then more weeks more recently there's +this one from Microsoft called Coco + +536 +00:38:32,760 --> 00:38:36,660 +which has fewer classes images but +actually has a lot more objects + +537 +00:38:36,659 --> 00:38:42,649 +percentage so people like to work I'm +not now has more interesting right + +538 +00:38:42,650 --> 00:38:45,300 +there's also this this when you're +talking about detection there's this + +539 +00:38:45,300 --> 00:38:49,000 +funny evaluation metric we use called +mean average precision and early wanna + +540 +00:38:49,000 --> 00:38:52,000 +get too much into the details like what +you really need to know is that it's a + +541 +00:38:52,000 --> 00:38:56,570 +number between 0 and hundreds and +hundreds good and it + +542 +00:38:56,570 --> 00:38:59,940 +and it also I mean the kind of the +intuition is that it's you want to have + +543 +00:38:59,940 --> 00:39:04,079 +the right you wanna have true positives +get high scores and you also have to + +544 +00:39:04,079 --> 00:39:08,230 +have some threshold that your boxes you +produced need to be within some + +545 +00:39:08,230 --> 00:39:12,090 +threshold of a crack box and you can +usually this that threshold this point + +546 +00:39:12,090 --> 00:39:15,420 +by an intersection of a union but you'll +see different challenges you slightly + +547 +00:39:15,420 --> 00:39:19,740 +different things for thats right so +let's now that we understand the data + +548 +00:39:19,739 --> 00:39:24,679 +sets on the elevation of us via our CNN +did right so this is on the past two + +549 +00:39:24,679 --> 00:39:27,779 +versions of the Pascal Davis at like I +said it's smaller as you'll see a lot of + +550 +00:39:27,780 --> 00:39:32,730 +results on this there's different +versions one in 2007 2010 you often see + +551 +00:39:32,730 --> 00:39:35,990 +people use those just because the test +is publicly available so it's easy to + +552 +00:39:35,989 --> 00:39:37,169 +evaluate + +553 +00:39:37,170 --> 00:39:42,380 +yeah but so in this this deformable +parts model that we saw from 2011 from + +554 +00:39:42,380 --> 00:39:48,579 +couple slides ago is getting twenty +about 30 on average precision there's + +555 +00:39:48,579 --> 00:39:52,069 +this other method called region let's +from 2013 that was sort of the state of + +556 +00:39:52,070 --> 00:39:55,280 +the art that I could find right before +deep learning but it's it's sort of a + +557 +00:39:55,280 --> 00:39:58,130 +similar flavor you have these features +in its class players on top of teachers + +558 +00:39:58,130 --> 00:40:02,840 +and our CNN is this pretty simple thing +we just saw and actually jump and + +559 +00:40:02,840 --> 00:40:06,789 +actually improves the performance quite +a lot so the first thing the Seas we had + +560 +00:40:06,789 --> 00:40:10,509 +a big improvement when we just switch +this pretty simple framework using CNN's + +561 +00:40:10,510 --> 00:40:15,160 +and actually this this result here is +without the bounding box repressions + +562 +00:40:15,159 --> 00:40:19,029 +this is only using the region proposals +on ESPN's actually if you include this + +563 +00:40:19,030 --> 00:40:23,550 +additional bonding proposal stop it +actually helps quite a bet another fun + +564 +00:40:23,550 --> 00:40:26,820 +thing to note is that if you take our +CNN and you do everything the same + +565 +00:40:26,820 --> 00:40:31,080 +except used eg 16 instead of Alex net +you get another pretty big boost in + +566 +00:40:31,079 --> 00:40:34,059 +performance so this is kind of similar +to what we've seen before that just + +567 +00:40:34,059 --> 00:40:39,650 +using these more powerful features tends +to help a lot of different tasks right + +568 +00:40:39,650 --> 00:40:42,840 +this is really good right we've we've +done like a huge improvement on + +569 +00:40:42,840 --> 00:40:47,829 +detection compared to 2013 that's +amazing but our CNN is not perfect it + +570 +00:40:47,829 --> 00:40:53,150 +has some problems right so it's pretty +slow its test time right we saw that we + +571 +00:40:53,150 --> 00:40:57,110 +have maybe two thousand regions means to +evaluate our CNN for each region that's + +572 +00:40:57,110 --> 00:41:02,910 +kinda slow we also have this this +slightly subtle problem where r SVM + +573 +00:41:02,909 --> 00:41:07,009 +regression those were sort of trained +off-line using likely best p.m. + +574 +00:41:07,010 --> 00:41:10,930 +and linear regression actually the +weights of our of our CNN didn't really + +575 +00:41:10,929 --> 00:41:14,960 +have the chance to update in response to +what those parts of of a network of + +576 +00:41:14,960 --> 00:41:19,039 +those objectives wanted to do and we +also had this kind of complicated + +577 +00:41:19,039 --> 00:41:24,309 +training pipeline that was a bit of a +mess so to fix these problems a year + +578 +00:41:24,309 --> 00:41:29,690 +later we have this thing called fast our +CNN so fast our CNN it was presented + +579 +00:41:29,690 --> 00:41:34,950 +pretty recently in ICC be just in +December but the idea is really simple + +580 +00:41:34,949 --> 00:41:39,819 +we're just gonna swap the order of +extracting regions and running the CNN + +581 +00:41:39,820 --> 00:41:43,550 +this is kind of a kind of related to the +sliding window idea we saw with + +582 +00:41:43,550 --> 00:41:48,450 +over-the-top so here the pipeline that +test time looks kinda similar we have + +583 +00:41:48,449 --> 00:41:52,299 +this input image we're gonna not we're +going to take this high-resolution input + +584 +00:41:52,300 --> 00:41:55,920 +image and run it through the +convolutional layers of our network and + +585 +00:41:55,920 --> 00:42:00,150 +now we're gonna get this high-resolution +convolutional feature map and now our + +586 +00:42:00,150 --> 00:42:03,940 +region proposals were gonna extracts +directly features for those region + +587 +00:42:03,940 --> 00:42:07,610 +proposals from this convolutional +feature map using this thing called ROI + +588 +00:42:07,610 --> 00:42:10,530 +pooling and then the region's + +589 +00:42:10,530 --> 00:42:14,269 +the features for these the compositional +features for those regions will be fed + +590 +00:42:14,269 --> 00:42:17,829 +into our fully connected layers and will +again have a classification had a + +591 +00:42:17,829 --> 00:42:22,670 +regression had like we saw before so +this is really cool it's it's pretty + +592 +00:42:22,670 --> 00:42:26,930 +great it solves a lot of the problems +that we just saw with our CNN so our CNN + +593 +00:42:26,929 --> 00:42:31,039 +is really slow at US time we solve this +problem by just sharing this this + +594 +00:42:31,039 --> 00:42:37,289 +computation of convolutional features +across the region proposals are see our + +595 +00:42:37,289 --> 00:42:40,519 +CNN also have these problems at training +time where we had this this message + +596 +00:42:40,519 --> 00:42:44,920 +training pipeline we had this this +problem where we're training different + +597 +00:42:44,920 --> 00:42:48,760 +parts of the network separately and the +solution is pretty simple we just you + +598 +00:42:48,760 --> 00:42:50,480 +know training all together all at once + +599 +00:42:50,480 --> 00:42:53,800 +don't don't have this complicated +pipeline which we can actually do it now + +600 +00:42:53,800 --> 00:42:58,140 +that we have this this pretty nice +function from inputs to outputs right as + +601 +00:42:58,139 --> 00:43:01,299 +you can see that are so that fast our +CNN actually solves quite a lot of the + +602 +00:43:01,300 --> 00:43:06,340 +problems that we saw with our CNN sort +of them with a really interesting + +603 +00:43:06,340 --> 00:43:10,530 +technical bit in fast our CNN was this +problem of our way region of interest + +604 +00:43:10,530 --> 00:43:15,519 +pooling so the idea is that we have this +input image that's probably high + +605 +00:43:15,519 --> 00:43:19,068 +resolution and we have this region +proposal that's becoming + +606 +00:43:19,068 --> 00:43:23,969 +elective surgery boxes or something like +that and we can put this region this + +607 +00:43:23,969 --> 00:43:27,199 +high resolution image through our +convolutional and pooling layers just + +608 +00:43:27,199 --> 00:43:30,880 +fine because those are sort of +scale-invariant they're still up two + +609 +00:43:30,880 --> 00:43:34,318 +different sizes of inputs but now the +problem is that the fully connected + +610 +00:43:34,318 --> 00:43:39,630 +layers from our pre train network are +expecting these pretty low res con + +611 +00:43:39,630 --> 00:43:46,068 +features whereas these features from the +whole image are high res so now we solve + +612 +00:43:46,068 --> 00:43:50,038 +this problem in a pretty straightforward +way so given this region proposal we're + +613 +00:43:50,039 --> 00:43:53,930 +gonna projected onto it sort of the +special part of that comment feature + +614 +00:43:53,929 --> 00:43:59,368 +volume now we're going to divide that +Khan future vol into a little grid right + +615 +00:43:59,369 --> 00:44:04,910 +divide that thing into this hiw grid +that downstream layers are expecting and + +616 +00:44:04,909 --> 00:44:09,798 +we do Macs pulling within each of those +grid cells so now we've seen now we have + +617 +00:44:09,798 --> 00:44:14,349 +this pretty simple strategy we've taken +this region proposal and we've shared + +618 +00:44:14,349 --> 00:44:19,430 +compilation features extracted this +excites output for that region for that + +619 +00:44:19,429 --> 00:44:23,629 +for that region proposal writes this is +basically just swapping the order of + +620 +00:44:23,630 --> 00:44:28,108 +convolution and warping and cropping +that's one way to think about it and + +621 +00:44:28,108 --> 00:44:31,538 +also this is a pretty nice operation +because since this thing is basically + +622 +00:44:31,539 --> 00:44:35,249 +just max pulling and we know how to back +propagate through max pulling you can + +623 +00:44:35,248 --> 00:44:38,368 +back propagate through these are these +are region of interest pulling there's + +624 +00:44:38,369 --> 00:44:42,269 +just fine and that's what really allows +us to train this whole thing in a joint + +625 +00:44:42,268 --> 00:44:46,758 +way rights let's see some results and +these are actually pretty cool pretty + +626 +00:44:46,759 --> 00:44:50,858 +amazing great so for training time are +CNN it had this complicated pipeline + +627 +00:44:50,858 --> 00:44:54,098 +would save all the stuff that desk where +to do all this stuff independently and + +628 +00:44:54,099 --> 00:44:57,789 +even on that pretty small Pascale Denis +at it took eighty four hours to train + +629 +00:44:57,789 --> 00:45:05,229 +passed our CNN is much faster you can +train and a at as far as test time in LR + +630 +00:45:05,228 --> 00:45:09,318 +CNN is pretty slow because again we're +running these independent forward passes + +631 +00:45:09,318 --> 00:45:14,469 +at the CNN for each region proposal +whereas for fast our CNN where we can + +632 +00:45:14,469 --> 00:45:17,979 +sort of share the computation between +different region proposals and get this + +633 +00:45:17,978 --> 00:45:23,439 +gigantic speed up a test I'm a hundred +and forty-six that's great amazing and + +634 +00:45:23,440 --> 00:45:26,690 +in fact in terms of performance I mean +it does a little bit better it's not a + +635 +00:45:26,690 --> 00:45:30,048 +drastic difference in performance but +this could probably be attributed to + +636 +00:45:30,048 --> 00:45:32,130 +this fine tuning property that + +637 +00:45:32,130 --> 00:45:35,140 +past our CNN you can actually find you +in all parts of the convolutional + +638 +00:45:35,139 --> 00:45:38,969 +network jointly to help with these ALPA +tasks and that's probably why you see a + +639 +00:45:38,969 --> 00:45:43,230 +bit of an increase here right so this is +great right what's what could possibly + +640 +00:45:43,230 --> 00:45:45,730 +be wrong with fast our CNN and its looks +amazing + +641 +00:45:45,730 --> 00:45:51,699 +the big problem is that these tests I'm +speeds don't include region proposals + +642 +00:45:51,699 --> 00:45:55,669 +right so now fast our CNN the so good +that actually the bottleneck is + +643 +00:45:55,670 --> 00:46:00,750 +computing region proposals that's pretty +cool so once you factor in the speed of + +644 +00:46:00,750 --> 00:46:04,789 +computer actually computing these region +proposals on CPU you can see that a lot + +645 +00:46:04,789 --> 00:46:09,190 +of our speed benefits disappear right +only 25 x faster and we kind of lost + +646 +00:46:09,190 --> 00:46:15,030 +that beautiful hundred speed-up also now +because it takes me two seconds Tehran + +647 +00:46:15,030 --> 00:46:18,560 +actually pretty magenta and you can't +really use this real-time it still kinda + +648 +00:46:18,559 --> 00:46:23,750 +off-line processing thing right so the +solution of this should be pretty + +649 +00:46:23,750 --> 00:46:27,340 +obvious rate we are all you're already +using a convolutional network for + +650 +00:46:27,340 --> 00:46:32,620 +regression using it for classification +why not use it for a reason proposals to + +651 +00:46:32,619 --> 00:46:39,569 +write should work may be kind of crazy +so that's that's a paper anyone want to + +652 +00:46:39,570 --> 00:46:46,570 +guess the name yes it's faster our CNN + +653 +00:46:46,570 --> 00:46:50,789 +yes they were they were really creative +here right but the idea is pretty simple + +654 +00:46:50,789 --> 00:46:55,460 +right where some from fast our CNN where +are you taking our input image and where + +655 +00:46:55,460 --> 00:46:59,630 +computing these big convolutional +feature maps over the entire input image + +656 +00:46:59,630 --> 00:47:05,170 +so that instead of using some external +method to compute region proposals they + +657 +00:47:05,170 --> 00:47:09,010 +add this little thing called the region +proposal network that looks directly at + +658 +00:47:09,010 --> 00:47:13,060 +these looks at these last their +compositional features as able to + +659 +00:47:13,059 --> 00:47:17,599 +produce region proposals directly from +that competition will feature map and + +660 +00:47:17,599 --> 00:47:21,190 +then once you have region proposals you +just do the same thing as fast our CNN + +661 +00:47:21,190 --> 00:47:25,880 +you use this ROI pooling and I'll be +upstream stop is the same as fast or CNN + +662 +00:47:25,880 --> 00:47:31,130 +so really about the novel bit here is +this region proposal network it's it's + +663 +00:47:31,130 --> 00:47:34,180 +really cool right we're doing the whole +thing and one giant competition at work + +664 +00:47:34,179 --> 00:47:40,500 +right so the way this region proposal +network works is that were sort of we + +665 +00:47:40,500 --> 00:47:43,880 +receive as input this competition will +feature map this may be coming out of + +666 +00:47:43,880 --> 00:47:47,820 +the last layer of our convolutional +features and we're going to add you like + +667 +00:47:47,820 --> 00:47:52,570 +like most things are recent post on that +worked as a convolutional network right + +668 +00:47:52,570 --> 00:47:57,570 +so actually this is a typo this is a +free by freak on that right so we have a + +669 +00:47:57,570 --> 00:48:01,809 +sort of a sliding window approach over +our convolutional feature map but + +670 +00:48:01,809 --> 00:48:06,820 +sliding sliding window is just a +convolution rate so we just have a three + +671 +00:48:06,820 --> 00:48:10,920 +by three convolution on top of this +feature map and then we have this this + +672 +00:48:10,920 --> 00:48:14,599 +peculiar struck this familiar to head +structure inside the region proposal + +673 +00:48:14,599 --> 00:48:19,670 +network where we're doing classification +we're here we just want to say whether + +674 +00:48:19,670 --> 00:48:25,430 +or not it's an object and also +regression to regress from this sort of + +675 +00:48:25,429 --> 00:48:29,829 +position on to an actual pridgen +proposal so the idea is that the + +676 +00:48:29,829 --> 00:48:33,909 +position of the sliding window relative +to the feature map sort of tells us + +677 +00:48:33,909 --> 00:48:38,239 +where we are in the image and then these +regression outputs sort of give us + +678 +00:48:38,239 --> 00:48:43,619 +corrections on top of this this position +in the feature map but actually they + +679 +00:48:43,619 --> 00:48:46,940 +make it a little bit more complicated +than that so instead of addressing + +680 +00:48:46,940 --> 00:48:51,110 +directly from this position in the +convolution will feature map they have + +681 +00:48:51,110 --> 00:48:55,280 +this notion of these different anchor +boxes you can imagine taking these + +682 +00:48:55,280 --> 00:48:59,910 +different sized and shaped banker boxes +and sort of pasting them in the original + +683 +00:48:59,909 --> 00:49:03,538 +image at the point of the image +corresponding to this point in the + +684 +00:49:03,539 --> 00:49:08,020 +feature map right leg and fast RCMP were +projecting forward from the image into + +685 +00:49:08,019 --> 00:49:11,519 +the feature map now we're doing the +opposite we're projecting from the + +686 +00:49:11,519 --> 00:49:17,288 +feature map back into the image for +these boxes so then for each of these + +687 +00:49:17,289 --> 00:49:21,640 +anchor boxes they use sort of an +convolutional anchor boxes may use the + +688 +00:49:21,639 --> 00:49:27,400 +same ones at every position in the image +and they for each of these anchor boxes + +689 +00:49:27,400 --> 00:49:32,119 +they produce score as to whether or not +that anchor box corresponds to an object + +690 +00:49:32,119 --> 00:49:36,809 +and they also produce for regression +coordinates that's incorrect that anger + +691 +00:49:36,809 --> 00:49:41,880 +box in similar ways that we saw before +and now this region proposal network you + +692 +00:49:41,880 --> 00:49:45,700 +can just trained to try to predict it's +sort of a high-class agnostic object + +693 +00:49:45,699 --> 00:49:52,058 +detector so faster our CNN in the +original paper they train this thing and + +694 +00:49:52,059 --> 00:49:55,490 +kind of a funny way where first they +train read a proposal not work then they + +695 +00:49:55,489 --> 00:49:59,500 +train passed our CNN then they do some +magic to merge together and at the end + +696 +00:49:59,500 --> 00:50:03,530 +of the day they have one network that +produces everything so this this is a + +697 +00:50:03,530 --> 00:50:07,880 +little bit messy but individual paper +they describe this thing but since then + +698 +00:50:07,880 --> 00:50:10,470 +they've had some unpublished work where +they actually just change the whole + +699 +00:50:10,469 --> 00:50:14,909 +thing jointly where they're sort of have +one big network where you have an image + +700 +00:50:14,909 --> 00:50:19,679 +coming in you have this in inside the +region proposal network you have a + +701 +00:50:19,679 --> 00:50:23,538 +classification lost to classify whether +each region proposal is or is not an + +702 +00:50:23,539 --> 00:50:27,670 +object you have these bounding box +regressions inside the region proposal + +703 +00:50:27,670 --> 00:50:33,500 +not work on top of your competition +anchors and then from fast then we do + +704 +00:50:33,500 --> 00:50:37,190 +our life pooling and do this fast our +CNN trek and then at the end of the + +705 +00:50:37,190 --> 00:50:41,200 +network we have this classification lost +to say which class that is and this + +706 +00:50:41,199 --> 00:50:47,659 +regression lost to correct a correction +on top of the region proposal so this is + +707 +00:50:47,659 --> 00:50:53,170 +this big thing is just one big network +with four losses yeah + +708 +00:50:53,170 --> 00:51:04,019 +so the proposal and repression +coordinates are produced by a three by + +709 +00:51:04,019 --> 00:51:07,588 +three three by three and an apparent +one-by-one convolutions often feature + +710 +00:51:07,588 --> 00:51:12,358 +map right so the idea is that we're +looking at these different anchor boxes + +711 +00:51:12,358 --> 00:51:16,400 +of different positions and scales but +we're actually looking at the same + +712 +00:51:16,400 --> 00:51:20,139 +position in the feature map to classify +those different banker boxes but you + +713 +00:51:20,139 --> 00:51:26,179 +have different you learn different +weights for the different anchors I + +714 +00:51:26,179 --> 00:51:29,969 +think it's mostly empirical right so the +three by the idea is just you want to + +715 +00:51:29,969 --> 00:51:33,429 +have a little bit of nonlinearity you +could imagine just doing sort of a + +716 +00:51:33,429 --> 00:51:38,098 +direct one-by-one convolution directly +off the feature maps but I think they + +717 +00:51:38,099 --> 00:51:40,990 +don't discuss this in the paper but I'm +guessing just a three by three times to + +718 +00:51:40,989 --> 00:51:44,669 +work a bit better but there's no like +really deep reason why you why you do + +719 +00:51:44,670 --> 00:51:47,450 +that you could be more could be less +that could be a bigger colonel is just + +720 +00:51:47,449 --> 00:51:50,548 +sort of you have this little competition +at work with two heads that's the main + +721 +00:51:50,548 --> 00:51:53,710 +point and your questions + +722 +00:51:53,710 --> 00:52:18,380 +yeah I understand because + +723 +00:52:18,380 --> 00:52:22,140 +corresponds to the whole image + +724 +00:52:22,139 --> 00:52:26,098 +the point is that we don't actually want +to process the whole image you want to + +725 +00:52:26,099 --> 00:52:29,960 +pick out some regions of the image to do +more processing on but we need to choose + +726 +00:52:29,960 --> 00:52:36,048 +those regions somehow + +727 +00:52:36,048 --> 00:52:42,188 +yes that's basically the that's +basically this idea of using external + +728 +00:52:42,188 --> 00:52:46,428 +region proposals right so when you do +that external region proposals you're + +729 +00:52:46,429 --> 00:52:50,929 +sort of picking it first before you do +the convolutions but it's just sort of a + +730 +00:52:50,929 --> 00:52:54,858 +nice thing if you can do it all at once +so it's like I'm illusions are kind of + +731 +00:52:54,858 --> 00:52:58,748 +this general like really general +processing processing but you can do to + +732 +00:52:58,748 --> 00:53:01,608 +the image you're kinda hoping that +contributions are good enough for + +733 +00:53:01,608 --> 00:53:04,869 +classification gonna aggression the +types of information that you have in + +734 +00:53:04,869 --> 00:53:07,439 +those contributions is probably good +enough for classifying regions as well + +735 +00:53:07,438 --> 00:53:11,958 +so it's actually it's actually a +computational savings because at the end + +736 +00:53:11,958 --> 00:53:15,719 +of the day you end up using that same +convolutional Peter map for everything + +737 +00:53:15,719 --> 00:53:18,938 +for the region proposals for the +downstream classification for the dam + +738 +00:53:18,938 --> 00:53:23,389 +downstream regression that's actually +why you get the speed up here + +739 +00:53:23,389 --> 00:53:29,788 +question yes we have this big network we +train with four losses and now we can do + +740 +00:53:29,789 --> 00:53:31,569 +object detection sort of all at once + +741 +00:53:31,568 --> 00:53:37,858 +pretty cool so if we look at results +comparing the free our CNN's of various + +742 +00:53:37,858 --> 00:53:43,630 +velocities then we have original our CNN +it took about 50 seconds a test time per + +743 +00:53:43,630 --> 00:53:47,150 +image this is counting the region +proposals this is counting running the + +744 +00:53:47,150 --> 00:53:52,439 +CNN separately for each region proposal +that's pretty slow now passed our CNN we + +745 +00:53:52,438 --> 00:53:56,909 +saw it was sort of bottleneck by the +region proposal time but once we move to + +746 +00:53:56,909 --> 00:54:01,768 +faster our CNN than those region +proposals are basically coming for free + +747 +00:54:01,768 --> 00:54:06,139 +since they're just the way we compute +region proposals is just a tiny three my + +748 +00:54:06,139 --> 00:54:09,199 +free time dilution and a couple +one-by-one convolutions so they're very + +749 +00:54:09,199 --> 00:54:13,229 +cheap to evaluate it sent a test times +faster our CNN runs in the fifth of a + +750 +00:54:13,228 --> 00:54:23,849 +second a pretty high resolution image +that's actually yeah + +751 +00:54:23,849 --> 00:54:36,739 +well I mean you're not one of the ideas +behind zero padding as you're hoping not + +752 +00:54:36,739 --> 00:54:40,699 +too far away information from the edges +so I think maybe you might have a + +753 +00:54:40,699 --> 00:54:45,299 +problem with if you didn't do the zero +padding and maybe more problem but I + +754 +00:54:45,300 --> 00:54:48,430 +mean as we sort of discussed before and +the fact that you're adding that zero + +755 +00:54:48,429 --> 00:54:52,519 +padding might affect the statistics of +those features so it could maybe be a + +756 +00:54:52,519 --> 00:54:56,900 +bit of a problem but in practice it +seems to work just fine but actually + +757 +00:54:56,900 --> 00:55:00,099 +about yeah that that's an analysis of +where do we have a failure cases where + +758 +00:55:00,099 --> 00:55:02,949 +do we get things wrong as a really +important process when you develop new + +759 +00:55:02,949 --> 00:55:08,419 +algorithms and I can give you insight +into what might make things better + +760 +00:55:08,420 --> 00:55:26,940 +yeah yeah yeah + +761 +00:55:26,940 --> 00:55:35,858 +so maybe it might help but it's actually +kinda hard to the next to do that + +762 +00:55:35,858 --> 00:55:40,108 +experiment because the data sets are +different right because when you when + +763 +00:55:40,108 --> 00:55:43,789 +you were kind of classification dataset +like image now that's one thing but then + +764 +00:55:43,789 --> 00:55:47,259 +when you work on detection it's this +other data set and I haven't liked you + +765 +00:55:47,260 --> 00:55:51,000 +could imagine trying to classify the +detection images based on what objects + +766 +00:55:51,000 --> 00:55:54,500 +are present but I haven't really seen +any really good comparisons that try to + +767 +00:55:54,500 --> 00:56:00,630 +study that apparently but I mean that +the experiment on your project + +768 +00:56:00,630 --> 00:56:18,088 +yeah that's a very good question so then +you have this problem with our way + +769 +00:56:18,088 --> 00:56:22,119 +pooling right because of the way that +the ROI pooling work as well as by + +770 +00:56:22,119 --> 00:56:25,720 +dividing that thing into the sixth grade +and doing max pulling once you do + +771 +00:56:25,719 --> 00:56:29,949 +rotations it's actually kind of +difficult there's this really cool paper + +772 +00:56:29,949 --> 00:56:33,159 +from deep mind in the last over the +summer called spatial transformer + +773 +00:56:33,159 --> 00:56:39,250 +networks that actually introduces a +really cool way to solve this problem in + +774 +00:56:39,250 --> 00:56:42,239 +the idea is that instead of doing ROI +pooling we're gonna do by linear + +775 +00:56:42,239 --> 00:56:46,699 +interpolation kinda like you might be +used for textures and graphics so once + +776 +00:56:46,699 --> 00:56:50,009 +you do by linear interpolation than you +actually can do maybe these these crazy + +777 +00:56:50,010 --> 00:56:53,609 +regions so yeah that's definitely +something people are thinking about but + +778 +00:56:53,608 --> 00:56:56,848 +it hasn't been incorporated into the +into the whole pipeline yet + +779 +00:56:56,849 --> 00:57:00,338 +yeah + +780 +00:57:00,338 --> 00:57:11,728 +you could be slowed down your back in +this sort of our CNN regime right and + +781 +00:57:11,728 --> 00:57:12,449 +look at that + +782 +00:57:12,449 --> 00:57:16,828 +250 times slower you really want to pay +that price I mean I think another + +783 +00:57:16,829 --> 00:57:20,690 +practical concern with rotated objects +is that we don't really have that ground + +784 +00:57:20,690 --> 00:57:25,318 +truth data sets so for most of these +most of these detection dataset the only + +785 +00:57:25,318 --> 00:57:29,190 +ground truth information we have are +these access online bounding boxes so + +786 +00:57:29,190 --> 00:57:33,150 +it's hard you don't have a ground truth +position that's kind of a practical + +787 +00:57:33,150 --> 00:57:39,219 +concern I think people haven't really +explored this so much so the end and + +788 +00:57:39,219 --> 00:57:43,009 +story with past our CNN has its super +fast and it was about the same right + +789 +00:57:43,009 --> 00:57:49,798 +that's good and works actually really +interesting is now at this point I knew + +790 +00:57:49,798 --> 00:57:52,949 +it you can actually understand the state +of the art in object detection so this + +791 +00:57:52,949 --> 00:57:55,669 +is this is one of the best object +detector in the world it crushed + +792 +00:57:55,670 --> 00:58:00,479 +everyone at the image that challenge in +image and cocoa challenges in December + +793 +00:58:00,478 --> 00:58:06,710 +and like most other thing is it's this +deep residual network so the best object + +794 +00:58:06,710 --> 00:58:10,548 +in the world right now is a hundred and +one layer residual network plus faster + +795 +00:58:10,548 --> 00:58:17,298 +our CNN plus a couple other goodies here +right so we talked about we talk about + +796 +00:58:17,298 --> 00:58:23,670 +past our CNN we saw president last year +they have to get an extra they always + +797 +00:58:23,670 --> 00:58:26,389 +for competitions you need to add a +couple of crazy things to get a little + +798 +00:58:26,389 --> 00:58:30,348 +bit boost in performance right so here +in this box refinements actually do + +799 +00:58:30,349 --> 00:58:33,528 +multiple steps of refining the bounding +box + +800 +00:58:33,528 --> 00:58:38,818 +you saw that in the fast our CNN +framework you doing this correction on + +801 +00:58:38,818 --> 00:58:41,929 +top of your region proposal could +actually feed that back into the network + +802 +00:58:41,929 --> 00:58:46,298 +and reclassify Andrea get another +production so that's this box refinement + +803 +00:58:46,298 --> 00:58:50,929 +step it gives you a little bit a boost +they add context so in addition to + +804 +00:58:50,929 --> 00:58:55,710 +classifying just just the region they +get out of actor that gives you the + +805 +00:58:55,710 --> 00:59:00,309 +whole features for the entire image that +sort of gives you more contacts than + +806 +00:59:00,309 --> 00:59:03,999 +just that little crop net gives you a +little bit more apartments and they also + +807 +00:59:03,998 --> 00:59:08,179 +do multiscale testing kinda like we saw +in over feet back so they actually run + +808 +00:59:08,179 --> 00:59:10,730 +the thing on images at different size is +a test time + +809 +00:59:10,730 --> 00:59:13,949 +an aggregate or those different sizes +and when you put all those things + +810 +00:59:13,949 --> 00:59:21,129 +together you win a lot of competitions +so this thing one on SoCo actually + +811 +00:59:21,130 --> 00:59:24,960 +Microsoft Coco actually runs a detection +challenge and they wonder detection + +812 +00:59:24,960 --> 00:59:29,199 +challenge on cocoa we can also look at +the rapid progress on the image that + +813 +00:59:29,199 --> 00:59:32,909 +detection challenges over the last +couple of years so you can see in 2013 + +814 +00:59:32,909 --> 00:59:38,949 +was sort of the first time that we had +these deep learning detection models so + +815 +00:59:38,949 --> 00:59:43,789 +over feat that we saw for localisation +they actually submitted version of their + +816 +00:59:43,789 --> 00:59:47,949 +system that works on detection as well +by sort of changing the logic with by + +817 +00:59:47,949 --> 00:59:51,849 +which they merge bounding boxes and they +did pretty good but they were actually + +818 +00:59:51,849 --> 00:59:57,319 +outperformed by this other this other +group called you vision that was sort of + +819 +00:59:57,320 --> 01:00:02,289 +not a deep learning approach to use a +lot of features but none in 2014 we + +820 +01:00:02,289 --> 01:00:05,840 +actually saw both of these were deep +learning approaches and Google actually + +821 +01:00:05,840 --> 01:00:09,740 +won that one by using a Google Map plus +some other detection stuff on top of + +822 +01:00:09,739 --> 01:00:15,029 +Google not and then in 2015 things went +crazy and these residual networks plus + +823 +01:00:15,030 --> 01:00:19,410 +passer I CNN just crushed everything so +I think that action especially over the + +824 +01:00:19,409 --> 01:00:22,409 +last couple years has been a really +exciting thing because we've seen this + +825 +01:00:22,409 --> 01:00:25,429 +really rapid progress over the last +couple years in detection like most + +826 +01:00:25,429 --> 01:00:29,129 +other things and another point I think +it's kind of fun to make is that + +827 +01:00:29,130 --> 01:00:33,800 +actually for all I can to win +competitions you know Andre said you + +828 +01:00:33,800 --> 01:00:37,830 +ensemble and get 2% so you always win +competitions with an ensemble but + +829 +01:00:37,829 --> 01:00:42,829 +actually sort of fun microsoft also +submitted their best single resident + +830 +01:00:42,829 --> 01:00:47,440 +model this was not an ensemble and just +a single resident model actually be all + +831 +01:00:47,440 --> 01:00:52,400 +the other things from all the other +years that's actually pretty cool yeah + +832 +01:00:52,400 --> 01:00:58,130 +that's that's the best actor out there +so this is kind of a funny thing right + +833 +01:00:58,130 --> 01:01:03,240 +so this is a really so we we we talked +about this idea of localisation as + +834 +01:01:03,239 --> 01:01:08,439 +regression so this funny thing called +Yolo you only look once actually tries + +835 +01:01:08,440 --> 01:01:13,519 +to oppose the detection problem directly +as a regression problem so the idea is + +836 +01:01:13,519 --> 01:01:18,389 +that we actually are going to take our +input image and we're gonna divided into + +837 +01:01:18,389 --> 01:01:22,190 +some spatial grid they used to seven by +seven and then within + +838 +01:01:22,190 --> 01:01:26,480 +each element about spatial grid we're +gonna make six number of bounding box + +839 +01:01:26,480 --> 01:01:31,039 +predictions they use be equal to I think +in most of the experiments so then + +840 +01:01:31,039 --> 01:01:36,489 +within each grid you're going to predict +maybe to be bounding boxes that's four + +841 +01:01:36,489 --> 01:01:41,229 +numbers are also going to protect US +single score for how much you believe + +842 +01:01:41,230 --> 01:01:44,969 +that bounding box and you're also going +to protect classification score for each + +843 +01:01:44,969 --> 01:01:49,659 +class near Davis at so then you can sort +of take this this detection problem and + +844 +01:01:49,659 --> 01:01:53,969 +it ends up being regression your input +is an image in your output is this maybe + +845 +01:01:53,969 --> 01:01:59,529 +seven by seven by five B plus see answer +right now just a regression problem and + +846 +01:01:59,530 --> 01:02:04,820 +just try it and that's pretty cool and +it's it's sort of a new approach to a + +847 +01:02:04,820 --> 01:02:07,900 +bit different than these region proposal +things that we've seen before + +848 +01:02:07,900 --> 01:02:12,300 +of course sort of a problem with this is +that there's an upper bound in the + +849 +01:02:12,300 --> 01:02:15,930 +number of outputs that your model can +have so that might be a problem if + +850 +01:02:15,929 --> 01:02:20,279 +you're testing data has many many more +ground truth boxes in your training data + +851 +01:02:20,280 --> 01:02:27,180 +so this this yellow detector actually is +really fast it's actually faster and + +852 +01:02:27,179 --> 01:02:32,460 +then faster our CNN which is pretty +crazy but unfortunately it tends to work + +853 +01:02:32,460 --> 01:02:36,769 +a little bit worse so bad this other +thing called fast yellow that i dont + +854 +01:02:36,769 --> 01:02:39,460 +wanna talk about but + +855 +01:02:39,460 --> 01:02:45,170 +right but just as our number these are +mean AP numbers on passed on one of the + +856 +01:02:45,170 --> 01:02:49,619 +Pascal data sets that we saw you can see +yellow actually gets 64 that's pretty + +857 +01:02:49,619 --> 01:02:53,329 +good and runs at forty five frames per +second that this is obviously on a + +858 +01:02:53,329 --> 01:02:58,840 +powerful GPU but still that's that's +pretty much real time that's amazing + +859 +01:02:58,840 --> 01:03:03,960 +was also I don't wanna talk about that +right now knows these different versions + +860 +01:03:03,960 --> 01:03:09,309 +of past and Pastor are CNN's you can see +that these actually pretty much all beat + +861 +01:03:09,309 --> 01:03:14,119 +yo in terms of performance but are quite +a bit slower yeah that's that's actually + +862 +01:03:14,119 --> 01:03:20,119 +kind of a neat twist on the detection +problem actually all these all these + +863 +01:03:20,119 --> 01:03:22,779 +different detection metric all these +different detection models that we + +864 +01:03:22,780 --> 01:03:26,780 +talked about today they all pretty much +have code up their released you should + +865 +01:03:26,780 --> 01:03:30,800 +maybe consider using them for projects +probably don't use our CNN it's too slow + +866 +01:03:30,800 --> 01:03:36,090 +fast are seen on pretty good but +requires MATLAB pastor our CNN there is + +867 +01:03:36,090 --> 01:03:39,720 +actually a Persian a pastor our CNN that +doesn't require MATLAB is just Pipeline + +868 +01:03:39,719 --> 01:03:44,379 +Cafe I haven't personally used it but +it's something you might want to try to + +869 +01:03:44,380 --> 01:03:48,070 +use for your projects I'm not sure how +difficult it is to get running and + +870 +01:03:48,070 --> 01:03:52,050 +yellow as actually I think maybe a good +choice for some of your projects because + +871 +01:03:52,050 --> 01:03:55,810 +it's so fast that it might be easier to +work with if you have not be really big + +872 +01:03:55,809 --> 01:03:59,860 +powerful GPUs and actually have caught +up as well + +873 +01:03:59,860 --> 01:04:03,480 +yes that's actually I got through things +a little bit faster than expected so is + +874 +01:04:03,480 --> 01:04:10,559 +there any questions on detection + +875 +01:04:10,559 --> 01:04:15,880 +yeah + +876 +01:04:15,880 --> 01:04:22,630 +yes in terms of model like model size +it's pretty much about the same as a + +877 +01:04:22,630 --> 01:04:26,039 +classification model because when when +you're running on bigger image + +878 +01:04:26,039 --> 01:04:29,109 +especially for faster our CNN right +cause your convolutions you don't really + +879 +01:04:29,108 --> 01:04:32,558 +introduce any more parameters the full +impact of layers are not really anymore + +880 +01:04:32,559 --> 01:04:35,829 +parameters you have a couple extra +parameters for the region proposal + +881 +01:04:35,829 --> 01:04:38,798 +network but it's basically the same +number primaries as a classification + +882 +01:04:38,798 --> 01:04:45,619 +model right I guess I guess we're done a +little early today + diff --git a/captions/Ko/Lecture10_ko.srt b/captions/Ko/Lecture10_ko.srt new file mode 100644 index 00000000..6881c6bb --- /dev/null +++ b/captions/Ko/Lecture10_ko.srt @@ -0,0 +1,3857 @@ +1 +00:00:00,000 --> 00:00:04,129 +마이크 테스트 + +2 +00:00:04,129 --> 00:00:12,109 +오늘은 주제는 Recurrent neural networks 입니다. + +3 +00:00:12,109 --> 00:00:15,199 +개인적으로 가장 좋아하는 주제이고 + +4 +00:00:15,199 --> 00:00:18,960 +또 여러 형태로 사용하고 있는 NN 모델이기도 하죠. 재밌어요. + +5 +00:00:18,960 --> 00:00:23,009 +강의 진행에 관해서 언급할 게 있는데, + +6 +00:00:23,009 --> 00:00:26,089 +수요일에 중간 고사가 있어요. + +7 +00:00:26,089 --> 00:00:32,738 +다들 중간고사 기대하고 있는거 다 알아요. 사실 별로 기대하는 것 같이 보이지는 않네요. + +8 +00:00:32,738 --> 00:00:37,979 +수요일에 과제가 나갈 거에요. + +9 +00:00:37,979 --> 00:00:40,429 +제출 기한은 2주 뒤 월요일까지입니다. + +10 +00:00:40,429 --> 00:00:43,399 +그런데 저희가 원래 월요일에 이걸 발표하려 했는데 늦어져서 + +11 +00:00:43,399 --> 00:00:47,129 +아마 제출 기한이 수요일 즈음으로 미뤄질 것 같네요. + +12 +00:00:47,130 --> 00:00:51,179 +2번째 과제는 금요일까지고, 3-late day를 사용할 수 있어요. 그런데 너무 일찍 사용하지는 마세요. + +13 +00:00:51,179 --> 00:00:55,119 +2번째 과제는 금요일까지고, 3-late day를 사용할 수 있어요. 그런데 너무 일찍 사용하지는 마세요. + +14 +00:00:55,119 --> 00:01:01,089 +몇 명이나 끝냈나요? 72명? 거의 다 끝냈네요, 좋아요. + +15 +00:01:01,090 --> 00:01:04,549 +자 우리는 Convolutional Neural Network (CNN)에 대해서 얘기하고 있었죠. + +16 +00:01:04,549 --> 00:01:07,820 +지난 수업에서는 CNN에 대한 시각화와 간단한 이해에 대해서 다루었고, + +17 +00:01:07,819 --> 00:01:11,618 +이런 그림과 비디오들을 살펴보면서 CNN이 어떻게 작동하는지 살펴보았죠. + +18 +00:01:11,618 --> 00:01:14,938 +이런 그림과 비디오들을 살펴보면서 CNN이 어떻게 작동하는지 살펴보았죠. + +19 +00:01:14,938 --> 00:01:17,828 +이런 그림과 비디오들을 살펴보면서 CNN이 어떻게 작동하는지 살펴보았죠. + +20 +00:01:17,828 --> 00:01:24,188 +그리고 맨 마지막 그림에서 본 것처럼 디버깅도 해 보았고요. + +21 +00:01:24,188 --> 00:01:27,408 +지난 주말에 트위터에서 새로운 시각화 자료를 찾았는데요, + +22 +00:01:27,409 --> 00:01:32,569 +신기하죠? + +23 +00:01:32,569 --> 00:01:37,118 +사실 설명이 없어서 정확히 어떤 방법으로 이걸 만든 건지는 잘 모르겠네요. + +24 +00:01:37,118 --> 00:01:43,099 +그래도 멋있지 않아요? 이건 거북이고, 저건 타란튤라 거미이고, + +25 +00:01:43,099 --> 00:01:47,468 +이건 체인이고, 저건 개들인데, + +26 +00:01:47,468 --> 00:01:50,509 +제가 보기에 이건 어떤 최적화 기법을 이미지에 적용한 것 같은데, + +27 +00:01:50,509 --> 00:01:53,679 +뭔가 다른 regularization 방법을 적용한 것 같네요 + +28 +00:01:53,679 --> 00:01:57,049 +음, 여기에는 bilateral filter (쌍방향 필터) 를 적용한 것 같네요. + +29 +00:01:57,049 --> 00:01:59,420 +음, 여기에는 bilateral filter (쌍방향 필터) 를 적용한 것 같네요. + +30 +00:01:59,420 --> 00:02:03,659 +그래도 솔직히 정확히 어떤 기법을 적용한 것인지는 잘 모르겠어요. + +31 +00:02:03,659 --> 00:02:04,549 +오늘의 주제는 RNN입니다. + +32 +00:02:04,549 --> 00:02:10,360 +오늘의 주제는 RNN입니다. + +33 +00:02:10,360 --> 00:02:13,520 +RNN의 강점은 네트워크 아키텍쳐를 구성하는 데에 자유도가 놓다는 것입니다. + +34 +00:02:13,520 --> 00:02:15,870 +RNN의 강점은 네트워크 아키텍쳐를 구성하는 데에 자유도가 놓다는 것입니다. + +35 +00:02:15,870 --> 00:02:18,650 +일반적으로 NN을 왼쪽 그림과 같이 구성할 때는 (역자주: Vanilla NN) + +36 +00:02:18,650 --> 00:02:22,849 +여기 빨간색으로 표시된 것처럼 고정된 크기의 input vector를 사용하고, + +37 +00:02:22,848 --> 00:02:27,639 +초록색의 hidden layer들을 통해 작동시키며, 마찬가지로 고정된 크기의 파란색 output vector를 출력합니다. + +38 +00:02:27,639 --> 00:02:30,738 +마찬가지로 고정된 크기의 이미지를 입력으로 받고, + +39 +00:02:30,739 --> 00:02:34,469 +고정된 크기의 이미지를 벡터 형태로 출력합니다. + +40 +00:02:34,469 --> 00:02:38,239 +RNN에서는 이러한 작업을 계속 반복할 수 있습니다. input, output 모두에서 가능하죠. + +41 +00:02:38,239 --> 00:02:41,319 +오늘 다룰 image captioning(이미지에 상응하는 자막/주석 생성) 을 예로 들면, + +42 +00:02:41,318 --> 00:02:44,689 +고정된 크기의 이미지를 RNN에 입력하게 됩니다. + +43 +00:02:44,689 --> 00:02:47,829 +그리고 그 RNN은 해당 이미지를 설명하는 단어/문장 들을 출력하게 되죠. + +44 +00:02:47,829 --> 00:02:52,560 +그리고 그 RNN은 해당 이미지를 설명하는 단어/문장 들을 출력하게 되죠. + +45 +00:02:52,560 --> 00:02:55,969 +Sentiment classification(감정 분류)를 예로 들면, + +46 +00:02:55,969 --> 00:02:59,759 +(어떤 문장의) 단어들과 그 순서을 입력으로 받아서, + +47 +00:02:59,759 --> 00:03:03,828 +그 문장의 느낌이 긍정적인지 또는 부정적인지를 출력하게 됩니다. + +48 +00:03:03,829 --> 00:03:07,590 +또 다른 예로 machine translation (역자주: 구글 번역과 같은 알고리즘 번역) 에서는, + +49 +00:03:07,590 --> 00:03:12,069 +어떤 영어 문장을 입력으로 받고, 프랑스어로 출력해야 합니다. + +50 +00:03:12,068 --> 00:03:17,119 +그래서 우리는 이 영어 문장을 RNN에 입력하고 (이것을 Sequence to Sequence 라 부름) + +51 +00:03:17,120 --> 00:03:20,280 +그래서 우리는 이 영어 문장을 RNN에 입력하고 (이것을 Sequence to Sequence 라 부름) + +52 +00:03:20,280 --> 00:03:25,169 +RNN은 이 영어 문장을 프랑스어 문장으로 번역합니다. + +53 +00:03:25,169 --> 00:03:28,000 +마지막 예 video classification(영상 분류) 에서는, + +54 +00:03:28,000 --> 00:03:31,699 +각 프레임 (순간 캡쳐 화면) 이 어떤 속성을 지니는지, + +55 +00:03:31,699 --> 00:03:35,429 +그리고 그 전의 모든 프레임과의 관계는 어떻게 되는지도 고려합니다. + +56 +00:03:35,430 --> 00:03:38,739 +그리고 그 전의 모든 프레임과의 관계는 어떻게 되는지도 고려합니다. + +57 +00:03:38,739 --> 00:03:41,909 +그러니까 RNN은 각각의 프레임이 어떤 속성을 지니는지 분류하고, + +58 +00:03:41,909 --> 00:03:44,680 +이전까지의 모든 프레임을 입력으로 받는 함수가 되어, + +59 +00:03:44,680 --> 00:03:48,760 +앞으로의 프레임을 예측하는 아키텍쳐를 제공합니다. + +60 +00:03:48,759 --> 00:03:52,388 +만약 맨 왼쪽 그림과 같이 입력과 출력의 순서에 관한 정보를 가지고 있지 않아도 RNN을 사용할 수 있습니다. + +61 +00:03:52,389 --> 00:03:55,250 +만약 맨 왼쪽 그림과 같이 입력과 출력의 순서에 관한 정보를 가지고 있지 않아도 RNN을 사용할 수 있습니다. + +62 +00:03:55,250 --> 00:04:01,560 +예를 들어, 제가 좋아하는 딥마인드의 한 논문에서는 + +63 +00:04:01,560 --> 00:04:05,189 +번지로 된 집 주소 이미지를 문자로 변환했습니다. + +64 +00:04:05,189 --> 00:04:09,750 +여기서는 단순히 CNN을 사용해서 이미지 자체가 몇 번지를 나타내는지를 분류하지 않고, + +65 +00:04:09,750 --> 00:04:13,530 +여기서는 단순히 CNN을 사용해서 이미지 자체가 몇 번지를 나타내는지를 분류하지 않고, + +66 +00:04:13,530 --> 00:04:16,649 +RNN을 사용해서 작은 CNN이 이미지를 돌아다니면서 읽어들였습니다. + +67 +00:04:16,649 --> 00:04:19,779 +RNN을 사용해서 작은 CNN이 이미지를 돌아다니면서 읽어들였습니다. + +68 +00:04:19,779 --> 00:04:23,969 +이렇게 RNN은 번지 주소 이미지를 왼쪽으로 오른쪽으로 순차적으로 읽는 방법을 학습했습니다. + +69 +00:04:23,970 --> 00:04:26,870 +이렇게 RNN은 번지 주소 이미지를 왼쪽으로 오른쪽으로 순차적으로 읽는 방법을 학습했습니다. + +70 +00:04:26,870 --> 00:04:32,019 +반대로 생각할 수도 있습니다. 이것은 DRAW라는 유명한 논문인데요, + +71 +00:04:32,019 --> 00:04:35,879 +여기서는 이미지 샘플 하나하나가 무엇인지 개별적으로 판단하지 않고, + +72 +00:04:35,879 --> 00:04:39,490 +여기서는 이미지 샘플 하나하나가 무엇인지 개별적으로 판단하지 않고, + +73 +00:04:39,490 --> 00:04:42,860 +RNN이 여러 이미지를 하나의 큰 캔버스의 형태로 한번에 출력합니다. + +74 +00:04:42,860 --> 00:04:47,540 +RNN이 여러 이미지를 하나의 큰 캔버스의 형태로 한번에 출력합니다. + +75 +00:04:47,540 --> 00:04:50,200 +이 방법은 한 번지수 이미지에 대한 입력 결과를 곧바로 출력하지 않고, 보다 많은 계산을 거친다는 점에서 강력합니다. + +76 +00:04:50,199 --> 00:04:53,479 +이 방법은 한 번지수 이미지에 대한 입력 결과를 곧바로 출력하지 않고, 보다 많은 계산을 거친다는 점에서 강력합니다. 질문 있나요? + +77 +00:04:53,480 --> 00:05:14,189 +(질문) 그림에서 화살표는 무엇인가요? + +78 +00:05:14,189 --> 00:05:19,310 +화살표는 functional dependence를 나타냅니다. 조금 있다가 좀 더 자세하게 살펴 볼 거에요. + +79 +00:05:19,310 --> 00:05:23,139 +화살표는 functional dependence를 나타냅니다. 조금 있다가 좀 더 자세하게 살펴 볼 거에요. + +80 +00:05:23,139 --> 00:05:37,168 +(질문) 그림에서 나타나는 숫자들은 무엇인가요? + +81 +00:05:37,168 --> 00:05:41,219 +이것들은 실제 사진이 아니라 RNN이 학습 후 출력한 결과물입니다. + +82 +00:05:41,220 --> 00:05:44,830 +이것들은 실제 사진이 아니라 RNN이 학습 후 출력한 결과물입니다. + +83 +00:05:44,829 --> 00:05:48,219 +(질문) 그러니까 실제 사진이 아니라 만들어진 거라는 거죠? + +84 +00:05:48,220 --> 00:05:51,689 +네, 꽤 실제 사진처럼 포이기는 하지만, 이것들은 만들어진 이미지입니다. + +85 +00:05:51,689 --> 00:05:55,809 +RNN은 이런 초록색 박스처럼 생겼습니다. + +86 +00:05:55,809 --> 00:06:00,979 +RNN은 계속해서 input vector를 입력받습니다. + +87 +00:06:00,978 --> 00:06:04,859 +RNN은 계속해서 input vector를 입력받습니다. + +88 +00:06:04,860 --> 00:06:08,538 +RNN 내부에는 여러 state가 있는데, 이는 매 시간에 입력받는 input vector의 형태로 나타낼 수 있습니다. + +89 +00:06:08,538 --> 00:06:12,988 +RNN 내부에는 여러 state가 있는데, 이는 매 시간에 입력받는 input vector의 형태로 나타낼 수 있습니다. + +90 +00:06:12,988 --> 00:06:17,258 +RNN에는 또한 weight(가중치)를 설정할 수 있고, 이를 조정함으로써 RNN의 작동을 조절할 수 있습니다. + +91 +00:06:17,259 --> 00:06:20,829 +RNN에는 또한 weight(가중치)를 설정할 수 있고, 이를 조정함으로써 RNN의 작동을 조절할 수 있습니다. + +92 +00:06:20,829 --> 00:06:25,769 +우리는 물론 RNN의 출력 결과물에도 관심을 갖고 있지만, + +93 +00:06:25,769 --> 00:06:30,429 +우리는 물론 RNN의 출력 결과물에도 관심을 갖고 있지만, + +94 +00:06:30,428 --> 00:06:33,988 +RNN은 이 중간에 있는, 시간에 따라 이미지를 입력받고 출력하는 단계인 이 초록색 박스라는 것을 알아두셨으면 합니다. + +95 +00:06:33,988 --> 00:06:36,688 +RNN은 이 중간에 있는, 시간에 따라 이미지를 입력받고 출력하는 단계인 이 초록색 박스라는 것을 알아두셨으면 합니다. + +96 +00:06:36,689 --> 00:06:39,489 +RNN은 이 중간에 있는, 시간에 따라 이미지를 입력받고 출력하는 단계인 이 초록색 박스라는 것을 알아두셨으면 합니다. + +97 +00:06:39,488 --> 00:06:44,838 +RNN은 이 중간에 있는, 시간에 따라 이미지를 입력받고 출력하는 단계인 이 초록색 박스라는 것을 알아두셨으면 합니다. + +98 +00:06:44,838 --> 00:06:50,610 +RNN의 각 state는 vector들의 집합으로 나타낼 수 있고, 여기서는 h로 표기하겠습니다. + +99 +00:06:50,610 --> 00:06:55,399 +RNN의 각 state는 vector들의 집합으로 나타낼 수 있고, 여기서는 h로 표기하겠습니다. + +100 +00:06:55,399 --> 00:07:00,939 +각각의 state(h_t) 는 바로 전 단계의 state(h_t-1)과 input vector(x_t)들의 함수로 나타낼 수 있습니다. + +101 +00:07:00,939 --> 00:07:05,769 +각각의 state(h_t) 는 바로 전 단계의 state(h_t-1)과 input vector(x_t)들의 함수로 나타낼 수 있습니다. + +102 +00:07:05,769 --> 00:07:08,338 +여기서의 함수는 Recurrence funtion 이라고 하고 파라미터 W(가중치)를 갖습니다. + +103 +00:07:08,338 --> 00:07:13,728 +우리는 W 값을 변경함에 따라 RNN이 다른 결과를 보이는 걸 확인할 수 있습니다. + +104 +00:07:13,728 --> 00:07:16,228 +우리는 W 값을 변경함에 따라 RNN이 다른 결과를 보이는 걸 확인할 수 있습니다. + +105 +00:07:16,228 --> 00:07:19,338 +따라서 우리는 우리가 원하는 결과를 만들어낼 수 있는 적절한 W를 찾기 위해 training을 거칠 것이죠. + +106 +00:07:19,338 --> 00:07:23,639 +따라서 우리는 우리가 원하는 결과를 만들어낼 수 있는 적절한 W를 찾기 위해 training을 거칠 것이죠. + +107 +00:07:23,639 --> 00:07:28,209 +여기서 기억해야 할 것은 매 단계마다 같은 함수와 같은 W를 사용한다는 것입니다. + +108 +00:07:28,209 --> 00:07:31,778 +여기서 기억해야 할 것은 매 단계마다 같은 함수와 같은 W를 사용한다는 것입니다. + +109 +00:07:31,778 --> 00:07:35,928 +그래서 입력이나 출력 시퀀스의 길이를 고려할 필요가 없습니다. + +110 +00:07:35,928 --> 00:07:38,778 +그래서 입력이나 출력 시퀀스의 길이를 고려할 필요가 없습니다. + +111 +00:07:38,778 --> 00:07:43,528 +그래서 입력이나 출력 시퀀스의 길이를 고려할 필요가 없습니다. + +112 +00:07:43,528 --> 00:07:46,769 +RNN을 구현하는 가장 간단한 방법은 Vanilla RNN 입니다. + +113 +00:07:46,769 --> 00:07:50,309 +RNN을 구현하는 가장 간단한 방법은 Vanilla RNN 입니다. + +114 +00:07:50,309 --> 00:07:54,569 +여기서 RNN을 구성하는 것은 단 하나의 hidden state h 입니다. + +115 +00:07:54,569 --> 00:08:00,569 +여기서 RNN을 구성하는 것은 단 하나의 hidden state h 입니다. + +116 +00:08:00,569 --> 00:08:04,039 +그리고 여기 Recurrence(재귀) 식은 각 hidden state를 시간과 현재 input (x_t)로 어떻게 나타낼 수 있는지 알려줍니다. + +117 +00:08:04,038 --> 00:08:04,688 +그리고 여기 Recurrence(재귀) 식은 각 hidden state를 시간과 현재 input (x_t)로 어떻게 나타낼 수 있는지 알려줍니다. + +118 +00:08:04,689 --> 00:08:08,369 +그리고 여기 Recurrence 식은 각 hidden state를 시간과 현재 input (x_t)로 어떻게 나타낼 수 있는지 알려줍니다. + +119 +00:08:08,369 --> 00:08:10,349 +가중치 행렬 W_hh와 W_xh에 직전 단계의 hidden state h 와 input vector x가 각각 곱해지고, + +120 +00:08:10,348 --> 00:08:15,238 +가중치 행렬 W_hh와 W_xh에 직전 단계의 hidden state h_t-1 와 input vector x가 각각 곱해지고, + +121 +00:08:15,238 --> 00:08:18,238 +이것이 tanh 함수에 의해 새로운 hidden state h_t로 결정되는 방식으로 업데이트 됩니다. + +122 +00:08:18,238 --> 00:08:21,978 +이것이 tanh 함수에 의해 새로운 hidden state h_t로 결정되는 방식으로 업데이트 됩니다. + +123 +00:08:21,978 --> 00:08:26,199 +이러한 재귀 식은 h가 시간과 현재 입력에 따라 업데이트되는 함수라는 것을 보여줍니다. + +124 +00:08:26,199 --> 00:08:29,769 +이러한 재귀 식은 h가 시간과 현재 입력에 따라 업데이트되는 함수라는 것을 보여줍니다. + +125 +00:08:29,769 --> 00:08:34,129 +h 바로 다음에 결과물이 행렬의 형태로 출력되는 형태가 가장 간단한 형태의 RNN입니다. + +126 +00:08:34,129 --> 00:08:37,528 +h 바로 다음에 결과물이 행렬의 형태로 출력되는 형태가 가장 간단한 형태의 RNN입니다. + +127 +00:08:37,528 --> 00:08:42,288 +이게 어떻게 작동되는지 간단히 설명드리기 위해 예를 들자면, + +128 +00:08:42,288 --> 00:08:46,639 +이게 어떻게 작동되는지 간단히 설명드리기 위해 예를 들자면, + +129 +00:08:46,639 --> 00:08:49,299 +이런 추상적인 x, h, y 등에 의미를 부여할 수 있습니다. + +130 +00:08:49,299 --> 00:08:53,059 +이런 추상적인 x, h, y 등에 의미를 부여할 수 있습니다. + +131 +00:08:53,059 --> 00:08:56,149 +예를 들어 이러한 문자 수준 언어 모델에 RNN을 적용하는 것 말이죠. + +132 +00:08:56,149 --> 00:08:59,899 +저는 이 예시를 참 좋아합니다. 직관적이고 재밌거든요. + +133 +00:08:59,899 --> 00:09:04,698 +그래서 RNN 기반 문자 수준 언어 모델에서는, RNN에 문자열의 순서를 주고, + +134 +00:09:04,698 --> 00:09:07,859 +그래서 RNN 기반 문자 수준 언어 모델에서는, RNN에 문자열의 순서를 주고, + +135 +00:09:07,860 --> 00:09:10,899 +그래서 RNN 기반 문자 수준 언어 모델에서는, RNN에 문자열의 순서를 주고, + +136 +00:09:10,899 --> 00:09:14,299 +지금까지의 관찰 결과를 바탕으로 각각의 단계에서 다음에 올 문자는 무엇인지 예측하게 합니다. + +137 +00:09:14,299 --> 00:09:16,909 +지금까지의 관찰 결과를 바탕으로 각각의 단계에서 다음에 올 문자는 무엇인지 예측하게 합니다. + +138 +00:09:16,909 --> 00:09:21,120 +간단한 예를 한번 보죠. + +139 +00:09:21,120 --> 00:09:25,610 +여기서 training 문자열 'hello'를 주면, + +140 +00:09:25,610 --> 00:09:29,870 +우리의 현재 어휘 목록에는 'h, e , l, o' 이렇게 4글자가 있겠죠 + +141 +00:09:29,870 --> 00:09:33,289 +그러니까 RNN은 우리의 training 문자열 데이터를 바탕으로 다음에 올 글자가 무엇인지 예측하게 됩니다. + +142 +00:09:33,289 --> 00:09:37,000 +구체적으로, h, e, l, o를 각각 순서대로 하나씩 RNN에 입력해 줍니다. + +143 +00:09:37,000 --> 00:09:40,509 +여기서 가로축은 시간입니다. (역자주: 오른쪽으로 갈수록 뒤) + +144 +00:09:40,509 --> 00:09:47,110 +h는 첫번째, e는 두번째, 그다음 l, 그다음 l + +145 +00:09:47,110 --> 00:09:50,629 +여기서는 'one-hot' 표기법을 사용하고 있습니다. (역자주: 0과 1로만 나타내는 것) + +146 +00:09:50,629 --> 00:09:53,889 +여기서는 'one-hot' 표기법을 사용하고 있습니다. (역자주: 0과 1로만 나타내는 것) + +147 +00:09:53,889 --> 00:09:58,129 +그리고 아까 본 재귀 식을 사용합니다. + +148 +00:09:58,129 --> 00:10:01,860 +처음에 h에는 0만 들어가 있습니다. + +149 +00:10:01,860 --> 00:10:04,720 +그래서 매 시간 단계마다 이 재귀 식을 이용해서 hidden state 벡터를 계산합니다. + +150 +00:10:04,720 --> 00:10:08,790 +hidden state에 3개의 (안들림) 가 있습니다. + +151 +00:10:08,789 --> 00:10:11,099 +각 시점에서 이전까지 입력받은 모든 문자들을 요약해서 표현합니다. + +152 +00:10:11,100 --> 00:10:13,040 +각 시점에서 이전까지 입력받은 모든 문자들을 요약해서 표현합니다. + +153 +00:10:13,039 --> 00:10:15,759 +각 시점에서 이전까지 입력받은 모든 문자들을 요약해서 표현합니다. + +154 +00:10:15,759 --> 00:10:20,159 +이런 방법으로 매 시간 단계마다 바로 다음 순서 에 올 문자를 예측할 것입니다. + +155 +00:10:20,159 --> 00:10:23,139 +이런 방법으로 매 시간 단계마다 바로 다음 순서에 올 문자를 예측할 것입니다. + +156 +00:10:23,139 --> 00:10:27,569 +우리는 이 4 개의 문자(역자주: h, e, l, o)를 가지고 있고, 매 시간 단계마다 이 4개의 문자 중 어떤 문자가 오는지 예측할 것입니다. + +157 +00:10:27,570 --> 00:10:32,100 +우리는 이 4 개의 문자(역자주: h, e, l, o)를 가지고 있고, 매 시간 단계마다 이 4개의 문자 중 어떤 문자가 오는지 예측할 것입니다. + +158 +00:10:32,100 --> 00:10:37,139 +제일 처음에는 H를 입력할 것입니다. + +159 +00:10:37,139 --> 00:10:40,799 +RNN은 현재의 weight를 바탕으로 다음에 어떤 문자가 올 지 예측합니다. + +160 +00:10:40,799 --> 00:10:42,959 +RNN은 현재의 weight를 바탕으로 다음에 어떤 문자가 올 지 예측합니다. + +161 +00:10:42,960 --> 00:10:47,950 +현재 normalized 되지 않은 수치로는, (역자주: 맨 위 왼쪽 사각형 안의 숫자) h는 1.0, e는 2.2, + +162 +00:10:47,950 --> 00:10:52,640 +l은 -3.0 , o는 4.1라는 숫자의 정도로 나타날 것입니다. + +163 +00:10:52,639 --> 00:10:56,409 +물론 우리는 이 training sequence에서 h 다음에 e가 온다는 것을 알고 있습니다. + +164 +00:10:56,409 --> 00:11:00,669 +그러니까 여기 초록색으로 적혀 있는 e의 2.2라는 숫자가 정답이 되는 것이죠. + +165 +00:11:00,669 --> 00:11:04,559 +그래서 이 숫자는 커야 하고, 다른 숫자들은 작아져야 합니다. + +166 +00:11:04,559 --> 00:11:07,799 +이처럼 매 시간 단계마다 우리는 다음에 올 타겟 문자를 갖고 있습니다. + +167 +00:11:07,799 --> 00:11:12,209 +타겟에 해당하는 숫자는 커야 하고, 나머지 숫자는 작아야 합니다. + +168 +00:11:12,210 --> 00:11:15,470 +타겟에 해당하는 숫자는 커야 하고, 나머지 숫자는 작아야 합니다. + +169 +00:11:15,470 --> 00:11:19,950 +그래서 이러한 정보는 loss function(손실 함수)의 gradient signal에 포함됩니다. + +170 +00:11:19,950 --> 00:11:23,220 +그리고 그러한 loss 들은 이 연결들은 통해 back-propagation 됩니다. + +171 +00:11:23,220 --> 00:11:26,600 +매 시간 단계에 softmax classifier을 갖고 있다고 합시다. + +172 +00:11:26,600 --> 00:11:31,300 +그래서 매 시간 단계마다 softmax classifier가 다음에 어떤 문자가 와야 할 지를 예측하고, + +173 +00:11:31,299 --> 00:11:34,269 +그리고 모든 loss들은 맨 위(역자주: output layer)부터 거꾸로 그래프를 내려오면서 계산되어서 + +174 +00:11:34,269 --> 00:11:37,879 +그리고 모든 loss들은 맨 위(역자주: output layer)부터 거꾸로 그래프를 내려오면서 계산되어서 + +175 +00:11:37,879 --> 00:11:41,179 +그리고 모든 loss들은 맨 위(역자주: output layer)부터 거꾸로 그래프를 내려오면서 계산되어서 + +176 +00:11:41,179 --> 00:11:44,479 +weight 행렬에 gradient를 주어 적절한 값으로 변화시켜 RNN이 문자를 보다 정확하게 예측하게 합니다. + +177 +00:11:44,480 --> 00:11:50,039 +weight 행렬에 gradient를 주어 적절한 값으로 교정시켜 RNN이 문자를 보다 정확하게 예측하게 합니다. + +178 +00:11:50,039 --> 00:11:53,599 +그러니까 여러분이 RNN에 문자를 입력하면 RNN은 보다 정확한 행동(역자주: 여기서는 문자 예측)을 하는 것이죠. + +179 +00:11:53,600 --> 00:11:57,750 +이제 어떻게 데이터를 학습시키는지에 대해 상상이 좀 갈 거에요. + +180 +00:11:57,750 --> 00:12:02,879 +여기 그림에 대해 질문이 있나요? + +181 +00:12:02,879 --> 00:12:08,750 +(질문): W_xh와 W_hy는 항상 일정한 값을 가지나요? + +182 +00:12:08,750 --> 00:12:13,320 +(답변): W(weight) 들은 매 recurrence 단계 마다 항상 일정한 값을 가집니다. + +183 +00:12:13,320 --> 00:12:17,010 +(답변): W(weight) 들은 매 recurrence 단계 마다 항상 일정한 값을 가집니다. + +184 +00:12:17,009 --> 00:12:23,830 +여기서 우리는 W_xh, W_hh, W_yh를 각각 4번씩 사용했습니다. + +185 +00:12:23,830 --> 00:12:27,720 +여러분이 backpropagation을 할 때, 동일한 weight 행렬에 이러한 gradient 들을 계속 더한다는 것을 명심해야 합니다. + +186 +00:12:27,720 --> 00:12:30,750 +여러분이 backpropagation을 할 때, 동일한 weight 행렬에 이러한 gradient 들을 계속 더한다는 것을 명심해야 합니다. + +187 +00:12:30,750 --> 00:12:35,879 +그리고 이것은 우리가 길이가 다양한 입력값들을 사용할 수 있게 해 줍니다. + +188 +00:12:35,879 --> 00:12:38,960 +그리고 이것은 우리가 길이가 다양한 입력값들을 사용할 수 있게 해 줍니다. + +189 +00:12:38,960 --> 00:12:48,540 +그러니까 정해진 길이의 입력값들을 사용하지 않아도 된다는 것이죠. + +190 +00:12:48,539 --> 00:12:52,579 +(질문): 처음 h_0를 어떻게 초기화하나요? + +191 +00:12:52,580 --> 00:13:00,650 +(답변): 0으로 놓는 것이 가장 일반적입니다. + +192 +00:13:00,649 --> 00:13:01,289 +(질문): 입력값의 순서는 영향을 미치나요? + +193 +00:13:01,289 --> 00:13:11,299 +(질문): 입력값의 순서는 영향을 미치나요? +194 +00:13:11,299 --> 00:13:14,359 +(답변): 여기서는 중요하지 않습니다. hidden state는 지금까지 들어온 모든 값을 반영하거든요. + +195 +00:13:14,360 --> 00:13:17,870 +(답변): 여기서는 중요하지 않습니다. hidden state는 지금까지 들어온 모든 값을 반영하거든요. + +196 +00:13:17,870 --> 00:13:21,299 +(답변): 여기서는 중요하지 않습니다. hidden state는 지금까지 들어온 모든 값을 반영하거든요. + +197 +00:13:21,299 --> 00:13:26,859 +(답변): 여기서는 중요하지 않습니다. hidden state는 지금까지 들어온 모든 값을 반영하거든요. + +198 +00:13:26,860 --> 00:13:31,590 +보다 구체적인 예들로 확실히 설명드리겠습니다. + +199 +00:13:31,590 --> 00:13:36,149 +문자 단위의 언어 모델 코드는 매우 간단합니다. + +200 +00:13:36,149 --> 00:13:38,980 +여러분들이 나중에 찾아볼 수 있게 GitHub에 올려 놓았어요. + +201 +00:13:38,980 --> 00:13:43,350 +이것은 NumPy 기반의 100줄 길이의 문자 단위 RNN 코드입니다. + +202 +00:13:43,350 --> 00:13:47,220 +이것은 NumPy 기반의 100줄 길이의 문자 단위 RNN 코드입니다. + +203 +00:13:47,220 --> 00:13:49,840 +실제로 RNN이 어떻게 학습하는지를 알기 위해서 이 코드를 단계별로 살펴볼게요. + +204 +00:13:49,840 --> 00:13:53,220 +실제로 RNN이 어떻게 학습하는지를 알기 위해서 이 코드를 단계별로 살펴볼게요. + +205 +00:13:53,220 --> 00:13:58,250 +코드를 블록들로 나누어 하나하나 살펴보겠습니다. + +206 +00:13:58,250 --> 00:14:02,389 +처음에는 보다시피 NumPy만 사용합니다. + +207 +00:14:02,389 --> 00:14:05,569 +여기에 우리가 입력받을 것은 문자들의 대용량 순서 .txt 데이터입니다. + +208 +00:14:05,570 --> 00:14:10,090 +여기에 우리가 입력받을 것은 문자들의 대용량 순서 .txt 데이터입니다. + +209 +00:14:10,090 --> 00:14:14,810 +이 파일의 모든 문자를 읽어들이고, mapping dictionary를 생성합니다. + +210 +00:14:14,809 --> 00:14:18,179 +mapping dictionary는 문자에 index를 대응시키고, 또 반대로 index에 문자를 대응시킵니다. + +211 +00:14:18,179 --> 00:14:23,120 +그러니까 문자를 순서대로 배열하는 것입니다. + +212 +00:14:23,120 --> 00:14:27,350 +여기 보면 아주 긴 문자열이 들어 있는 큰 데이터를 읽어들이네요. + +213 +00:14:27,350 --> 00:14:30,860 +우리는 이 데이터를 배열해서 각 문자에 index를 지정할 것입니다. + +214 +00:14:30,860 --> 00:14:36,300 +그리고 여기에 보다시피 initialization(초깃값 설정)을 하게 됩니다. + +215 +00:14:36,299 --> 00:14:39,899 +hidden size(hidden state의 크기)는 hyperparameter(바뀌지 않는 값) 입니다. 여기서는 100으로 설정했습니다. + +216 +00:14:39,899 --> 00:14:43,100 +hidden size(hidden state의 크기)는 hyperparameter(바뀌지 않는 값) 입니다. 여기서는 100으로 설정했습니다. + +217 +00:14:43,100 --> 00:14:46,720 +여기 있는 건 learning rate 이고요. + +218 +00:14:46,720 --> 00:14:51,019 +25가 지정되어 있는 seq_length는 여러분이 RNN을 공부하다 보면 나오는 parameter 입니다. + +219 +00:14:51,019 --> 00:14:53,899 +많은 경우 우리의 입력 데이터는 너무 커서 RNN에 한꺼번에 넣을 수가 없습니다. + +220 +00:14:53,899 --> 00:14:56,870 +이것은 우리가 backpropagation을 하는 동안 메모리에 데이터를 저장해 두어야 하는데 여기에 한계가 있기 때문이죠 + +221 +00:14:56,870 --> 00:15:00,070 +이것은 우리가 backpropagation을 하는 동안 메모리에 데이터를 저장해 두어야 하는데 여기에 한계가 있기 때문이죠 + +222 +00:15:00,070 --> 00:15:03,540 +이것은 우리가 backpropagation을 하는 동안 메모리에 데이터를 저장해 두어야 하는데 여기에 한계가 있기 때문이죠 + +223 +00:15:03,539 --> 00:15:07,139 +그래서 우리는 입력 데이터를 몇 개의 데이터로 쪼개고, 여기서는 길이가 25인 데이터들로 쪼갰습니다. + +224 +00:15:07,139 --> 00:15:09,230 +그래서 우리는 입력 데이터를 몇 개의 데이터로 쪼개고, 여기서는 길이가 25인 데이터들로 쪼갰습니다. + +225 +00:15:09,230 --> 00:15:14,769 +그러니까 한 번에 처리할 문자의 개수가 25개인 것입니다. + +226 +00:15:14,769 --> 00:15:19,509 +다시 설명하면, 한 번에 backpropagation 하는 문자의 개수가 25인 것이고, + +227 +00:15:19,509 --> 00:15:22,149 +한 번에 모든 데이터를 기억해서 backpropagation 할 수 없기 때문에, 하나의 크기가 25개인 덩어리 데이터들로 나누어서 처리합니다. + +228 +00:15:22,149 --> 00:15:26,899 +한 번에 모든 데이터를 기억해서 backpropagation 할 수 없기 때문에, 하나의 크기가 25개인 덩어리 데이터들로 나누어서 처리합니다. + +229 +00:15:26,899 --> 00:15:30,789 +여기 보이는 행렬들은 random 함수를 이용해서 초기값이 무작위적으로 입력됩니다. + +230 +00:15:30,789 --> 00:15:34,709 +Wxh, Whh, Wxy은 모두 우리가 backpropagation을 통해 학습시킬 대상들입니다. + +231 +00:15:34,710 --> 00:15:36,790 +Wxh, Whh, Wxy은 모두 우리가 backpropagation을 통해 학습시킬 대상들입니다. + +232 +00:15:36,789 --> 00:15:40,699 +loss function은 넘어가고 맨 밑 부분을 살펴보겠습니다. + +233 +00:15:40,700 --> 00:15:44,020 +이 부분은 Main loop입니다. 이 중에서 몇 부분을 한번 살펴보죠. + +234 +00:15:44,019 --> 00:15:48,399 +이 부분에서 어떤 변수들에 0을 대입하는 초기화가 진행됩니다. + +235 +00:15:48,399 --> 00:15:50,829 +그리고 계속해서 loop을 돌리게 되죠. + +236 +00:15:50,830 --> 00:15:54,960 +우리가 지금 보고 있는 것은 전체 데이터의 한 batch 입니다. + +237 +00:15:54,960 --> 00:15:58,970 +전체 데이터 세트에서 크기 25의 문자 batch를 가지를 list input으로 넣어줍니다. + +238 +00:15:58,970 --> 00:16:03,019 +그리고 그 list input은 각 문자에 대응되는 25개의 숫자를 갖고 있습니다. + +239 +00:16:03,019 --> 00:16:06,919 +타겟들은 여기 index에 1을 더한 값이 되는데요, + +240 +00:16:06,919 --> 00:16:09,909 +이것은 타겟들이 현재 순서가 아니라 바로 다음 순서에 나올 문자들이기 때문에 그렇습니다. + +241 +00:16:09,909 --> 00:16:15,269 +그러니까 list input에는 25개의 문자에 대응되는 25개의 숫자가 있고, 타겟 문자는 그 숫자들에서 1을 더한 index에 대응되는 문자들입니다. + +242 +00:16:15,269 --> 00:16:20,689 +그러니까 list input에는 25개의 문자에 대응되는 25개의 숫자가 있고, 타겟 문자는 그 숫자들에서 1을 더한 index에 대응되는 문자들입니다. + +243 +00:16:20,690 --> 00:16:26,480 +이것은 sampling 코드입니다. + +244 +00:16:26,480 --> 00:16:30,659 +매 시간 단계에서 RNN을 학습시키면서, 현재 RNN이 어떻게 사고하고 있는지 알아보기 위한 sample을 출력합니다. + +245 +00:16:30,659 --> 00:16:35,370 +매 시간 단계에서 RNN을 학습시키면서, 현재 RNN이 어떻게 사고하고 있는지 알아보기 위한 sample을 출력합니다. + +246 +00:16:35,370 --> 00:16:40,320 +우리가 문자 단위의 RNN을 사용할 때에는 + +247 +00:16:40,320 --> 00:16:43,570 +RNN이 매 시간 단계마다 바로 다음에 올 문자들의 순서를 출력합니다. + +248 +00:16:43,570 --> 00:16:46,379 +그러니까 sampling 후 그것을 다시 입력값으로 주고, 다음 sample을 또다시 입력값으로 주는 방식으로 모든 sample을 입력한 다음, + +249 +00:16:46,379 --> 00:16:49,259 +그러니까 sampling 후 그것을 다시 입력값으로 주고, 다음 sample을 또다시 입력값으로 주는 방식으로 모든 sample을 입력한 다음, + +250 +00:16:49,259 --> 00:16:52,769 +그러니까 sampling 후 그것을 다시 입력값으로 주고, 다음 sample을 또다시 입력값으로 주는 방식으로 모든 sample을 입력한 다음, + +251 +00:16:52,769 --> 00:16:56,549 +RNN에게 추상적인 문자열을 만들라고 지시할 수 있게 됩니다. + +252 +00:16:56,549 --> 00:17:00,549 +이게 이 코드의 기능이고, 이것은 조금 있다 살펴볼 sample function을 사용합니다. + +253 +00:17:00,549 --> 00:17:04,250 +여기서는 loss function을 불러옵니다. + +254 +00:17:04,250 --> 00:17:09,160 +loss function은 입력값, 타겟 문자, hprev 을 입력받습니다. + +255 +00:17:09,160 --> 00:17:13,900 +hprev는 h from previous chunk 을 뜻합니다. + +256 +00:17:13,900 --> 00:17:18,179 +우리가 크기가 25인 batch들을 사용하는데, + +257 +00:17:18,179 --> 00:17:22,400 +hidden state에서는 바로 전 batch의 마지막 문자가 무엇인지에 대한 정보가 필요하고, 이 마지막 문자를 다음 batch의 첫 h 에 입력하게 됩니다. + +258 +00:17:22,400 --> 00:17:26,140 +그러니까 h가 batch 에서 그 다음 batch 로 제대로 넘어가기 위해서 h prev을 사용하는 것입니다. + +259 +00:17:26,140 --> 00:17:30,700 +그리고 그 h prev는 backpropagation 할 때만 사용됩니다. + +260 +00:17:30,700 --> 00:17:35,558 +그 h prev을 loss fuction에 입력하면, loss, gradient, weight 행렬, 그리고 bias를 출력합니다. + +261 +00:17:35,558 --> 00:17:39,319 +그 h prev을 loss fuction에 입력하면, loss, gradient, weight 행렬, 그리고 bias를 출력합니다. + +262 +00:17:39,319 --> 00:17:44,149 +여기에서 loss를 print 하고, 여기에선 parameter들을 loss function이 하라는 대로 업데이트합니다. + +263 +00:17:44,150 --> 00:17:47,429 +실제로 업데이트가 되는 것은 여기 adagrad update 라고 적혀 있는 부분이네요. + +264 +00:17:47,429 --> 00:17:53,100 +여기 gradient 계산을 위한 변수들을 제곱한 값들을 계속 더해 줍니다. + +265 +00:17:53,099 --> 00:17:56,819 +그리고 이것들로 adagrad를 업데이트 하죠. + +266 +00:17:56,819 --> 00:18:00,639 +이제 loss funcion을 살펴보겠습니다. + +267 +00:18:00,640 --> 00:18:05,790 +이 블록이 loss fuction이고, foward와 backward 방법들로 이루어져 있습니다. + +268 +00:18:05,789 --> 00:18:08,990 +처음에는 forward pass, 나중에는 초록색으로 적혀 있는 backward pass를 수행합니다. + +269 +00:18:08,990 --> 00:18:13,130 +처음에는 forward pass, 나중에는 초록색으로 적혀 있는 backward pass를 수행합니다. + +270 +00:18:13,130 --> 00:18:18,919 +forward pass에서는 input을 target을 향하게 만듭니다. + +271 +00:18:18,919 --> 00:18:23,360 +여기서 25개의 index를 받지만, 반복문을 25번 실행하는 것이 아니라, + +272 +00:18:23,359 --> 00:18:27,500 +여기 있는 성분이 모두 0인 input vector에 one-hot 인코딩을 하게 됩니다. + +273 +00:18:27,500 --> 00:18:32,169 +그러니까 input에 대응되는 bit를 1로 지정하는 것이죠. + +274 +00:18:32,169 --> 00:18:34,110 +one hot encoding을 이용해서 input을 주고, + +275 +00:18:34,109 --> 00:18:39,229 +밑에 있는 recurrence 공식을 이용해서 계산합니다. + +276 +00:18:39,230 --> 00:18:42,210 +hs[t]는 매 시간 단계의 모든 값들을 기록합니다. + +277 +00:18:42,210 --> 00:18:46,910 +recurrence 공식과 이 두 줄의 코드를 통해 hidden state vector과 output vector 을 계산합니다. + +278 +00:18:46,910 --> 00:18:50,779 +여기서는 softmax function을 이용해서 normalization을 구현합니다. + +279 +00:18:50,779 --> 00:18:54,440 +softmax function에서의 loss는 정답(역자주: 타겟 문자)이 나올 확률의 log를 취하고 거기에 -1을 곱한 값입니다.(역자주: cross entropy loss) + +280 +00:18:54,440 --> 00:18:58,190 +softmax function에서의 loss는 정답(역자주: 타겟 문자)이 나올 확률의 log를 취하고 거기에 -1을 곱한 값입니다.(역자주: cross entropy loss) + +281 +00:18:58,190 --> 00:19:02,779 +지금까지 forward pass 를 살펴보았고, 이제 그래프를 통해 backpropagation을 살펴보겠습니다. + +282 +00:19:02,779 --> 00:19:06,899 +backward pass에서는, 25번째 문자에서 첫번째 문자까지 거슬러 올라갑니다. + +283 +00:19:06,900 --> 00:19:08,530 +backward pass에서는, 25번째 문자에서 첫번째 문자까지 거슬러 올라갑니다. + +284 +00:19:08,529 --> 00:19:12,899 +backward pass에서는, 25번째 문자에서 첫번째 문자까지 거슬러 올라갑니다. + +285 +00:19:12,900 --> 00:19:16,509 +여기서는 softmax, activation 등을 통한 backpropagation이 수행됩니다. + +286 +00:19:16,509 --> 00:19:19,089 +그리고 모든 gradient와 parameter들을 더해주죠. + +287 +00:19:19,089 --> 00:19:23,379 +한 가지 짚고 넘어갈 점은, Whh를 비롯한 행렬에서의 gradient 계산에서 '+='을 사용하고 있다는 것입니다. + +288 +00:19:23,380 --> 00:19:27,210 +한 가지 짚고 넘어갈 점은, Whh를 비롯한 행렬에서의 gradient 계산에서 '+='을 사용하고 있다는 것입니다. + +289 +00:19:27,210 --> 00:19:31,210 +한 가지 짚고 넘어갈 점은, Whh를 비롯한 행렬에서의 gradient 계산에서 '+='을 사용하고 있다는 것입니다. + +290 +00:19:31,210 --> 00:19:34,590 +우리는 매 시간 단계마다 weight 행렬들이 gradient를 받고, 이 값들을 모두 더해 주어야 하기 때문에, 이 행렬을 계속 쓰게 됩니다. + +291 +00:19:34,589 --> 00:19:37,449 +우리는 매 시간 단계마다 weight 행렬들이 gradient를 받고, 이 값들을 모두 더해 주어야 하기 때문에, 이 행렬을 계속 쓰게 됩니다. + +292 +00:19:37,450 --> 00:19:43,980 +그리고 계속해서 backpropagation을 하게 되죠. + +293 +00:19:43,980 --> 00:19:48,130 +여기에서 나온 gradient는 loss function에 사용되고, 결국 parameter를 업데이트하게 됩니다. + +294 +00:19:48,130 --> 00:19:52,580 +마지막으로 sampling function입니다. + +295 +00:19:52,579 --> 00:19:55,960 +여기서 RNN을 지금까지 학습한 training 데이터를 바탕으로 실제로 새로운 문자열 데이터를 출력하게 됩니다. + +296 +00:19:55,960 --> 00:19:59,058 +여기서 RNN을 지금까지 학습한 training 데이터를 바탕으로 실제로 새로운 문자열 데이터를 출력하게 됩니다. + +297 +00:19:59,058 --> 00:20:02,048 +여기서 문자열을 초기화해주었고, + +298 +00:20:02,048 --> 00:20:06,759 +피곤해질 때까지 (역자주: 미리 설정한 recurrence가 끝날 때까지) 다음 작업들을 반복합니다. + +299 +00:20:06,759 --> 00:20:09,289 +recurrence 공식 실행, 각 문자에 대한 확률분포 계산, 샘플링, one-hot 인코딩, 그리고 그 결과물을 다음 시간 단계로 재입력 + +300 +00:20:09,289 --> 00:20:10,450 +recurrence 공식 실행, 각 문자에 대한 확률분포 계산, 샘플링, one-hot 인코딩, 그리고 그 결과물을 다음 시간 단계로 재입력 + +301 +00:20:10,450 --> 00:20:15,640 +recurrence 공식 실행, 각 문자에 대한 확률분포 계산, 샘플링, one-hot 인코딩, 그리고 그 결과물을 다음 시간 단계로 재입력 + +302 +00:20:15,640 --> 00:20:22,460 +이 작업들을 충분히 많은 문자열을 출력할 때까지 계속 수행합니다. + +303 +00:20:22,460 --> 00:20:27,190 +(질문: 안들림 => 답변) 우리는 매 batch 마다 25개의 softmax classifier를 갖고 있습니다. + +304 +00:20:27,190 --> 00:21:04,680 +(답변) 그 classifier 들은 한번에 backpropagation을 진행하고, 반대방향으로 모든 결과물들을 더해주죠. + +305 +00:21:04,680 --> 00:21:14,910 +그게 우리가 이걸 쓰는 이유죠. 다음 질문? + +306 +00:21:14,910 --> 00:21:19,259 +(질문) 여기서 regularization을 쓰나요? + +307 +00:21:19,259 --> 00:21:23,720 +(답변) 여기서는 빠져 있습니다. 일반적으로 RNN에서는 다른 알고리즘만큼 regularization이 흔하게 적용되지는 않습니다. + +308 +00:21:23,720 --> 00:21:27,269 +(답변) 여기서는 빠져 있습니다. 일반적으로 RNN에서는 다른 알고리즘만큼 regularization이 흔하게 적용되지는 않습니다. + +309 +00:21:27,269 --> 00:21:38,379 +(답변) 가끔 아주 좋지 않은 결과를 낳기도 해서, 저는 그냥 사용하지 않을 때도 있습니다. 일종의 hyperparameter이죠. 다음 질문? (질문 안들림) + +310 +00:21:38,380 --> 00:21:48,260 +(답변) 여기서의 문자들은 아주 기초적인 수준입니다. 그래서 실제로 이런 문자가 존재하는지 별로 신경쓰지는 않아요. + +311 +00:21:48,259 --> 00:21:51,839 +(답변) 여기서의 문자들은 아주 기초적인 수준입니다. 그래서 실제로 이런 문자가 존재하는지 별로 신경쓰지는 않아요. + +312 +00:21:51,839 --> 00:21:56,289 +문자들의 index와 그것들의 순서 정도만을 고려할 뿐이죠. + +313 +00:21:56,289 --> 00:21:58,569 +다음 질문? + +314 +00:21:58,569 --> 00:22:08,009 +(질문) space 대신 일정한 segment size(25)를 이용하는 이유가 있나요? + +315 +00:22:08,009 --> 00:22:13,460 +(질문) space 대신 일정한 segment size(25)를 이용하는 이유가 있나요? + +316 +00:22:13,460 --> 00:22:18,630 +(답변) 크기가 25인 batch 말고 space로 구분하는 것 역시 가능할 것 같습니다. 하지만 거기에는 언어에 대한 특별한 가정이 필요해서 권장되지 않아요. + +317 +00:22:18,630 --> 00:22:22,530 +자세한 이유는 좀 있다가 살펴보도록 하겠습니다. + +318 +00:22:22,529 --> 00:22:25,359 +이 코드에는 어떤 문자열도 입력할 수 있어요. 이걸 갖고 여러 가지를 해 볼게요. + +319 +00:22:25,359 --> 00:22:31,539 +여기 우리가 출처를 모르는 어떤 문자열이 있습니다. + +320 +00:22:31,539 --> 00:22:34,889 +그리고 이 문자열을 RNN에 학습시키고, RNN이 문자열을 만들어내게 할 거에요. + +321 +00:22:34,890 --> 00:22:40,670 +예를 들어, 셰익스피어의 모든 작품을 입력할 수 있습니다. + +322 +00:22:40,670 --> 00:22:44,789 +크기가 좀 크긴 하지만, 이건 단지 문자열일 뿐이에요. + +323 +00:22:44,789 --> 00:22:48,289 +크기가 좀 크긴 하지만, 이건 단지 문자열일 뿐이에요. + +324 +00:22:48,289 --> 00:22:51,909 +RNN 셰익스피어의 작품을 학습시키고, 셰익스피어의 시에서의 다음 문자를 예측하게끔 할 수 있습니다. + +325 +00:22:51,910 --> 00:22:54,650 +처음에는 학습이 되어 있지 않기 때문에, 결과물들은 매우 무작위적인 문자열입니다. + +326 +00:22:54,650 --> 00:22:59,030 +처음에는 학습이 되어 있지 않기 때문에, 결과물들은 매우 무작위적인 문자열입니다. + +327 +00:22:59,029 --> 00:23:03,200 +하지만 학습을 통해 RNN은 이 문자열 안에는 단어들이 있고, 단어들 사이에 space가 있고, 쌍따옴표(")의 사용법을 이해하기 되죠. + +328 +00:23:03,200 --> 00:23:06,930 +하지만 학습을 통해 RNN은 이 문자열 안에는 단어들이 있고, 단어들 사이에 space가 있고, 쌍따옴표(")의 사용법을 이해하기 되죠. + +329 +00:23:06,930 --> 00:23:11,490 +하지만 학습을 통해 RNN은 이 문자열 안에는 단어들이 있고, 단어들 사이에 space가 있고, 쌍따옴표(")의 사용법을 이해하기 되죠. + +330 +00:23:11,490 --> 00:23:16,420 +그리고 'here', 'on', 'and so on' 과 같은 기본적인 표현들을 알게 됩니다. + +331 +00:23:16,420 --> 00:23:18,820 +그리고 RNN을 계속 학습시킬수록, 이러한 표현들이 점점 정제되는 것을 확인할 수 있습니다. + +332 +00:23:18,819 --> 00:23:22,609 +예를 들어 "를 한번 사용하면 "를 한번 더 사용해서 인용구를 닫아 주는 것들을 익히는 거죠. + +333 +00:23:22,609 --> 00:23:26,379 +또 문장이 마침표로 끝나는 것 역시 따로 가르치지 않고도 패턴만으로 익히게 됩니다. + +334 +00:23:26,380 --> 00:23:29,630 +또 문장이 마침표로 끝나는 것 역시 따로 가르치지 않고도 통계적 패턴만으로 익히게 됩니다. + +335 +00:23:29,630 --> 00:23:30,580 +그리고 마침내 '셰익스피어 문학' 자체를 생성할 수 있게 되죠. + +336 +00:23:30,579 --> 00:23:34,349 +여기 RNN이 만들어낸 작품을 읽어볼게요. + +337 +00:23:34,349 --> 00:23:38,740 +(읽는 중) "Alas, I think he shall come approached and the day..." + +338 +00:23:38,740 --> 00:23:42,900 +(읽는 중) "Alas, I think he shall come approached and the day..." + +339 +00:23:42,900 --> 00:23:45,460 +(읽는 중) "Alas, I think he shall come approached and the day..." + +340 +00:23:45,460 --> 00:23:56,909 +(질문) 하지만 이것들은 25개가 넘는 문자로 이루어진 문장은 기억할 수가 없기 때문에 제대로 생성할 수 없죠? + +341 +00:23:56,909 --> 00:24:02,679 +(답변) 네 맞습니다. 그거 사실 되게 알아차리기 힘든 부분이라 제가 나중에 말하려고 했었어요. + +342 +00:24:02,679 --> 00:24:05,980 +우리는 셰익스피어 작품이 아니라 다른 것들에도 이것을 활용할 수 있습니다. + +343 +00:24:05,980 --> 00:24:08,960 +이것들은 제가 Justin과 작년에 만들어본 것들입니다. + +344 +00:24:08,960 --> 00:24:12,990 +Justin은 한 대수기하학 책의 LaTeX 소스를 RNN에 학습시켰습니다. + +345 +00:24:12,990 --> 00:24:18,069 +Justin은 한 대수기하학 책의 LaTeX 소스를 RNN에 학습시켰습니다. + +346 +00:24:18,069 --> 00:24:23,398 +그리고 RNN은 수학책을 집필했죠. + +347 +00:24:23,398 --> 00:24:27,199 +물론 RNN은 LaTeX 형식으로 결과물을 출력하지 않아서 저희가 약간 손봐주긴 했지만, + +348 +00:24:27,200 --> 00:24:30,009 +물론 RNN은 LaTeX 형식으로 결과물을 출력하지 않아서 저희가 약간 손봐주긴 했지만, + +349 +00:24:30,009 --> 00:24:33,890 +어쨌든 한두 번 손보고 나니 보시는 바와 같이 수학책이 되었어요. + +350 +00:24:33,890 --> 00:24:37,200 +어쨌든 한두 번 손보고 나니 보시는 바와 같이 수학책이 되었어요. + +351 +00:24:37,200 --> 00:24:42,460 +살펴보면, RNN은 proof(정리)를 쓰는 방법을 배웠네요. 수학적 정리의 끝에는 저렇게 사각형을 쓰죠. + +352 +00:24:42,460 --> 00:24:47,090 +lemma(소정리)를 비롯한 다른 것들도 만들어 냈고요. + +353 +00:24:47,089 --> 00:24:52,428 +그림을 그리는 방법도 배웠네요. + +354 +00:24:52,429 --> 00:24:56,720 +제가 가장 좋아하는 부분은 여기 왼쪽 상단에 있는 "Proof. Omitted" 부분입니다. + +355 +00:24:56,720 --> 00:24:59,650 +RNN도 귀찮았나 봐요 (웃음) + +356 +00:24:59,650 --> 00:25:05,780 +RNN도 귀찮았나 봐요 (웃음) + +357 +00:25:05,779 --> 00:25:12,480 +전반적으로 보면 RNN은 꽤 대수기하학책 같이 보이는 걸 만들어 냈어요. + +358 +00:25:12,480 --> 00:25:16,160 +뭐 세부적인 부분은 제가 대수기하를 잘 몰라서 말하기 그렇지만, 전반적으로 괜찮아요. + +359 +00:25:16,160 --> 00:25:19,529 +저는 이어서 문자 단위 RNN으로 표현할 수 있는 가장 어렵고 추상적인 것들이 무엇이 있을까 생각했고, + +360 +00:25:19,529 --> 00:25:22,769 +소스 코드에 생각이 미쳤습니다. + +361 +00:25:22,769 --> 00:25:27,879 +그래서 리누스 토발즈의 GitHub에 들어가 리눅스의 모든 C 코드를 가져왔습니다. + +362 +00:25:27,880 --> 00:25:30,850 +이 C 코드는 자그마치 700MB나 됩니다. + +363 +00:25:30,849 --> 00:25:35,079 +이 코드를 RNN에게 학습시켰고, RNN은 코드를 생성해 냈습니다. + +364 +00:25:35,079 --> 00:25:39,849 +이게 바로 RNN이 생성해낸 코드입니다. + +365 +00:25:39,849 --> 00:25:42,949 +살펴보면 함수를 생성했고, 변수를 지정하고, 문법적 오류가 거의 없습니다. + +366 +00:25:42,950 --> 00:25:47,460 +변수를 어떻게 사용하는지도 아는 것 같고, + +367 +00:25:47,460 --> 00:25:53,230 +indentation (들여쓰기)도 적절히 했고, 주석도 달았습니다. + +368 +00:25:53,230 --> 00:25:58,089 +괄호를 열고 닫지 않는 등의 실수를 찾아보기가 매우 힘들었습니다. + +369 +00:25:58,089 --> 00:26:01,808 +이런 것들은 RNN이 배우기 가장 쉬운 것들 중 하나거든요. + +370 +00:26:01,808 --> 00:26:04,058 +RNN의 실수들 중에는 쓰이지 않을 변수를 선언하거나, 선언하지도 않은 변수를 불러오기를 시도는 것들이 있었습니다. + +371 +00:26:04,058 --> 00:26:07,240 +RNN의 실수들 중에는 쓰이지 않을 변수를 선언하거나, 선언하지도 않은 변수를 불러오기를 시도는 것들이 있었습니다. + +372 +00:26:07,240 --> 00:26:09,929 +그러니까 아직 매우 높은 단계의 코딩 수준에는 도달하지 못한 거죠. + +373 +00:26:09,929 --> 00:26:12,509 +하지만 그런 것들을 제외하고 보면 꽤 코딩을 잘 했습니다. + +374 +00:26:12,509 --> 00:26:17,460 +새로운 GPU 라이센스에 관한 주석을 다는 방법도 배웠네요. + +375 +00:26:17,460 --> 00:26:22,009 +새로운 GPU 라이센스에 관한 주석을 다는 방법도 배웠네요. + +376 +00:26:22,009 --> 00:26:25,779 +GPL 라이센스 다음에는 #include, 매크로 코드 등이 오는 것도 배웠고요. + +377 +00:26:25,779 --> 00:26:33,879 +(질문) 이건 (아까 보여준) min char-rnn 으로 만들어낸 건가요? + +378 +00:26:33,880 --> 00:26:37,169 +(답변) min char-rnn은 그냥 작동 원리를 알려주기 위해 만들어낸 장난감 같은 거고, + +379 +00:26:37,169 --> 00:26:41,230 +(답변) 실제로는 min char-rnn의 확장판인 torch 기반 char-rnn을 으로 구현했고, GPU를 이용해서 처리했습니다. + +380 +00:26:41,230 --> 00:26:45,009 +(답변) 실제로는 min char-rnn의 확장판인 torch 기반 char-rnn을 으로 구현했고, GPU를 이용해서 처리했습니다. + +381 +00:26:45,009 --> 00:26:49,269 +이 부분은 수업 마지막 부분에 다룰 것인데, 3-layer LSTM 이라는 것입니다. + +382 +00:26:49,269 --> 00:26:52,289 +이건 RNN의 복잡한 버전이라고 생각하면 됩니다. + +383 +00:26:52,289 --> 00:26:58,839 +좀 더 이해가 쉽도록 예를 들어 볼게요. + +384 +00:26:58,839 --> 00:27:02,089 +이건 작년에 저희가 이런 것들을 가지고 만들어본 것들입니다. + +385 +00:27:02,089 --> 00:27:08,949 +저희는 문자 단위 RNN에 신경과학적으로 접근을 해 보았습니다. + +386 +00:27:08,950 --> 00:27:13,110 +hidden state 내부 특정 cell의 excitement(흥분) 여부에 따라 색을 칠해 봤습니다. + +387 +00:27:13,109 --> 00:27:17,119 +hidden state 내부 특정 cell의 excitement(흥분) 여부에 따라 색을 칠해 봤습니다. + +388 +00:27:17,119 --> 00:27:18,699 +hidden state 내부 특정 cell의 excitement(흥분) 여부에 따라 색을 칠해 봤습니다. + +389 +00:27:18,700 --> 00:27:23,470 +보시다시피, hidden state의 뉴런들의 상태를 해석하는 일이 쉽지가 않습니다. + +390 +00:27:23,470 --> 00:27:27,110 +보시다시피, hidden state의 뉴런들의 상태를 해석하는 일이 쉽지가 않습니다. + +391 +00:27:27,109 --> 00:27:29,829 +왜냐하면 어떤 뉴런들은 매우 낮은 단계에서의 작업을 맡거든요. + +392 +00:27:29,829 --> 00:27:33,859 +예를 들면, 'h 다음에 e가 얼마나 자주 오는가' 가 있네요. + +393 +00:27:33,859 --> 00:27:37,928 +하지만 어떤 cell 들은 해석하기가 꽤 용이했습니다. + +394 +00:27:37,929 --> 00:27:41,830 +여기 보시는 것은 인용구 검출 cell 입니다. + +395 +00:27:41,829 --> 00:27:46,460 +이 cell은 처음 따옴표가 나오면 켜지고, 따옴표가 다시 나타나면 꺼집니다. + +396 +00:27:46,460 --> 00:27:50,610 +이건 그냥 backpropagation의 결과로 나온 것입니다. + +397 +00:27:50,609 --> 00:27:54,329 +RNN은 문자열의 길이가 따옴표들의 사이에 있을때와 따옴표 바깥에 있을 때에 다르다는 것을 파악했습니다. + +398 +00:27:54,329 --> 00:27:57,639 +그래서 hidden state의 특정 부분들을 현재 문자들이 인용구 안에 있는지 파악하게 했습니다. + +399 +00:27:57,640 --> 00:28:00,650 +그래서 hidden state의 특정 부분들을 현재 문자들이 인용구 안에 있는지 파악하게 했습니다. + +400 +00:28:00,650 --> 00:28:05,159 +이것이 아까 (질문했던 사람)의 질문에 답을 해줄 것 같은데요, + +401 +00:28:05,159 --> 00:28:06,500 +이 RNN의 seq_length는 100 이었습니다.(역자주: batch 크기가 100) + +402 +00:28:06,500 --> 00:28:10,269 +하지만 실제로 이 인용구들의 크기를 재어 보면 100보다 훨씬 길다는 것을 알 수 있습니다. + +403 +00:28:10,269 --> 00:28:16,220 +제가 보기에 대략 250정도 인 것 같네요. + +404 +00:28:16,220 --> 00:28:20,190 +그러니까 우리는 한 번에 크기가 100인 backpropagation만을 진행했고, RNN에게는 그때만이 유일한 학습 기회입니다. + +405 +00:28:20,190 --> 00:28:23,460 +그러니까 문자열 크기가 100이 넘어가면 그 앞뒤의 dependencies(종속성, 관계) 에 대해서는 직접적으로 학습하지를 않습니다. +406 +00:28:23,460 --> 00:28:27,809 +그러니까 문자열 크기가 100이 넘어가면 그 앞뒤의 dependencies(종속성, 관계) 에 대해서는 직접적으로 학습하지를 않습니다. + +407 +00:28:27,809 --> 00:28:31,159 +하지만 이 결과는 실제 문자열의 길이보다 작은 크기의 batch 들로 학습한다고 해도, batch 크기보다 긴 문자열에 대해서도 잘 작동할 수 있다는 것을 보여주네요. + +408 +00:28:31,160 --> 00:28:36,580 +하지만 이 결과는 실제 문자열의 길이보다 작은 크기의 batch 들로 학습한다고 해도, batch 크기보다 긴 문자열에 대해서도 잘 작동할 수 있다는 것을 보여주네요. + +409 +00:28:36,579 --> 00:28:39,859 +그러니까 batch 크기는 100이었지만, + +410 +00:28:39,859 --> 00:28:44,759 +크기가 수백이 넘는 문자열의 dependecies 도 잘 잡아낸 것이죠. + +411 +00:28:44,759 --> 00:28:48,890 +이것은 톨스토이의 <전쟁과 평화> 데이터 입니다. + +412 +00:28:48,890 --> 00:28:52,460 +이 데이터 세트는 대략 80문자마다 한 번 줄이 바뀝니다. + +413 +00:28:52,460 --> 00:28:57,819 +이 데이터 세트는 대략 80문자마다 한 번 줄이 바뀝니다. + +414 +00:28:57,819 --> 00:29:02,470 +그리고 우리는 줄 길이 tracking cell을 찾아냈습니다. + +415 +00:29:02,470 --> 00:29:06,539 +이 cell은 줄이 처음 시작하면 1로 시작해서, 문자열이 진행될수록 천천히 그 값이 감소합니다. + +416 +00:29:06,539 --> 00:29:09,019 +RNN은 현재 자신이 어느 시간 단계에 있는지 알아야 하기 때문에 이 기능은 매우 유용합니다. + +417 +00:29:09,019 --> 00:29:13,059 +RNN은 현재 자신이 어느 시간 단계에 있는지 알아야 하기 때문에 이 기능은 매우 유용합니다. + +418 +00:29:13,059 --> 00:29:15,149 +이를 통해서 언제 줄을 바꾸어야 하는지 알 수 있기 때문이죠. + +419 +00:29:15,150 --> 00:29:19,280 +이것 말고도 if 문을 감지하는 cell도 찾아냈고, + +420 +00:29:19,279 --> 00:29:23,970 +인용구과 주석을 감지하는 cell 도 찾아냈고, + +421 +00:29:23,970 --> 00:29:28,710 +상대적으로 deep한 코드를 감지하는 cell 도 찾아냈습니다. + +422 +00:29:28,710 --> 00:29:33,150 +다른 역할을 수행하는 cell 들도 찾을 수 있을 것이고, 중요한 것은 이것들이 전부 backpropagation 에서 나왔다는 겁니다. + +423 +00:29:33,150 --> 00:29:36,710 +되게 마법같은 일이죠. + +424 +00:29:36,710 --> 00:29:42,130 +(질문) 어떻게 cell 하나하나가 흥분했는지 알 수 있었죠? + +425 +00:29:42,130 --> 00:29:49,110 +(답변) 이 LSTM 에서는 대략 2100개의 cell 들이 있었습니다. 저는 그냥 하나하나 다 살펴봤어요. + +426 +00:29:49,109 --> 00:29:54,589 +(답변) 대부분은 규칙을 찾기가 어려웠지만, 약 5%에 해당하는 cell들에 대해서 살펴본 것들과 같은 규칙을 찾을 수 있었습니다. + +427 +00:29:54,589 --> 00:30:00,429 +(질문) 그러니까 어떤 cell들은 켜고, 어떤 cell들은 끄는 방식으로 찾은 건가요? + +428 +00:30:00,430 --> 00:30:05,310 +(답변) 오 제가 질문을 잘못 이해했었네요. 저희는 RNN 전체를 실행시켰고, 특정 hidden state의 흥분 상태를 관찰했습니다. + +429 +00:30:05,309 --> 00:30:09,679 +(답변) 오 제가 질문을 잘못 이해했었네요. 저희는 RNN 전체를 실행시켰고, 특정 hidden state의 흥분 상태를 관찰했습니다. + +430 +00:30:09,680 --> 00:30:14,470 +(답변) 그러니까 그냥 실행은 그대로 하되, 특정 hidden state의 상태를 기록하고 살펴본 것입니다. + +431 +00:30:14,470 --> 00:30:20,900 +이해가 되셨나요? + +432 +00:30:20,900 --> 00:30:23,940 +그러니까 저는 여기서 hidden state 단 한 부분만을 여기 슬라이드에 나타냈습니다. + +433 +00:30:23,940 --> 00:30:27,740 +물론 hidden state 에는 이 부분 말고도 다른 일들을 하는 cell들이 많이 있죠. + +434 +00:30:27,740 --> 00:30:30,349 +이것들은 모두 동시에, 다른 기능을 수행합니다. + +435 +00:30:30,349 --> 00:30:41,899 +(질문) 여기서의 hidden state의 layer은 1개인가요? + +436 +00:30:41,900 --> 00:30:50,150 +(답변) Multi-layer RNN을 말씀하시는 건가요? 그것에 대해서는 좀 있다가 설명드리겠습니다. 여기서는 Multi-layer을 썼지만, Single-layer을 썼어도 결과는 비슷했을 거에요. + +437 +00:30:50,150 --> 00:31:00,490 +(질문: 안들림) (답변): 이 hidden state 들은 -1 ~ 1의 값을 가집니다. tanh 함수의 결과물이거든요. + +438 +00:31:00,490 --> 00:31:04,120 +(답변) 이건 우리가 아직 다루지 않은 LSTM에 대한 것들입니다. 한 cell에 배정된 값은 -1~1 이라는 것 정도만 알아두세요. + +439 +00:31:04,119 --> 00:31:11,869 +(답변) 이건 우리가 아직 다루지 않은 LSTM에 대한 것들입니다. 한 cell에 배정된 값은 -1~1 이라는 것 정도만 알아두세요. + +440 +00:31:11,869 --> 00:31:15,609 +RNN은 매우 잘 작동하고, 이러한 시퀀스 모델을 잘 학습할 수 있습니다. + +441 +00:31:15,609 --> 00:31:19,039 +대략 1년 전에 어떤 사람들이 이걸 컴퓨터 비전-image captioning 분야에 적용해 보았습니다. + +442 +00:31:19,039 --> 00:31:22,039 +대략 1년 전에 어떤 사람들이 이걸 컴퓨터 비전-image captioning 분야에 적용해 보았습니다. + +443 +00:31:22,039 --> 00:31:25,210 +여기서는 어떤 하나의 사진을 가지고 단어의 배열을 생성해 보았는데요, + +444 +00:31:25,210 --> 00:31:27,840 +RNN은 여기서 매우 잘 작동했습니다. + +445 +00:31:27,839 --> 00:31:32,490 +RNN은 여기서 매우 잘 작동했습니다. + +446 +00:31:32,490 --> 00:31:36,240 +여기 한 부분을 보시면, + +447 +00:31:36,240 --> 00:31:43,039 +사실 이건 제 논문이기 때문에 저 사진들은 제가 마음대로 쓸 수 있죠. + +448 +00:31:43,039 --> 00:31:46,629 +CNN에 이미지를 입력했는데요, + +449 +00:31:46,630 --> 00:31:48,990 +잘 살펴보시면 사실 이것은 CNN과 RNN의 두 부분으로 구성되어 있다는 것을 발견할 수 있습니다. + +450 +00:31:48,990 --> 00:31:51,750 +잘 살펴보시면 사실 이것은 CNN과 RNN의 두 부분으로 구성되어 있다는 것을 발견할 수 있습니다. + +451 +00:31:51,750 --> 00:31:55,460 +CNN은 이미지 처리를, RNN은 단어들의 순서 결정을 맡았습니다. + +452 +00:31:55,460 --> 00:31:58,470 +제가 강의 처음에 했던 레고 블록 비유를 기억한다면, + +453 +00:31:58,470 --> 00:32:01,039 +CNN과 RNN을 그림에 보이는 화살표와 같이 연결시킨 것을 이해할 수 잇을 것입니다. + +454 +00:32:01,039 --> 00:32:04,509 +CNN과 RNN을 그림에 보이는 화살표와 같이 연결시킨 것을 이해할 수 잇을 것입니다. + +455 +00:32:04,509 --> 00:32:07,829 +저희가 여기서 잘한 점은 여기서 RNN 단어 생성 모델의 입력값을 적절히 조절했다는 것입니다. + +456 +00:32:07,829 --> 00:32:11,349 +그러니까 아무 텍스트나 RNN에 입력한 것이 아니라, + +457 +00:32:11,349 --> 00:32:14,939 +CNN의 결과물을 RNN의 입력값으로 받아온 것이죠. + +458 +00:32:14,940 --> 00:32:21,220 +좀 더 자세히 설명드리겠습니다. forward pass 부분부터요. + +459 +00:32:21,220 --> 00:32:24,110 +여기 test image가 있습니다. + +460 +00:32:24,109 --> 00:32:27,679 +우리는 이 이미지에서 단어들의 시퀀스를 만들어보고 싶어요. + +461 +00:32:27,680 --> 00:32:31,240 +그래서 다음과 같이 이미지를 먼저 처리했습니다. + +462 +00:32:31,240 --> 00:32:35,250 +먼저 이미지를 CNN에 입력했습니다. 여기서 쓰인 CNN은 VGG net 이었습니다. + +463 +00:32:35,250 --> 00:32:37,349 +그리고 여기 conv들과 maxpool 들을 통과시켰죠. + +464 +00:32:37,349 --> 00:32:40,149 +일반적으로 마지막에는 softmax classifier가 위치합니다. + +465 +00:32:40,150 --> 00:32:44,440 +softmax는 확률분포를 출력하죠. 예를 들어 1000개의 카테고리가 있다면 각 카테고리에 대한 확률분포를요. + +466 +00:32:44,440 --> 00:32:47,420 +근데 여기서 우리는 softmax를 사용하지 않았습니다. + +467 +00:32:47,420 --> 00:32:50,750 +대신 이 끝부분을 RNN의 시작 부분과 연결시켰죠. + +468 +00:32:50,750 --> 00:32:54,880 +RNN 입력에 처음에는 특별한 벡터들을 사용했습니다. + +469 +00:32:54,880 --> 00:33:00,410 +RNN 에 입력되는 벡터들의 차원은 300이었고요, + +470 +00:33:00,410 --> 00:33:02,700 +RNN의 첫 iteration에는 무조건 이 벡터를 사용했습니다. + +471 +00:33:02,700 --> 00:33:05,750 +그럼으로써 RNN이 이것이 시퀀스의 시작임을 파악할 수 있게 했습니다. + +472 +00:33:05,750 --> 00:33:09,039 +그리고 아까 살펴본 recurrence 공식 (Vanilla NN)을 사용했습니다. + +473 +00:33:09,039 --> 00:33:13,769 +그리고 아까 살펴본 recurrence 공식 (Vanilla NN)을 사용했습니다. + +474 +00:33:13,769 --> 00:33:18,779 +아까는 (Wxh*x + Whh*h)과 0으로 초기화되는 h_0을 사용했다면, + +475 +00:33:18,779 --> 00:33:23,500 +아까는 (Wxh*x + Whh*h)과 0으로 초기화되는 h_0을 사용했다면, + +476 +00:33:23,500 --> 00:33:28,089 +아까는 (Wxh*x + Whh*h)과 0으로 초기화되는 h_0을 사용했다면, + +477 +00:33:28,089 --> 00:33:33,649 +이번에는 v를 추가해서 (Wxh*x + Whh*h + Wih*v) 를 사용했습니다. + +478 +00:33:33,650 --> 00:33:38,040 +v는 CNN의 맨 마지막 출력값이고, + +479 +00:33:38,039 --> 00:33:43,399 +Wih는 v에 들어 있는 이미지에 대한 정보를 RNN에게 전달해주기 위한 가중치 행렬입니다. + +480 +00:33:43,400 --> 00:33:46,380 +RNN에 이미지의 정보를 전달해 주는 방법은 실제로 여러 가지가 있고, + +481 +00:33:46,380 --> 00:33:48,940 +이것은 그 중 쉬운 한 방법일 뿐입니다. + +482 +00:33:48,940 --> 00:33:51,690 +이것은 그 중 쉬운 한 방법일 뿐입니다. + +483 +00:33:51,690 --> 00:33:55,750 +t = 0 에서의 y_0 벡터는 시퀀스의 첫번째 단어의 확률분포입니다 + +484 +00:33:55,750 --> 00:34:00,009 +t = 0 에서의 y0 벡터는 시퀀스의 첫번째 단어의 확률분포입니다 + +485 +00:34:00,009 --> 00:34:05,490 +이것이 작동하는 방식을 설명해 볼게요. + +486 +00:34:05,490 --> 00:34:09,699 +여기 그림에 밀짚모자가 보이시죠 + +487 +00:34:09,699 --> 00:34:12,939 +이 부분은 CNN에 의해 '지푸라기 같은' 물체로 인식됩니다. + +488 +00:34:12,940 --> 00:34:17,039 +Wih는 이 부분의 hidden state의 값이 특정 state로 넘어갈 때 '지푸라기'이라는 단어가 출력되게 하는 확률을 높이는 데 영향을 미칩니다. + +489 +00:34:17,039 --> 00:34:20,519 +그래서 '지푸라기 같은' 질감을 가진 이미지가 실제로 '지푸라기'이라는 단어의 출현 확률을 높이는 것이죠. + +490 +00:34:20,519 --> 00:34:23,940 +y0의 값 중 하나가 커지게 되는 방식으로요. + +491 +00:34:23,940 --> 00:34:28,470 +y0의 값 중 하나가 커지게 되는 방식으로요. + +492 +00:34:28,469 --> 00:34:32,269 +이제 RNN은 두 가지 작업을 처리해야 합니다. + +493 +00:34:32,269 --> 00:34:36,550 +다음 순서에 어떤 단어가 올지를 예측하고, 현재 이미지 정보를 기억해야 합니다. + +494 +00:34:36,550 --> 00:34:40,629 +우리가 이 softmax로부터 샘플링을 했을때 실제로 이 부분에서 가장 출현 확률이 높은 단어가 '지푸라기'라면, + +495 +00:34:40,628 --> 00:34:44,710 +우리가 이 softmax로부터 샘플링을 했을때 실제로 이 부분에서 가장 출현 확률이 높은 단어가 '지푸라기'라면, + +496 +00:34:44,710 --> 00:34:47,519 +우리는 이 단어를 기록하고 이것을 다시 RNN에 넣어줍니다. + +497 +00:34:47,519 --> 00:34:52,190 +이 단계에서 우리는 단어 단위 embedding을 사용하고 있습니다. + +498 +00:34:52,190 --> 00:34:55,750 +'지푸라기' 라는 단어는 차원이 300인 벡터의 한 원소입니다. + +499 +00:34:55,750 --> 00:35:00,010 +현재 우리가 여기서 사용하는 단어 사전에는 지푸라기를 비롯한 각기 다른 벡터의 형태로 표시되는 300개의 단어들이 존재합니다. + +500 +00:35:00,010 --> 00:35:02,940 +이 300개의 단어를 RNN에 입력하면 그 출력값 y1은 바로 다음 순서에 올 단어를 예측합니다. + +501 +00:35:02,940 --> 00:35:07,090 +하나는 우리가 이러한 모든 특성을 우리가 얻을 왜 내 두 번째 세계와 순서 + +502 +00:35:07,090 --> 00:35:08,010 +그것에서 샘플을 다시 + +503 +00:35:08,010 --> 00:35:12,490 +워드 모자 가능성이 있다고 가정 지금 우리는 모자 400 훨씬 나이 프리젠 테이션을 + +504 +00:35:12,489 --> 00:35:18,299 +그리고 거기의 분포를 얻을 후 우리는 다시 샘플링하고 우리는 때까지 샘플 + +505 +00:35:18,300 --> 00:35:21,350 +우리는 특별한 샘플 및 진정의 끝에있는 기간 토큰 + +506 +00:35:21,349 --> 00:35:24,900 +문장하고는 arnaz 지금이에서 생성 할 것을 우리에게 알려줍니다 + +507 +00:35:24,900 --> 00:35:30,280 +군대는 그렇게 확인 밀짚 모자 기간이 이미지를 설명했을 포인트 + +508 +00:35:30,280 --> 00:35:34,010 +치수와 그의 아내 사진의 수는 단어의 숫자 당신의 + +509 +00:35:34,010 --> 00:35:39,220 +특수 토큰과 우리가 항상 먹이 산업을위한 어휘 +1 + +510 +00:35:39,219 --> 00:35:43,609 +다른 단어에 해당하는 부문과 얘기 특별한 시작과 + +511 +00:35:43,610 --> 00:35:46,250 +우리는 언제나 그 전부 단일 통해 전파 + +512 +00:35:46,250 --> 00:35:49,769 +시간은 무작위로이 국유화하거나 당신은 무료로 BG 그물을 초기화 할 수 있습니다 + +513 +00:35:49,769 --> 00:35:52,099 +다음 분을 위해 무역 + +514 +00:35:52,099 --> 00:35:56,319 +배포판은 다음 그라데이션을 인코딩 한 다음이를 통해 백업 + +515 +00:35:56,320 --> 00:35:59,700 +전체 단일 모델로 것이나 그냥 모든 공동에서 훈련하고 얻을 + +516 +00:35:59,699 --> 00:36:08,389 +캡션 또는 이미지 캡처 확인 질문을 많이하지만 네 삼백 + +517 +00:36:08,389 --> 00:36:12,609 +감정 묻어은 너무 이미지 모든 단어의 단지 독립적있어 + +518 +00:36:12,610 --> 00:36:18,430 +그렇게 우리가 그것으로 얻을 파산거야와 관련된 300 번호를 가지고 + +519 +00:36:18,429 --> 00:36:21,769 +당신은 무작위로 초기화 한 다음이 더 나은 섹스에 들어갈 백업 할 수 있습니다 + +520 +00:36:21,769 --> 00:36:25,360 +그 묻어은 주위 그냥 매개 변수를 다른 이동합니다 오른쪽 그래서 + +521 +00:36:25,360 --> 00:36:30,530 +그것에 대해 생각하는 방법은 모두를위한 하나의 홉 표현을 데입니다입니다 + +522 +00:36:30,530 --> 00:36:34,960 +단어는 당신은 거대한 W 매트릭스 곳 하나 하나가 + +523 +00:36:34,960 --> 00:36:40,130 +그 백 농장과 W 곱셈과 승 300 밖으로하지만 크기가 + +524 +00:36:40,130 --> 00:36:43,530 +효과적으로 하나가 부러 밖으로 따 버릴거야있는 뭔가 w + +525 +00:36:43,530 --> 00:36:47,560 +나는 당신이 그 마음에 들지 않는 경우 그래서 그냥 생각이 한랭 전선의 종류의 걸거야 + +526 +00:36:47,559 --> 00:36:50,279 +침대에서 단지 하나의 호퍼 프리젠 테이션으로 생각하고 수행 할 수 있습니다 + +527 +00:36:50,280 --> 00:36:58,920 +교육에 토큰 네 말에 최대 네 그것의 모델러를 그런 식으로 생각 + +528 +00:36:58,920 --> 00:37:02,769 +데이터는 우리가 예술에서 기대하는 올바른 순서는 내가 할 수있는 첫 번째 단어입니다 + +529 +00:37:02,769 --> 00:37:07,969 +기대 때문에 매일 훈련 예 일종의 특별이 + +530 +00:37:07,969 --> 00:37:10,288 +그리고 진행 토큰 + +531 +00:37:10,289 --> 00:37:28,929 +당신이 유선 수 다르게 우리는 모든 단일 상태로 연결이 밝혀 + +532 +00:37:28,929 --> 00:37:32,999 +그것은 실제로 당신이 단지에 연결하면 실제로 잘 작동 악화 때문에 작동 + +533 +00:37:32,998 --> 00:37:36,718 +시간 단계 최초의 다음 아르 논은이이 두 작업을 저글링하는 + +534 +00:37:36,719 --> 00:37:40,829 +그것은 예술과 그것을 통해 기억 할 필요가 무엇 이미지에 대한 기억 + +535 +00:37:40,829 --> 00:37:45,179 +또한 이러한 모든 의상을 생산해야하고 어떻게 든 거기에 그렇게하고 싶어 + +536 +00:37:45,179 --> 00:38:04,209 +일부는 사실 클래스 직후 나는 당신을 줄 수있는 이유를 전진 + +537 +00:38:04,208 --> 00:38:10,208 +단일 인스턴스는 이미지와 단어의 순서와 우리가 대응합니다 + +538 +00:38:10,208 --> 00:38:16,328 +여기에 그 단어를 연결 것이고, I를 우리는 이미지를 이야기하고 우리가하여야한다 + +539 +00:38:16,329 --> 00:38:22,159 +그래서 와서 당신이 모든 사람들은 바닥에 계획되지 않은 한 기차 시간 + +540 +00:38:22,159 --> 00:38:25,528 +이미지 런던과 다음이 그래프를 풀다 당신은 당신의 손실을 + +541 +00:38:25,528 --> 00:38:29,389 +당신이 조심 있다면 배경이 다음 이미지의 배치를 할 수 있으며, + +542 +00:38:29,389 --> 00:38:33,108 +그래서 당신의 이미지를 한 경우에는 때로는 서로 다른 길이의 시퀀스가 + +543 +00:38:33,108 --> 00:38:36,199 +당신이 난 것을 확인 말을해야하기 때문에 훈련 데이터는 조심해야 + +544 +00:38:36,199 --> 00:38:41,059 +아마 다음의 몇 가지를 최대 스무 단어의 배치를 처리하고자 + +545 +00:38:41,059 --> 00:38:44,499 +코드에서 당신이 알고에 그 문장이 짧거나 더 이상 필요가있을 것입니다 + +546 +00:38:44,498 --> 00:38:48,188 +일부 일부 일부 문장은 다른 사람보다 더 오래 있기 때문에 걱정 + +547 +00:38:48,188 --> 00:38:55,368 +우리는 내가 갈 물건이 너무 많은 질문이 + +548 +00:38:55,369 --> 00:39:03,450 +그 완전히 공동으로이 모든 것을 전파하도록 네 감사합니다 + +549 +00:39:03,449 --> 00:39:07,538 +훈련은 인터넷으로 기차를 미리 할 수​​ 있도록 한 다음 그 단어를 넣어 + +550 +00:39:07,539 --> 00:39:10,190 +이하지만 당신은 공동으로 모든 훈련을 원하고 그 큰이야 + +551 +00:39:10,190 --> 00:39:15,429 +우리는 우리가 검색 기능을 알아낼 수 있기 때문에 실제로 이점 + +552 +00:39:15,429 --> 00:39:20,368 +더 좋은 말은 그래서 당신은이 훈련하는 이미지를 설명하기 위해 + +553 +00:39:20,369 --> 00:39:23,890 +실제로 우리가 인구 조사 자료에이 시도는 일반적인 욕구 중 하나를 설정합니다 + +554 +00:39:23,889 --> 00:39:27,368 +마이크로 소프트 코코라고하는 것은, 그래서 그냥 당신이처럼 보이는 무엇의 아이디어를 제공합니다 + +555 +00:39:27,369 --> 00:39:31,499 +대략 각 이미지 80 이미지와 다섯 문장의 설명이 있었다 + +556 +00:39:31,498 --> 00:39:35,288 +그래서 당신은 단지 사람들에게 아마존 기계 터크를 사용하여 얻은 것은 우리에게주세요 + +557 +00:39:35,289 --> 00:39:39,710 +문장 이미지에 대한 설명과 기록 및 데이터 세트를 종료하고 + +558 +00:39:39,710 --> 00:39:43,249 +그래서 당신은 당신이 예상 할 수있는이 모델에게 결과의 종류를 훈련 할 때 또는 + +559 +00:39:43,248 --> 00:39:49,078 +약 좀이 같은이 너무 이러한 이미지를 설명하는 우리의 무엇이다 + +560 +00:39:49,079 --> 00:39:52,329 +이 이것이 검은 셔츠 연주 기타 또는 건설 사람이다라고 말한다 + +561 +00:39:52,329 --> 00:39:55,710 +도로 또는 두 젊은 여자에 작업 오렌지 시티 웨스트에서 노동자 재생 + +562 +00:39:55,710 --> 00:40:00,528 +레고 장난감이나 소년 그건 아니에요 웨이크 보드에 물론 공중제비를하고있다 + +563 +00:40:00,528 --> 00:40:04,650 +웨이크 보드는하지만 매우 재미 실패 사례도 있습니다 가까이있는 + +564 +00:40:04,650 --> 00:40:07,680 +또한이 야구 방망이를 들고 어린 소년입니다 보여주고 싶은 + +565 +00:40:07,679 --> 00:40:12,338 +이 고양이는 여자의 원격 제어와 함께 소파에 앉아있다 + +566 +00:40:12,338 --> 00:40:15,710 +거울 앞의 테디 베어를 들고 + +567 +00:40:15,710 --> 00:40:22,400 +여기 질감은 아마 무슨 일이 것은 그것을 만든 것입니다 확신 해요 + +568 +00:40:22,400 --> 00:40:26,289 +이 테디 베어가 있다고 생각하고 마지막은 서 창녀입니다 + +569 +00:40:26,289 --> 00:40:30,409 +거리 도로의 중간 그래서 분명히 일부 확실하지 아무 말 없다 무엇 + +570 +00:40:30,409 --> 00:40:34,858 +이 나온 모델의 단지 간단한 종류 그래서 거기에 무슨 일이 있었 + +571 +00:40:34,858 --> 00:40:37,619 +작년 모델의 이러한 종류의 상단에 작업하려고 많은 사람들이 있었다 + +572 +00:40:37,619 --> 00:40:41,559 +난 그냥 당신에게 11 레벨의 아이디어를 제공하고자 그들을 더 복잡하게 + +573 +00:40:41,559 --> 00:40:44,929 +흥미로운 단지 사람들이 기본 아키텍처를 연주하는 방법에 대한 아이디어를 얻을 수 + +574 +00:40:44,929 --> 00:40:51,329 +그래서 이것은 현재 모델에서 발견 경우 지난해 종이는 우리 + +575 +00:40:51,329 --> 00:40:55,608 +단지 처음에 시간을 이미지로 한 시간을 공급 한 경우를 + +576 +00:40:55,608 --> 00:40:59,480 +이 놀 수있는 것은 실제로 다시 볼 수있는 난폭 한 재발 성 신경 네트워크입니다 + +577 +00:40:59,480 --> 00:41:03,130 +무선 않는 작동 기술 화상의 화상 및 참조 부 + +578 +00:41:03,130 --> 00:41:07,180 +당신이 허용 등이 모든 단어를 생성하는 등의 단어가 없습니다 + +579 +00:41:07,179 --> 00:41:10,460 +실제로 이미지 옆 모습을하고 다른 기능을 찾아 + +580 +00:41:10,460 --> 00:41:13,470 +그것은 다음에 설명 할 수 있습니다 당신은 실제로 완전히에서이 작업을 수행 할 수있는 작업 + +581 +00:41:13,469 --> 00:41:17,899 +그들은 단지이 말뿐만 아니라 측면을 생성하지 않도록 학습 가능한 방법 + +582 +00:41:17,900 --> 00:41:21,289 +여기서 이미지에 다음보고하는 등이 작동하는 방식 만을 수행하지 않습니다 + +583 +00:41:21,289 --> 00:41:24,259 +아웃 아르 논하지만 당신은 아마 다음 하나의 시퀀스에 대한 분배있어 + +584 +00:41:24,260 --> 00:41:29,250 +하지만 제공이 오는 당신은 발륨은 우리가 전달이 경우 말을 않는 + +585 +00:41:29,250 --> 00:41:37,389 +512 활성화 부피 (512) (14)에 의해 14를 얻었고에서 모든 및 주석 + +586 +00:41:37,389 --> 00:41:40,179 +우리는 단지 그 분포를 인정하지 않습니다하지만 당신은 또한을 방출 한 시간 + +587 +00:41:40,179 --> 00:41:44,358 +모양까지 키처럼 좀입니다 오백열둘 차원 사진 + +588 +00:41:44,358 --> 00:41:48,019 +당신은 이미지 옆에 그래서 실제로 나는이 생각하지 않습니다 찾기 위해 원하는 것을 + +589 +00:41:48,019 --> 00:41:51,210 +그들은이 특별한 종이에 무슨 짓을하지만, 이것은 당신이 연결할 수 있습니다 한 방법입니다 + +590 +00:41:51,210 --> 00:41:54,510 +이 위로이 사진을보고 뭔가는 아르 논에서 방출되는 단지 + +591 +00:41:54,510 --> 00:41:58,430 +그냥 약간의 무게와 다음이 그림은 점 수를 사용하여 예측처럼 + +592 +00:41:58,429 --> 00:42:03,618 +제품이 모든 (14) (14)에 의해 위치가 그래서 우리는 이러한 모든 점 제품을 함께 + +593 +00:42:03,619 --> 00:42:09,108 +우리는 우리가 지금 우리가 다음 우리 (14)의 호환성에 의해 기본적으로 14 계산 달성 + +594 +00:42:09,108 --> 00:42:13,949 +그것은 모두 당신의 있도록 그래서 기본적으로 우리는이 모든 것을 정상화 이것에 부드러운 최대를 넣어 + +595 +00:42:13,949 --> 00:42:17,149 +이 14 (14)에 의해, 그래서 우리는 이미지를 통해 긴장 부르는이를 얻을 수 + +596 +00:42:17,150 --> 00:42:21,230 +아마 이미지에 지금 아르 논에 대한 흥미로운 내용을 통해지도, + +597 +00:42:21,230 --> 00:42:25,889 +우리는이와이 사람의 가중 합을 수행하라는 메시지가이 문제를 사용 + +598 +00:42:25,889 --> 00:42:27,239 +현출 + +599 +00:42:27,239 --> 00:42:30,929 +그래서 오늘 아침은 기본적으로는 어떻게 생각하는지의 신화는 현재 수 + +600 +00:42:30,929 --> 00:42:36,089 +그것에 대한 흥미가 돌아갑니다 당신은의 가중 합을하고 결국 + +601 +00:42:36,090 --> 00:42:39,850 +엘리스 팀이 시점에서보고 싶은 기능의 종류 + +602 +00:42:39,849 --> 00:42:44,809 +시간 등 섬의 생성 물건, 예를 들어 그것을 결정할 수 있습니다 + +603 +00:42:44,809 --> 00:42:49,400 +지금과 같은 객체에 대한보고 싶은 그 확인은 벡터 파일을 인정 + +604 +00:42:49,400 --> 00:42:53,220 +물건 같은 개체의 숫자는이 때의 정액과 상호 작용 + +605 +00:42:53,219 --> 00:42:57,379 +위원회 주석 어쩌면 그 지역 같은 개체의 일부는 오는 + +606 +00:42:57,380 --> 00:43:01,700 +점등 및 천장처럼 떨어지는 정품 인증에서이지도를 참조 + +607 +00:43:01,699 --> 00:43:05,949 +4514 화나게하고 당신은 그 부분에 관심을 집중 결국 + +608 +00:43:05,949 --> 00:43:10,059 +이 상호 작용을 통해 그래서 당신은 기본적으로 그냥 할 수있는 조회 이미지 + +609 +00:43:10,059 --> 00:43:14,130 +이미지에 당신은 문장을 설명하고 그래서이 뭔가 우리 동안 + +610 +00:43:14,130 --> 00:43:17,360 +부드러운 구금으로 참조 실제로 몇 강연이가는 것 + +611 +00:43:17,360 --> 00:43:21,050 +그래서 우리는 군대가 실제로하지 않은 수있는이 같은 일을 다루려고 + +612 +00:43:21,050 --> 00:43:26,880 +선택적 입력을 처리하는 등의 수입을 통해 관심과 그 그래서 I + +613 +00:43:26,880 --> 00:43:30,030 +그냥 당신에게 그 무엇의 미리보기를 제공하기 위해 약 한 시간 그것을 가지고 싶어 + +614 +00:43:30,030 --> 00:43:34,490 +우리가 중 한 가지 방법으로 우리의 삶을 더 복잡하게하려면 이제 괜찮아 보이는 + +615 +00:43:34,489 --> 00:43:39,259 +이 당신을 제공합니다, 그래서 우리가 그 층을 쌓아하는 것입니다 할 수있는 당신이 더 많은 것을 알고 + +616 +00:43:39,260 --> 00:43:43,570 +깊은 물건은 일반적으로 더 나은 우리가에 가지 방법 중 하나를이를 시작하는 방법을 작동 + +617 +00:43:43,570 --> 00:43:46,809 +적어도 당신은 재발 성 신경 네트워크를 쌓을 수 많은 방법이있다 그러나이 + +618 +00:43:46,809 --> 00:43:49,409 +사람들이 당신이 할 수 실제로 사용하는 것이 바로 그 중 하나입니다 + +619 +00:43:49,409 --> 00:43:53,339 +똑바로 그냥 서로 그렇게 한 아르 논에 대한 자극이에 하네스를 연결 + +620 +00:43:53,340 --> 00:43:59,170 +우리가 이전에 주 사진의 디렉터 등이 이미지 + +621 +00:43:59,170 --> 00:44:02,750 +시간 축이 수평으로 이동 한 다음 우리가 다른이 위쪽으로가는 + +622 +00:44:02,750 --> 00:44:05,960 +이 특정 이미지의 의식 등 세 가지 별도의 재발이 있습니다 + +623 +00:44:05,960 --> 00:44:09,858 +신경 네트워크는 무게의 자신의 세트와 각각이 대령이다 그 + +624 +00:44:09,858 --> 00:44:16,299 +난 그냥 서로 먹이를하지 그래서이 항상 공동으로 더 거기에 훈련되어 작동합니다 + +625 +00:44:16,300 --> 00:44:19,119 +기차는 먼저 모든 단지 하나의 경쟁 성장의 두 번째 임기 하나 원 + +626 +00:44:19,119 --> 00:44:22,700 +배경으로는 상단이 재발 식을 통해 얻을 수 + +627 +00:44:22,699 --> 00:44:25,980 +상아 영국은 여전히​​ 우리는 여전히있어 더 일반적인 규칙을 만들 가능성이 높습니다 + +628 +00:44:25,980 --> 00:44:29,280 +똑같은 일을하면 우리는 우리가 복용하고있는 같은 공식을하지 않았다된다 + +629 +00:44:29,280 --> 00:44:35,390 +우린 시간 전에에서 아래 아래 깊이와 효과에서 강의 + +630 +00:44:35,389 --> 00:44:39,469 +를 절단하고 퍼팅이 w 변환과를 통해 지원 + +631 +00:44:39,469 --> 00:44:40,519 +스매싱 10 각 + +632 +00:44:40,519 --> 00:44:44,509 +당신이 이것에 대해 약간 혼란스러워하는 경우에 당신이 기억한다면, 그래서 거기있다 + +633 +00:44:44,510 --> 00:44:51,760 +WRX H 시간의 X 플러스 당신이 다시 작성할 수 있습니다 whah 시간의 H는 엑손의 연결입니다 + +634 +00:44:51,760 --> 00:44:56,260 +하나의 행렬 곱 H 바로 그래서 난에 침을 국가 스틱 것처럼 + +635 +00:44:56,260 --> 00:45:03,680 +기본적으로 무슨 일이 끝나는 다음 하나의 열 벡터와 나는이 w 행렬이 + +636 +00:45:03,679 --> 00:45:07,690 +최대 일어나고 당신의 WX 연령이 행렬과 WH의 첫 번째 부분 + +637 +00:45:07,690 --> 00:45:12,700 +미국에서 두 번째로 당신의 매트릭스의 일부 등 식의이 종류는 기록 될 수있다 + +638 +00:45:12,699 --> 00:45:16,099 +식으로 당신은 당신의 입력을 쌓아 단일 W가 어디 + +639 +00:45:16,099 --> 00:45:24,759 +변환은 같은 식 있도록 그래서 우리가이는 중지 할 수 있습니다 방법 + +640 +00:45:24,760 --> 00:45:29,780 +두 시간 색인되는 이후로 지금 다음이 발표하고 + +641 +00:45:29,780 --> 00:45:33,510 +우리는 또한이 더 복잡한이 적층 공유되지 수 있습니다 지금은 한 방향으로 발생 + +642 +00:45:33,510 --> 00:45:37,030 +그들을 실제로 그렇게 지금 약간 더 반복 공식을 사용하여 + +643 +00:45:37,030 --> 00:45:40,300 +지금까지 우리는 복귀에 대한 매우 간단한 재발 수식으로 보았다 + +644 +00:45:40,300 --> 00:45:44,480 +실제로 작품은 실제로 거의 지금과 같은 공식을 사용하고 + +645 +00:45:44,480 --> 00:45:48,170 +기본 네트워크는 매우 드물게 우리가 그것에게 부르는 사용합니다 대신 사용되지 않습니다 + +646 +00:45:48,170 --> 00:45:52,059 +LSD와 오랜 단기 기억은 그래서 이것은 기본적으로 모든 서류에 사용된다 + +647 +00:45:52,059 --> 00:45:56,500 +지금이 공식은 당신이 인 경우도 프로젝트를 사용하는 것입니다 + +648 +00:45:56,500 --> 00:46:00,989 +사용이 현재 작동하지만 나는이 시점에서 주목하고 싶은 모든입니다 + +649 +00:46:00,989 --> 00:46:04,729 +동일은 알렌과 마찬가지로이 재발 수식은이 단지의 + +650 +00:46:04,730 --> 00:46:09,050 +약간 더 복잡한 기능을 확인 우리는 여전히 낮은에서 사진을 촬영하고 + +651 +00:46:09,050 --> 00:46:13,789 +그리고 이전의 시간에 입력 같은 깊이 이전 재산이었다 + +652 +00:46:13,789 --> 00:46:18,309 +연락 그들 앗 전송을 통해 이르렀 그러나 지금 우리는이 더이 + +653 +00:46:18,309 --> 00:46:21,869 +복잡성과 방법을 우리가 실제로이 지점에서 뉴 헤이븐 상태를 달성 + +654 +00:46:21,869 --> 00:46:25,539 +시간은 그래서 우리는 단지 약간 더 복잡한되고있어 방법에서 북한 이탈 주민을 결합 + +655 +00:46:25,539 --> 00:46:28,900 +아래 실제로 단지 더 상태를 제목에 업데이트를 수행하기 전에 + +656 +00:46:28,900 --> 00:46:33,050 +이 동기를 부여 정확히 복잡한 공식은 그래서 우리는 몇 가지 세부 사항에 갈거야 + +657 +00:46:33,050 --> 00:46:41,609 +공식 이유는 실제로 오스틴에서 사용할 수있는 더 좋은 생각이 될 수 있습니다 + +658 +00:46:41,608 --> 00:46:49,909 +그리고 우리가 지금 당장 그것을 통해 갈거야 의미가 나를 신뢰하게 그렇다면 당신 + +659 +00:46:49,909 --> 00:46:56,480 +오후 4시 일부 온라인 비디오를 차단하거나 Google 이미지는 다이어그램을 찾을 수 있습니다로 이동 + +660 +00:46:56,480 --> 00:47:00,989 +정말 도움이되지 않는이처럼 사람에게 내가 그를 처음봤을 때 생각 + +661 +00:47:00,989 --> 00:47:04,048 +이 사람이 정말 그가 무슨 일이 일어나고 있는지 정말 확신했다 겁처럼 정말 무서워되고 + +662 +00:47:04,048 --> 00:47:08,170 +나는 엘리스 팀을 이해하고 난 여전히이 두 다이어그램이 무엇인지 모르는에 + +663 +00:47:08,170 --> 00:47:14,289 +나는 목록을 파괴하려고하는거야하고 ​​까다로운 물건의 종류, 그래서 그렇게 확인 + +664 +00:47:14,289 --> 00:47:18,329 +그것을 통해 단계의 종류 당신이 정말로이 도면에 강의 있도록 넣어 + +665 +00:47:18,329 --> 00:47:24,220 +형식은 우리가 미국의 방정식이 있고 난 그래서 여기에없는 스팀 확인을 위해 완벽하다 + +666 +00:47:24,219 --> 00:47:28,238 +우리는이 두 벡터를 가지고 위치를 상단에 여기에 첫 번째 부분에 초점을 맞출 것 + +667 +00:47:28,239 --> 00:47:32,720 +아래로부터의 상태에서 이렇게 X와 HHS 이전 전에 사고 있지만, + +668 +00:47:32,719 --> 00:47:37,848 +우리는 변환 W를 통해 지금 모두 잭슨 href가 크기 경우를 만났다 + +669 +00:47:37,849 --> 00:47:40,950 +그래서 우리는 어떤을 위해 생산 끝날거야 숫자를 보낼있다 + +670 +00:47:40,949 --> 00:47:46,068 +(21)에 의해 제시되었다이 w 매트릭스를 통해 확인 번호는 그래서 우리는 이러한이 + +671 +00:47:46,068 --> 00:47:51,108 +그들이 입력 짧은 것 OMG 경우 사 및 차원 벡터 나가뿐만 + +672 +00:47:51,108 --> 00:47:57,328 +그리고 G는 나는 당신과 그렇게 ISI없이 신호를 통과 단지를 무엇 확실하지 않다 + +673 +00:47:57,329 --> 00:48:05,859 +게이트 및 G는 방법에게 지금이 실제로 작동이 길을 똑바로 세입자 게이트로 이동 + +674 +00:48:05,858 --> 00:48:09,420 +그것에 대해 생각하는 가장 좋은 방법은 내가 깜빡 한 가지가 실제로 언급하는 것입니다 + +675 +00:48:09,420 --> 00:48:15,028 +이전 슬라이드는 일반적으로 하나의 HVAC 시도 말합니다 할 네트워크를 필요로하지 않습니다 + +676 +00:48:15,028 --> 00:48:18,018 +매번 중지하고 그에게 물었다 실제로 두 벡터 모든이 + +677 +00:48:18,018 --> 00:48:23,618 +한 시간 때문에 우리는 세포 상태 벡터를 참조 전화를 매도록 + +678 +00:48:23,619 --> 00:48:29,470 +시간 단계는 우리가 위험에 두 기관이 있고 그리고에서와 같이 여기 벡터를 참조하십시오 + +679 +00:48:29,469 --> 00:48:33,558 +노란색 그래서 우리는 기본적으로 두 벡터 여기 공간에있는 모든 단일 지점을 가지고 + +680 +00:48:33,559 --> 00:48:37,849 +그들이하는 일은 그들이 기본적 그래서이 셀 상태에서 작동하고있다 + +681 +00:48:37,849 --> 00:48:41,680 +전에 당신 아래의 내용에 따라 해당 사용자 컨텍스트 당신은 결국 + +682 +00:48:41,679 --> 00:48:45,199 +이들과 함께 세포 상태에서 작동 + +683 +00:48:45,199 --> 00:48:50,509 +그리고 옹 요소와 그것에 대해 생각하는 새로운 방법 내가 통해 갈거야된다 + +684 +00:48:50,510 --> 00:48:58,290 +이 0 또는 1 우리가 원하는 I NO처럼 이진 않습니다에 대해이 방법을 많이 생각합니다 + +685 +00:48:58,289 --> 00:49:01,199 +그들에게 우리가 그들을 게이트의 해석이 생각하고 싶다 갖고 싶어 할 수 + +686 +00:49:01,199 --> 00:49:05,449 +영웅이 그들이다 그것의로 우리는 물론 우리가 원하기 때문에 그들에게 이상 신호를 만들 + +687 +00:49:05,449 --> 00:49:08,348 +우리는하지만, 모든 것을 통해 전파 백업 할 수 있도록이 미분 될 수 있습니다 + +688 +00:49:08,349 --> 00:49:11,960 +우리의 상황에 기반을 계산 한 바로 진 것들로 이노 생각 + +689 +00:49:11,960 --> 00:49:17,740 +항상 여기서 뭘에서이 참조 다음 당신은 무엇을 기준으로 그를 볼 수있는 + +690 +00:49:17,739 --> 00:49:22,250 +이 문은 다음과 디아즈 우리는이 페이지의 값을 데이트 끝날거야 무슨 + +691 +00:49:22,250 --> 00:49:29,289 +특히이 에피소드는 TUS을 종료하는 데 사용됩니다 게이트를 잊지 + +692 +00:49:29,289 --> 00:49:34,869 +(20) 태양 전지 등의 보호소 가장 생각되는 세포들을 재설정 + +693 +00:49:34,869 --> 00:49:38,700 +우리와 함께 (20)이 상호 작용보다 기본적으로 우리가 할 수있는 하나 최근 이러한 카운터 + +694 +00:49:38,699 --> 00:49:42,368 +이것은 자신의 레이저 포인터가 부족합니다 곱셈의 요소입니다 + +695 +00:49:42,369 --> 00:49:45,530 +배터리 때문에 + +696 +00:49:45,530 --> 00:49:50,140 +상호 작용 0 당신은 우리가를 재설정 할 수 있도록 그 셀을 제로 것이다 볼 수 있습니다 + +697 +00:49:50,139 --> 00:49:53,969 +카운터 그리고 우리는 또한 우리는이를 통해 추가 할 수있는 카운터에 추가 할 수 있습니다 + +698 +00:49:53,969 --> 00:50:00,459 +상호 작용 I 번 G와 11 사이와 G는 부정적 일 사이이기 때문에 + +699 +00:50:00,460 --> 00:50:05,900 +(10)에 기본적으로 한 12 매 있도록 모든 세포 사이의 숫자를 추가 + +700 +00:50:05,900 --> 00:50:09,338 +우리는이를 재설정 할 수있는 모든 세포에서 이러한 카운터를 하나의 시간 단계 + +701 +00:50:09,338 --> 00:50:13,588 +국가 2012 케이트를 잊어 버렸거나 우리는 하나 사이의 숫자를 추가 할 수 있습니다 + +702 +00:50:13,588 --> 00:50:18,039 +12 그래서 확인을 하나 하나 셀은 우리가 다음 셀 업데이트 및 수행 방법 + +703 +00:50:18,039 --> 00:50:24,029 +업데이트가 찌그러 세포 그렇게 10 HFC는 셀을 숙청되고 끝 머리 + +704 +00:50:24,030 --> 00:50:28,760 +그렇게 만 셀 상태의 일부와 위로로 누출이 업데이트에 의해 변조 + +705 +00:50:28,760 --> 00:50:33,500 +숨겨진 상태가이 벡터에 의해 변조 오 그래서 우리는 단지의 일부를 공개 선택 + +706 +00:50:33,500 --> 00:50:39,530 +암탉 상태와 학습 가능 방법으로 세포는 몇 가지가있다 + +707 +00:50:39,530 --> 00:50:43,910 +에 하이라이트의 종류 여기에 아마 여기에 가장 혼란스러운 부분에 우리가 걸이다 + +708 +00:50:43,909 --> 00:50:47,500 +여기에 D I 배 하나 하나 사이의 숫자를 추가하지만 가지의 + +709 +00:50:47,500 --> 00:50:51,809 +우리는 단지 거기 G가 있다면 대신 다음 이미 사이에 이름 : Jeez 때문에 혼란 + +710 +00:50:51,809 --> 00:50:56,679 +8 11 왜 우리는 내가 여러 번 G 무엇을하지 실제로 우리가 제공하는 모든 필요합니까 우리 + +711 +00:50:56,679 --> 00:50:58,279 +원하는에 의해 바다를 구현하는 것입니다 + +712 +00:50:58,280 --> 00:51:02,330 +하나 하나 사이의 숫자는 그래서는 대한 내 성 부품의 종류의 + +713 +00:51:02,329 --> 00:51:08,989 +마지막으로 내가 한 대답은 당신이 G에 대해 생각하면 그것의 기능 있다고 생각합니다 + +714 +00:51:08,989 --> 00:51:16,159 +당신의 문맥의 선형 함수는 하나의 기회가 오른쪽으로 레이저 프린터가 없습니다 + +715 +00:51:16,159 --> 00:51:26,649 +확인 그래서 G는 G 그래서 확인을 지역 310 세의 함수의 선형 함수로 + +716 +00:51:26,650 --> 00:51:30,579 +우리가 청바지를 추가 한 경우 10 시간 등에 의해 숙청 이전에 접촉하는 경우 + +717 +00:51:30,579 --> 00:51:35,349 +추가하여, 그래서 나는 시간 그녀는 그 종류의 매우 간단한 함수 같은 것 + +718 +00:51:35,349 --> 00:51:38,929 +이 난 후 실제로 더 있어요 곱셈 상호 작용을 갖는 + +719 +00:51:38,929 --> 00:51:42,710 +실제로 우리가 추가하는 것을 표현 할 수 있습니다 풍부한 기능 + +720 +00:51:42,710 --> 00:51:47,010 +이전 테스트의 기능을 생각하는 또 다른 방법으로 상태를 몸통 + +721 +00:51:47,010 --> 00:51:50,620 +이 약이 기본적으로 방법이 두 개념을 분리하는 것 + +722 +00:51:50,619 --> 00:51:54,159 +많은 우리가 G 인 셀 상태로 추가 싶어하고 우리가 원하는 수행 + +723 +00:51:54,159 --> 00:51:58,129 +나는 우리가 실제로 무엇을이 조작 가능성이 있으므로 모든 상태를 해결 + +724 +00:51:58,130 --> 00:52:03,280 +또한 될 수 있음이 두 디커플링에 의해 통해 천재 우리가 원하는 이동 + +725 +00:52:03,280 --> 00:52:08,470 +동적 측면에서 몇 가지 좋은 특성을 가지고 어떻게이 모든 증기 기차하지만, + +726 +00:52:08,469 --> 00:52:12,039 +우리는 단지 그 오스틴 공식처럼 결국 나는 실제로 갈거야 + +727 +00:52:12,039 --> 00:52:14,059 +자세한 세부 사항에서이뿐만 아니라 통해 + +728 +00:52:14,059 --> 00:52:21,400 +확인 상기 제 1 상호 작용 이제 셀 C가 흐르는으로 이것에 대해 생각하고 + +729 +00:52:21,400 --> 00:52:28,269 +여기 그래서 경제적으로 그 시그 모이 약간의 DOTC 그렇게 노력하다 + +730 +00:52:28,269 --> 00:52:32,559 +곱셈의 상호 작용으로 자신을 게이팅 F 제로는 것입니다 그래서 만약 + +731 +00:52:32,559 --> 00:52:38,409 +셀을 차단하고 세포학 부분이 기본적으로 제공되는 카운터를 재설정 + +732 +00:52:38,409 --> 00:52:44,799 +당신은 완은 기본적으로 하위 상태 누수가 유일한 상태로 추가하고있다 + +733 +00:52:44,800 --> 00:52:51,100 +언덕 상태로하지만 너무 의해 문이 가도록 한 후 10 시간 통해 + +734 +00:52:51,099 --> 00:52:55,380 +전기 만 결정 사실로 밝혀 몇 가지 상태에있는 부품 + +735 +00:52:55,380 --> 00:52:59,610 +매각하지 않았다 숨겨진 그리고 당신은 알 수가이 고속도로뿐만 아니라, + +736 +00:52:59,610 --> 00:53:03,720 +STM의 다음 반복으로 이동뿐만 아니라 실제로까지 폐쇄 + +737 +00:53:03,719 --> 00:53:07,159 +상위 계층이 우리가 실제로 종료 상태 교리의 머리이기 때문에 + +738 +00:53:07,159 --> 00:53:11,250 +까지 우리 위에 팀으로보고하거나이 예측에 간다 + +739 +00:53:11,250 --> 00:53:14,510 +이 기본적으로 방법을 풀다 때 그래서 그것이 가지처럼 보이는 + +740 +00:53:14,510 --> 00:53:19,270 +지금은 내 자신 그게 전부의 혼란도를 가지고있는이 나는 우리가 끝난 것 같아요 + +741 +00:53:19,269 --> 00:53:24,550 +그러나 아래에서 입력 벡터를 얻을 수와 최대 당신은 당신의 자신의 상태에서이 + +742 +00:53:24,550 --> 00:53:26,090 +(248) + +743 +00:53:26,090 --> 00:53:31,030 +그들은 다음 차원 벡터 및 모든 거 알아 fije 네 성문을 결정 + +744 +00:53:31,030 --> 00:53:35,110 +는 셀 상태에서 동작하고, 셀의 상태가 변조 방법을 종료 + +745 +00:53:35,110 --> 00:53:38,610 +당신이 한 번 실제로 우리는 일부 국가를 설정하고 하나 사이에 번호를 추가하면 + +746 +00:53:38,610 --> 00:53:42,630 +(12) 국가의 셀 상태는 그것의 일부는 학습 가능에서 누수 밖으로 누출 + +747 +00:53:42,630 --> 00:53:45,840 +방법 및 다음 중 하나를 예측까지 갈 수 또는 다음에 갈 수 있습니다 + +748 +00:53:45,840 --> 00:53:52,269 +미국 팀의 반복은 향후 그래서 그게 그렇게이 그렇게 추한 모습입니다 + +749 +00:53:52,269 --> 00:53:58,429 +문제는 당신의 마음에 아마 그래서 우리는 거 야 우리가 간다 않은 이유입니다 + +750 +00:53:58,429 --> 00:54:02,649 +이 특별한 방법 I에서이 Look을 수행하는 이유가 뭔가의 모든 통해 + +751 +00:54:02,650 --> 00:54:05,639 +알고 싶어한다 분석가 많은 다양한 있다는 것을이 시점이 + +752 +00:54:05,639 --> 00:54:09,309 +이 시점하지만 강의 사람들의 말은 이런 식으로 많이 연주 + +753 +00:54:09,309 --> 00:54:12,840 +우리는 종류의 합리적인 것 같은 것으로이에 수렴했지만 + +754 +00:54:12,840 --> 00:54:15,510 +당신이 실제로하지 않는이에 수 많은 작은 비틀기가있다 + +755 +00:54:15,510 --> 00:54:18,930 +당신 같은 사람들 게이트의 일부를 제거 할 수 있습니다 많은하여 성능을 저하 + +756 +00:54:18,929 --> 00:54:20,359 +아마 연루 등 + +757 +00:54:20,360 --> 00:54:25,200 +당신은 할 수의 악취가이 바다가 될 수 볼 밝혀 그것을 잘 작동합니다 + +758 +00:54:25,199 --> 00:54:28,619 +일반적으로하지만 좌석의 어린 나이로 때로는 약간 더 있었다 I + +759 +00:54:28,619 --> 00:54:33,869 +우리는 CSI가의 비트와 함께 결국 왜를위한 아주 좋은 이유가 생각하지 않습니다 + +760 +00:54:33,869 --> 00:54:37,039 +괴물하지만 실제로 좀 법무부 카운터의 측면에서 의미가 생각 + +761 +00:54:37,039 --> 00:54:40,739 +그 0으로 재설정 할 수 있습니다 또는 당신은 하나 (12)을 사이에 작은 숫자를 추가 할 수 있습니다 + +762 +00:54:40,739 --> 00:54:46,039 +지금은 좋은 실제로 비교적 단순한 이해하는 것처럼 그렇게는 가지이다 + +763 +00:54:46,039 --> 00:54:49,300 +이것은 우리 자신보다 훨씬 더 그리고 우리는 약간에 가야 정확하게 이유 + +764 +00:54:49,300 --> 00:54:55,330 +다른 그림은 재발 성 신경 있도록 구별을 그립니다 + +765 +00:54:55,329 --> 00:54:59,259 +어떤 상태 벡터 권리가 네트워크 당신은 그것을 통해 운영하고 있고이있어 + +766 +00:54:59,260 --> 00:55:02,260 +완전히이 재발 식을 통해로 변신 그래서 당신은 종료 + +767 +00:55:02,260 --> 00:55:06,280 +시간 물건 시간에서 상태 벡터를 변경까지 당신은 미국 것을 알 수 있습니다 + +768 +00:55:06,280 --> 00:55:11,140 +팀 대신 셀 미국이 흐르는 우리가 효과적으로 무슨 일을하고있다 + +769 +00:55:11,139 --> 00:55:15,250 +우리는 세포에서 찾고 그것의 일부는 국가의 머리에 누수로 + +770 +00:55:15,250 --> 00:55:19,329 +우리가 이득을 다음 잊어 버린 경우 셀에서 동작하는 방법을 결정하는 상태 + +771 +00:55:19,329 --> 00:55:22,869 +기본적으로 그냥하여 셀을 조정 끝 + +772 +00:55:22,869 --> 00:55:28,509 +함수로 쳐다 보면서 몇 가지 물건이 그래서 그래서 여기 활성 상호 작용 + +773 +00:55:28,510 --> 00:55:33,040 +우리는 영혼의 상태를 변경 결국 그것이 무엇이든 셀 상태의 다음 + +774 +00:55:33,039 --> 00:55:37,190 +대신 바로이 첨가제는 대신, 그래서 그것을 변환의 + +775 +00:55:37,190 --> 00:55:38,429 +변형 + +776 +00:55:38,429 --> 00:55:42,929 +그런 상호 작용이나 뭐 이제이 실제로 뭔가 당신을 생각 나게한다 + +777 +00:55:42,929 --> 00:55:48,839 +우리가 이미 염두에두고 클래스에 적용되었음을 그, 그래 맞아 + +778 +00:55:48,840 --> 00:55:53,240 +그래서이 같은 사실은 고체와 같은 일이 이렇게 기본적으로 직렬 공진입니다 + +779 +00:55:53,239 --> 00:55:56,299 +일반적으로 우리가 표현 거주자가 변화하고 진정으로 + +780 +00:55:56,300 --> 00:56:00,019 +여기에이 스킵 연결 및 당신은 기본적으로 주민들이를 볼 수 있습니다 + +781 +00:56:00,019 --> 00:56:04,690 +우리가 지금 여기이 X이 때문에 첨가제의 상호 작용 우리는 약간의 계산에 기초 않는다 + +782 +00:56:04,690 --> 00:56:10,240 +다음 섹스 그리고 우리는 행위와 첨가제의 상호 작용을 가지고 있고 그래서는이다 + +783 +00:56:10,239 --> 00:56:12,959 +같은 멋진로 발생하는 기본 주민들의 블록과 그 사실의 + +784 +00:56:12,960 --> 00:56:18,440 +물론 우리는 우리가 여기있어 이러한 상호 작용을 가지고 전은 세포이며, 우리가 간다 + +785 +00:56:18,440 --> 00:56:22,619 +다음 몇 가지 기능은 당신과 떨어져 우리는이 세포 상태 만에 추가 할 수 + +786 +00:56:22,619 --> 00:56:26,900 +LSD와는 달리 주민들은 또한 추가 된 날짜를 잊지하시기 바랍니다있다 + +787 +00:56:26,900 --> 00:56:31,519 +이뿐만 아니라 신호의 일부를 차단하도록 선택할 경우 제어를 잊지 있지만, + +788 +00:56:31,519 --> 00:56:33,679 +그렇지 않으면 나는 그것이 가지 생각 때문에 대통령처럼 매우 보인다 + +789 +00:56:33,679 --> 00:56:36,710 +보고 아키텍처와 매우 유사 종류에 수렴하고 그 재미 + +790 +00:56:36,710 --> 00:56:40,429 +보인다 곳은 재발 성 신경 네트워크에서 끝의 두 소득을 작동 + +791 +00:56:40,429 --> 00:56:43,809 +같은 동적으로 어떻게 든 실제로 이러한 첨가제를 가지고 훨씬 좋네요이다 + +792 +00:56:43,809 --> 00:56:48,739 +당신이 실제로 훨씬 더 효과적으로 그렇게 전파 할 수 있도록 상호 작용 + +793 +00:56:48,739 --> 00:56:49,779 +그 시점에 + +794 +00:56:49,780 --> 00:56:53,860 +분석 팀 사이의 뒷면 전파 역학에 대해 생각 + +795 +00:56:53,860 --> 00:56:57,760 +특히 미국 팀에 좀 그라디언트를 주입하면 매우 명확하고 + +796 +00:56:57,760 --> 00:57:01,120 +가끔 내가 생기를 주입하고이 그림의 끝을 보자, 그래서 만약 여기에 + +797 +00:57:01,119 --> 00:57:05,239 +다음이 플러스 상호 작용은 바로 여기 그냥 재료 고속도로처럼 + +798 +00:57:05,239 --> 00:57:09,299 +이 동영상은 모든 탭 추가 상호 작용 오른쪽으로 흐르는 것 같은 + +799 +00:57:09,300 --> 00:57:13,240 +내가 그라데이션 시간의 어느 지점을 연결하는 경우 버전은 동일하므로 분산 때문에 + +800 +00:57:13,239 --> 00:57:16,849 +여기에 단지 물론 그라데이션도 다시 모든 방법을 날려 가고 + +801 +00:57:16,849 --> 00:57:20,809 +이러한 행위를 통해 흘러 그들이에 자신의 재료를 기여 결국 + +802 +00:57:20,809 --> 00:57:25,630 +독서 흐름합니다하지만 당신은 우리가 우리의 강렬한으로 참조 무​​엇으로 끝낼 수 없을거야 + +803 +00:57:25,630 --> 00:57:30,110 +이 그라디언트 그냥 제로로 이동을 사망 어디에 문제가 지역 사라지는라고 + +804 +00:57:30,110 --> 00:57:32,880 +당신은 다시 통해 전파 내가 예를 보여 드리겠습니다로 + +805 +00:57:32,880 --> 00:57:36,640 +완전히이 조금 수중 음파 탐지기에서 발생하는 이유 떨어져 지금 우리는이 배니싱이 + +806 +00:57:36,639 --> 00:57:40,670 +나는 당신을 보여줄 것 그라데이션 문제는 이유는이 때문에 애널리스트 오전 발생 + +807 +00:57:40,670 --> 00:57:45,210 +그냥 판의 고속도로 매 시간 단계의 이러한 구배가 + +808 +00:57:45,210 --> 00:57:47,130 +우리는 위의 미국 팀에 주입 + +809 +00:57:47,130 --> 00:57:54,829 +그냥 세포를 통과하고 등급이에서 마무리 결국하지 않습니다 + +810 +00:57:54,829 --> 00:57:57,339 +어쩌면 내가 몇 가지 질문을 가리 혼란 기능에 대한 질문이 있습니다 + +811 +00:57:57,338 --> 00:58:01,849 +여기하지만 마지막으로 한 다음 그 후 나는 arnaz가에 있었던 이유에 갈거야 + +812 +00:58:01,849 --> 00:58:03,059 +그린 즈 버러 + +813 +00:58:03,059 --> 00:58:09,789 +예 000 벡터가 중요한 것입니다 + +814 +00:58:09,789 --> 00:58:13,400 +내가 하나가 특별히 매우 중요 아니라고 생각 밝혀 + +815 +00:58:13,400 --> 00:58:16,660 +나는 스페이스 오디세이 그들이 대답 할 다른 무엇을 보여 드리겠습니다 종이가있다 + +816 +00:58:16,659 --> 00:58:21,719 +정말 거기에이 걸릴 물건 아웃하지만 물건을 연주 또한 같은있다 + +817 +00:58:21,719 --> 00:58:25,588 +당신이 그렇게이 셀 상태가 여기에있을 수 추가 할 수 있습니다 사람들의 연결 + +818 +00:58:25,588 --> 00:58:29,538 +사람들이 정말 재생할 수 있도록 실제로 입력으로 더 나은 숨겨진 상태에 넣어 + +819 +00:58:29,539 --> 00:58:32,049 +이 아키텍처 그들은 바로 이러한 반복을 많이 시도 + +820 +00:58:32,048 --> 00:58:37,230 +방정식과 거의 모든 약 동일한 일부 작동 당신이 우리와 끝까지 + +821 +00:58:37,230 --> 00:58:40,490 +그것을 우리는 약간은 매우 가지 혼란이있는, 그래서 때로는 있었다있어 + +822 +00:58:40,489 --> 00:58:45,699 +그들은했다 어디 용지를 표시하려면이 방법은 그들이 DS 업데이트를 처리 + +823 +00:58:45,699 --> 00:58:49,538 +방정식은 업데이트 방정식을 통해 나무를 내장하고있다 그리고 그들은했다 + +824 +00:58:49,539 --> 00:58:52,950 +이 같은 무작위 돌연변이 물건과 서로 다른 잔디의 모든 종류의 시도 + +825 +00:58:52,949 --> 00:58:57,028 +사용자가 업데이트 할 수 그들 대부분은 그들 중 일부의 일부를 파괴에 대해 작동 + +826 +00:58:57,028 --> 00:58:59,858 +정말보다 훨씬 더 않습니다처럼은 동일하지만 아무것도에 대한 작업 + +827 +00:58:59,858 --> 00:59:08,150 +분석 팀과 질문 재발 성 신경 네트워크가 왜 가고있다 + +828 +00:59:08,150 --> 00:59:15,389 +또한 끔찍한 역류 비디오 + +829 +00:59:15,389 --> 00:59:22,000 +와 재발 성 신경 네트워크에서 사라지는 그라데이션 문제를 보여주는 + +830 +00:59:22,000 --> 00:59:29,250 +모두에 대해 우리가 재발보고있는 것처럼 우리가 여기에 표시하고 줄기 + +831 +00:59:29,250 --> 00:59:33,039 +많은 기간 많은 시간 단계에 걸쳐 신경망 다음 주입 그라데이션 + +832 +00:59:33,039 --> 00:59:36,760 +그것은 백 스물여덟번째 시간 단계의 말을 우리는 파산하고 + +833 +00:59:36,760 --> 00:59:40,028 +네트워크를 통해 재료와 우리는 그라데이션이 무엇인지보고있는 + +834 +00:59:40,028 --> 00:59:44,699 +용 나는 체중의 입력 타입 숨겨진 매트릭스 하나에 모든 행렬 생각 + +835 +00:59:44,699 --> 00:59:49,009 +한 시간 간격 때문에 실제로 통해 전체 업데이트를 얻기 위해 그 기억 + +836 +00:59:49,010 --> 00:59:52,289 +다시 우리가 실제로 여기에 모든 그라디언트를 추가하고 그래서 무엇 무엇이다 + +837 +00:59:52,289 --> 00:59:56,760 +어떻게 여기에 표시되는 것은 배경으로 우리는 단지에서 성분을 주입하는 것입니다 + +838 +00:59:56,760 --> 01:00:00,799 +우리가 시간과 강한 조각을 통해 배경을 120 시간 단계 + +839 +01:00:00,798 --> 01:00:04,088 +그 전파의 당신이보고있는 것은 미국 팀이 당신을 많이 준다이다 + +840 +01:00:04,088 --> 01:00:06,699 +많이있다, 그래서이 역 전파에 걸쳐 그라데이션 + +841 +01:00:06,699 --> 01:00:11,000 +단지 바로이 기술을 통해 흐르는되는 정보는 전원 사망 + +842 +01:00:11,000 --> 01:00:15,210 +그냥 욕심 우리는 추방은 그냥 아무 거기에 작은 숫자가된다라고 + +843 +01:00:15,210 --> 01:00:18,750 +내가 단계 그렇게되는 시간에 대해 표시를 생각이 경우 너무 그라데이션 + +844 +01:00:18,750 --> 01:00:22,679 +우리가하지 않았다 주입 모든 정보와 10 배 단계 등 + +845 +01:00:22,679 --> 01:00:26,149 +네트워크를 통해 흘러 모든 때문에 매우 긴 종속성을 배울 수 있습니다 + +846 +01:00:26,150 --> 01:00:29,720 +우리가 왜이 볼 수 있도록 상관 관계 구조는 아래가 사망 한 + +847 +01:00:29,719 --> 01:00:39,399 +조금 동적으로 발생이 채널이 너무 재미 그가처럼 몇 가지 코멘트 + +848 +01:00:39,400 --> 01:00:40,490 +YouTube 또는 뭔가 + +849 +01:00:40,489 --> 01:00:44,779 +그래 + +850 +01:00:44,780 --> 01:00:53,170 +확인 그래서 우리가 재발 성 신경 네트워크가 여기 아주 간단한 예를 살펴 보자 + +851 +01:00:53,170 --> 01:00:56,300 +내가 보여주는 아니에요이 재발 성 신경 네트워크에 당신을 위해 전개거야 것을 + +852 +01:00:56,300 --> 01:01:03,960 +우리가있어 모든 입력은 자신의 상태 업데이트가 너무 whaaa 교회와 대기 상태가 + +853 +01:01:03,960 --> 01:01:07,260 +상호 작용을 칠 숨겨진 나는 기본적으로 재발을 전달하려고 해요 + +854 +01:01:07,260 --> 01:01:12,380 +신경망 때문에 T-오십를 사용하고 여기에 내가 어떤 차 시간 단계를하지를 않습니다 + +855 +01:01:12,380 --> 01:01:16,260 +내가 무슨 일을하고있어 WHAS 시간을 그 위에 다음 이전 세입자와 물건과입니다 + +856 +01:01:16,260 --> 01:01:20,570 +그래서 이것은 모든 입력 벡터를 무시 들어오는 단지 전진 패스입니다 + +857 +01:01:20,570 --> 01:01:25,280 +단지 WHAS 시간 H 임계 값 WHAS 시간 세이 임계 값 등 + +858 +01:01:25,280 --> 01:01:29,500 +그 전진 패스의 다음 뒤로 여기가 연출하고있어 여기서 통과 + +859 +01:01:29,500 --> 01:01:33,820 +마지막 단계에서 여기에 임의의 기울기에 의해 50 시간 단계에서 매우 + +860 +01:01:33,820 --> 01:01:37,880 +뒤쪽으로 이동 한 후 무작위 및 그라데이션을 주입 나는 그렇게 백업 + +861 +01:01:37,880 --> 01:01:41,059 +당신은 백업이 권한을 통해 여기 내가 사용하고 있습니다 통해 백업해야 할 때 + +862 +01:01:41,059 --> 01:01:46,170 +오히려 곱셈 등 400 WH보다 곱셈 어를 통해 배경을 얻을 + +863 +01:01:46,170 --> 01:01:51,800 +그래서 여기서주의 할 것은 여기에서 매우이다 나는 개발자 브라운 백을하고있는 중이 야 + +864 +01:01:51,800 --> 01:01:54,980 +수입을 어디에서 관련 바로 잡고 아무것도 통해 전파 + +865 +01:01:54,980 --> 01:02:02,309 +나는 WH 시간마다 작업을 제로보다 작은 여기서 포기하고 있었다 + +866 +01:02:02,309 --> 01:02:06,570 +우리가 실제로 WH 행렬 곱 경우 우리는 그렇게 비선형 성을하기 전에 + +867 +01:02:06,570 --> 01:02:09,570 +당신이 실제로 무슨 일을 볼 때가는 매우 펑키 뭔가가있다 + +868 +01:02:09,570 --> 01:02:13,300 +당신이 시간을 통해 뒤로 이동으로 NHS의 구배이 DHS에 + +869 +01:02:13,300 --> 01:02:18,160 +당신이 보는 것처럼 매우 걱정입니다 재미있는 구조의 매우 종류가 있습니다 + +870 +01:02:18,159 --> 01:02:22,210 +등이 우리가 여기 무슨 일을하는지와 같은 루프에 연결되는 방식 + +871 +01:02:22,210 --> 01:02:33,409 +두 시간 간격 + +872 +01:02:33,409 --> 01:02:43,849 +제로 그래 나는 생각하고 가끔 어쩌면 반군이 모든 있었다 출력의 + +873 +01:02:43,849 --> 01:02:47,630 +죽은 당신을 죽일 수 듯하지만 그건 정말 문제 아니다 + +874 +01:02:47,630 --> 01:02:51,470 +더 걱정 문제는 그 모든 쇼가 될 것 잘하지만 착용 한 생각 + +875 +01:02:51,469 --> 01:02:55,500 +사람들이 쉽게 우리가 걸 볼 수 있습니다뿐만 아니라 발견 할 수 있습니다 문제 + +876 +01:02:55,500 --> 01:03:00,380 +때문에에 또 다시 이상이 whah 행렬 곱 + +877 +01:03:00,380 --> 01:03:04,840 +앞으로 우리가 매일 반복에 awhh 곱 통과 + +878 +01:03:04,840 --> 01:03:09,670 +다시 우리가이 전파 결국 모든 숨겨진 상태를 통해 전파 + +879 +01:03:09,670 --> 01:03:13,820 +무형 문화 유산 konnte 체스와 backrub 어 공식은 실제로 것을 밝혀 + +880 +01:03:13,820 --> 01:03:19,000 +당신은 whah 행렬 곱 인사말 신호를 가지고 우리는 종료 + +881 +01:03:19,000 --> 01:03:26,199 +그라데이션이 whah 유지를 곱한 도착까지 그 다음 WH 관계자를 곱한 + +882 +01:03:26,199 --> 01:03:32,019 +그렇게 우리는 그렇게하지 ​​매트릭스 W​​H 나이 오십 번 곱 결국 + +883 +01:03:32,019 --> 01:03:37,509 +이 가진 문제는 녹색 신호는 기본적으로 두 가지 경우처럼 일어날 수 있다는 것입니다 + +884 +01:03:37,510 --> 01:03:41,080 +당신은 아마 규모 행렬없는 스칼라 값 작업에 대한 생각 + +885 +01:03:41,079 --> 01:03:45,469 +그때 임의의 번호를 가지고 있다면 두 번째 번호가 나는 유지 + +886 +01:03:45,469 --> 01:03:48,509 +그래서 또 다시 두 번째 숫자에 의해 첫 번째 숫자를 곱한 + +887 +01:03:48,510 --> 01:03:55,990 +다시 그 순서는 바로 같은 플레이 자신의 경우에 무엇을 이동 않습니다 + +888 +01:03:55,989 --> 01:04:01,849 +번호 하나 내가 죽거나 아직 경우 두 번째 번호를 정확히 절전 모드로 전환 + +889 +01:04:01,849 --> 01:04:05,119 +일년 실제로 폭발하지만, 그렇지 않는 경우에만 위치하도록 + +890 +01:04:05,119 --> 01:04:09,679 +정말 나쁜 일이 죽을 중 하나 일어나고 또는 우리는 우리가 큰이 여기 폭발 + +891 +01:04:09,679 --> 01:04:12,659 +도시 우리는 하나의 번호가없는 있지만, 사실은이 같은 일이 일어난다이다 + +892 +01:04:12,659 --> 01:04:16,599 +그것의 일반화는 WHS 장축 반경 스펙트럼에서 일어나는 + +893 +01:04:16,599 --> 01:04:21,839 +이는 그 행렬의 최대 고유 한 후보다 큰 것이다 + +894 +01:04:21,840 --> 01:04:25,220 +이 시민은 완전히 사망의 1도 이하의 경우 무선 신호가 폭발 + +895 +01:04:25,219 --> 01:04:30,549 +그래서 기본적으로 박사 탄 때문에이 재발이 매우 이상한이 있기 때문에 + +896 +01:04:30,550 --> 01:04:34,680 +공식 우리는 매우 끔찍 역학에 결국 그리고 그것은 매우 불안정입니다 + +897 +01:04:34,679 --> 01:04:39,949 +그냥 그렇게 연습이 처리 된 방법을 폭발하고 또는 사망했다 + +898 +01:04:39,949 --> 01:04:44,439 +당신은 폭발 그라디언트에게 인사말 마치 하나의 간단한 하키를 제어 할 수 있습니다 + +899 +01:04:44,440 --> 01:04:45,720 +폭발 당신은 그것을 클릭 + +900 +01:04:45,719 --> 01:04:50,789 +그래서 사람들은 실제로 매우 누덕 누덕 기운 솔루션처럼하지만 경우에이 관행을 + +901 +01:04:50,789 --> 01:04:55,119 +두 번 다섯 분 노먼 린 크램 펫 (25) 요소 위에합니까을 읽고있는 나 + +902 +01:04:55,119 --> 01:04:58,150 +당신이 저하되어 클리핑을 수행 할 수 있도록 그런 일이 그 방법을의 + +903 +01:04:58,150 --> 01:05:01,829 +폭발 등급을 매기는 문제를 해결하고 당신은 당신이 기록하고있어하지 않습니다 + +904 +01:05:01,829 --> 01:05:06,049 +더 이상 폭발 그러나 녹색당은 여전히​​ 직장과 엘리스에서 카니발에서 사라질 수 있습니다 + +905 +01:05:06,050 --> 01:05:08,310 +팀 때문에 이들의 사라지는 그라데이션 문제에 아주 좋은 것입니다 + +906 +01:05:08,309 --> 01:05:12,429 +단지와 첨가제의 상호 작용에 따라 변화되는 세포의 고속도로 + +907 +01:05:12,429 --> 01:05:17,309 +당신은 당신이이기 때문에 경우에 당신이 경우 구배는 단지 그들이 아래로 죽지 않을 날려 + +908 +01:05:17,309 --> 01:05:21,000 +이러한 이유 대략이다처럼 같은 나이 또는 무언가에 의해 곱 + +909 +01:05:21,000 --> 01:05:26,909 +단지 더 동적으로 우리는 항상 팀 그래서 우리는 그라데이션 클리핑을 수행 할 + +910 +01:05:26,909 --> 01:05:30,149 +일반적으로 달라스 팀의 기울기가 잠재적으로 폭발 할 수 있기 때문에 + +911 +01:05:30,150 --> 01:05:33,400 +여전히 그들은 일반적으로 사라하지 않는했다 + +912 +01:05:33,400 --> 01:05:48,608 +재발 성 신경 네트워크뿐만 아니라에 대한 엘리스 팀은 분명하지 않다 어디를 + +913 +01:05:48,608 --> 01:05:53,769 +당신이 플러그 것입니다 정확히 같은이 식의 명확하지에 뛰어들 것 + +914 +01:05:53,769 --> 01:06:00,619 +상대적으로 어디에 아마 대신 G에서 월의 많은 다음에 참석하기 때문에 + +915 +01:06:00,619 --> 01:06:08,690 +여기 huug하지만 재판매는 바로 이렇게 하나의 방향으로 성장할 것 + +916 +01:06:08,690 --> 01:06:11,980 +어쩌면 당신은 실제로 좋은 아니에요 작게 있도록 만드는 끝낼 수 없다 + +917 +01:06:11,980 --> 01:06:18,539 +난 당신이 알고있는 가정 아이디어는 이렇게 연결하는 명확한 방법이 없습니다 기본적으로됩니다 + +918 +01:06:18,539 --> 01:06:25,380 +여기에 행을 너무 좋아 한 것은 나는이 초 고속도로의 측면에서 그 통지 + +919 +01:06:25,380 --> 01:06:29,780 +네 개의 얻을 문이있을 때이 그라디언트 이러한 관점은 실제로 고장 + +920 +01:06:29,780 --> 01:06:33,310 +네 개의 얻을 때 때문에 케이트의 우리는 이러한 행위의 일부를 잊을 수있는 곳 + +921 +01:06:33,309 --> 01:06:37,150 +내가 문을 잊지 때마다 곱셈 상호 작용은 다음에 그것과 세가와 + +922 +01:06:37,150 --> 01:06:41,470 +다음 그라데이션을 죽이고 물론 역류 때문에 이러한 슈퍼 중단됩니다 + +923 +01:06:41,469 --> 01:06:45,250 +당신이없는 경우 고속도로 가지 사실 어느 문을 잊지하지만 당신은 경우 + +924 +01:06:45,250 --> 01:06:50,000 +a는 다음 그라디언트를 죽일 수 그들의줬고, 그래서 실제로 잊지했다 + +925 +01:06:50,000 --> 01:06:54,710 +우리는 우리와 함께 연주 할 때 팀은 우리가 가끔 사람들이 때 가정 오스틴의 사용이다 + +926 +01:06:54,710 --> 01:06:58,099 +긍정적 인 편견 때문에 함께 초기화에 그들이 처음 잊지 얻을 + +927 +01:06:58,099 --> 01:06:58,769 +에 의한 + +928 +01:06:58,769 --> 01:07:05,699 +나에 설정하는 것을 잊지 항상 종류의 내가 처음에 생각 해제 + +929 +01:07:05,699 --> 01:07:08,679 +그래서 처음에 녹색 아주 잘 이야기하고 미국 팀은 배울 수있는 방법 + +930 +01:07:08,679 --> 01:07:12,779 +그 해당 바이어스 용으로 나중에 사람들이 재생되도록 한 번에 그들을 차단하기 + +931 +01:07:12,780 --> 01:07:17,530 +수십 년 때때로 그래서 여기에 지난 밤 나는 그 비용을 언급하고 싶었다 + +932 +01:07:17,530 --> 01:07:21,580 +공간이 그래서 많은 사람들은 기본적으로이 꽤 플레이 한 + +933 +01:07:21,579 --> 01:07:26,119 +그들이 아키텍처로 다양한 변화를 시도 오디세이 용지 거기 + +934 +01:07:26,119 --> 01:07:32,829 +잠재적 인 변화의 큰 숫자 이상이 검색을 수행하려고 여기에 종이 + +935 +01:07:32,829 --> 01:07:36,940 +LST 방정식 그리고 그들은 많은 검색을했고, 그들은 아무것도 찾지 못했습니다 + +936 +01:07:36,940 --> 01:07:42,300 +그건 그냥 애널리스트 오전 너무 좋아하고있어보다 실질적으로 더 잘 작동 + +937 +01:07:42,300 --> 01:07:45,560 +또한 상대적으로 실제로 인기가 있고 내가 실제로 것 GRU + +938 +01:07:45,559 --> 01:07:50,159 +당신이 콜로세움 그것의 변화를 개의 DRU 사용 할 수 있습니다 것이 좋습니다 + +939 +01:07:50,159 --> 01:07:54,460 +그것은 짧은 점이다 대해도 좋은 상호 작용으로 결정했다 + +940 +01:07:54,460 --> 01:07:59,400 +작은 공식과 단지 하나있는 테네시을 갖지 않는 트랙터 + +941 +01:07:59,400 --> 01:08:03,130 +구현은 현명한 단지 하나가 가진 기억 단지 좋네요 있도록 만 H가 + +942 +01:08:03,130 --> 01:08:07,590 +단지 작은 간단한 일이 같은 앞으로 과​​거 두 가지 요인에 차질 + +943 +01:08:07,590 --> 01:08:12,190 +그 불쾌한의 혜택의 대부분을 갖고있는 것 같아요하지만 그래서는 GRU과라고 + +944 +01:08:12,190 --> 01:08:16,730 +거의 항상 멋진에 대한 내 경험에 작동하고 그래서 당신은 수도 + +945 +01:08:16,729 --> 01:08:19,939 +그것을 사용하려는 또는 당신은 그들이 모두 좀 동일한 대해 알고 마지막 시간을 사용할 수 있습니다 + +946 +01:08:19,939 --> 01:08:28,088 +그래서 누군가가 마구는 아주 좋은하지만의 RaWR하고 실제로하지 않는 것입니다 + +947 +01:08:28,088 --> 01:08:29,130 +아주 잘 작동 + +948 +01:08:29,130 --> 01:08:32,420 +소유즈 미국 팀은 무엇을 그들에 대해 좋은 데요 것은 이상한 갖는 것입니다 대신 사용된다 + +949 +01:08:32,420 --> 01:08:36,000 +그리스을 허용 이러한 첨가제의 상호 작용은 매우 잘 재생 당신은하지 않습니다 + +950 +01:08:36,000 --> 01:08:39,579 +사라지는 품종 문제를 얻을 우리는 여전히 폭발에 대해 조금 걱정 + +951 +01:08:39,579 --> 01:08:44,269 +이 사람들은 때때로 내가이 여자 클립을 참조하는 것이 일반적 그래서 문제를 공급 + +952 +01:08:44,270 --> 01:08:46,670 +더 간단한 구조가 정말하려고하는 말 것 + +953 +01:08:46,670 --> 01:08:50,838 +연결과 무슨 깊은 거기에 뭔가 오는 방법을 이해 + +954 +01:08:50,838 --> 01:08:53,899 +주민과 엘리스 팀 사이에 이들에 대해 뭔가 깊은있다 + +955 +01:08:53,899 --> 01:08:57,579 +나는 우리가 아직 정확히 그 이유는 완전히 이해되지 것 같아요 상호 작용 + +956 +01:08:57,579 --> 01:09:02,210 +그래서 잘 작동하고 어떤 부분은 시원했고, 그래서 우리가 필요하다고 생각 + +957 +01:09:02,210 --> 01:09:05,119 +공간 이론과 경험을 모두 이해하고 그것은 매우이야 + +958 +01:09:05,119 --> 01:09:10,979 +벌리고 연구의 영역과 그래서 그래서 + +959 +01:09:10,979 --> 01:09:23,469 +스포츠 (10) 그러나 나는 내가 그렇지 않은 그래서 폭발 가정 할 수 클래스의 끝 + +960 +01:09:23,470 --> 01:09:27,020 +명확 왜 것이라고하지만 당신은 세포 상태로 그라데이션을 주입 유지 + +961 +01:09:27,020 --> 01:09:30,069 +그래서 어쩌면 때때로 큰 얻을 수 있습니다 저하 + +962 +01:09:30,069 --> 01:09:33,960 +그것은 그들을 수집하는 것이 일반적이지만 중요 할 수 있으므로 한 시간으로 아마 생각 + +963 +01:09:33,960 --> 01:09:40,829 +그리고, 나는 그 시점하지만 비뇨기과 기초 I에 대해 확실히 백퍼센트 아니에요 + +964 +01:09:40,829 --> 01:09:46,640 +흥미로운 무슨 생각 그래 나는 우리가 여기까지해야한다고 생각하지 않습니다 있지만 난 + +965 +01:09:46,640 --> 01:09:47,569 +여기에 질문을 드리겠습니다 diff --git a/captions/Ko/Lecture11_ko.srt b/captions/Ko/Lecture11_ko.srt new file mode 100644 index 00000000..e0d1c5ca --- /dev/null +++ b/captions/Ko/Lecture11_ko.srt @@ -0,0 +1,3900 @@ +1 +00:00:00,000 --> 00:00:03,428 + 오늘날 우리가 통과하는 물건을 많이 가지고 오른쪽 그래서 나는 시작하고 싶습니다 + +2 +00:00:03,428 --> 00:00:08,669 + 그래서 오늘 우리는 CNN의 실천에 대해 이야기 할 것과 많이 이야기하고 + +3 +00:00:08,669 --> 00:00:12,050 + 정말 얻​​기 위해 언급되는 구현 세부 사항의 정말 낮은 수준의 종류 + +4 +00:00:12,050 --> 00:00:15,980 + 당신이 실제로하지만 처음으로 일을 훈련 할 때 이런 일들이 작동합니다 + +5 +00:00:15,980 --> 00:00:20,189 + 보통 우리는 번호 하나에 대해 이야기하는 일부 관리 물건을 그 통해이 + +6 +00:00:20,189 --> 00:00:24,600 + 모든 TA에 의해 정말 영웅적인 노력은 모든 중간 고사가되도록 저하되어있다 + +7 +00:00:24,600 --> 00:00:27,740 + 사람은 확실히 그 대해 감사해야하고 당신도 그들을 선택할 수 있습니다 + +8 +00:00:27,739 --> 00:00:34,920 + 클래스 오늘 또는 여기있는이 근무 시간 중 어느 한 후에도 계속 + +9 +00:00:34,920 --> 00:00:38,609 + 그 마음에 프로젝트 이정표는 자정 때문에 오늘 밤 수 있도록거야 + +10 +00:00:38,609 --> 00:00:41,628 + 난 당신이 마지막에 대한 귀하의 프로젝트를 진행했습니다 희망 있는지 확인 + +11 +00:00:41,628 --> 00:00:45,579 + 지난 주 정도에 대한 부부 등 몇 가지 정말 흥미로운 진전을 + +12 +00:00:45,579 --> 00:00:51,289 + 그를 쓸 수 있는지 확인하고 더 더 드롭 박스에 할당 탭에서 그것을하지 넣지 + +13 +00:00:51,289 --> 00:00:55,460 + 드롭 박스에 있지만 과정에서 할당 탭에 나는이 알고 죄송 것을 + +14 +00:00:55,460 --> 00:00:58,910 + 정말 혼란하지만 단지 단지 등의 지정과 같은 탭을 할당 + +15 +00:00:58,909 --> 00:01:04,000 + 할당이 잘하면 우리가 언젠가 수행해야합니다 그레이딩 작업을 하였다 + +16 +00:01:04,000 --> 00:01:10,140 + 이번 주에 그 과제 세 가지가 너무 밖으로 기억하는 것은 어떻게이 진행되고있어 + +17 +00:01:10,140 --> 00:01:17,159 + 나머지는 당신이 얻을해야하므로 좋은 사람이 누구 괜찮 수행 한 사람은 이루어집니다 + +18 +00:01:17,159 --> 00:01:22,740 + 우리는 그래서 중간에서 약간의 재미 통계를 그래서 1 주일에 의한 때문에 시작 + +19 +00:01:22,739 --> 00:01:26,379 + 당신이 당신의 성적을 볼 때 우리가 실제로이 정말 좋은했다 사촌 흥분하지 않습니다 + +20 +00:01:26,379 --> 00:01:30,759 + 우리가하지 않는 아름다운 표준 편차 아름다운 가우스 분포 + +21 +00:01:30,759 --> 00:01:34,549 + 그것이 내가 또한 지적하고 싶습니다 이미 완벽이 일을 정상화 비난해야 + +22 +00:01:34,549 --> 00:01:38,049 + 그 사람이 일어나서에서 최대는 백 세 그들이있어 의미 점수 + +23 +00:01:38,049 --> 00:01:43,470 + 즉, 그래서 바로 보너스의 모든는 아마에 하드 충분하지 의미 + +24 +00:01:43,469 --> 00:01:49,500 + 우리는 또한 질문 당 일부는 평균 점수에 내 타악기 고장을 먹으 렴이 + +25 +00:01:49,500 --> 00:01:52,450 + 당신이 뭔가를 가지고있는 경우 중간에 매 질문마다 그래서 만약 당신이 원하는 + +26 +00:01:52,450 --> 00:01:55,510 + 잘못 당신은 다른 사람이 당신에게 잘못에 확인 갈 수 있어요 있는지 확인하려면 + +27 +00:01:55,510 --> 00:01:59,380 + 당신이 당신의 자신의 시간에있어 이들 통계 리더 우리는 참 거짓에 대한 통계가 + +28 +00:01:59,379 --> 00:02:00,959 + 와 객관식 + +29 +00:02:00,959 --> 00:02:04,729 + 실제로 우리가 등급 중 결정 진정한 거짓이 해고 염두에 두어야 + +30 +00:02:04,730 --> 00:02:07,090 + 그들은 그것을 버리고 당신에게 줄 약간 불공정라고 모든 + +31 +00:02:07,090 --> 00:02:12,960 + 그 두 가지가 백퍼센트 왜 우리가이 통계에 대한이되는 점 + +32 +00:02:12,960 --> 00:02:19,810 + 모든 개별 질문은 그래서 가서 그 이상으로 재미를 + +33 +00:02:19,810 --> 00:02:24,379 + 내가 아는 마지막으로 그것은 동안이었다 그러나 우리는 중간을했고, 우리는 휴일이 있었다 + +34 +00:02:24,379 --> 00:02:28,030 + 하지만 당신은 일주일 전에 우리가 재발에 대해 얘기했다 위에처럼 기억할 수있는 경우 + +35 +00:02:28,030 --> 00:02:31,509 + 우리가 어떻게 재발 네트워크에 대해 이야기 네트워크 모델링에 사용할 수 있습니다 + +36 +00:02:31,509 --> 00:02:35,500 + 이러한 피드 포워드 네트워크와 당신이 일반적으로 알고 서열 그들은 그것을 바쳐 그들이 + +37 +00:02:35,500 --> 00:02:39,139 + 이 키보드 기능을 모델링하지만이 재발 네트워크를 우리는 방법에 대해 이야기 + +38 +00:02:39,139 --> 00:02:43,208 + 그들은 우리가 이야기 순서 문제의 다른 종류를 모델링 할 수 + +39 +00:02:43,209 --> 00:02:48,319 + 재발 네트워크 (10)의 특정 구현이 발표되고 앨리스와 + +40 +00:02:48,319 --> 00:02:51,539 + 당신이 알아야 할 수 있도록 할당에 그 모두를 구현 무엇 + +41 +00:02:51,539 --> 00:02:56,079 + 우리가이 올바른 재발 성 신경 네트워크가 될 수있는 방법에 대해 이야기하다 + +42 +00:02:56,080 --> 00:03:01,010 + 언어 모델에 사용되는 일부 샘플 생성 된 텍스트를 보여주는 재미 있었다 + +43 +00:03:01,009 --> 00:03:06,329 + 에 우리는 방법에 대해 이야기를의 셰익스피어와 대수 기하학이 무엇인지 + +44 +00:03:06,330 --> 00:03:09,590 + 우리는 이미지를 할 수있는 길쌈 네트워크와 재발 네트워크를 결합 할 수 있습니다 + +45 +00:03:09,590 --> 00:03:14,180 + 캡처 우리는 RNN의 신경 과학자 인의 조금이 게임을 + +46 +00:03:14,180 --> 00:03:17,700 + 그리고 아르덴의 세포로 다이빙은 무엇을 해석하려고 + +47 +00:03:17,699 --> 00:03:21,879 + 그들은 일을하는지 우리는 때때로 우리가이 끝없는 세포를 가지고 보았다 + +48 +00:03:21,879 --> 00:03:27,049 + 꽤 멋진 문을 예를 들어 활성화를위한 선동하지만 + +49 +00:03:27,049 --> 00:03:28,890 + 오늘 우리는 완전히 다른 무언가에 대해 이야기하는거야 + +50 +00:03:28,889 --> 00:03:33,339 + 우리는 낮은 수준의 것들을 정말 많이 약거야 말거야 세 가지가 그 + +51 +00:03:33,340 --> 00:03:37,830 + 세 가지 주요 테마가 그래서 당신은 실제로 CNN의 작업을 얻기 위해 알아야 할 + +52 +00:03:37,830 --> 00:03:41,600 + 그것은 포푸리 약간의 그러나 우리는 그래서 그것을 함께 묶어하려고거야 + +53 +00:03:41,599 --> 00:03:45,349 + 첫 번째는 정말 모든 주스를 압착된다 당신이 할 수없는 데이터의 정말 + +54 +00:03:45,349 --> 00:03:48,219 + 당신이 큰 데이터 세트를 필요가 없습니다 특히 프로젝트에 대해 많이 알고 + +55 +00:03:48,219 --> 00:03:51,789 + 우리는 데이터 증가 및 전송 학습에 대해 이야기 할 것중인 + +56 +00:03:51,789 --> 00:03:55,079 + 당신이있어 특히이 정말 강력한 유용한 기술입니다 + +57 +00:03:55,080 --> 00:03:56,350 + 작은 데이터 세트로 작업 + +58 +00:03:56,349 --> 00:04:00,889 + 우리가 정말 회선에 깊은 다이빙에 대한 더 많은 이야기거야 + +59 +00:04:00,889 --> 00:04:05,959 + 그 모두가 회선을 사용하여 효율적인 아키텍처를 설계 할 수있는 방법과 + +60 +00:04:05,960 --> 00:04:10,480 + 기여 효율적으로 연습 한 후 최종적으로 구현되는 방법도 + +61 +00:04:10,479 --> 00:04:13,269 + 우리는 뭔가에 대해 이야기거야하지만 일반적으로 구현에서 집중됩니다 + +62 +00:04:13,270 --> 00:04:17,480 + 자세한 내용은 심지어 종이로하지 않고 사람과 같은 그 물건은이 + +63 +00:04:17,480 --> 00:04:21,750 + CPU와 경험 병목 어떤 종류의 GPU하고 당신에게 얼마나 많은 훈련 + +64 +00:04:21,750 --> 00:04:26,069 + 물건을 많이의 여러 장치를 통해 여러 걸쳐 비가 배포 + +65 +00:04:26,069 --> 00:04:31,620 + 우리의 내가 생각 데이터 증가에 대해 이야기 할 수 있도록 먼저 시작해야 + +66 +00:04:31,620 --> 00:04:34,910 + 우리는 종류의이 강의 그러나 결코 지금까지 전달 될 수 있습니다 언급 한 + +67 +00:04:34,910 --> 00:04:39,780 + 정말 당신이 CNN의 당신이 정말로있어 훈련 때 그래서 일반적으로 그것에 대해 이야기 + +68 +00:04:39,779 --> 00:04:44,179 + 때 훈련 도중 파이프 라인의이 유형에 익숙해 거 부하 이미지를이야 + +69 +00:04:44,180 --> 00:04:48,379 + 당신이거야 책상 오프 최대 레이블은 다음 CNN을 통해 이미지를 지불 + +70 +00:04:48,379 --> 00:04:51,009 + 일부 손실을 계산하기 위해 라벨과 함께 이미지를 사용하는거야 + +71 +00:04:51,009 --> 00:04:55,610 + 기능 및 역 전파를 업데이트 CNN과 이전의 반복 그 때문에 + +72 +00:04:55,610 --> 00:05:00,970 + 문서상의가에 대해 이제 일에 의해 그와 정말 잘 알고 있어야합니다 + +73 +00:05:00,970 --> 00:05:05,960 + 우리는 단지 우리가로드 한 후, 그래서 여기에이 파이프 라인에 하나의 작은 단계가 있었다 + +74 +00:05:05,959 --> 00:05:09,849 + 책상 위의 이미지는 우리가 전달하기 전에 어떤 방법으로 그것을 변환하는거야 + +75 +00:05:09,850 --> 00:05:13,910 + 그것을 CNN에이 변환은 라벨을 보존한다 + +76 +00:05:13,910 --> 00:05:19,090 + 정말 간단하고 트릭 그냥 그래서 거 전파 돌아올와 CNN + +77 +00:05:19,089 --> 00:05:24,089 + 당신은 데이터 증가를 생각되어 사용되어야한다 변압기의 종류 + +78 +00:05:24,089 --> 00:05:27,679 + 정말 간단 당신이 인위적으로 교육을 확장기 할 수 있습니다이 방법의 일종 + +79 +00:05:27,680 --> 00:05:32,030 + 변환의 다른 종류의 현명한 사용을 통해 설정된 경우 그래서 당신 + +80 +00:05:32,029 --> 00:05:35,409 + 이러한 시도로 컴퓨터가 정말로 이러한 이미지를보고있다 기억하고 몇 가지를 얻을 수 + +81 +00:05:35,410 --> 00:05:39,189 + 픽셀 우리가 할 수있는 변환의 서로 다른 종류가있다 + +82 +00:05:39,189 --> 00:05:43,230 + 그 레이블을 유지해야하지만 이는 모든 픽셀이 변경됩니다 당신이 경우 + +83 +00:05:43,230 --> 00:05:46,770 + 그것을 왼쪽으로 그 고양이 1 픽셀을 출하처럼 상상하는 것은 여전히​​ 고양이하지만 모든입니다 + +84 +00:05:46,769 --> 00:05:50,539 + 픽셀을 사용하면 문서에 대해 이야기 할 때 그래서 그 변경 예정 + +85 +00:05:50,540 --> 00:05:54,680 + 당신은 종류의 당신이 당신의 훈련을 확대하고 상상하고 + +86 +00:05:54,680 --> 00:05:58,629 + 교육 및 새로운 기본 훈련 샘플은 상관 관계하지만 여전히 것 + +87 +00:05:58,629 --> 00:06:03,389 + 당신은 방지와 더 큰 모델과 모델을 훈련하는 데 도움이 매우이다 + +88 +00:06:03,389 --> 00:06:04,959 + 매우 광범위하게 연습에 사용 + +89 +00:06:04,959 --> 00:06:08,668 + 거의 모든 CNN 당신은 그 대회에서 우승거나 잘하고있어 참조 + +90 +00:06:08,668 --> 00:06:09,810 + 벤치 마크는 일부를 사용하고 있습니다 + +91 +00:06:09,810 --> 00:06:15,889 + 비트 증가에서 가장 쉬운 그래서 역은 수평 뒤집기 경우입니다 + +92 +00:06:15,889 --> 00:06:18,699 + 당신이 거울 이미지가해야 미러 이미지를 볼 때 우리는이 고양이를 생각한다 + +93 +00:06:18,699 --> 00:06:22,949 + 여전히 고양이 일이 방금 할 수있는 심판을 구현하기 위해 정말 정말 간단합니다 + +94 +00:06:22,949 --> 00:06:27,159 + 단일 통화 마찬가지로 쉽고 다른 토치 단 한 줄의 코드와 함께 할 + +95 +00:06:27,160 --> 00:06:32,040 + 프레임 워크이 정말 쉽다는 아주의 다른 널리 사용되는 뭔가를 변화 + +96 +00:06:32,040 --> 00:06:37,120 + 널리 교육 시간 있도록 훈련 이미지에서 임의의 작물을하는 데 사용 + +97 +00:06:37,120 --> 00:06:40,949 + (A)에서 이미지에 대한 패치를 가지고 우리는 그녀의 이미지를로드거야 그리고 우리는거야 + +98 +00:06:40,949 --> 00:06:42,629 + 임의의 규모와 위치 + +99 +00:06:42,629 --> 00:06:47,189 + 우리는 CNN의는 어떤 크기를 기대하고 고정으로 크기를 조정 한 다음과 같은 것을 사용하여 우리의 + +100 +00:06:47,189 --> 00:06:51,389 + 예를 훈련하고 다시이 아주 아주 널리 그냥 사용하는 것은 당신에게 맛을 제공 + +101 +00:06:51,389 --> 00:06:56,610 + 이 사용 방법을 정확하게의 전 주민에 대한 세부 사항을 그들이 그렇게 고개 + +102 +00:06:56,610 --> 00:07:01,639 + 실제로 각 교육 이미지 크기 조정 종이 스티커 난수 교육 시간을 가졌다 + +103 +00:07:01,639 --> 00:07:05,620 + 짧은면 해당 번호는 다음 샘플이되도록 전체 이미지 크기를 조정 + +104 +00:07:05,620 --> 00:07:09,720 + 다음 임의의 크기 조정 차원에서 224 작물로 224 그리고 같은 것을 사용 + +105 +00:07:09,720 --> 00:07:13,990 + 훈련 샘플은 그래서 도움이 일반적으로 구현하는 데 아주 쉽게 그리고 + +106 +00:07:13,990 --> 00:07:20,560 + 꽤 당신은 데이터 증가 일반적으로 사물의 형태를 사용하고 그렇게 할 때 + +107 +00:07:20,560 --> 00:07:25,269 + 약간의 테스트 시간이 양식을 사용하므로 교육 시간 변경 + +108 +00:07:25,269 --> 00:07:29,079 + 데이터 증가는 네트워크가 정말 긴장 전체 이미지에 훈련되지 않습니다 + +109 +00:07:29,079 --> 00:07:34,219 + 그의 작물에 정말 이해하거나 강제로 시도 공정하지 않는 것 때문에 + +110 +00:07:34,220 --> 00:07:38,900 + 네트워크는 내가 연습 할 때 그렇게 보통 해요 테스트로 전체 이미지를 볼 수 있습니다 + +111 +00:07:38,899 --> 00:07:42,879 + 당신은 미국에서 데이터 증가에 대한 임의 자르기 이런 종류의 일을하고 + +112 +00:07:42,879 --> 00:07:48,379 + 시간 당신은 작물의 일부 고정 세트가 그래서 매우 테스트하기 위해 다음을 사용합니다 + +113 +00:07:48,379 --> 00:07:52,019 + 일반적으로 당신은 당신이 왼쪽 손을 잡고 것이다 열 작물을 볼 것을 볼 수 있습니다 + +114 +00:07:52,019 --> 00:07:52,649 + 모서리 + +115 +00:07:52,649 --> 00:07:56,189 + 당신 하단 모서리와 중심을 제공하는 오른쪽 상단 모서리 + +116 +00:07:56,189 --> 00:08:00,800 + 함께 수평 플립에서 오는 10 그가 그 (10) 작물 할게요 제공하는 + +117 +00:08:00,800 --> 00:08:06,460 + 그래서 그 (10) 작물의 네트워크를 통해 평균 점수를 통과 시험 시간 + +118 +00:08:06,459 --> 00:08:09,519 + 공진 실제로 더 그 조금 한 단계 소요 실제로 수행 + +119 +00:08:09,519 --> 00:08:14,759 + 다중 스케일 다중 스케일이가 경향이 뭔가뿐만 아니라 시간을 증명 + +120 +00:08:14,759 --> 00:08:20,649 + 실제로 성능이 도움 구현 다시 아주 쉽게 널리 사용되는 다양 + +121 +00:08:20,649 --> 00:08:26,418 + 우리가 일반적으로 48 증가 할 또 다른 점은 그렇다면 컬러 생성입니다 + +122 +00:08:26,418 --> 00:08:29,529 + 당신은 아마 어쩌면 고양이의이 사진을 찍을는 그 조금 cloudier했다 + +123 +00:08:29,529 --> 00:08:33,348 + 그 날 우스운 우리가 많이보다 사진을 촬영했을 경우 하루 조금 + +124 +00:08:33,349 --> 00:08:37,070 + 색상의 너무 한 가지 아주의 그 상당히 달라졌을 것이다 + +125 +00:08:37,070 --> 00:08:40,360 + 방금 전에 색상 우리 교육의 이미지를 조금 변경하면된다하는 것이 일반적 + +126 +00:08:40,360 --> 00:08:45,539 + 내가 아주 간단한 방법은 그냥이가있는 반면 변화입니다 해요 그래서 우리는 CNN에 도착 + +127 +00:08:45,539 --> 00:08:50,469 + 매우 수행하는 매우 간단한을 쉽게 구현할 수 있지만, 실제로는 실제로 당신은 볼 수 있습니다 + +128 +00:08:50,470 --> 00:08:55,759 + 이 흔하지 조금 동안 계약 대신 당신이 볼은 그 + +129 +00:08:55,759 --> 00:09:01,259 + 이상의 주성분 분석을 사용하여이 약간 더 복잡한 파이프 라인 + +130 +00:09:01,259 --> 00:09:06,439 + 학습 데이터의 모든 화소 아이디어이라는 각 화소에 대한 우리 + +131 +00:09:06,440 --> 00:09:11,390 + 훈련 데이터는 길이 3의 RGB이 벡터이며 우리는 그 화소를 수집하는 경우 + +132 +00:09:11,389 --> 00:09:15,129 + 전체 훈련 데이터를 통해 당신은 색상의 종류의 감각을 얻을 것을 + +133 +00:09:15,129 --> 00:09:19,330 + 일반적으로 다음 주성분 분석을 사용하여 트레이닝 데이터에 존재 + +134 +00:09:19,330 --> 00:09:23,930 + 우리에게 색 공간이 종류의 세 가지 주요 구성 요소의 방향을 제공합니다 + +135 +00:09:23,929 --> 00:09:27,879 + 색상이 데이터 세트에서 변화하는 경향이 따라 방향이 무엇인지를 알려 + +136 +00:09:27,879 --> 00:09:32,429 + 색상 확대 술에 대한 교육 시간에 시험보다 너무 + +137 +00:09:32,429 --> 00:09:35,889 + 우리는 실제로 훈련의 색깔이 주요 구성 요소를 사용할 수 있습니다 + +138 +00:09:35,889 --> 00:09:41,419 + 성별 훈련시에 색이 다시 얼마나 사이트 정확하게 선택할 + +139 +00:09:41,419 --> 00:09:46,719 + 조금 더 복잡하지만 꽤 널리 PCA 이러한 유형의 있도록 사용 + +140 +00:09:46,720 --> 00:09:51,580 + 내가 생각하는 색상의 구동 데이터의 증가는 알렉스 도입 한 것 + +141 +00:09:51,580 --> 00:09:58,310 + 2012 년 종이 및 또한 예를 들어 ResNet에 사용되는 데이터의 증가 때문에 + +142 +00:09:58,309 --> 00:10:02,829 + 이 매우 일반적인 일이 바로 당신이 단지에 대해 생각하고 싶지 이스라엘 당신의 + +143 +00:10:02,830 --> 00:10:06,420 + 변환의 종류는 당신이 당신의 클래스 불에 원하는 작업 데이터 세트 + +144 +00:10:06,419 --> 00:10:11,179 + 다양한 너무 그리고 당신은에 변화의 그 유형을 소개하고 싶습니다 + +145 +00:10:11,179 --> 00:10:15,229 + 훈련 데이터 교육 시간 그리고 당신이 정말로 여기에 미쳐 얻을 수 있습니다 + +146 +00:10:15,230 --> 00:10:18,740 + 정말 창의적이고는 데이터에 대해 생각하고 어떤 종류의 편차를 + +147 +00:10:18,740 --> 00:10:23,659 + 당신은 아마 랜덤처럼 그것을 시도 할 수 있습니다 귀하의 데이터에 대한 의미가 있습니다 + +148 +00:10:23,659 --> 00:10:27,708 + 몇도 회전 할 수있다 데이터에 따라 회전은 의미가 + +149 +00:10:27,708 --> 00:10:31,399 + 당신은 스트레칭과 시뮬레이션하기 위해 전단의 다른 종류처럼 시도 할 수 있습니다 + +150 +00:10:31,399 --> 00:10:33,189 + 데이터의 아마 아핀 변환 + +151 +00:10:33,190 --> 00:10:36,990 + 그리고 당신이 정말로 여기에 미쳐과 창의력을하려고 생각할 수 + +152 +00:10:36,990 --> 00:10:43,840 + 내가 지적하고 싶은에 대한 데이터와 다른 일을 할 수있는 흥미로운 방법 + +153 +00:10:43,840 --> 00:10:49,009 + 데이터 증가의이 아이디어는 정말 지금 우리가했습니다 큰 테마에 맞는 것입니다 + +154 +00:10:49,009 --> 00:10:54,090 + 볼이 과정을 통해 여러 번 반복이 팀은 하나의 방법이다 + +155 +00:10:54,090 --> 00:10:58,420 + 즉 정기적으로 overfitting 방지하기위한 연습에 정말 유용 + +156 +00:10:58,419 --> 00:11:02,209 + 라이더는 훈련 도중 네 번째 패스 동안 우리가 훈련 할 때이다 + +157 +00:11:02,210 --> 00:11:05,930 + 네트워크 우리와 종류 혼란에 이상한 확률 잡음의 일종했다 + +158 +00:11:05,929 --> 00:11:10,629 + 네트워크는 데이터 증가와 예를 들어 우리가 실제로 수정하고 + +159 +00:11:10,629 --> 00:11:14,210 + 우리가 떨어 뜨리거나 같은 것을 사용하여 네트워크에 넣어 학습 데이터 + +160 +00:11:14,210 --> 00:11:18,860 + 네트워크의 임의의 부분을 복용하고 그는 그들이 설정하는 연결 drop하여 + +161 +00:11:18,860 --> 00:11:22,730 + 프로세서의 활성화 임의로 가중치 20 아르 + +162 +00:11:22,730 --> 00:11:28,450 + 이것은 또한 이것은 또한 패치 보쉬 정상화와 종류의 표시가 + +163 +00:11:28,450 --> 00:11:31,930 + 정규화하여 정규화 내용의 다른 것들에 의존 + +164 +00:11:31,929 --> 00:11:35,000 + 그래서 정상적인 훈련 중에 배치 + +165 +00:11:35,000 --> 00:11:39,440 + 같은 이미지는 서로 다른 다른 이미지와 많은 일괄 적으로 나타나는 끝낼 수 있습니다 + +166 +00:11:39,440 --> 00:11:43,840 + 실제로 나는 시간을 훈련 잡음 만의 모든 유형을 소개합니다 + +167 +00:11:43,840 --> 00:11:47,690 + 이러한 예는 테스트 시간 우리는 데이터 증가에 대한 그래서이 소음을 평균 + +168 +00:11:47,690 --> 00:11:52,790 + 우리 모두가 드롭 아웃에 대한 훈련 데이터의 다양한 샘플에 걸쳐 평균을 + +169 +00:11:52,789 --> 00:11:56,870 + 그리고 일종의 평가할 수를 연결 삭제하고이를 소외 + +170 +00:11:56,870 --> 00:12:01,090 + 우리가 자신을 계속 실행 계속 더 분석적으로 작은 및 전망 정상화 + +171 +00:12:01,090 --> 00:12:05,269 + 그래서 난 그냥 그 이러한 아이디어를 많이 통합 할 수있는 좋은 방법의 종류 생각 의미 + +172 +00:12:05,269 --> 00:12:08,960 + 정규화는 다음 전진 패스에 노이즈를 추가 할 수 있습니다 때이다 + +173 +00:12:08,960 --> 00:12:13,540 + 당신이 올하려는 경우 한 번에 이상 소외 너무 마음에 계속 + +174 +00:12:13,539 --> 00:12:20,250 + 그 주요 테이크 아웃을 그래서 다른 창조적 인 방법은 네트워크를 정례화하기 + +175 +00:12:20,250 --> 00:12:24,149 + 데이터 증가에 대한 하나 그것을 구현하는 것이 정말 간단하다는 있습니다 + +176 +00:12:24,149 --> 00:12:28,329 + 그래서 당신은 거의 항상 변명하지에 그것을 정말이 아니에요 사용해야합니다 + +177 +00:12:28,330 --> 00:12:32,730 + 그것은 내가 당신을 많이 생각하는 작은 데이터 세트를 위해 특별히 아주 아주 유용합니다 + +178 +00:12:32,730 --> 00:12:36,850 + 프로젝트에 사용하고 또한이 프레임 워크와 멋지게에 맞는 + +179 +00:12:36,850 --> 00:12:41,509 + 교육 및 소외에서 잡음이 난 테스트 그래서 나는 그 꽤의 생각 + +180 +00:12:41,509 --> 00:12:45,360 + 많은 모든 질문이 그래서 데이터 증가에 대해 말할 수있다 + +181 +00:12:45,360 --> 00:12:45,840 + 그것에 대해 + +182 +00:12:45,840 --> 00:13:01,840 + 네 시간 훈련에 많은 시간이 걸릴 것 내가 지금 얘기 행복 해요 + +183 +00:13:01,840 --> 00:13:05,790 + 디스크 공간이 많이 난 그렇게 가끔 해요 있도록 책상이 일을 덤프 시도 + +184 +00:13:05,789 --> 00:13:08,879 + 사람들은 창의력과 배경이 그들의 일치하는 데이터를 스레드처럼도 있습니다 + +185 +00:13:08,879 --> 00:13:16,799 + 및 문서 바로 그래서 나는 그것이 우리가 이야기 할 수는 분명 생각 + +186 +00:13:16,799 --> 00:13:21,069 + 다음 생각은 그래서 당신이 작업 할 때 그 주위에 떠있는이 신화있다 + +187 +00:13:21,070 --> 00:13:25,770 + CNN의 당신은 정말 많은 양의 데이터가 필요하지만 그 이전 것이 밝혀 + +188 +00:13:25,769 --> 00:13:33,029 + 이 정말 간단 레시피가 당신이 할 수있다, 그래서이 신화를 학습하는 파열된다 + +189 +00:13:33,029 --> 00:13:37,769 + 전송 학습에 사용할 그게 먼저 당신은 무엇이든 좋아하는을 + +190 +00:13:37,769 --> 00:13:42,879 + CNN 아키텍처는 알렉스 문제 BG 또는 무엇을 당신과 당신이 훈련 중 하나가 있습니다 + +191 +00:13:42,879 --> 00:13:46,970 + 이미지가 아래로 더 일반적으로에 대한 자신 또는 당신은 자유 무역 병을 다운로드하지 + +192 +00:13:46,970 --> 00:13:51,360 + 불과 20 분 거리에 쉽게 인터넷이 수행하는 많은 시간을 다운로드 + +193 +00:13:51,360 --> 00:13:56,590 + 훈련 할 수 있지만, 일반적으로 두 종류의 거기 옆에 당신은 아마 그 부분을하지 않습니다 + +194 +00:13:56,590 --> 00:14:00,910 + 경우 하나의 데이터 세트가 정말 작고, 당신이 정말로 어떤이없는 경우 + +195 +00:14:00,909 --> 00:14:05,019 + 이미지는 무엇이든지 당신은 단지 고정 된 기능으로이 분류를 처리 할 수 + +196 +00:14:05,019 --> 00:14:10,110 + 추출기 그래서이 보는 한 가지 방법은 당신이 마지막 층 할게요이다 + +197 +00:14:10,110 --> 00:14:15,580 + 소프트 맥스 병원 아시아 모델 멀리 걸릴 것 네트워크 그는거야 + +198 +00:14:15,580 --> 00:14:18,370 + 작업에 대한 선형 분류의 일종으로 대체 당신을 + +199 +00:14:18,370 --> 00:14:21,810 + 실제로 걱정 이제 네트워크의 나머지 부분을 동결거야, 그리고 + +200 +00:14:21,809 --> 00:14:26,969 + 해당 상위 레이어를 재교육하는 것은 그래서 이것은 일종의 그냥 훈련에의 것과 동일 + +201 +00:14:26,970 --> 00:14:31,230 + 네트워크에서 추출 된 기능의 상단에 직접 선형 분류 그래서 + +202 +00:14:31,230 --> 00:14:35,149 + 당신이이 경우에 대한 연습에 시간을 많이 볼 것은 그런 종류입니다 + +203 +00:14:35,149 --> 00:14:38,399 + 전처리 단계로 당신은 모든 테스트하는 기능을 덤프합니다 당신의 + +204 +00:14:38,399 --> 00:14:42,100 + 이미지를 훈련하고 있기 때문에 그 캐스트 기능의 상단에 완전히 작동 + +205 +00:14:42,100 --> 00:14:48,110 + 꽤 속도 일을 도와 그것은 매우 매우를 사용하여 아주 쉽게 할 수 있습니다 + +206 +00:14:48,110 --> 00:14:51,250 + 일반적인 보통 많은 문제를위한 매우 강력한 기준을 제공한다 + +207 +00:14:51,250 --> 00:14:56,169 + 당신은 실제로 발생하고 수도 당신보다 조금 더 많은 데이터가있는 경우 + +208 +00:14:56,169 --> 00:14:58,599 + 당신은 실제로 더 편안 훈련을 감당할 수 + +209 +00:14:58,600 --> 00:15:03,949 + 모델은 그래서 일반적으로 일부 부품을 고정합니다 데이터 집합의 크기에 따라 + +210 +00:15:03,948 --> 00:15:07,669 + 다음 하부 네트워크 층과 일부 대신 오직 재교육의 + +211 +00:15:07,669 --> 00:15:11,919 + 마지막 은신처 당신은 방법에 따라 훈련 마지막 편지의 일부 번호를 선택합니다 + +212 +00:15:11,919 --> 00:15:16,349 + 당신이 더 큰 데이터 세트를 사용할 수있을 때 더 큰 데이터 세트는 일반적이며, + +213 +00:15:16,350 --> 00:15:21,350 + 당신이 훈련을 다음 최종 그들의 더 ​​많은 훈련을 감당할 다시이라면 할 수 있습니다 + +214 +00:15:21,350 --> 00:15:26,060 + 당신은 매우 일반적입니다 볼 수 있습니다 무엇을 여기에 트릭에 비슷한 유사 + +215 +00:15:26,059 --> 00:15:29,729 + 그 대신 실제로 명시 적으로이 부분을 계산 당신은 덤프합니다 + +216 +00:15:29,730 --> 00:15:35,019 + 이 마지막 층은 책상에 기능하고 메모리에이 부분에서 작동되도록 + +217 +00:15:35,019 --> 00:15:47,490 + 꽤 많은 일들을 빠르게 때로는 나는 기본적 것을 질문 할 수 있습니다 + +218 +00:15:47,490 --> 00:15:51,959 + 그것을 시도하고 볼 수 있지만 특히 작은 데이터 세트 이러한 유형의 작동해야 + +219 +00:15:51,958 --> 00:15:55,799 + 당신이 원한다면 그냥 할 것처럼 당신이있는 경우에 경우에 이미지가 꽤을 검색하도록 + +220 +00:15:55,799 --> 00:16:01,338 + 그렇게이 될 수 있도록 강력한베이스 라인은 CNN의 기능에 LTE 거리를 사용한다 + +221 +00:16:01,339 --> 00:16:05,110 + 당신이 훈련해야 할 것으로 예상 얼마나 많은 샘플 방식의 유형은 내가 하모니를 의미 + +222 +00:16:05,110 --> 00:16:10,470 + 당신이 당신이 가진 것보다 더 많은이 경우 연방 수사 국 (FBI)이나 뭐 같은 이들에 대한 많은 + +223 +00:16:10,470 --> 00:16:15,310 + 당신보다 더 많은 데이터는 그래서 좋은시는 그 시도에 필요한 기대 + +224 +00:16:15,309 --> 00:16:28,879 + 전혀 어쩌면 내가 그래 당신은 가끔 미안 당신이 실제로 것입니다 의존하고있어 + +225 +00:16:28,879 --> 00:16:32,309 + 전진 패스를 통해 실행하지만 때로는 당신은 단지 네 개의 패스를 실행 + +226 +00:16:32,309 --> 00:16:36,818 + 한 번이 꽤 일반적입니다 좀 그건의 두 책상을 덤프 + +227 +00:16:36,818 --> 00:16:41,458 + 실제로 계산을 절약 + +228 +00:16:41,458 --> 00:16:59,729 + 랜덤 하우스에서 당신은 아마 당신이 다른 수업을해야합니다 + +229 +00:16:59,730 --> 00:17:03,350 + 러시아어 문제이나 뭐하지만이 이러한 다른 중간층 + +230 +00:17:03,350 --> 00:17:08,750 + 당신은 실제로와의 이전 모델에 있던 어떤에서 초기화 + +231 +00:17:08,750 --> 00:17:15,068 + 당신이 실제로 할 수있는 좋은 팁을 찾아 연습 나쁜있다가 + +232 +00:17:15,068 --> 00:17:18,588 + 당신은 괜찮아요 때 레이어의 세 가지 유형을 추측하는 경우에만 층 두 가지 유형의 수 + +233 +00:17:18,588 --> 00:17:22,349 + 그들은 당신이있는 것으로 생각할 수있는 냉동 층 수 있습니다 튜닝은 + +234 +00:17:22,349 --> 00:17:27,448 + 제로의 속도를 학습이 새로운 래리가 유리 초기화한다는 것이다있다 + +235 +00:17:27,449 --> 00:17:32,548 + 처음부터 일반적으로 사람들은 너무 어쩌면 더 높은 학습 속도 만이 + +236 +00:17:32,548 --> 00:17:36,528 + 네트워크는 원래 서쪽 훈련이 있었는지의 높은 어쩌면 십분 및 + +237 +00:17:36,528 --> 00:17:40,079 + 우리는 당신이에서 초기화하는이 중간층을해야합니다 + +238 +00:17:40,079 --> 00:17:43,269 + 전 열차 네트워크는하지만 당신은 동시 최적화를 수정할 계획하고 + +239 +00:17:43,269 --> 00:17:47,470 + 당신이 경향이 있습니다 이러한 중간층 있도록 미세 조정은 매우 작게 + +240 +00:17:47,470 --> 00:17:56,589 + 원래의 학습 속도 어쩌면 한 백 그래 + +241 +00:17:56,589 --> 00:18:04,319 + 그 어떤 사람들은 일반적으로 그것을 조사하기 위해 노력하고 발견이다 + +242 +00:18:04,319 --> 00:18:08,079 + 미세 조정 미세 조정 방식의 작품을 학습 전송이 유형의 + +243 +00:18:08,079 --> 00:18:11,710 + 더 나은 네트워크는 원래 데이터의 비슷한 유형의 훈련을받은 때 + +244 +00:18:11,710 --> 00:18:16,610 + 어떤 것을 의미하지만, 사실이이 매우 낮은 수준의 기능을 가지 있습니다 + +245 +00:18:16,609 --> 00:18:20,308 + 아마 적용 할 수 원하지되어 가장자리와 색상과 가버 필터 등 + +246 +00:18:20,308 --> 00:18:24,190 + 단지에 대한 시각 모든 유형의 데이터 그래서 특히 이러한 낮은 수준의 기능 I에 + +247 +00:18:24,190 --> 00:18:29,009 + 생각하는 거의 모든과 방법에 의해 일반적으로 꽤 적용 할 수있는 나는 또 다른 + +248 +00:18:29,009 --> 00:18:33,788 + 당신은 당신이 때로는 미세 조정을위한 연습에서 볼 수 있다고 팁은 당신 것입니다 + +249 +00:18:33,788 --> 00:18:37,609 + 첫째는 동결 여기서 실제로 다단 방식을 가질 수도 + +250 +00:18:37,609 --> 00:18:42,079 + 다음 전체 네트워크에만이 지난 후에 다음이 마지막 은신처 및 훈련 + +251 +00:18:42,079 --> 00:18:46,939 + 층 돌아가서 실제로 인디 당신이 할 수있는 것을 발견 한 후 수렴 것으로 보인다 + +252 +00:18:46,940 --> 00:18:51,519 + 때로는이 때문에이 마지막 층 초기화이 문제가 + +253 +00:18:51,519 --> 00:18:54,690 + 무작위로 당신은 엉망이 어떤 종류의 매우 큰 경사가있을 수 있습니다 + +254 +00:18:54,690 --> 00:18:59,070 + 초기화는 두 가지 방법 중 하나를이 동결되는 주위를 얻을 수 있도록 + +255 +00:18:59,069 --> 00:19:02,788 + 처음에는이 수렴을 쓰는 사람이나이 베어링 학습 속도를 가짐으로써 + +256 +00:19:02,788 --> 00:19:08,658 + 그래서 네트워크 전송 학습이 아이디어의 두 정권 사이 + +257 +00:19:08,659 --> 00:19:14,470 + 실제로 정말 잘 몇 꽤 초기 논문은 2013 거기 때문에 작동 + +258 +00:19:14,470 --> 00:19:19,390 + 2014 CNN의 당은 특히이 하나의 인기가 점점 시작했을 때 + +259 +00:19:19,390 --> 00:19:24,490 + 그들은했다 용지가 있었다 놀라운베이스 라인 그들이 무슨 짓을했는지 매우 담담했다 무엇 + +260 +00:19:24,490 --> 00:19:26,009 + 당시 최고 중 하나였다 + +261 +00:19:26,009 --> 00:19:30,470 + 해외에서 CNN의 아웃 위업에 걸쳐 있었다 그들은 단지 추출 기능과 + +262 +00:19:30,470 --> 00:19:33,640 + 다른 표준 데이터 세트 및 표준의 무리에 이러한 기능을 적용 + +263 +00:19:33,640 --> 00:19:38,679 + 컴퓨터 비전의 문제하고는 이러한 권리 아이디어에 비해 + +264 +00:19:38,679 --> 00:19:42,210 + 그들은 시간이 매우 전문을에서 무엇과 비교된다 + +265 +00:19:42,210 --> 00:19:45,298 + 각 개인에 대한 파이프 라인 및 매우 전문적인 아키텍처 + +266 +00:19:45,298 --> 00:19:49,408 + 문제와 데이터 세트 및 각 문제에 대해 그들은 단지이 매우 교체 + +267 +00:19:49,409 --> 00:19:54,380 + 기능의 상단에 아주 간단한 선형 모델의와 전문 파이프 라인 + +268 +00:19:54,380 --> 00:19:58,559 + 피트와 서로 다른 데이터 세트의 모두를 위해 이런 짓을 발견 + +269 +00:19:58,558 --> 00:20:01,940 + 일반적으로 전체 이러한 이상 - 더 - 상단 교사는 아주 아주이었다 있음 + +270 +00:20:01,940 --> 00:20:06,080 + 강한베이스 라인과 몇 가지 문제들은 기존보다 실제로 더 좋았다 + +271 +00:20:06,079 --> 00:20:08,428 + 방법과 몇 가지 문제에 대해 그들이 있었다 + +272 +00:20:08,429 --> 00:20:12,879 + 이 이것은 정말 멋진 종이했을 정도로 악화하지만 여전히 매우 경쟁력 얻을 그 + +273 +00:20:12,878 --> 00:20:16,118 + 다만 이들에 사용될 수있다 정말 강한 특징이 있다고 입증 + +274 +00:20:16,118 --> 00:20:19,949 + 다른 작업의 많은 아주 잘 작동하는 경향이 사람들을 따라 다른 종이 + +275 +00:20:19,950 --> 00:20:25,419 + 라인은 카페인 용지 및 디카 페인 나중에되었다되었다 버클리 출신 + +276 +00:20:25,419 --> 00:20:33,610 + 카페인과 그가가가 너무 혈통의 종류의이다, 그래서 카페가되었다 + +277 +00:20:33,609 --> 00:20:37,388 + 전송 학습을위한 조리법의 종류는 당신에 대해 생각해야 있다는 것입니다 + +278 +00:20:37,388 --> 00:20:43,398 + 이 행렬을 구입하기에 너무 작은 무엇 초반 이었죠로 설정 데이터 방법과 유사하다 + +279 +00:20:43,398 --> 00:20:47,989 + 모델이었고, 데이터의 양을 당신이 할 당신은 그 4에 어떻게해야 + +280 +00:20:47,990 --> 00:20:53,240 + 당신은 매우 유사한 데이터 세트와 매우있는 경우에 그래서 일반적으로 다른 열 + +281 +00:20:53,240 --> 00:20:57,538 + 단지 네트워크를 사용하여 약간의 데이터는 특징 추출기 및 훈련을 수정 한 + +282 +00:20:57,538 --> 00:21:02,429 + 이러한 기능의 상단에 간단한 선형 모델은 경우에 아주 잘 작동하는 경향이있다 + +283 +00:21:02,429 --> 00:21:06,470 + 당신이 미세 조정을 시도하고 실제로 시도하려고 할 수있는 것보다 조금 더 많은 데이터를 가지고 + +284 +00:21:06,470 --> 00:21:10,509 + 사전 심사 무게와 실행에서 미세 조정에서 네트워크를 초기화 + +285 +00:21:10,509 --> 00:21:15,868 + 다른 열이에서 최적화이 상자 여기에 약간의 트릭이다 + +286 +00:21:15,868 --> 00:21:20,099 + 당신은 당신이 창의력을 시도하고 아마 대신 할 수 있습니다 문제가 될 수 있습니다 + +287 +00:21:20,099 --> 00:21:23,998 + 맨 마지막 층으로부터 특징을 추출하면이 기능을 추출 시도 할 수 있습니다 + +288 +00:21:23,999 --> 00:21:27,470 + 다른 대륙의 레이어와 그 때로는 때로는 도움이 될 수 있습니다에서 + +289 +00:21:27,470 --> 00:21:32,819 + 직관이있다 어쩌면 아마도 이러한 MRI 데이터 등 뭔가 + +290 +00:21:32,819 --> 00:21:37,178 + 매우 최고 수준의 기능은 매우 구체적인 이미지 지금 범주는하지만, 이러한 + +291 +00:21:37,179 --> 00:21:42,059 + 매우 낮은 수준의 기능을 할 수처럼 가장자리와 물건 같은 것들입니다 + +292 +00:21:42,058 --> 00:21:47,980 + 켜 분명 이미지 네트 기술 데이터 세트를 켜 더 양도 + +293 +00:21:47,980 --> 00:21:51,099 + 이 상자에 당신은 더 나은 모양에있어 다시 당신은 일종의 초기화 할 수 있습니다 + +294 +00:21:51,099 --> 00:21:57,928 + 미세 곧 내가 지적하고 싶은 다른 일이 초기화이 좋습니다 + +295 +00:21:57,929 --> 00:22:01,590 + 초반 이었죠 모델과 미세 조정으로 실제로는 예외 아니다 + +296 +00:22:01,589 --> 00:22:05,439 + 당신이 볼 수있는 거의 모든 큰 시스템에서 거의 표준 연습 + +297 +00:22:05,440 --> 00:22:09,070 + 컴퓨터 비전 요즘 우리는 실제로이 두 가지 예를 본 적이 + +298 +00:22:09,069 --> 00:22:13,220 + 이미 분기 예를 들어, 당신은 몇 강의 전에서 기억한다면, 그래서 + +299 +00:22:13,220 --> 00:22:17,220 + 우리는 이미지를보고 우리가 CNN했다 물체 검출에 대해 이야기 + +300 +00:22:17,220 --> 00:22:21,620 + 지역의 제안이 다른 호출이이 모든 미친 것들하지만이 부분 + +301 +00:22:21,619 --> 00:22:25,529 + CNN과 우리가 CNN이보고 한 이미지 및 이미지 캡션보고 있었다 + +302 +00:22:25,529 --> 00:22:29,399 + 이미지는 이러한 경우 모두에서, 그래서 CNN의 처음 imagefap에서입니다 되었더라도 + +303 +00:22:29,400 --> 00:22:34,080 + 모델과 그 정말이 다른보다 전문적인 문제를 해결하는 데 도움이 + +304 +00:22:34,079 --> 00:22:38,839 + 심지어 거대한 데이터 세트없이 또한 이미지 캡션 모델에 대한 + +305 +00:22:38,839 --> 00:22:42,829 + 이 모델의 특정 부분이 당신이해야 할 것을 요구 하였다 포함 + +306 +00:22:42,829 --> 00:22:47,500 + 당신이 그것을 시작하지만 사람들은 벡터 아니었다면 숙제를 지금까지 본 + +307 +00:22:47,500 --> 00:22:50,099 + 당신은 실제로 아마이었다 뭔가에서 초기화 할 수 있습니다 + +308 +00:22:50,099 --> 00:22:54,019 + 과세의 무리를 사전에 훈련하고 때로는 일부 검색에 아마 도움이 될 수 있습니다 + +309 +00:22:54,019 --> 00:22:58,668 + 어떤 상황에서 사용 가능한 캡처 데이터를 많이 가지고하지 않을 수 있습니다 경우 + +310 +00:22:58,669 --> 00:23:15,490 + 그래 나는 문제가에 따라 달라집니다에 의존 때로는 도움이 왔어요 + +311 +00:23:15,490 --> 00:23:18,859 + 네트워크는하지만 확실히 당신이 시도 할 수 있습니다 뭔가하고 특히 수도 + +312 +00:23:18,859 --> 00:23:27,548 + 이 상자에있을 때 도움이되지만 그래 그것은 테이크 아웃에 좋은 트릭 + +313 +00:23:27,548 --> 00:23:31,210 + 대한 미세 조정은 정말 정말 좋은 생각을 사용해야한다는 것입니다 + +314 +00:23:31,210 --> 00:23:35,950 + 그것은 실제로 정말 잘 작동하도록 그래 당신이해야 아마 거의 + +315 +00:23:35,950 --> 00:23:39,900 + 항상 그것을 사용하고 어느 정도는 일반적으로 교육되고 싶지 않아 + +316 +00:23:39,900 --> 00:23:42,519 + 처음부터이 일이 정말 정말 큰 데이터 집합이없는 경우 + +317 +00:23:42,519 --> 00:23:45,970 + 거의 모든 상황에서 사용할 수있는 것이 훨씬 더 편리 + +318 +00:23:45,970 --> 00:23:52,279 + 카페 기존 모델과 방식에 의해 발견하면이 기존 모델이 + +319 +00:23:52,279 --> 00:23:58,230 + 당신은 많은 사람들이 많은 유명한 이미지로 모델을 존재 다운로드 할 수 있습니다 + +320 +00:23:58,230 --> 00:24:01,880 + 실제로 잔류 네트워크 공식 모델은 아주 최근에 출시있어 + +321 +00:24:01,880 --> 00:24:06,130 + 당신도 다운로드 정말 멋진 이들 카페 것 그것으로 재생할 수 있습니다 + +322 +00:24:06,130 --> 00:24:09,020 + 새로운 모델 모델은 일종의 표준의 약간처럼 + +323 +00:24:09,019 --> 00:24:13,759 + 지역 사회 당신은 심지어 같은 다른 다른 프레임 워크에 카페 모델을로드 할 수 있도록 + +324 +00:24:13,759 --> 00:24:17,658 + 토치는 그래서 그 다음 카페 모델이 상당히 있다는 사실을 양지해야 할 무언가이다 + +325 +00:24:17,659 --> 00:24:21,030 + 유용 못했습니다 + +326 +00:24:21,029 --> 00:24:26,889 + 미세 조정 또는 전송 학습에 어떤 추가 질문 + +327 +00:24:26,890 --> 00:24:46,650 + 당신이 높게을 시도 할 수 있도록 그래 그건 아주 크고 낮은 차원의 + +328 +00:24:46,650 --> 00:24:50,250 + 그 꼭대기에 선형 모델을 정례화하거나 작은 올 퍼팅 시도 할 수 있습니다 + +329 +00:24:50,250 --> 00:24:53,109 + 어쩌면 차원을 줄일 그 꼭대기에에서 여기 창조적 얻을 수 있습니다 + +330 +00:24:53,109 --> 00:24:56,399 + 하지만 난 당신이 그 일을 할 수 시도 할 수있는 일이있다가 있다고 생각 + +331 +00:24:56,400 --> 00:25:03,640 + 데이터가에 따라 적합한 그래서 우리가 더 이야기해야한다고 생각 + +332 +00:25:03,640 --> 00:25:07,740 + 회선이 모든 네트워크에 대해 우리는 그것에 대해 정말 이야기했습니다 있도록 + +333 +00:25:07,740 --> 00:25:11,920 + 회선은 많은 작업을하고있어 계산 주력이다 + +334 +00:25:11,920 --> 00:25:18,090 + 네트워크는 그래​​서 우리는 회선 처음에 대해 약 두 가지를 얘기해야 + +335 +00:25:18,089 --> 00:25:22,809 + 우리는 효율적인 네트워크 아키텍처를 설계 할 수있는 방법 그래서 그들을 막을 방법입니다 + +336 +00:25:22,809 --> 00:25:28,789 + 여기에, 그래서 몇 가지 멋진 결과를 달성하기 위해 회선의 많은 레이어를 결합 + +337 +00:25:28,789 --> 00:25:33,230 + 질문은 우리가 사람들의 두 개의 층이있는 네트워크를 가지고 있다고 가정 내가 + +338 +00:25:33,230 --> 00:25:37,190 + 이 같은 세 개의 기여가 입력 될이 활성화 될지도 + +339 +00:25:37,190 --> 00:25:40,120 + 제 1 층이 두 층의 후 활성화 NAP 것 + +340 +00:25:40,119 --> 00:25:45,959 + 회선 문제는이 두 번째 층 상에이란에 대한 지역의 얼마나 큰 + +341 +00:25:45,960 --> 00:25:49,640 + 입력이 표시되지 않습니다에이 내가 희망 나는 귀하의 중간에 있던 나는 U 너희들 희망 + +342 +00:25:49,640 --> 00:25:53,920 + 모두가 이에 대한 답변을 알고 + +343 +00:25:53,920 --> 00:26:01,298 + 확인 사람은 아마 그 힘든 시험 문제였다 + +344 +00:26:01,298 --> 00:26:05,230 + 하지만이 다섯으로 다섯입니다 그리고 그것은이에서 볼 꽤 쉽게이다 + +345 +00:26:05,230 --> 00:26:08,989 + 그림은 왜 두 번째 층까지이 신경이보고 될 수 있도록 + +346 +00:26:08,989 --> 00:26:13,619 + 일부 특정이 픽셀의 중간에서 전체 볼륨 + +347 +00:26:13,618 --> 00:26:18,138 + 중간 우리가 그렇게 할 때를 입력 세 지역에 따라이 세 가지에서 찾고 + +348 +00:26:18,138 --> 00:26:22,738 + 이보다 다음의 세 가지를 모두 볼 때 모두에서 평균 + +349 +00:26:22,739 --> 00:26:26,200 + 두 번째 또는 세 번째 층에있는이 신경을 은신처는 사실이보고있다 + +350 +00:26:26,200 --> 00:26:32,669 + 우리가 있던 경우에 입력 다섯 부피 다섯 전체 괜찮 이제 질문은 + +351 +00:26:32,669 --> 00:26:36,820 + 에서 지역의 얼마나 큰 행에 쌓여 세 개의 회선으로 삼피트 + +352 +00:26:36,819 --> 00:26:43,700 + 입력 이유로 같은 종류의 그가 수용 필드가 있다고 그래서 그들은 나중에 무엇을보고 + +353 +00:26:43,700 --> 00:26:49,739 + 여기에 포인트를 만들 수 있도록 단지 종류의 연속 공헌 구축 + +354 +00:26:49,739 --> 00:26:53,940 + 당신이 실제로 매우 줄 33 세에 의해 회선을 알고있다 + +355 +00:26:53,940 --> 00:26:57,919 + 비슷한 표현 능력은 하나의 일곱 일곱하여 내 주장이다 + +356 +00:26:57,919 --> 00:27:02,619 + 컨볼 루션은이의 정확한 의미에 대한 토론을 할 수 있도록 당신은 할 수 + +357 +00:27:02,618 --> 00:27:05,528 + 하지만 단지에서 같은 그것에 대해 정리하고 물건을 증명하려고 + +358 +00:27:05,528 --> 00:27:09,940 + 직관적 인 감각 그들은 333 회선 비슷한 유형을 나타낼 수 있습니다 + +359 +00:27:09,940 --> 00:27:14,100 + 그것은에서 찾고 있기 때문에 일곱 기여하여 유사한 칠 등의 기능 + +360 +00:27:14,099 --> 00:27:22,189 + 입력에 동일한 입력 영역 그래서 지금 생각은 지금 실제로 우리는 더 팔 수있다 + +361 +00:27:22,190 --> 00:27:27,399 + 이 아이디어로 우리는 하나 797 사이에보다 구체적으로 비교할 수 있습니다 + +362 +00:27:27,398 --> 00:27:32,618 + 그래서이 가정하자 세 공헌 33의 스택 대 컨볼 루션 + +363 +00:27:32,618 --> 00:27:38,638 + 우리는 바다로 HIW의 입력 이미지를 가지고 우리는 회선을하도록 + +364 +00:27:38,638 --> 00:27:43,329 + 우리가 필터를 볼 그래서 그 깊이를 보존하고 우리는 그들을 갖고 싶어 + +365 +00:27:43,329 --> 00:27:48,019 + 그래서 우리가 적절하고 우리가 원하는 두드리며 말했다으로 높게 보존 식품 + +366 +00:27:48,019 --> 00:27:51,528 + 구체적으로 비교 한 일곱의 차이에 의한 것입니다하기 + +367 +00:27:51,528 --> 00:27:56,648 + 이러한 각각의 세 가지 그래서 처음 몇 주에 의해 세의 스택 대 칠 + +368 +00:27:56,648 --> 00:28:01,748 + 두 가지 사람이 얼마나 많은 무게 단일 일곱 일곱으로의 가스가있다 + +369 +00:28:01,749 --> 00:28:09,519 + 컨볼 루션 집과는 편견에 대해 잊을 수는 혼동 + +370 +00:28:09,519 --> 00:28:19,869 + 나는 약간의 여름을 들었지만 나의 내 대답 나는 바로 그것을 가지고 희망을 들어 + +371 +00:28:19,869 --> 00:28:24,319 + 49 C는 각자가 찾고있는 일곱 일곱하여 회선을 가지고로 제곱했다 + +372 +00:28:24,319 --> 00:28:29,809 + 볼의 깊이에서 당신은 지금 49 C 제곱 있도록 이러한 필터를 보게되었다하지만 + +373 +00:28:29,809 --> 00:28:34,649 + 세 개의 회선 세에 의해 우리는 회선의 세 층 각 하나가 + +374 +00:28:34,650 --> 00:28:38,990 + 각 필터는 스티브으로 세 가지로 세 가지이며, 각 플레이어는 때 필터를 볼 수 있습니다 + +375 +00:28:38,990 --> 00:28:43,980 + 모든 아웃 우리가 무료로 회선 33 만 제곱 (27) C가 볼 것을 곱 + +376 +00:28:43,980 --> 00:28:49,079 + 매개 변수와 우리는 이러한 각각의 사이에 후 레이 루이스가 있다고 가정 + +377 +00:28:49,079 --> 00:28:54,049 + 기여는 우리는 세 개의 회선으로 최대 33 스택이 실제로 있는지 참조 + +378 +00:28:54,049 --> 00:28:58,649 + 이런 종류의 좋은 좋은 더 비선형이다 적은 수의 매개 변수 + +379 +00:28:58,650 --> 00:29:02,960 + 의 여러 세에 의해 당신에게 세 가지의 이유 스택에 대한 몇 가지 직관을 제공합니다 + +380 +00:29:02,960 --> 00:29:06,440 + 세 개의 회선은 실제로 하나의 일곱 칠로하는 것이 바람직 할 수있다 + +381 +00:29:06,440 --> 00:29:11,559 + 경쟁은 우리가 실제로 더이 한 단계 걸릴에 대해 생각 할 수 있습니다 + +382 +00:29:11,559 --> 00:29:14,750 + 하지 일반 매개 변수 아래 단지 수 있지만, 실제로 꿀 부동 + +383 +00:29:14,750 --> 00:29:19,099 + 이 일에 소수점 연산 그래서 사람이 얼마나 많은 가스가 걸릴 + +384 +00:29:19,099 --> 00:29:29,669 + 이러한 일들이 지금 수행 할 작업이이 실제로 하드 쓰기 소리 + +385 +00:29:29,670 --> 00:29:33,740 + 때문에 이러한 필터 각각에 대해 매우 쉽다는 거의 모든 IT를 사용했다 + +386 +00:29:33,740 --> 00:29:37,819 + 이미지의 단부에 위치하므로 실제 곱셈의 광고의 수는 + +387 +00:29:37,819 --> 00:29:42,099 + 단지거야 시간이 Heights의 배 가연성 필터의 수를 당신 때문에 + +388 +00:29:42,099 --> 00:29:47,789 + 실제로 여기에 그것을 볼 수 있습니다 다시뿐만 아니라 우리 사이에 있나요 + +389 +00:29:47,789 --> 00:29:52,440 + 이 두 가지 사이에 일곱으로 칠 작업을 비교하는 것은뿐만 아니라 더 많은 학습 가능이 + +390 +00:29:52,440 --> 00:29:57,460 + 매개 변수하지만 실제로 잘 스택 있도록 더 많은 컴퓨터에 비용 + +391 +00:29:57,460 --> 00:30:03,140 + 자주 암시로 (33)는 다시 적은 컴퓨팅 우리에게 더 비선형 성을 제공하므로 + +392 +00:30:03,140 --> 00:30:06,170 + 그 좀 당신에게 왜 실제로 여러 층을 갖는 몇 가지 직관을 제공합니다 + +393 +00:30:06,170 --> 00:30:12,300 + 세 베이 세 회선하지만 다음 큰 필터 실제로 바람직하다 + +394 +00:30:12,299 --> 00:30:15,750 + 우리가 작은쪽으로 밀어 봤는데 알고 또 다른 질문을 생각할 수 있으며, + +395 +00:30:15,750 --> 00:30:20,109 + 작은 필터하지만 왜 바로 우리가 실제로 작은 갈 수있는 세 가지에 의해 세에서 정지 + +396 +00:30:20,109 --> 00:30:21,859 + 그건있을 수 있습니다보다 같은 논리 확장 것 + +397 +00:30:21,859 --> 00:30:27,798 + 머리를 흔들어 당신은 당신이 얻을하지 않습니다 그것은 사실 그 사실을 믿지 않는다 + +398 +00:30:27,798 --> 00:30:33,539 + 우리가 여기서 할거야 그래서 실제로 무엇을 수용 필드는 단일 비교된다 + +399 +00:30:33,539 --> 00:30:39,019 + 약간 애호가 아키텍처 병목 아키텍처 대 33 회선 + +400 +00:30:39,019 --> 00:30:45,150 + 그래서 여기에 우리는 내가 HW의 입력이보고 우리가 실제로 할 수있는들을 수있는 거 가정하고 + +401 +00:30:45,150 --> 00:30:50,070 + 이것은 우리가까지 볼 수있는 단일 하나씩 회선을 멋진 트릭 + +402 +00:30:50,069 --> 00:30:54,609 + 필터는 실제로 지금이 볼륨의 차원을 줄이기 위해 + +403 +00:30:54,609 --> 00:30:57,990 + 것은 동일한 공간 범위이지만 기능 절반이 예정 + +404 +00:30:57,990 --> 00:31:03,480 + 심층 지금 우리는거야이 병목을 수행 한 후 3 × 3 할 + +405 +00:31:03,480 --> 00:31:08,929 + 이 감소 차원에서 컨볼 루션 지금이이 세 가지에 의해 + +406 +00:31:08,929 --> 00:31:13,610 + 세 개의 회선 입력 기능을 통해 받아 출력에 이상 발생 + +407 +00:31:13,609 --> 00:31:18,000 + 기능과 이제 우리는 하나 다른 하나를 사용하여 차원을 복원 + +408 +00:31:18,000 --> 00:31:23,558 + 이 펑키 한 가지 종류의 볼 백업을 통해 볼에서 회선은 이동 + +409 +00:31:23,558 --> 00:31:27,910 + 아키텍처는 하나씩 회선을 사용하는이 아이디어는 어디 에나있다 + +410 +00:31:27,910 --> 00:31:31,669 + 그것은이 직관을 가지고 있기 때문에 때로는 네트워크와 네트워크라는 것을 + +411 +00:31:31,669 --> 00:31:35,730 + 당신은 하나씩 회선이 완전히 연결 슬라이딩에 좀 유사하다있어 + +412 +00:31:35,730 --> 00:31:42,480 + 당신의 입력 볼륨의 각 부분에 걸쳐 네트워크와이 아이디어도 나타납니다 + +413 +00:31:42,480 --> 00:31:46,259 + 구글 매트와이 하나씩 병목 현상을 사용 ResNet이 생각 + +414 +00:31:46,259 --> 00:31:52,679 + 기여 그래서 우리는 하나이이 병목 샌드위치를​​ 비교할 수 있습니다 + +415 +00:31:52,679 --> 00:31:56,390 + C 필터와 세 개의 회선에 의해 세와 같은 논리를 통해 실행 + +416 +00:31:56,390 --> 00:32:01,270 + 그래서 나는 당신의 머리에있는 컴퓨터에 강제하지 않습니다하지만 당신은 할 수 있습니다 + +417 +00:32:01,269 --> 00:32:02,720 + 이 날 믿어하기 + +418 +00:32:02,720 --> 00:32:08,700 + 이 병목 스택은 세 가지를 가지고 어디 분기 C는 매개 변수를 제곱 것을 + +419 +00:32:08,700 --> 00:32:12,360 + 여기이 사람은 우리가 고집하는 경우 구 C가 다시 매개 변수를 제곱있다 + +420 +00:32:12,359 --> 00:32:15,879 + 이 병목 이상이 기여 각각의 사이에 집회 + +421 +00:32:15,880 --> 00:32:20,620 + 샌드위치는 적은 수의 우리에게 더 많은 비선형 성을주고있다 + +422 +00:32:20,619 --> 00:32:28,899 + 우리는 실제로 같은 우리 유사한 매개 변수 및 대 세에 의해 세에서 본 + +423 +00:32:28,900 --> 00:32:33,200 + 일곱으로 칠 매개 변수의 수는 그렇게 계산에 직접 연결되어 + +424 +00:32:33,200 --> 00:32:35,389 + 이 병목 샌드위치도 + +425 +00:32:35,388 --> 00:32:39,788 + 훨씬 빠른 하나씩 병목 현상이 아이디어에 따라서이를 계산한다 + +426 +00:32:39,788 --> 00:32:52,669 + 구글 매트에서 최근에 사용 꽤 많이 받고 특히 그래 그렇게 + +427 +00:32:52,669 --> 00:32:56,579 + 당신은 때때로 당신이에서 투사로로 생각 당신은 그것의 생각 + +428 +00:32:56,578 --> 00:33:00,308 + 다시 더 높은 차원 공간에 다음의 경우 낮은 차원 기능 등 + +429 +00:33:00,308 --> 00:33:03,868 + 당신은 어떻게 서로의 상단에 이런 일을 많이 쌓아 생각 + +430 +00:33:03,868 --> 00:33:09,499 + 당신이 한 직후오고 것보다보다 주민 + +431 +00:33:09,499 --> 00:33:11,088 + 하나 또 하나가 될 것 + +432 +00:33:11,088 --> 00:33:14,858 + 당신은 가지 위에 많은 많은 한 사람들 하나 하나 회선 붙어있어 + +433 +00:33:14,858 --> 00:33:18,918 + 서로 하나씩 회선 조금 슬라이딩 같이하는 완전히 + +434 +00:33:18,919 --> 00:33:23,409 + 각각의 이중 채널을 통해 다층 완전히 연결 네트워크는 아마 생각 생각하는 + +435 +00:33:23,409 --> 00:33:27,229 + 그것에 대해 때 조금하지만 실제로 당신이 정말 안 밝혀 + +436 +00:33:27,229 --> 00:33:31,200 + 공간적 범위를 필요로하고, 심지어 단 하나의 세 가지로 샌드위치를​​ 비교 + +437 +00:33:31,200 --> 00:33:35,769 + 세 Khans하여 정렬의 동일한 입력 출력 볼륨 크기를 갖는하고 있지만, + +438 +00:33:35,769 --> 00:33:41,429 + 더 비선형 그들이있어 그래서 저렴 계산과 동물 매개 변수는 무엇입니까 + +439 +00:33:41,429 --> 00:33:46,089 + 이 것입니다 모든 좋은 기능의 종류 만이 거기에 하나의 문제입니다 + +440 +00:33:46,088 --> 00:33:49,668 + 즉, 우리는 여전히 어딘가에서 3 × 3 회선을 사용하고 그리고 + +441 +00:33:49,669 --> 00:33:54,709 + 우리는 우리가 정말이 필요하면 경우에 당신이 궁금해 할 수 대답은 그것이 나오는 것에 없음입니다 + +442 +00:33:54,709 --> 00:33:59,808 + 내가 최근에 본 적이 한 미친 것은 당신이 당신이 인수 분해 할 수 있다는 것입니다 + +443 +00:33:59,808 --> 00:34:05,608 + 하나 2003 년에 세 개의 회선으로 거리와 세에 의해 원에 비해 + +444 +00:34:05,608 --> 00:34:09,469 + 단일 세에 의해 세 개의 회선이 몇 가지 매개 변수를 저장 끝 + +445 +00:34:09,469 --> 00:34:14,428 + 뿐만 아니라 당신이 정말로 당신으로이 일에 의해 올 수 미쳐 경우 수도 있기 때문에 + +446 +00:34:14,429 --> 00:34:18,019 + 세와 함께이 병목와 하나 세 생각과 사물 단지 + +447 +00:34:18,018 --> 00:34:22,358 + 정말 저렴받을 수 있도록 구글이 가장에서 무슨 짓을했는지 기본적이다 + +448 +00:34:22,358 --> 00:34:27,038 + 인 셉션의 최신 버전은 그렇게 미친 종이의이 종류가를 다시 생각있다 + +449 +00:34:27,039 --> 00:34:30,389 + 그들이이 많이 재생 컴퓨터 비전에 대한 개시 아키텍처 + +450 +00:34:30,389 --> 00:34:34,169 + 이상한 방법으로 회선을 감안하고있는에 관하여 미친 트릭 + +451 +00:34:34,168 --> 00:34:37,138 + 다른에 하나씩 병목 현상을 많이하고 예측 백업 + +452 +00:34:37,139 --> 00:34:40,608 + 당신이 생각 치수 다음 경우 원래 구글과 만나 자신의 + +453 +00:34:40,608 --> 00:34:42,699 + 처음 모듈은 미친이었다 + +454 +00:34:42,699 --> 00:34:46,118 + 이 일이 구글이 지금에 사용하고있는 개시 모듈입니다 것 + +455 +00:34:46,119 --> 00:34:47,329 + 그들의 최신 개시 + +456 +00:34:47,329 --> 00:34:50,739 + 여기에 흥미로운 기능은 다음 하나씩을 가지고 있습니다 + +457 +00:34:50,739 --> 00:34:55,819 + 모든 곳에서 병목 현상이에 대한 이러한 비대칭 필터를 가지고 있는지 확인 + +458 +00:34:55,820 --> 00:35:01,519 + 에이 계산 그래서이 물건은 슈퍼 널리 아직 사용되지 않지만, 그것은 그것의의의 + +459 +00:35:01,519 --> 00:35:05,079 + 거기 그것은 구글 매트는 그것을 언급 멋진 무언가를 정신병이야 + +460 +00:35:05,079 --> 00:35:14,610 + 이렇게 빨리 회선에서 개괄하고는 보통의 것입니다 스택하는 방법 + +461 +00:35:14,610 --> 00:35:18,530 + 대신 큰 필터 크기의 하나의 큰 회선을 갖는의 더 나은 + +462 +00:35:18,530 --> 00:35:22,740 + 그것은 여러 개의 작은 필터로하고 있음을 해체하는 것이 더 나은에도 + +463 +00:35:22,739 --> 00:35:26,339 + 어쩌면이 BGG 같은 사이의 차이를 설명하는 데 도움이 + +464 +00:35:26,340 --> 00:35:30,059 + 적은이 알렉스 그물 같은 많은 많은 3 × 3 필터 + +465 +00:35:30,059 --> 00:35:35,119 + 실제로 내가 생각 꽤 공통되고있다 작은 필터와 다른 것은 + +466 +00:35:35,119 --> 00:35:38,829 + 당신을 네킹 한 병 하나의이 아이디어는 볼 구글의 두 버전에서 + +467 +00:35:38,829 --> 00:35:42,579 + 되지도 ResNet에서 그것은 실제로 당신이 매개 변수를 많이 절약 할 수 있습니다 I + +468 +00:35:42,579 --> 00:35:46,340 + 이 염두에 두어야 할 유용한 트릭과 인수 분해의이 아이디어 생각 + +469 +00:35:46,340 --> 00:35:50,890 + 내가 생각이 비대칭 필터로 회선 어쩌면 너무 광범위하지 않습니다 + +470 +00:35:50,889 --> 00:35:54,629 + 지금 사용하지만, 더 일반적으로 미래에 사용되는 잘 모르겠어요 될 수 있습니다 + +471 +00:35:54,630 --> 00:36:00,160 + 이러한 모든 트랙에 대한 지배적 인 테마를 통해 기본은 당신을 할 수 있다는 것입니다 + +472 +00:36:00,159 --> 00:36:04,289 + 적은 학습 가능 매개 변수 적은 적은 컴퓨팅 등을 가지고 + +473 +00:36:04,289 --> 00:36:07,739 + 좋은 기능을 모든 종류의 당신의 아키텍처를 데있다 비선형 + +474 +00:36:07,739 --> 00:36:18,779 + 이러한 이러한 이러한 회선 아키텍처 설계에 대한 질문으로 + +475 +00:36:18,780 --> 00:36:21,300 + 그녀가 너무 분명 가져 + +476 +00:36:21,300 --> 00:36:26,340 + 확인 그래서 그 다음 것은 당신이했습니다되면 실제로 당신이 원하는 방법에 대한 결정이다 + +477 +00:36:26,340 --> 00:36:30,760 + 회선의 스택을 묶는 당신은 실제로 그들을 계산하고이하는 + +478 +00:36:30,760 --> 00:36:33,630 + 실제로 구현하는 여러 가지 방법에 많은 일이있었습니다 + +479 +00:36:33,630 --> 00:36:37,950 + 기부는 우리가 루프를 사용하여 과제의 구현을 물어 + +480 +00:36:37,949 --> 00:36:43,960 + 당신은 너무 잘 확장되지 않습니다 추​​측 수 있으므로이 너무 예쁜 예쁜 + +481 +00:36:43,960 --> 00:36:47,720 + 구현하기가 매우 쉽다 쉬운 방법은 이름의이 아이디어는 호출하는 것입니다 + +482 +00:36:47,719 --> 00:36:52,269 + 방법은 그래서 여기 직관은 우리가 행렬 곱셈 정말 알고있다 + +483 +00:36:52,269 --> 00:36:56,809 + 빠르고 거기에 누군가가 거의 모든 컴퓨팅 아키텍처 + +484 +00:36:56,809 --> 00:37:00,949 + 정말 정말 잘 최적화 행렬 곱셈 고정 라이브러리를 작성 + +485 +00:37:00,949 --> 00:37:06,230 + 그래서 전화를 그의 생각이 잘 행렬 곱셈이 주어진 악취된다 + +486 +00:37:06,230 --> 00:37:07,400 + 정말 빠른 + +487 +00:37:07,400 --> 00:37:11,420 + 우리는이 컨볼 루션 연산을하고 같은 개주 수있는 몇 가지 방법이있다 + +488 +00:37:11,420 --> 00:37:17,800 + 행렬 곱셈과 이것이 꽤 다소 쉽게가 있음을 밝혀 + +489 +00:37:17,800 --> 00:37:22,930 + 아이디어가 그래서 당신은 우리의 입력 볼륨을 가지고 그것에 대해 생각하면 + +490 +00:37:22,929 --> 00:37:28,549 + 바다 HIW 우리는 컨볼 루션 필터 회선의 필터 뱅크가 + +491 +00:37:28,550 --> 00:37:32,730 + 그것은이 때문에 이들 각각 볼 볼륨으로 사례 별을 될 것입니다 + +492 +00:37:32,730 --> 00:37:36,659 + 사례 별 수용 필드와 적응 두가 일치 일치 참조 + +493 +00:37:36,659 --> 00:37:39,989 + 입력 여기에 우리는거야 것은 이러한 필터 처리해야하고 우리가 원하는 + +494 +00:37:39,989 --> 00:37:44,809 + 아이디어는 점이다 있도록 행렬 곱셈 문제로이 점을 켭니다 + +495 +00:37:44,809 --> 00:37:48,829 + 우리는 우리가 먼저 수용 필드를 취할려고하고 자신의 일을하는거야 + +496 +00:37:48,829 --> 00:37:54,019 + 거 지역에서 CEE 영역에 의해 케이하여이 케이를 할 수있는 이미지의 + +497 +00:37:54,019 --> 00:37:58,130 + 축구에서 최종까지 나는 사건이 컬럼에 그것을 바꿀거야 당신 + +498 +00:37:58,130 --> 00:38:01,910 + 이에 요소를 확인하고 우리는 가능한 모든이를 반복하는거야 + +499 +00:38:01,909 --> 00:38:05,909 + 이미지의 수용 필드 그래서 우리는 내가 갈거야이 작은 사람을거야 + +500 +00:38:05,909 --> 00:38:09,359 + 이미지에 가능한 모든 영역에 걸쳐 그를 이동하고 여기에 그냥 말하는거야 + +501 +00:38:09,360 --> 00:38:12,680 + 될 것이 아마 지역과 다른 수용 필드를 종료하는 것이 + +502 +00:38:12,679 --> 00:38:18,389 + 위치는 지금 우리는 우리의 이미지를 촬영했습니다 우리는이 거대한으로 재편 촬영했습니다 + +503 +00:38:18,389 --> 00:38:25,139 + 매트릭스 오 누구나 볼 볼 수있다 내 말 및 내 경우 어떤 가능성을 + +504 +00:38:25,139 --> 00:38:28,139 + 이 아마와 문제 + +505 +00:38:28,139 --> 00:38:36,829 + 그래, 그게 사실은 그래서 최선이 바로 많은 메모리를 많이 사용하는 경향이 + +506 +00:38:36,829 --> 00:38:41,380 + 이 책의 요소가 나타나는 경우 여러 수용 필드 다음이다 + +507 +00:38:41,380 --> 00:38:45,010 + 가고 그래서 이러한 열 여러 중복되는 및이에 가고 + +508 +00:38:45,010 --> 00:38:49,220 + 당신의 수용 필드하지만 사이가 overlap은 더 더를 얻을 수 + +509 +00:38:49,219 --> 00:38:52,839 + 실제로이 실제로 거래의 너무 큰 아니에요 및 밝혀 그 + +510 +00:38:52,840 --> 00:38:57,910 + 우리는거야이 길쌈에 유사한 검사를 잘 실행하고 작품 + +511 +00:38:57,909 --> 00:39:01,699 + 필터는 그래서 만약 당신이 회선 우리가 먹고 싶어 무엇을하고 있는지 기억 + +512 +00:39:01,699 --> 00:39:06,039 + 이러한 길쌈 무게의 각 각으로 우리의 제품을 + +513 +00:39:06,039 --> 00:39:10,889 + 이미지 때문에 각 수용 필드 위치에 대한 길쌈 무게 + +514 +00:39:10,889 --> 00:39:16,420 + 이 길쌈 무게의 각이 케이에 의해이 케이 것은 좌석이 그렇게 대답 구입 + +515 +00:39:16,420 --> 00:39:21,059 + 우리는 치로로 이제 우리는 D를 경우로 그 각각을 바꿀거야 + +516 +00:39:21,059 --> 00:39:26,420 + 좌석 행렬은 지금이 좋은해진다 필터 그래서 우리는 경우에 의해 계약을 얻었다 + +517 +00:39:26,420 --> 00:39:31,750 + 지금이 가이드는 수용 필드로 각 열의 모든 레셉 각을 포함 우리 + +518 +00:39:31,750 --> 00:39:37,039 + 이미지에 하나의 컬럼 수용 필드가 지금이 행렬은 하나가있다 + +519 +00:39:37,039 --> 00:39:42,679 + 하나의 각 행은 우리가 쉽게이 모든 계산할 수 해주기 때문에 다른 무게 + +520 +00:39:42,679 --> 00:39:49,069 + 내부의 제품은 한 번에 하나의 행렬 곱 나는 사과 + +521 +00:39:49,070 --> 00:39:52,809 + 이러한 차원의 아마 교체해야 밖으로 작동하지 않는 것은 더 만드는 것입니다 + +522 +00:39:52,809 --> 00:39:59,219 + 분명하지만 난 당신이 아이디어를 얻을 생각 때문에이이 최종 결과에 의해 싶게를 제공하는 + +523 +00:39:59,219 --> 00:40:03,659 + 그 D 출력 필터의 우리의 수를하고, n은 모두 받아 들일입니다 + +524 +00:40:03,659 --> 00:40:07,469 + 이미지 필드의 위치는 다음이 걸릴 비슷한 여행을 재생 + +525 +00:40:07,469 --> 00:40:13,000 + 실제로이 참을 수있는 실내 3D 전채로 모양을 변경 + +526 +00:40:13,000 --> 00:40:16,219 + 아주 쉽게 당신이 이들의 미니 배치가있는 경우 너무 많은 배치 + +527 +00:40:16,219 --> 00:40:24,099 + 요소는 당신은 행의 한 세트 나 다시 요소이 당 더 많은 행과 방법을 추가 + +528 +00:40:24,099 --> 00:40:28,589 + 실제로 그래 그렇게 구현하는 매우 간단합니다 + +529 +00:40:28,590 --> 00:40:35,090 + 그 것을 - 그 다음 구현 오른쪽에 달려 있지만, 다음의 따라 달라집니다 + +530 +00:40:35,090 --> 00:40:39,910 + 그 때로는 같은 메모리 레이아웃 및 물건 같은 것들에 대해 걱정할 필요가 + +531 +00:40:39,909 --> 00:40:45,099 + 당신도 당신이 병렬로 그것을 할 수있는 GPU에서 그 모양 변경 작업을 수행하지만, + +532 +00:40:45,099 --> 00:40:50,089 + 그래서 이것은 정말 쉬운 사례 연구로는 너무 많이 구현하는 등의 경우 경우 경우 + +533 +00:40:50,090 --> 00:40:53,470 + 사용 가능한 컨벌루션 기술이 없어 하나를 구현해야 + +534 +00:40:53,469 --> 00:40:57,869 + 이것은 아마도 선택할 수있는 하나 통과하면 실제 카페를 보면 + +535 +00:40:57,869 --> 00:41:01,119 + 카페의 이전 버전이 그들이 무엇에 사용되는 방법이다 + +536 +00:41:01,119 --> 00:41:07,730 + 기부금이는 GPU 충돌에 대한 회선 앞으로 코드가 있도록 + +537 +00:41:07,730 --> 00:41:12,630 + 당신의 기본 GPU의 컨볼 루션들은으로 전화하는거야이 붉은 덩어리를 볼 수 있습니다 + +538 +00:41:12,630 --> 00:41:18,070 + 전화를 같은 방법이 복용 그들의 입력 영상 권한을 가지고 그들의 + +539 +00:41:18,070 --> 00:41:22,900 + 입력 영상 어딘가에이 그래서 이것은 그들의 의도 된 후 그들은거야 + +540 +00:41:22,900 --> 00:41:27,050 + 이것은이이 방법으로 호출하고이를 저장하는 동일한 전화 재정비 + +541 +00:41:27,050 --> 00:41:33,519 + 그들이 호출 곱 행렬 행렬에 거 가지고있어보다 열 GPU의 tenser + +542 +00:41:33,519 --> 00:41:37,980 + 즉, 그래서 그것은 곱셈 후 바이어스 행렬을 지속 할 수 + +543 +00:41:37,980 --> 00:41:42,840 + 그 그 내가 이러한 일들이 실제로 아주 잘 작동하는 경향이 의미의 방법과 + +544 +00:41:42,840 --> 00:41:45,850 + 당신은 우리가 당신 하나를 준 빠른 레이어를 기억한다면 또 다른 사례 연구가 + +545 +00:41:45,849 --> 00:41:51,500 + 과제 실제로 우리가 실제로 나노 수행 그래서 여기에이 동일한 전략을 사용 + +546 +00:41:51,500 --> 00:41:55,940 + 작업을 호출하는 지금 우리가 실제로 할 수있는 다음 어떤 미친 NumPy와 트릭이었고, + +547 +00:41:55,940 --> 00:42:00,230 + NumPy와 매트릭스에 단일 통화와 FAST 층 내부의 회선 + +548 +00:42:00,230 --> 00:42:03,900 + 곱셈 당신과이 보통 나에게 몇 가지를 제공합니다 숙제에 서명 + +549 +00:42:03,900 --> 00:42:07,740 + 이 꽤 잘 작동 루프를 사용하는 것보다 더 빨리 백 번 + +550 +00:42:07,739 --> 00:42:18,209 + 그리고는 전화 그에 대해 질문을 구현하는 데 아주 쉽게이다 + +551 +00:42:18,210 --> 00:42:24,949 + 그것에 대해 조금 생각하지만, 당신이 생각하는 경우에 당신이 정말 열심히 생각하면 당신은거야 + +552 +00:42:24,949 --> 00:42:28,219 + 컨볼 루션의 뒤로 패스도 실제로 실현 + +553 +00:42:28,219 --> 00:42:33,358 + 당신이 그것에 대해 생각한다면 당신은 몇 가지 알아 낸 수 컨볼 루션 + +554 +00:42:33,358 --> 00:42:37,269 + 당신의 숙제하지만 이전 버전과는 회선도 실제로의 유형입니다 통과 + +555 +00:42:37,269 --> 00:42:41,070 + 상류 구배를 통해 실제로 비슷한을 사용할 수있는 이상 컨볼 루션 + +556 +00:42:41,070 --> 00:42:45,789 + 담배가 아니라 유일한 트릭을 전달하기위한 이미지의 유형은 메소드를 호출합니다 + +557 +00:42:45,789 --> 00:42:51,259 + 당신이 뒤로 패스 할 후에는 일부 그라디언트에 필요 한 것입니다 + +558 +00:42:51,260 --> 00:42:54,940 + 상류에서 수용 필드를 중복 통해 당신은주의해야하므로 + +559 +00:42:54,940 --> 00:43:02,889 + 통화 팀에 대해 당신이 뒤로 패스에서 호출 팀을 소환 필요 + +560 +00:43:02,889 --> 00:43:06,150 + 숙제는 것을 구현에 실제로 빠른 차선에서 확인하실 수 있습니다 것은 + +561 +00:43:06,150 --> 00:43:11,050 + 너무 실제로 비록 더 숙제를 호출 팀에 빠른 레이어 + +562 +00:43:11,050 --> 00:43:18,910 + 실제로 거기에 내가 충분히 빨리 그것을 얻을 수있는 방법을 찾을 수 없습니다에 눈에 + +563 +00:43:18,909 --> 00:43:22,710 + 때로는 사람들이 회선 사용하고는이 아이디어 또 다른 방법 + +564 +00:43:22,710 --> 00:43:27,400 + 당신이 신호 등으로부터 추억이있는 경우 고속 푸리에 그래서 변환 + +565 +00:43:27,400 --> 00:43:30,700 + 처리 클래스 또는 호출이 일을 기억 수도 같은 + +566 +00:43:30,699 --> 00:43:34,639 + 충족 회선 정리는 두 개의 신호를 가지고 있다면 당신은 당신이 원하는 것을 말한다 + +567 +00:43:34,639 --> 00:43:38,779 + 하나 신중하게 한 후 다른 여자와 계속되어 그들에게 전화 + +568 +00:43:38,780 --> 00:43:44,130 + 이들 두 신호의 콘볼 루션을 복용하면 오히려와 동일 + +569 +00:43:44,130 --> 00:43:47,820 + 회선의 푸리에 변환은 요소 제품과 동일 + +570 +00:43:47,820 --> 00:43:51,859 + 당신은 당신이 밖으로 압축을 푼 가지고 기호를 응시 그래서 만약 푸리에 변환 I + +571 +00:43:51,858 --> 00:43:56,779 + 이 의미가있을 거라 생각하고 또한 경우 다시 신호에서 기억하고 있습니다 + +572 +00:43:56,780 --> 00:44:00,240 + 처리 클래스 또는 알고리즘 클래스 호출이 놀라운 일이있다 + +573 +00:44:00,239 --> 00:44:04,299 + 고속 푸리에 실제로 좋아 푸리에 변환을 계산하기 위해 우리가 할 수 변환 + +574 +00:44:04,300 --> 00:44:08,080 + 역 푸리에 정말 정말 빠른 변환 변환 + +575 +00:44:08,079 --> 00:44:11,679 + 당신은 2D에서 하루에이 버전의 곰 볼 수 있으므로 그들은 모든 것 + +576 +00:44:11,679 --> 00:44:17,129 + 정말 빠른 그래서 우리는 실제로 엄격한 회선을 적용 할 수있는 방법이 너무 + +577 +00:44:17,130 --> 00:44:20,660 + 작동 처음 우리가 고속 푸리에 변환을 사용하여 계산하는거야 것입니다 + +578 +00:44:20,659 --> 00:44:24,899 + 푸리에 푸리에을 계산에도 가중치를 계산하는 변환 + +579 +00:44:24,900 --> 00:44:30,320 + 우리의 활성화지도의 변환 지금 푸리에 공간에서 우리는 단지 요소를 할 + +580 +00:44:30,320 --> 00:44:35,050 + 정말 정말 빠르고 효율적이며 다음 우리가 올 곱셈 + +581 +00:44:35,050 --> 00:44:40,269 + 다시의 패스를 사용하여 상기 역 출력을 변환 할 변환 + +582 +00:44:40,269 --> 00:44:44,420 + 그 요소 제품의이에 우리를 위해 회선을 구현 + +583 +00:44:44,420 --> 00:44:52,550 + 멋진 영리한 방법 좀 시원하고이 실제로 사용하고 몇몇 사람에 직면하고있다 + +584 +00:44:52,550 --> 00:44:55,940 + 페이스 북이 작년에 관한 논문을했다 그리고 그들은 실제로 출시 것을 + +585 +00:44:55,940 --> 00:44:57,650 + GPU 라이브러리는이 작업을 수행하는 + +586 +00:44:57,650 --> 00:45:03,329 + 이 일을 계산하지만, 이러한 푸리에에 대한 슬픈 일이 변환이 + +587 +00:45:03,329 --> 00:45:07,819 + 그들은 실제로 당신에게 정말 다른 방법하지만 통해 정말 큰 속도 향상을 제공 + +588 +00:45:07,820 --> 00:45:11,970 + 당신은이 작은 3 × 3에 최선을 다하고 네 개의 큰 바위와 때 + +589 +00:45:11,969 --> 00:45:15,829 + 푸리에 변환을 계산하는 오버 헤드를 향해 바로 변환 필터 + +590 +00:45:15,829 --> 00:45:20,449 + 입력 화소 공간에서 직접 연산을하는 연산 + +591 +00:45:20,449 --> 00:45:25,579 + 우리가 강의에 앞서 이야기로 작은 기여는 + +592 +00:45:25,579 --> 00:45:30,389 + 그것은을, 그래서 많은 이유에 대해 정말 정말 멋지고 매력과 큰 + +593 +00:45:30,389 --> 00:45:33,489 + 트릭이 너무 잘 영향을 작동하지 않습니다 수치의 조금 + +594 +00:45:33,489 --> 00:45:38,439 + 우리하지만 어떤 이유로 경우 당신은 정말 큰 기여를 계산하고 싶어 + +595 +00:45:38,440 --> 00:45:46,019 + 이 그래 당신이 시도 할 수있는 일입니다 + +596 +00:45:46,019 --> 00:46:02,489 + 너무 물건에 관여하지만 당신이 문제가 아마이다 생각하면 내가 상상 + +597 +00:46:02,489 --> 00:46:04,639 + 문제 + +598 +00:46:04,639 --> 00:46:12,900 + 나중에 지적하는 또 다른 한가지는 그 푸리에에 대한 균형 아웃 한 종류의 + +599 +00:46:12,900 --> 00:46:17,430 + 결론을 변환하는 것은 그들이 지금까지 너무 잘 월쯤 처리하지 않는다는 것입니다 + +600 +00:46:17,429 --> 00:46:21,219 + 정상의 종류에 계산 귀에 거슬리는 회선과 일반 컴퓨터 + +601 +00:46:21,219 --> 00:46:25,409 + 입력 공간 만 우리의 제품에 사람들의 작은 하위 집합을 계산하면 이렇게 + +602 +00:46:25,409 --> 00:46:28,489 + 당신이 회선을 공격 할 때 실제로 계산을 많이 저장 + +603 +00:46:28,489 --> 00:46:32,199 + 직접 입력 공간하지만 방법에 당신은 귀에 거슬리는 구현하는 경향이 + +604 +00:46:32,199 --> 00:46:36,649 + 푸리에의 회선 공간이 당신은 단지 전체를 계산한다 변환 및 + +605 +00:46:36,650 --> 00:46:43,180 + 그 때문에 매우 효율적되지 않는 끝, 그래서 당신은 데이터의 일부를 밖으로 던져 + +606 +00:46:43,179 --> 00:46:47,969 + 정말 너무 넓게 생각도하게되지 않은 또 다른 트릭이있다 + +607 +00:46:47,969 --> 00:46:51,989 + 알려진 아직하지만 난 정말 그렇게 내가 그것에 대해 그렇게 얘기하고 싶었 생각 좋아 + +608 +00:46:51,989 --> 00:46:55,909 + 당신은 스트래튼의 알고리즘이라는 알고리즘 클래스에서 뭔가 기억하고있다 + +609 +00:46:55,909 --> 00:47:00,789 + 바로 당신의 순진 행렬 곱셈을 수행 할 때 종료하는 것이이 아이디어가있다 + +610 +00:47:00,789 --> 00:47:04,869 + 에 의해 종류의 행렬 당신은 카운트 경우 비록 모든 수정 있지만, + +611 +00:47:04,869 --> 00:47:08,630 + 당신이 그것을 할 필요가 추가는거야 어떤을 정도 걸릴 것 + +612 +00:47:08,630 --> 00:47:12,950 + 귀여운 운영 및 스트래튼의 알고리즘은 정말 미친 것처럼 이것이다 우리 + +613 +00:47:12,949 --> 00:47:16,839 + 이 모든 미친 중간체를 계산하고 어떻게 든 마술에 밖으로 작동 + +614 +00:47:16,840 --> 00:47:22,289 + 순진한 방법보다 점근 적으로 빠른 출력을 계산하고 당신은 알고 + +615 +00:47:22,289 --> 00:47:26,869 + 그 날 행렬 곱셈이 우리가 구현할 수있는 것을 알고 호출 + +616 +00:47:26,869 --> 00:47:31,339 + 행렬 곱셈 같은 회선은 직관적으로 이러한 것을 예상합니다 + +617 +00:47:31,340 --> 00:47:35,110 + 트릭 비슷한 유형의 이론적 아마에 적용 할 수 있습니다 + +618 +00:47:35,110 --> 00:47:41,320 + 컨볼 루션 그것은 그들이 그렇게이 정말 멋진 용지가있을 수 있습니다 밝혀 그 단지 + +619 +00:47:41,320 --> 00:47:46,370 + 이 두 사람이 아주 명시 적으로 밖으로 일 여름 동안 나왔다 + +620 +00:47:46,369 --> 00:47:50,670 + 뭔가 아주 특별한 경우를 자주 암시 43 그것은이 포함됩니다 + +621 +00:47:50,670 --> 00:47:54,659 + 분명 여기 세부 사항으로 이동하지 않을거야하지만 비슷한 맛이다 + +622 +00:47:54,659 --> 00:47:58,539 + 스트레스와 중간 아주 영리한 계산에 + +623 +00:47:58,539 --> 00:48:03,630 + 헨리 실제로 계산과 이들에 많이 저장을 결합 + +624 +00:48:03,630 --> 00:48:08,220 + 사람은 실제로 정말 강렬하고 그들은 단지 수학자하지 않은 그들 + +625 +00:48:08,219 --> 00:48:11,959 + 실제로 또한 매우 높은이를 계산하기위한 CUDA 커널을 최적화 쓴 + +626 +00:48:11,960 --> 00:48:17,570 + 두 배 BGG을 단축 할 수 있었다 물건 그래서 정말 정말 + +627 +00:48:17,570 --> 00:48:21,890 + 인상적인 그래서 나는이 이러한 유형의 트럭이 유형이 될 수 있다는 생각 + +628 +00:48:21,889 --> 00:48:26,019 + 인 미래하지만 시간이 꽤 인기가 나는 그것이 매우 광범위하지 생각 + +629 +00:48:26,019 --> 00:48:30,650 + 사용하지만이 숫자는 특히 그들이있어 작은 배치 크기에 미친 + +630 +00:48:30,650 --> 00:48:35,010 + 그게 정말 인상적이고 I의의 BGG에 여섯 속도를 점점 + +631 +00:48:35,010 --> 00:48:38,770 + 단점은 당신이 좀 일을해야한다는 것입니다 그것은 정말 멋진 방법이라고 생각 + +632 +00:48:38,769 --> 00:48:43,009 + 이러한 명시 적으로 특별한 경우 외부 회선 각각 다른 크기 그러나 아마 + +633 +00:48:43,010 --> 00:48:45,850 + 우리는 3 × 3 회선에 대해 신경 경우 그는 큰 문제가 아니다 + +634 +00:48:45,849 --> 00:48:54,719 + 그래서 실제로 회선 컴퓨팅 정리해은 그 진짜의 종류 + +635 +00:48:54,719 --> 00:48:58,579 + 이러한 것들을 구현하는 빠르고, 쉽고 신속하고 더러운 방법은 전화에서입니다 + +636 +00:48:58,579 --> 00:49:02,869 + 행렬 곱셈이 전달됩니다 그것을 구현하는 것이 너무 어렵지 않다 않습니다 + +637 +00:49:02,869 --> 00:49:06,609 + 이 것을 어떤 이유로 당신이 정말로 대회를 구현해야하는 경우 이렇게 + +638 +00:49:06,610 --> 00:49:11,400 + 자신을 정말 통화 활동에 권 해드립니다은 오는 뭔가 + +639 +00:49:11,400 --> 00:49:15,230 + 당신이 생각하는 신호 처리 정말 시원하고 정말 유용하지만 것 + +640 +00:49:15,230 --> 00:49:19,719 + 그것은하지 그래서는하지만 큰 필터의 속도 업을 주는가 있다고 밝혀 + +641 +00:49:19,719 --> 00:49:24,000 + 당신이 희망하지만 한 수만큼이 빠른 때문에 희망 유용있다 + +642 +00:49:24,000 --> 00:49:25,440 + 알고리즘은 정말 좋아 + +643 +00:49:25,440 --> 00:49:29,650 + 이미 세계 어딘가에 코드가 존재하고 필터는 그렇게 할 일 + +644 +00:49:29,650 --> 00:49:35,889 + 희망이 이러한 것들에 잡아 더 널리 그래서 만약 사용 될 것입니다 + +645 +00:49:35,889 --> 00:49:41,529 + 계산 회선에 대해 질문이있다 + +646 +00:49:41,530 --> 00:49:50,940 + 확인을 우리는 첫 번째 질문 그래서 일부 구현 세부 사항에 대한 거 이야기를 그렇게 옆에있어 + +647 +00:49:50,940 --> 00:49:55,710 + 어떻게 너희들은 지금까지 자신의 컴퓨터를 구축 할 + +648 +00:49:55,710 --> 00:50:01,710 + 확인 그래서 너희들은 너무 사람이 할 수있는이 다음 슬라이드에이 답변 방지 할 수있다 + +649 +00:50:01,710 --> 00:50:07,869 + 아웃 지점의 CPU 사람을 발견 + +650 +00:50:07,869 --> 00:50:17,210 + CPU는이 작은 사람이 바로 그래서 실제로는이 일이 사실이다 + +651 +00:50:17,210 --> 00:50:22,179 + 는 CPU 자체의 내부에 약간의 작은 부분이 그래서 그것의 많은 쿨러입니다 + +652 +00:50:22,179 --> 00:50:28,730 + 여기 많은 다음 스폿에게 GPU 냉각 방열판 실제로 + +653 +00:50:28,730 --> 00:50:38,320 + 네, 그것은에 등이 GPU는 지포스를 말하는 것은 그것의 한 가지입니다입니다 + +654 +00:50:38,320 --> 00:50:43,180 + 그것은 훨씬 더 큰 그리고 당신은 그래서 할 수 있습니다 있도록 CPU는 더 강력한 I하다 + +655 +00:50:43,179 --> 00:50:48,679 + 알고 있지만 그 종류의 그, 그래서 적어도이 경우에 더 많은 공간을 복용 + +656 +00:50:48,679 --> 00:50:54,309 + 흥미로운 일이 너무 일어나고 있다는 표시로 나는 또 다른 질문을하고 있는데 + +657 +00:50:54,309 --> 00:50:57,029 + 당신 돼 플레이 비디오 게임 + +658 +00:50:57,030 --> 00:51:05,390 + 확인 후, 당신은 아마 그래서 사람들이 많이 밝혀 이에 대한 의견 + +659 +00:51:05,389 --> 00:51:09,809 + 기계 학습과 깊은 학습도 정말 강한 의견을 가지고 대부분의 + +660 +00:51:09,809 --> 00:51:15,639 + 사람들은 그래서 엔비디아 실제로는 훨씬 더 널리 후 사용 측면에 있습니다 + +661 +00:51:15,639 --> 00:51:21,179 + AMD는 당신이 GPU는 미국을 사용하는 이유는 것입니다 + +662 +00:51:21,179 --> 00:51:25,599 + NVIDIA는 정말 정말 깊이로 다이빙 지난 몇 년에 많은 일을하고있다 + +663 +00:51:25,599 --> 00:51:30,710 + 정도의 멋진 예를 들어 학습과 자신의 초점이 정말 핵심 부분 확인 + +664 +00:51:30,710 --> 00:51:34,769 + 입니다 GTC에서 작년 + +665 +00:51:34,769 --> 00:51:39,869 + 발표 새로운 제품에 대한 연간 큰 거대한 회의의 비디오 정렬 + +666 +00:51:39,869 --> 00:51:44,230 + 비디오에서의 CEO 실제로 또한 스탠포드 경보입니다 젠슨 홍콩 + +667 +00:51:44,230 --> 00:51:49,059 + 이 최신 가장 놀라운 새로운 GPU의 인두세 행위 등 소개 + +668 +00:51:49,059 --> 00:51:53,400 + 자신의 주력 것은 그가 그것을 판매하는 데 사용되는 벤치 마크는 얼마나 빨리 + +669 +00:51:53,400 --> 00:51:56,800 + 국가와 알렉스는 그렇게 만난이 미쳤다 + +670 +00:51:56,800 --> 00:52:00,140 + 이 같은 수백 수백명의 사람들과 함께 거대한 방이었다 + +671 +00:52:00,139 --> 00:52:04,279 + 이 거대한 높은 광택 프리젠 테이션 및 CEO 등 언론인과 + +672 +00:52:04,280 --> 00:52:07,890 + 비디오에서 알렉스 그물 및 회선에 대해 얘기하고, 나는 그것이라고 생각했다 + +673 +00:52:07,889 --> 00:52:11,690 + 정말 흥분하고 가지 방법을 보여줍니다 엔비디아는 정말 많은 약을 걱정하는 것이 + +674 +00:52:11,690 --> 00:52:15,300 + 이 일을 얻는 것은 일을하고 그들에 노력을 많이 밀어하기 + +675 +00:52:15,300 --> 00:52:22,150 + 그렇게 그냥 작동 제작에 들어가는 것은 아마 같은 생각을 당신에게 CPU를 제공합니다 + +676 +00:52:22,150 --> 00:52:26,900 + 빠른 순차 처리에 정말 좋은 알고 그들은 작은을 갖는 경향이 + +677 +00:52:26,900 --> 00:52:31,019 + 코어의 수는 노트북은 아마 어쩌면 사이에 하나 사처럼이 + +678 +00:52:31,019 --> 00:52:36,920 + 서버의 모서리와 큰 일 최대 4분의 16과 이런 일이있을 수 있습니다 + +679 +00:52:36,920 --> 00:52:39,610 + 컴퓨팅 물건 정말 정말 빨리에게 정말 좋은 + +680 +00:52:39,610 --> 00:52:45,349 + 그리고 순서 GPU 반면에는 많은 많은 많은 과정을 갖는 경향 + +681 +00:52:45,349 --> 00:52:49,759 + 세금 등 큰 사람은 분기 수천까지 가질 수 있지만 각 경향 + +682 +00:52:49,760 --> 00:52:53,500 + 코어는 지난 2010년 5월 10일 낮은 클럭 속도를하고 당 더 적은을 할 수 있습니다 + +683 +00:52:53,500 --> 00:52:59,429 + 이 GPU는 다시 우리가 실제로 처음으로 개발되었다 있도록 명령어 사이클 + +684 +00:52:59,429 --> 00:53:05,230 + 처리 그래픽 그래픽 처리 장치는 그래서 그들은 일에 정말 좋은거야 + +685 +00:53:05,230 --> 00:53:09,699 + 일종의 고도의 마비 작업은 싶어 많은 많은 일을 할 수 있습니다 + +686 +00:53:09,699 --> 00:53:15,460 + 독립적으로 평행하고 원래 컴퓨터를 위해 설계 되었기 때문에 + +687 +00:53:15,460 --> 00:53:19,590 + 그래픽은 그러나 그 이후 그들은 종류의보다 일반적인 컴퓨팅으로 진화했습니다 + +688 +00:53:19,590 --> 00:53:23,100 + 당신이 쓸 수 다른 프레임 워크가 플랫폼 있도록 + +689 +00:53:23,099 --> 00:53:28,929 + 엔비디아에서 우리는이 프레임 워크를 그래서 일반적인 코드는 GPU에서 직접 실행하기 + +690 +00:53:28,929 --> 00:53:33,509 + 그것은 당신이 실제로 직접 실행되는 코드를 작성 시트의 변형을 작성할 수 있습니다 + +691 +00:53:33,510 --> 00:53:37,990 + GPU에서와에 작동 오픈 CL이라는 비슷한 프레임 워크가있다 + +692 +00:53:37,989 --> 00:53:43,569 + 거의 모든 컴퓨팅 플랫폼 어느 정도하지만 개방형 표준이되는 의미 + +693 +00:53:43,570 --> 00:53:48,890 + 좋은 그것은 OpenCL을 사방에 작동하는지 아주 좋다하지만 실제로는 그렇게 열 + +694 +00:53:48,889 --> 00:53:52,559 + 즉, 더 많은 성능과 방법을 조금 더 멋진 도서관 경향이있다 + +695 +00:53:52,559 --> 00:53:57,420 + 지원 적어도 네 깊은 학습 대부분의 사람들이 사용할 수 있도록 할 수 대신하고있는 경우 + +696 +00:53:57,420 --> 00:54:01,309 + 실제로 G PIKO G PIKO를 직접 작성하는 방법을 학습에 관심 + +697 +00:54:01,309 --> 00:54:05,230 + 나는 그것이 꽤 멋지다의 것 정말 멋진 불쾌한 과정이 재미 있어요 + +698 +00:54:05,230 --> 00:54:09,409 + 당신이 코드는 GPU에 일을 실행에 쓸 수 있습니다 할당 모두 있지만, + +699 +00:54:09,409 --> 00:54:12,730 + 당신이 원하는 모든 경우 방법은 기차 너트를 와서 연구와 그와 같은 작업을 수행 + +700 +00:54:12,730 --> 00:54:16,409 + 일의 당신은 일반적으로이 코드를 직접 당신이 중 하나를 작성하지 않아도 결국 + +701 +00:54:16,409 --> 00:54:20,139 + 단지 외부 라이브러리에 의존 + +702 +00:54:20,139 --> 00:54:33,440 + 바로 그렇게 할 수 나는이 너무 귀엽다이 원 이상 높은 수준의 도서관처럼 + +703 +00:54:33,440 --> 00:54:38,599 + 종류의 유리 바로 그래서 한 가지 같은 GPU는에 정말 정말 좋은 것을 + +704 +00:54:38,599 --> 00:54:43,420 + 행렬 곱셈은 그래서 여기 여기 난이 엔비디아의에서 인 벤치 마크 뜻이야 + +705 +00:54:43,420 --> 00:54:49,550 + 웹 사이트는 그래서 조금 편견이다 그러나 이것은 행렬 곱셈을 보이고있다 + +706 +00:54:49,550 --> 00:54:54,789 + 이 꽤 살이 찐 CPU에 매트릭스 눈의 함수로 시간은 12 군단의 사람입니다 + +707 +00:54:54,789 --> 00:55:00,079 + 그것은 아주 아주 건강한 CPU처럼 서버에 살고있는 것이입니다 + +708 +00:55:00,079 --> 00:55:04,000 + 를 인 40 같은 시험에서 곱셈 같은 날짜 과학 행렬을 실행 + +709 +00:55:04,000 --> 00:55:11,000 + 꽤 살이 찐 GPU 그것은 훨씬 더 빨리 내가​​ 그 더 큰 놀라움 오른쪽 없습니다 의미이고 + +710 +00:55:11,000 --> 00:55:15,119 + 당신이 언급 한 비디오는이되도록 GPU는 또한 정말 꼭 회선입니다 + +711 +00:55:15,119 --> 00:55:19,909 + 오늘 호출 라이브러리는 특별히 최적화 된 낙관주의 CUDA를 발표했다 + +712 +00:55:19,909 --> 00:55:26,139 + 회선에 대한 커널 그래서 내 말은 CPU에 비해​​ 그것은 WAY 빨리이의의 + +713 +00:55:26,139 --> 00:55:30,139 + 실제로 그를 비교하는 것은 함께 캠페인의 기여를 호출 + +714 +00:55:30,139 --> 00:55:34,920 + 승무원 티에 넨 회선 나는이 그래프는 처음부터 실제로 생각 + +715 +00:55:34,920 --> 00:55:41,030 + CNN 버전의 버전은 단지 몇 주 전에 나와서 그러나 이것은 단지입니다 + +716 +00:55:41,030 --> 00:55:44,600 + 실제로 다음 벤치 마크 토목 때문에 CPU 벤치 마크했다 버전 + +717 +00:55:44,599 --> 00:55:49,699 + 더 빨리 그 이후부터 많이있어, 그래서 나 이전 버전에 대해되었습니다 + +718 +00:55:49,699 --> 00:55:54,769 + 여기에 있지만 증거가 맞는 방법은 두 개의 폭발 같은 또는 뭔가 + +719 +00:55:54,769 --> 00:56:00,090 + 이 기능을 제공하고 단지 종류의 것을 볼 수 있도록 DNN은 C 라이브러리입니다 + +720 +00:56:00,090 --> 00:56:05,309 + C 라이브러리와 GPU 멀리 추상 그래서 당신은 일종의에서의 텐서이있는 경우 + +721 +00:56:05,309 --> 00:56:09,429 + 메모리와 방금 한국 라이브러리에 대한 포인터를 전달할 수 있습니다보고는거야 + +722 +00:56:09,429 --> 00:56:13,299 + conf의 작은 아마 비동기 적으로 GPU에서 실행 돌아가서를 반환 + +723 +00:56:13,300 --> 00:56:19,440 + 결과 카페와 토치와 같은 프레임 워크 있도록 모든 이제 Q의 티에 넨을 통합 한 + +724 +00:56:19,440 --> 00:56:23,750 + 물건을 자신의 프레임 워크에 당신은 어떤에서 이러한 효율적인 솔루션을 활용할 수 + +725 +00:56:23,750 --> 00:56:30,340 + 이 프레임 워크는 알고 있지만, 문제는 그 우리는이를 일단 경우에도 + +726 +00:56:30,340 --> 00:56:33,430 + 정말 큰 모델을 훈련 강력한 GPU는 종류 여전히 + +727 +00:56:33,429 --> 00:56:39,409 + VG 정가가에 2 ~ 3 주 같은 훈련을 유명하게되었다 그래서 천천히 + +728 +00:56:39,409 --> 00:56:43,759 + 타이탄 무엇 타이탄 블랙 샌들은 싸지 않다이었고, 그것은 실제로이었다 + +729 +00:56:43,760 --> 00:56:47,280 + ResNet의 추천은 최근이 최대 정말 멋진 바로 거기 + +730 +00:56:47,280 --> 00:56:51,839 + 정말 멋진 블로그 게시물 여기를 설명하고 실제로 ResNet을 재교육 + +731 +00:56:51,838 --> 00:56:56,400 + 백 한 레이어 모델과 또한 동안 훈련을 2 주 정도 걸렸다 + +732 +00:56:56,400 --> 00:57:03,880 + 그 좋지 않다, 그래서 GPU를하고 한 방향으로 그 사람들이 길을 쉬운 방법으로 그 + +733 +00:57:03,880 --> 00:57:08,269 + 당신의 돈을 돌려 걸쳐 분할되어 여러 GPU에 걸쳐 교육을 분할 + +734 +00:57:08,269 --> 00:57:14,230 + 당신이 특히 BGG 같은 사람을 위해 당신이 수있는 GPU를 정상적으로 있도록 소요 + +735 +00:57:14,230 --> 00:57:17,679 + 많은 메모리 그래서 당신은 매우 큰 나 배치 크기와 경쟁 할 수 + +736 +00:57:17,679 --> 00:57:23,649 + 단일 GPU 당신은 당신이 할 거 야 그래서 어떤 이미지의 배치가 될 수있는 6 128 + +737 +00:57:23,650 --> 00:57:24,700 + 그런 일 + +738 +00:57:24,699 --> 00:57:30,338 + 네 개의 동일한 조각으로 어떤 경기보다 각각의 GPU는 순방향 및 역방향을 계산 + +739 +00:57:30,338 --> 00:57:35,190 + 가중치 반면에 계산 pramit 구배에 많은 배치에 대해 통과 + +740 +00:57:35,190 --> 00:57:39,470 + GPU에 대한 후 모든 무게 당신의 일부 내부에 그 무게의 일부 + +741 +00:57:39,469 --> 00:57:44,548 + 스페인어와이 정말 간단한 방법 즉 있도록 업데이 트 모델을 만드는 사람들 + +742 +00:57:44,548 --> 00:57:53,599 + 그래 GPU에서 분포를 구현하는 경향이 + +743 +00:57:53,599 --> 00:57:59,089 + 그들은이 과정을 자동화 할 수 있다고 주장 왜 그래 그래서 그건 및 + +744 +00:57:59,090 --> 00:58:03,039 + 정말 정말 효율적으로 내가 생각하는 정말 흥분되는에게 배포하지만 + +745 +00:58:03,039 --> 00:58:07,820 + 적어도 토치도 자신을 많이 연주되지 않은 데이터 병렬있다 + +746 +00:58:07,820 --> 00:58:11,059 + 당신이 그냥 드롭과는 자동으로 모든 종류의 사용할 수있는 + +747 +00:58:11,059 --> 00:58:14,070 + 병렬 이런 종류의 아주 쉽게 + +748 +00:58:14,070 --> 00:58:18,930 + 멀티 GPU 훈련을위한 약간 더 복잡한 아이디어는 실제로 알렉스에서 온다 + +749 +00:58:18,929 --> 00:58:21,279 + 알렉스되지 명성 + +750 +00:58:21,280 --> 00:58:26,670 + 그 재미있는 제목의 멋진 가지 종류하지만 생각 생각하지만, 생각은 + +751 +00:58:26,670 --> 00:58:31,409 + 우리는 실제로 하위 계층에 등을 데이터 병렬 처리를 수행하도록 + +752 +00:58:31,409 --> 00:58:35,980 + 우리의 이미지를 여러 배치를 취할 것 하위 계층은 두 개의 GPU를 통해 분할 및 + +753 +00:58:35,980 --> 00:58:42,059 + 먹고 GPU 하나는 먼저 첫 번째 부분에 대해 컨볼 루션을 계산하는 것 + +754 +00:58:42,059 --> 00:58:46,279 + 많은 배치의 일부 단지 발표 그냥 빌려 컨볼 루션 부분이 될 것입니다 + +755 +00:58:46,280 --> 00:58:49,960 + GPU에 걸쳐 균등하게 분포하지만 당신이 일단 완전 연결 + +756 +00:58:49,960 --> 00:58:50,760 + 층 + +757 +00:58:50,760 --> 00:58:54,800 + 그는 당신이 정말 큰 매트릭스 경우 실제로 더 효율적 발견 + +758 +00:58:54,800 --> 00:58:58,810 + 승산은 실제로 서로에 GPS 작업이 더 효율적 + +759 +00:58:58,809 --> 00:59:02,869 + 매우 아니다이 행렬이 멋진 트랙의 종류 곱 계산 + +760 +00:59:02,869 --> 00:59:09,480 + 일반적으로 사용하지만 그것이 구글에서 다른 생각을 언급하는 재미의 생각 + +761 +00:59:09,480 --> 00:59:13,800 + tenser 흐름이되기 전에 그들이이 일이라고했다 전에이다 + +762 +00:59:13,800 --> 00:59:18,380 + 전체 CPU 기반이었다 자신의 이전 시스템이었다 불신 + +763 +00:59:18,380 --> 00:59:22,630 + 몇 슬라이드 전에 벤치 마크에서 당신이 될 줄 상상도 할 수있는 + +764 +00:59:22,630 --> 00:59:26,250 + 정말 느리지 만 실제로 구글 매트의 첫 번째 버전은 모든 훈련을했다 + +765 +00:59:26,250 --> 00:59:30,800 + CPU에 대한 불신 때문에 실제로 그래서 그들은 엄청난 양의 작업을 수행했다 + +766 +00:59:30,800 --> 00:59:35,800 + CPU에 분포는이 멋진 종이 거기에 이런 일이 그래서 여기에 훈련을받을 + +767 +00:59:35,800 --> 00:59:39,530 + 이 설명 몇 년 전 JAP 청소년과 더 많은 세부 사항 만에서 + +768 +00:59:39,530 --> 00:59:43,640 + 당신은 데이터 병렬 처리를 사용하거나 각 시스템이 독립적 인 복사본이 + +769 +00:59:43,639 --> 00:59:48,710 + 데이터의 패치에 모델과 앞으로 컴퓨팅 각 기계 및 이전 버전 + +770 +00:59:48,710 --> 00:59:52,659 + 하지만 지금은 당신이 실제로 저장있어이 매개 변수 서버가 텍스트 + +771 +00:59:52,659 --> 00:59:55,739 + 모델의 매개 변수와 이러한 독립적 인 노동자 만들고있다 + +772 +00:59:55,739 --> 01:00:01,209 + 파라미터 서버와의 통신 모델을 업데이트 할 수 있으며하도록 + +773 +01:00:01,210 --> 01:00:05,740 + 당신이 1을 입력 어디 모델 병렬로이 대조 + +774 +01:00:05,739 --> 01:00:09,879 + 모델 당신은의 다른 부분을 계산하는 다른 다른 노동자 + +775 +01:00:09,880 --> 01:00:14,650 + 그래서 모델과 불신에 그들은 정말 정말 좋은 일을했다 + +776 +01:00:14,650 --> 01:00:18,110 + 이 많은 많은 CPU와 많은 많은 걸쳐 정말 잘 작동하도록 최적화 + +777 +01:00:18,110 --> 01:00:23,170 + 기계는하지만 지금은 희망이 일을해야 암의 흐름이 + +778 +01:00:23,170 --> 01:00:28,639 + 더 자동으로 당신이 이러한 업데이트를하고있어 일단이있다 + +779 +01:00:28,639 --> 01:00:34,949 + 비동기 STD 및 동기 STD 사이의 생각은 그래서 동기 STD입니다 + +780 +01:00:34,949 --> 01:00:39,299 + 순진한 것 같은 것들 중 하나는 당신이 어떤 배치를 예상하면 + +781 +01:00:39,300 --> 01:00:42,880 + 각 노동자는 다수의 근로자에​​ 걸쳐 전후 않습니다 분할 + +782 +01:00:42,880 --> 01:00:46,710 + 당신은 모든 그라디언트를 추가하고 단일 모델을 그라데이션을 계산 + +783 +01:00:46,710 --> 01:00:51,220 + 업데이트이이 정확하게의 정렬 시뮬레이션 할 것 + +784 +01:00:51,219 --> 01:00:55,029 + 그냥 계산하지만 더 큰 기계에 많은 배치하지만 종류의 수 + +785 +01:00:55,030 --> 01:00:59,619 + 당신이 기계를 통해 동기화 할 수 있기 때문에 속도가 느린 이것은 너무 많은 경향이있다 + +786 +01:00:59,619 --> 01:01:03,610 + 단일 노트에 여러 GPU를 작업 만 한 번하고 큰 문제 + +787 +01:01:03,610 --> 01:01:08,430 + 당신은 많은 많은 CPU를 통해 내가 동기있어 그 지역을 분산하고 + +788 +01:01:08,429 --> 01:01:12,569 + 사실은 꽤 비쌀 수 있으므로 대신 적어도 그들은이있다 + +789 +01:01:12,570 --> 01:01:17,500 + 각 모델은 단지 종류의 제조 업데이트 인 비동기 STD의 개념 + +790 +01:01:17,500 --> 01:01:21,599 + 매개 변수의 복사본에 그는 몇 가지 개념이 + +791 +01:01:21,599 --> 01:01:25,480 + 그들은 때로는 주기적으로 동기화 최종 일관성 + +792 +01:01:25,480 --> 01:01:29,530 + 서로 그것의 디버그하지만 정말 복잡하고 어려운 것 같다 + +793 +01:01:29,530 --> 01:01:35,619 + 그렇게 작동하려면 그것은 정말 멋진 그림 중 하나가 꽤 멋진이고있어 + +794 +01:01:35,619 --> 01:01:39,430 + 그래서이 두 숫자는 텐서 흐름 종이 모두이며 하나 + +795 +01:01:39,429 --> 01:01:42,549 + 텐서 흐름의 이미지는 실제로 이러한 유형해야한다는 것이다 + +796 +01:01:42,550 --> 01:01:46,510 + 당신이 일어날 경우 그 사용자에게 훨씬 더 투명 유통 + +797 +01:01:46,510 --> 01:01:51,580 + 의 GPU와 CPU를 큰 클러스터에 액세스하고 이것 저것 tenser 흐름해야 + +798 +01:01:51,579 --> 01:01:54,840 + 자동으로 이러한 종류의 할 수있는 최선의 방법을 알아낼 수 + +799 +01:01:54,840 --> 01:01:58,970 + 데이터 및 모델의 병렬 처리와 결합 분포는 당신을 위해 모든 것을 할 + +800 +01:01:58,969 --> 01:02:03,399 + 즉 그것은 정말 멋진이고, 그래서 나는 그게 정말 흥미로운 부분이라고 생각 + +801 +01:02:03,400 --> 01:02:11,050 + 바보 훈련 약 1000 어떤 질문이 그래 + +802 +01:02:11,050 --> 01:02:16,120 + 및 CN TK 나는 아직 그것을 살펴 촬영하지 않은 + +803 +01:02:16,119 --> 01:02:22,130 + 확인 그래서 다음 번에 ​​몇 병목 현상은 당신이 알고 있어야 거기 + +804 +01:02:22,130 --> 01:02:27,500 + 연습은 그래서 일반적으로이 같은이 일을 훈련 할 때처럼 기대 + +805 +01:02:27,500 --> 01:02:30,769 + 분산 물건 좋은 큰하지만 당신은 실제로 단지와 함께 먼 길을 갈 수 있습니다 + +806 +01:02:30,769 --> 01:02:34,840 + 하나의 단일 시스템에 GPU 및 병목 현상이 많이있다 그 + +807 +01:02:34,840 --> 01:02:39,160 + 그런데 하나 얻을 수 것은 GPU와 CPU 사이의 통신이며 + +808 +01:02:39,159 --> 01:02:44,759 + 실제로 그리고 많은 경우, 데이터는 가장 작고, 특히 + +809 +01:02:44,760 --> 01:02:48,000 + 파이프 라인의 고가의 제품은 GPU에 다음의 데이터를 복사하는 + +810 +01:02:48,000 --> 01:02:51,579 + 당신은 GPU에서 일을 일단 당신이 할 수있는 다시 복사 + +811 +01:02:51,579 --> 01:02:55,719 + 계산 정말 정말 빠르고 효율적으로하지만 복사는이다 + +812 +01:02:55,719 --> 01:03:01,089 + 정말 느린 부분은 메모리 복사를 방지 할 수 있는지 확인하려면로 11 아이디어 그래서 + +813 +01:03:01,090 --> 01:03:06,570 + 같은 때때로 네트워크의 각 계층에서 모두 표시 한 것입니다 + +814 +01:03:06,570 --> 01:03:10,460 + CPU에 GPU에서 앞뒤로 복사하고 정말 비효율적이고 속도가 느려질 수 있습니다 + +815 +01:03:10,460 --> 01:03:14,170 + 그래서 이상적으로 모든 것을 아래로 실행하기 위해 앞으로 전체와 후방 패스를 할 + +816 +01:03:14,170 --> 01:03:17,159 + GPU에에 한 번 + +817 +01:03:17,159 --> 01:03:21,139 + 당신이 볼 수있는 곳 가끔 볼 수 있습니다 또 다른 한가지는 접근 방식을 다중 스레드 + +818 +01:03:21,139 --> 01:03:27,849 + 에 하나의 스레드에서 데이터 많은 메모리를 프리 페치 된 CPU 스레드 + +819 +01:03:27,849 --> 01:03:28,690 + 배경 + +820 +01:03:28,690 --> 01:03:34,070 + 아마도 온라인 보강을 임명하고이이 배경 CPU + +821 +01:03:34,070 --> 01:03:37,470 + 전역 종류의를 발송도 가능 나에게 배치를 준비 할 것 + +822 +01:03:37,469 --> 01:03:41,669 + 이상 GPU에 당신은 종류의 데이터와 컴퓨팅이로드를 조정할 수 있습니다 + +823 +01:03:41,670 --> 01:03:44,680 + 전처리 및 배송 메모리 배송 + +824 +01:03:44,679 --> 01:03:48,940 + 많은 배치 데이터를 GPU에 실제로 계산을하고 실제로 + +825 +01:03:48,940 --> 01:03:51,980 + 아주 약간의 구애에 참여할 수 난의 모든 것을 알 수있을 것입니다 + +826 +01:03:51,980 --> 01:03:57,719 + 멀티 스레드 방식으로 나는 특히 카페 있도록 당신에게 좋은 속도 향상을 제공 할 수 있습니다 + +827 +01:03:57,719 --> 01:04:01,059 + 나는 이미 생각 특정 거기에이 프리 페치 날짜를 구현 + +828 +01:04:01,059 --> 01:04:04,199 + 데이터 스토리지 및 기타 프레임 워크의 유형 당신은 롤이 당신의 + +829 +01:04:04,199 --> 01:04:11,839 + 또 다른 문제는 CPU 디스크 모델 Mac 그래서이 이러한 일들이 친절이다 소유 + +830 +01:04:11,840 --> 01:04:17,820 + 느린 그들은 저렴하고있어 그들은 큰이야하지만 그들은 실제로 그렇게 그렇게 가장하지 않습니다 + +831 +01:04:17,820 --> 01:04:22,220 + 이제 이러한 고체 드라이브는 훨씬 더 일반적인 하드 디스크이다 + +832 +01:04:22,219 --> 01:04:25,730 + 그러나 문제는 고체 상태 드라이브는 더 작고 비용이 알고있는 것이다 + +833 +01:04:25,730 --> 01:04:30,590 + 하지만 그들은 실제로 많이 익숙해 빨리 너무 많이 그래서 무슨 일이 정말이야 + +834 +01:04:30,590 --> 01:04:35,710 + 비록 하드 디스크 및 고체 상태 드라이브와 같은 양으로 하나 하나 공통점 + +835 +01:04:35,710 --> 01:04:39,889 + 당신이 그렇게 책상을 많이 떨어져 데이​​터를 순차적으로 읽고있는 때 가장 잘 작동 + +836 +01:04:39,889 --> 01:04:44,108 + 예는 정말 나쁜 것 당신이 바로 그렇게 한 일을하는지 번 + +837 +01:04:44,108 --> 01:04:48,569 + 이제 이러한 이미지를 각각 때문에 JPEG 이미지의 전체 큰 폴더를해야합니다 수 + +838 +01:04:48,570 --> 01:04:52,309 + 그것까지 정말이 될 수 있도록 책상에 다른 부분에 위치 할 + +839 +01:04:52,309 --> 01:04:56,619 + 임의의 사용자가 읽어도 일단 지금은 개별 JPEG 이미지를 읽어 추구하고 + +840 +01:04:56,619 --> 01:05:01,150 + JPEG는 그렇게 무엇을 매우 비효율적 그 픽셀에 압축을 해제해야 + +841 +01:05:01,150 --> 01:05:05,079 + 당신은 연습에 많은 시간이 표시됩니다 당신거야 실제로 처리기 데이터가 + +842 +01:05:05,079 --> 01:05:10,059 + 그것을 압축 해제 단지 전체 데이터를 하나에 앉아 원시 픽셀을 타고 + +843 +01:05:10,059 --> 01:05:15,940 + 그래서 책상에 거대한 연속 파일이 디스크 공간을 많이 걸리지 만 우리가 할 + +844 +01:05:15,940 --> 01:05:22,230 + 그것은 어쨌든 그것은 평온의 좋은 모든 때문에 바로 그래서 이것은 좀 너무하다 + +845 +01:05:22,230 --> 01:05:27,400 + 카페에서 우리가 할 레벨 D 등을 결합하여이 작업을 수행하는 것은 일반적으로 사용되는 하나입니다 + +846 +01:05:27,400 --> 01:05:33,599 + 형식은 또한 나 또한 HTML5를 사용하는 사용하는 우리를 위해 많은 파일하지만 아이디어는 것입니다했습니다 + +847 +01:05:33,599 --> 01:05:39,280 + 당신은 모든 순차적으로 책상에 이미 설정 데이터 싶어 + +848 +01:05:39,280 --> 01:05:43,180 + 픽셀 상원 훈련에 당신은 당신이 모든 데이터를 저장할 수있는 훈련 할 때 + +849 +01:05:43,179 --> 01:05:46,230 + 메모리 당신은 당신이 그 빠른 속도로 읽을 만들고 싶어 할 때 책상을 읽을 필요 + +850 +01:05:46,230 --> 01:05:50,679 + 프리 페의 영리한 금액과 멀티 스레드 물건을 다시 가능하고 + +851 +01:05:50,679 --> 01:05:54,829 + 당신은 당신이 원 다른 동안 최고 책상 투구 소중히 수도있을 수 있습니다 + +852 +01:05:54,829 --> 01:05:57,460 + 경쟁은 백그라운드에서 일어나는 + +853 +01:05:57,460 --> 01:06:05,019 + GPU가 큰 사람은 그래서 기억해야 할 또 다른 점은 GPU 메모리 병목 현상입니다 + +854 +01:06:05,019 --> 01:06:10,559 + 큰 사람은 많은 메모리를 가지고 있지만 그 정도의 가장 큰 GPU는 그래서 당신은 할 수 있습니다 + +855 +01:06:10,559 --> 01:06:15,539 + 지금 내가 세금 것을 구입하고 키 마흔 메모리 12 기가를 가지고 그건 + +856 +01:06:15,539 --> 01:06:18,139 + 당신이 지금받을거야으로 큰로서 꽤 많은 + +857 +01:06:18,139 --> 01:06:22,679 + NextGen 더 큰해야하지만 실제로는이 제한에 반대 충돌 할 수 있습니다 + +858 +01:06:22,679 --> 01:06:26,989 + 당신은 BG 또는 같은 것을 훈련하고 특히 너무 많은 문제없이 + +859 +01:06:26,989 --> 01:06:31,608 + 당신이 발생하는 경우 재발 네트워크는 매우 매우 매우 매우 긴 시간이 그것의 중지했다 + +860 +01:06:31,608 --> 01:06:34,929 + 실제로 뭔가가있어이 메모리 제한에 반대 충돌하지 너무 열심히 + +861 +01:06:34,929 --> 01:06:35,598 + 유지해야 + +862 +01:06:35,599 --> 01:06:39,130 + 당신이 알고에 대해 당신이이 비행기의 일부를이 일을 훈련하고있을 때 마음 + +863 +01:06:39,130 --> 01:06:43,450 + 이러한 효율적인 회선 실제로 영리하게 만드는 구조 + +864 +01:06:43,449 --> 01:06:47,068 + 당신이 더 큰 더 강력한 모델을 가지고 할 수있는 경우뿐만 아니라이 메모리에 도움 + +865 +01:06:47,068 --> 01:06:52,268 + 와 적은 양의 메모리를 적게 사용하지 않는 당신은 훈련 할 수 있습니다보다 + +866 +01:06:52,268 --> 01:06:58,129 + 일이 더 빠르고 더 큰 일치를 사용하고 모든 것이 좋고, 다만 단지 + +867 +01:06:58,130 --> 01:07:01,588 + 규모 알렉스 기사의 의미는 모델의 많은에 비해 매우 작다 + +868 +01:07:01,588 --> 01:07:05,608 + 이미 소요 256 다시 양쪽 이제 최첨단하지만 알렉스 그물이 있습니다 + +869 +01:07:05,608 --> 01:07:09,469 + 대한 3기가바이트 GB 메모리 당신은 그것의이 더 큰 네트워크가 한 번 있도록 + +870 +01:07:09,469 --> 01:07:15,738 + 실제로 그래서 다른 일이 12 월 12 한계에 부딪하지 너무 열심히 우리 + +871 +01:07:15,739 --> 01:07:20,978 + 나는 많은 코드를 작성하고있을 때 너무 소수점 정밀도를 떠 대해 이야기한다 + +872 +01:07:20,978 --> 01:07:24,788 + 시간의 나는 당신이 이러한 일들이 그냥 실수 알고 상상하기 좋아하고 + +873 +01:07:24,789 --> 01:07:27,960 + 그들은 단지 작동하지만 실제로 그것은 사실이 아니에요 당신은 생각해야 + +874 +01:07:27,960 --> 01:07:32,889 + 부동 소수점의 얼마나 많은 비트 같은 것들 때문에 대부분의 유형을 사용하는 + +875 +01:07:32,889 --> 01:07:37,159 + 당신이 일종의 작성할 수 있습니다 숫자 코드의 유형이 많이 이중으로되어 있습니다 + +876 +01:07:37,159 --> 01:07:43,278 + 또한 작성의 기본이에 의해 정밀 64 비트를 많이 사용하고 더 + +877 +01:07:43,278 --> 01:07:47,449 + 이것은 단지 그래서 일반적으로 깊은 학습에 사용되는 단일 정밀도의이 아이디어는 + +878 +01:07:47,449 --> 01:07:52,710 + 32 내기 때문에 아이디어는 각 번호는 다음 적은 베팅 걸리는 경우 당신이 할 수있는 그 + +879 +01:07:52,710 --> 01:07:56,469 + 그 좋은, 그래서 같은 양의 메모리 내에서 그 숫자를 더 저장하고 + +880 +01:07:56,469 --> 01:08:00,559 + 또한 적은 베팅으로 당신은 더 적은 계산하여이야 그 숫자에서 작동해야 + +881 +01:08:00,559 --> 01:08:05,210 + 또한 우리는 그들이이기 때문에 작은 데이터 유형이 좋은 싶습니다 일반적 있도록 + +882 +01:08:05,210 --> 01:08:11,150 + 빠른 계산하고 불필요한 메모리와 케이스 등의 연구로이 있었다 + +883 +01:08:11,150 --> 01:08:15,489 + 실제로 숙제에 심지어 문제는 그과를 눈치 챘을 수 있도록 + +884 +01:08:15,489 --> 01:08:16,960 + 기본 데이터 타입이있다 + +885 +01:08:16,960 --> 01:08:21,289 + 64 비트 배정 밀도하지만 우리가 당신을 제공하는이 모델의 모든 + +886 +01:08:21,289 --> 01:08:25,789 + 숙제 우리는이 캐스트 또는 32 비트 부동 소수점 숫자를 가지고 있었고, 당신은 할 수 + +887 +01:08:25,789 --> 01:08:28,670 + 실제로 숙제에 돌아가서이 두 당신은거야 사이를 전환 시도 + +888 +01:08:28,670 --> 01:08:32,908 + 32 비트로 전환하는 것은 실제로 당신에게 몇 가지 괜찮은 몇 가지를 제공합니다 볼 + +889 +01:08:32,908 --> 01:08:39,670 + 괜찮은 속도 업 그렇게 나쁜 명백한 문제는 그 32 베팅이 더 나은 경우 + +890 +01:08:39,670 --> 01:08:42,829 + 64 내기 아마 지출보다 우리는 이하를 사용할 수 있습니다 + +891 +01:08:42,829 --> 01:08:52,199 + 그래서이 권리가있다 + +892 +01:08:52,199 --> 01:09:01,010 + 16 베팅은 있지만 32 비트뿐만 아니라, 그래서이 큰 확인을 수행하도록 명령했다 + +893 +01:09:01,010 --> 01:09:05,420 + 부동 소수점 16 비트 부동 소수점에 대한 표준도있다 + +894 +01:09:05,420 --> 01:09:09,699 + 때로는 반 정밀도와 cunanan 실제로 최신 버전이 할라고 + +895 +01:09:09,699 --> 01:09:17,199 + 멋진 실제로 거기에있는 위치에서 지원 컴퓨팅 것들 + +896 +01:09:17,199 --> 01:09:20,050 + 라는 회사에서 다른 기존의 구현 + +897 +01:09:20,050 --> 01:09:23,850 + 이러한 그래서이 여섯 비트 구현되는이 자신의 바나 + +898 +01:09:23,850 --> 01:09:28,350 + 이 좋은 GET있다이 때문에 지금 거기에 가장 빠른 회선 + +899 +01:09:28,350 --> 01:09:31,850 + 다른 유형의 주석 벤치 마크의 종류가 배고픈 설문 조사 + +900 +01:09:31,850 --> 01:09:35,160 + 회선 및 프레임 워크 및 모든과 거의 모든의 + +901 +01:09:35,159 --> 01:09:38,319 + 우승이 모든 벤치 마크 지금이 16 비트 부동 소수점입니다 + +902 +01:09:38,319 --> 01:09:42,279 + 난 당신이 가질 수 있기 때문에 놀라운 일이 옳지 않다 너바나에서 작업 + +903 +01:09:42,279 --> 01:09:47,479 + 베팅 더 빨리, 그래서 경쟁하지만 지금 아직 사실이 아니다합니다 + +904 +01:09:47,479 --> 01:09:51,479 + 열 여섯 비트를 이용하기위한 카페 또는 토치 같은 것들에 프레임 워크 지원 + +905 +01:09:51,479 --> 01:09:57,299 + 계산하지만 곧 다가오는되어야하지만, 문제는 우리하더라도 + +906 +01:09:57,300 --> 01:10:01,420 + 계산할 수 그것은 꽤 분명이야이야 당신은 16 만 번호가있는 경우 + +907 +01:10:01,420 --> 01:10:05,880 + 당신은 매우 빠른 그들과 경쟁하지만, 일단 (16)이 수도보다 더 얻을 수 있습니다 + +908 +01:10:05,880 --> 01:10:10,380 + 열 여섯의 두 가지이기 때문에 실제로 숫자 정밀도에 대한 걱정 + +909 +01:10:10,380 --> 01:10:13,550 + 다수의 큰되지는 더 이상 실제로 너무 많은 실수 당신입니다 + +910 +01:10:13,550 --> 01:10:20,360 + 심지어 표현 그래서 몇 년 전에했던 것과이 논문이있다 할 수 있습니다 + +911 +01:10:20,359 --> 01:10:25,339 + 일부 실험 낮은 정밀도 부동 소수점 그들은 발견 실제로 단지 + +912 +01:10:25,340 --> 01:10:28,710 + 실험을 사용하여 그들이 실제로는 부동 소수점 고정 사용 + +913 +01:10:28,710 --> 01:10:34,819 + 구현 및 그들이 발견 사실이 매우이와 + +914 +01:10:34,819 --> 01:10:38,659 + 이러한 낮은 정밀 방법 네트워크의 오슬로의 순진 구현의 종류 + +915 +01:10:38,659 --> 01:10:43,689 + 때문에이 낮은 정밀 Americare의 숫자에 아마 수렴 힘든 시간을했다 + +916 +01:10:43,689 --> 01:10:46,710 + 종류의 곱셈의 여러 라운드를 통해 축적 문제와 + +917 +01:10:46,710 --> 01:10:50,989 + 이것 저것 그러나 그들은 간단한 트릭이 실제로 확률이 아이디어 발견 + +918 +01:10:50,989 --> 01:10:54,559 + 자신의 곱셈의 일부는 그렇게 것 때문에 모든 라운딩 자신의 + +919 +01:10:54,560 --> 01:10:55,200 + 매개 변수 + +920 +01:10:55,199 --> 01:10:59,079 + 정품 인증은 16 건에 저장되지만, 그들은 곱셈 그들은을 수행 할 때 + +921 +01:10:59,079 --> 01:11:03,269 + 가입은 약간 높은 정밀도 부동 소수점 값으로 변환 + +922 +01:11:03,270 --> 01:11:07,570 + 그들은 여전히​​ 낮은 위치까지 다시 라운드를 캐스팅하고 실제로 일을 + +923 +01:11:07,569 --> 01:11:11,789 + 그 가장 가까운 숫자로 반올림되지 않은 확률 적 방법으로 반올림 + +924 +01:11:11,789 --> 01:11:16,479 + 그러나 확률 적 방식에 따라 서로 다른 번호를 라운딩 + +925 +01:11:16,479 --> 01:11:17,549 + 당신이 닫습니다 + +926 +01:11:17,550 --> 01:11:21,860 + 더 나은 일을하고 그렇게 연습을하는 경향이 그들은 예를 들어 당신이있을 때 발견 + +927 +01:11:21,859 --> 01:11:26,710 + 여섯 비트 고정 번호는 정수 두 개의 침대가 있었다 이러한 사용 + +928 +01:11:26,710 --> 01:11:31,170 + 에 대한에 대한 부동 소수점 12, 14이 사이 서 + +929 +01:11:31,170 --> 01:11:35,239 + 당신의이 아이디어를 사용할 때 항상 가장 가까운 반올림 것을 소수 부분 + +930 +01:11:35,239 --> 01:11:40,359 + 수 이러한 네트워크 및 분기 만하는 이러한 확률 접지를 사용하는 경우 + +931 +01:11:40,359 --> 01:11:43,599 + 실제로 이러한 네트워크는 아주 잘 수렴 얻을 수있는 기술 + +932 +01:11:43,600 --> 01:11:47,170 + 심지어이 매우 낮은 밀도 부동 소수점 기술 낮은 정밀도 + +933 +01:11:47,170 --> 01:11:52,859 + 부동 소수점 수 있지만 16 개의 비트가됩니다 대단한 문의 할 수 있습니다 + +934 +01:11:52,859 --> 01:11:59,089 + 그러나 우리는 아래로있어 2015 년 다른 용지가보다하는 것이 더 낮은 갈 수 있습니다 + +935 +01:11:59,090 --> 01:12:04,560 + 10 그래서 여기에 우리가 이미 가지고 있던 이전의 논문에서 의미하는 것으로 12 베팅 + +936 +01:12:04,560 --> 01:12:08,039 + 이 직관 어쩌면 당신은 매우 낮은 정밀도를 사용하는 부동 소수점 + +937 +01:12:08,039 --> 01:12:11,359 + 숫자는 실제로 네트워크의 일부 지역에서 더 정밀도를 사용할 필요가 + +938 +01:12:11,359 --> 01:12:15,909 + 네트워크의 다른 부분에서 낮은 정밀이 논문에서 그들은했다 그래서 + +939 +01:12:15,909 --> 01:12:22,149 + (10)의 활성화에 이야기를 사용하여 멀리 얻을 수 10 비트 값을 비트 및 + +940 +01:12:22,149 --> 01:12:27,500 + 12 베팅을 사용하여 컴퓨팅 그라데이션을하고 서서 그들은이 일을 가지고있는 + +941 +01:12:27,500 --> 01:12:34,800 + 꽤 훌륭하지만 사람이 그 한계는 우리가 더 갈 수 있다고 생각 + +942 +01:12:34,800 --> 01:12:36,310 + 예 + +943 +01:12:36,310 --> 01:12:44,180 + 이 같은에서 실제로 있도록 용지는 지난 주에 실제로 있었다 + +944 +01:12:44,180 --> 01:12:49,200 + 이전의 종이로 제작이는 내가 이것에 대해 놀랐다했다 미친이며, + +945 +01:12:49,199 --> 01:12:53,539 + 개념 네트워크의 모든 활성화 및 가중치 하나만​​을 사용하는 것이 듣는 + +946 +01:12:53,539 --> 01:12:58,819 + 지금은 그렇지 계산하기 위해 꽤 빨리 둘 중 하나 또는 음을 내기 + +947 +01:12:58,819 --> 01:13:02,429 + 심지어 정말 그냥 탐험 왜 같이 할 수 곱셈을해야하고 + +948 +01:13:02,430 --> 01:13:07,240 + 꽤 멋진 것들을 곱하지만 트릭은 앞으로 패스를 그이다 + +949 +01:13:07,239 --> 01:13:11,199 + 이 슈퍼 그래서 기울기 및 정품 인증 모두가 하나 또는 마이너스 하나 + +950 +01:13:11,199 --> 01:13:15,399 + 슈퍼 슈퍼 신속하고 효율적인하지만 지금은 뒤로 패스에 물건 4 패스 + +951 +01:13:15,399 --> 01:13:20,179 + 그들은 실제로 높은 정밀도와 다음이 이상을 사용하여 그라데이션을 계산 + +952 +01:13:20,180 --> 01:13:24,150 + 정밀 그라디언트 실제로 이러한 단일 비트에 대한 업데이트를 확인하는 데 사용됩니다 + +953 +01:13:24,149 --> 01:13:28,059 + 그것은 그래서 매개 변수는 실제로 정말 멋진 종이 그리고 내가 당신을 격려 것입니다 + +954 +01:13:28,060 --> 01:13:33,310 + 그것을 확인하지만 피치는 그 당신이 감당할 수있는 훈련 시간이 될 수있다합니다 + +955 +01:13:33,310 --> 01:13:36,600 + 어쩌면 부동 소수점 정밀도를 사용하지만 시험 시간은 당신이 원하는 마십시오 + +956 +01:13:36,600 --> 01:13:41,250 + 나는 이것이 정말 생각 때문에 네트워크 슈퍼 슈퍼 빠른 모든 이진 될 수 있습니다 + +957 +01:13:41,250 --> 01:13:45,010 + 나는 그것을 용지 뜻 정말 멋진 아이디어는 내가 2 주 전에 나온 + +958 +01:13:45,010 --> 01:13:50,460 + 알고하지만 난 그것을에서 정리 해보 그래서 정말 멋진 일이 생각하지 않습니다 + +959 +01:13:50,460 --> 01:13:52,199 + 구현 세부 사항 + +960 +01:13:52,199 --> 01:13:56,960 + 전체 GPU는 CPU가 때때로 사람들이 사용보다 훨씬 더 빠르게 있다는 것입니다 + +961 +01:13:56,960 --> 01:14:00,739 + 하나의 시스템에서 여러 GPU에 걸쳐 배포 배포 훈련은 예쁜 + +962 +01:14:00,739 --> 01:14:04,840 + 일반 사용자의 구글과 사용 텐서 여러를 통해 배포 한 후 흐르는 경우 + +963 +01:14:04,840 --> 01:14:10,239 + 노드는 어쩌면 더 일반적인 사이의 잠재적 인 병목을 알고 있어야한다 + +964 +01:14:10,239 --> 01:14:15,739 + 책상에서 GPU 사이와 GPU 메모리 사이도 지불 CPU와 GPU + +965 +01:14:15,739 --> 01:14:19,510 + 부동 소수점 정밀도에 대한 관심은 가장 매력적인 일이 될하지 않을 수 있습니다 + +966 +01:14:19,510 --> 01:14:23,409 + 하지만 실제로 나는 연습과 어쩌면 진에 큰 차이를 만드는 생각 + +967 +01:14:23,409 --> 01:14:28,639 + 그래서 그래 그냥 정리해하는 너트 꽤 흥미로운 것 다음 큰 일이 될 것입니다 + +968 +01:14:28,640 --> 01:14:32,690 + 모든 우리는 우리가 속임수로 날짜 확대 술 이야기 오늘 이야기 + +969 +01:14:32,689 --> 01:14:37,449 + 당신이 작은 데이터 세트를 가지고 우리를 overfitting 방지 때 개선 + +970 +01:14:37,449 --> 01:14:40,859 + 도움이 기존 모델에서 초기화하는 방법으로 전송 학습에 대해 이야기 + +971 +01:14:40,859 --> 01:14:44,399 + 훈련과 당신의 도움으로 우리가에 대한 세부 사항을 많이 이야기 + +972 +01:14:44,399 --> 01:14:48,159 + 회선 모두 효율적인 모델을 만들기 위해 그들을 결합하는 방법과 + +973 +01:14:48,159 --> 01:14:52,840 + 나는 그 생각 때문에 우리는 모든 구현 세부 사항에 대해 이야기 + +974 +01:14:52,840 --> 01:14:57,319 + 그것은 우리가 최대한 빨리 인쇄 한 모든 임의의 마지막 분 질문입니다입니다 + +975 +01:14:57,319 --> 01:15:02,840 + 좋아, 그래서 우리가 몇 초 분, 우리 가운데 중간 고사를 마친 것 같아요 + diff --git a/captions/Ko/Lecture12_ko.srt b/captions/Ko/Lecture12_ko.srt new file mode 100644 index 00000000..70b0b27d --- /dev/null +++ b/captions/Ko/Lecture12_ko.srt @@ -0,0 +1,4344 @@ +1 +00:00:00,000 --> 00:00:02,990 + 오늘 우리는이 네 개의 주요 소프트웨어 패키지를 통해 갈거야 그 사람들 + +2 +00:00:02,990 --> 00:00:10,919 + 일반적으로 보통 몇 관리자 가지 이정표로 사용 + +3 +00:00:10,919 --> 00:00:14,798 + 실제로 그렇게 희망을 살펴하려고합니다 남자를 반환 지난 주 않는 한 + +4 +00:00:14,798 --> 00:00:19,089 + 최종 할당은 3 사람들에 이번 주 또한 할당을 기억 + +5 +00:00:19,089 --> 00:00:23,160 + 거야 그래서 수요일에 기인하고 너희들은 아직 + +6 +00:00:23,160 --> 00:00:30,870 + 확인 그건 그 다음 당신은 당신이 잘 늦게 일을해야했습니다 좋은 + +7 +00:00:30,870 --> 00:00:34,230 + 내가 지적해야 다른 또 다른 한가지는 당신이 실제로 있다면 + +8 +00:00:34,229 --> 00:00:37,619 + 내가 당신을 많이 생각 프로젝트에 대한 터미널을 사용할 계획 + +9 +00:00:37,619 --> 00:00:42,049 + 당신은 당신이 떨어져 코드 및 데이터 물건을 백업하고 있는지 확인 + +10 +00:00:42,049 --> 00:00:46,659 + 아버지의 경우는 가끔씩 우리는 어떤 데 문제 했어 + +11 +00:00:46,659 --> 00:00:50,529 + 인스턴스는 무작위로 충돌하고 대부분의 경우 단말기 사람들은왔다 + +12 +00:00:50,530 --> 00:00:53,989 + 데이터를 다시 얻을 수 있지만 때로는 몇 일이 소요 및 + +13 +00:00:53,988 --> 00:00:57,570 + 이 때문에 사람들이 손실 된 데이터를 실제 사례 몇 가지가있었습니다 + +14 +00:00:57,570 --> 00:01:01,558 + 그냥 터미널에 그 당신이 사용하려는 경우 내가 생각하는 그래서 추락 + +15 +00:01:01,558 --> 00:01:04,569 + 단말기는 당신이 어떤 다른 백업 전략을 가지고 있는지 확인 + +16 +00:01:04,569 --> 00:01:10,250 + 코드와 나는 같은 데이터는 우리가이 불쌍한 대해 얘기 밝혔다 + +17 +00:01:10,250 --> 00:01:16,049 + 일반적으로 깊은 학습 카페 토치 피아노에 사용되는 소프트웨어 패키지와 + +18 +00:01:16,049 --> 00:01:20,269 + 텐서 흐름과 내가 같은 느낌이 처음에 부인의 조금으로 + +19 +00:01:20,269 --> 00:01:24,179 + 개인적으로 나는 주로 내가 아는 그 사람 때문에 카페와 성화와 함께 일했습니다 + +20 +00:01:24,180 --> 00:01:27,710 + I에 대한 가장 당신에게 다른 사람에 대한 좋은 맛을뿐만 아니라 제공하기 위해 최선을 다하겠습니다 + +21 +00:01:27,709 --> 00:01:35,939 + 하지만 단지 첫 번째, 그래서 거기에 그 부인을 던지는 것은 우리가 본 카페입니다 + +22 +00:01:35,939 --> 00:01:39,509 + 정말 카페이었다 버클리에서 본 논문에서 튀어 마지막 강의 + +23 +00:01:39,510 --> 00:01:44,040 + 재 고용 알렉스 NAT와 알렉스하려고하는 것은 다른 것들과 이후 기능 + +24 +00:01:44,040 --> 00:01:47,550 + 다음 캐시는 정말 널리 사용되는 정말 인기로 성장했다 + +25 +00:01:47,549 --> 00:01:53,759 + 카페 버클리에서 그래서 특히 길쌈 신경망을위한 패키지 + +26 +00:01:53,760 --> 00:01:56,859 + 난 당신이 많은 사람들은 더없는이 생각 + +27 +00:01:56,859 --> 00:02:01,989 + 그것은 대부분 C ++로 작성 실제로 카페에 대한 일이 구입되어, + +28 +00:02:01,989 --> 00:02:04,939 + 당신은 매우 유용하다 matlab에 파이썬에서 그물과 이것 저것에 액세스 할 수 있습니다 + +29 +00:02:04,939 --> 00:02:09,969 + 일반 카페에서 정말 널리 사용하고 그냥 경우는 정말 정말 좋은 + +30 +00:02:09,969 --> 00:02:15,289 + 일종의 표준 피드 포워드 컨볼 루션 네트워크를 훈련하고 싶은 + +31 +00:02:15,289 --> 00:02:17,489 + 실제로 카페는 다른 사람보다 약간 다르다 + +32 +00:02:17,490 --> 00:02:21,610 + 이 점에서 다른 프레임 워크는 당신이 실제로 큰 강력한 모델을 훈련하고 + +33 +00:02:21,610 --> 00:02:26,150 + 예제 ResNet 이미지를 그래서 어떤 코드를 직접 작성하지 않고 유지 + +34 +00:02:26,150 --> 00:02:29,760 + 분류 모델을 하나의 이미지 하나 다 작년에 당신이 할 수있는 그 + +35 +00:02:29,759 --> 00:02:33,189 + 실제로 꽤 임의의 코드를 작성하지 않고 카페를 사용하여 공진 훈련 + +36 +00:02:33,189 --> 00:02:37,579 + 놀라운 가장 그러나 당신이 작업하는 가장 중요한 팁 그래서 + +37 +00:02:37,580 --> 00:02:41,860 + 카페 설명서를 항상 최신 상태로 때때로되지 않고하지 않는 것이있다 + +38 +00:02:41,860 --> 00:02:45,980 + 완벽한 그래서 당신은 단지 거기에 다이빙 소스를 읽을 두려워하지 필요 + +39 +00:02:45,979 --> 00:02:52,359 + 그것은 ++ C의 너무 잘하면 당신이 그것을 읽고 이해 만에 할 수 있습니다 자신을 코드 + +40 +00:02:52,360 --> 00:02:56,080 + 그들이 인터페이스가 일반 C ++ 코드를 꽤 잘 구성되어 + +41 +00:02:56,080 --> 00:03:00,270 + 꽤 잘 조직하고 대한 의심이있는 경우 아주 쉽게 그렇게 이해하기 + +42 +00:03:00,270 --> 00:03:04,459 + 일이 카페에서 일하는 당신은 어떻게 당신의 가장 좋은 건에 가서 일어나서을 읽을 그냥 할 + +43 +00:03:04,459 --> 00:03:11,229 + 카페 그래서 소스 코드는 아마 수천 마이크와 함께이 거대한 큰 프로젝트입니다 + +44 +00:03:11,229 --> 00:03:14,369 + 수십 줄의 코드 수천하고 이해하기 무서운 약간의 + +45 +00:03:14,370 --> 00:03:18,730 + 모든 것을 함께 맞는하지만 카페에서 정말 네 개의 주요 클래스를 거기에 어떻게 + +46 +00:03:18,729 --> 00:03:24,310 + 첫 번째 일에 대해 알 필요가있는 얼룩 때문에 모든 군대 저장소를 모양 당신의 + +47 +00:03:24,310 --> 00:03:27,939 + 데이터와 무게와 네트워크에서 활성화 그래서 이러한 + +48 +00:03:27,939 --> 00:03:34,870 + 모양은 그래서 당신의 무게가 차단 한있는 네트워크의 것들 당신의 비율은 + +49 +00:03:34,870 --> 00:03:38,680 + 블롭에 저장되어있는 데이터는 픽셀 값처럼 될 것이다있다 + +50 +00:03:38,680 --> 00:03:43,189 + 블롭에 저장 레이블 당신의 아내 또는 BLOB에 저장된 모든의 + +51 +00:03:43,189 --> 00:03:47,319 + 모양이입니다 귀하의 중간 정품 인증은 모양에 저장됩니다 + +52 +00:03:47,319 --> 00:03:51,069 + 과 차원 텐서는 일종의 당신이 본 것 같은 심판은 허용 + +53 +00:03:51,069 --> 00:03:56,150 + 그들이 가진 내부 실제로 무 차원 tenser 네 사본을 + +54 +00:03:56,150 --> 00:03:57,370 + 데이터 + +55 +00:03:57,370 --> 00:04:02,450 + 실제 원시 데이터하고 저장하는 텐서의 데이터 버전 + +56 +00:04:02,449 --> 00:04:07,449 + 또한 카페가 사용하는 병렬 일을 가지고 있지만 10 평행 죽음을 원 + +57 +00:04:07,449 --> 00:04:12,459 + 저장소는 데이터에 대한 그라디언트 그것은 당신에게 당신에게 두 가지를 제공하고 + +58 +00:04:12,459 --> 00:04:16,280 + 그런 것들의 각각의 CPU와 GPU 버전이 있기 때문에 실제로 사를 + +59 +00:04:16,279 --> 00:04:21,228 + 그래서 당신은 CPU의 데이터 유형이 있고 GPU 실제로 사 및 치수있다 + +60 +00:04:21,228 --> 00:04:26,159 + 텐트 당신이 알아야 할 그 다음 중요한 클래스 로브 뛰어난이며, + +61 +00:04:26,160 --> 00:04:30,930 + 은신처를 카페와 래리는 것과 유사한에서 함수의 일종이다 사람 + +62 +00:04:30,930 --> 00:04:35,329 + 일부 입력 모양을받는 특징에 글을 입력 바닥을 야유 + +63 +00:04:35,329 --> 00:04:41,269 + 다음 구멍 정지를 유지 출력 모양을 생각한다는 것입니다 LOB를 생성하여 + +64 +00:04:41,269 --> 00:04:45,349 + 은신처가 채워 데이터 (RD)와 바닥 모양에 포인터를 받게되며 + +65 +00:04:45,350 --> 00:04:49,229 + 다음은 상단 모양에 대한 포인터를 받게됩니다 그리고 포트에 겁니다 + +66 +00:04:49,228 --> 00:04:53,759 + 열정적으로 상위의 데이터 요소의 값을 입력 할 것으로 예상 + +67 +00:04:53,759 --> 00:04:58,959 + 층을지나 다시 도로에 블로그 래디언스 검은 담비가 기대 계산합니다 + +68 +00:04:58,959 --> 00:05:03,649 + 기울기와 상부 작업에 대한 포인터를 수신하고 활성화 쏟 + +69 +00:05:03,649 --> 00:05:07,359 + 를 사용하며 그들은 또한 바닥 모양에 대한 포인터까지를 받게됩니다 + +70 +00:05:07,360 --> 00:05:12,650 + 재료 바닥과 블레어 총리는이 추상 꽤 잘 구성되어 + +71 +00:05:12,649 --> 00:05:17,019 + 클래스 당신은 갈 수 있고 내가 여기에 소스 파일에 대한 링크를했다 그 + +72 +00:05:17,019 --> 00:05:21,139 + 그리고 그들의의 다른 유형을 구현하는 몇 가지 클래스가 많이있다 및 + +73 +00:05:21,139 --> 00:05:26,750 + 같은 나는 일반적인 캡이 문제가 모든 전혀 정말 좋은 목록이 없습니다 말했다 + +74 +00:05:26,750 --> 00:05:30,490 + 유형의 은신처 당신은 거의 그냥 코드를보고 어떤 종류를 볼 필요가 + +75 +00:05:30,490 --> 00:05:36,280 + CPP 파일은 자연 있도록 당신이 알아야 할 다음 일은있다 + +76 +00:05:36,279 --> 00:05:40,859 + 그것은 단지 다수의 상속인을 결합하고는 기본적으로 비순환 그래프에 관한 것이다 + +77 +00:05:40,860 --> 00:05:44,598 + 층과하면의 전후 방법을 실행하기위한 책임 + +78 +00:05:44,598 --> 00:05:49,519 + 올바른 순서 층은 그래서 이것은 당신은 아마이 터치하지 않아도된다 + +79 +00:05:49,519 --> 00:05:52,560 + 자신 만이에 가지 좋은 볼의 어느 클래스는 방법의 맛을 얻을 수 + +80 +00:05:52,560 --> 00:05:56,139 + 모든 당신이 알아야 할 마지막 클래스에서 함께 맞는 + +81 +00:05:56,139 --> 00:06:00,720 + 솔버가 솔버 있도록 우리가 숙제 해결사이라는 것을 알고 + +82 +00:06:00,720 --> 00:06:04,710 + 그 정말 재주 넘기를 캡에서 영감을받은 나에 찍어하기위한 것입니다 + +83 +00:06:04,709 --> 00:06:05,288 + 순 + +84 +00:06:05,288 --> 00:06:08,889 + 실제로 업데이트 데이터의 다음 전후 실행할 + +85 +00:06:08,889 --> 00:06:11,319 + 네트워크와 핸들 검사 점과에서 다시 시작의 소유자 + +86 +00:06:11,319 --> 00:06:15,520 + 체크 포인트 및 물건의 카페 해결사의 모든 종류의이 추상입니다 + +87 +00:06:15,519 --> 00:06:20,278 + 클래스와 다른 갱신 규칙은 다른 서브 클래스에 의해 구현된다 그래서 + +88 +00:06:20,278 --> 00:06:24,598 + 예를 확률 그라데이션 하강을 위해 존재하는 것은 해석 원자 폭탄 엉덩이있다 + +89 +00:06:24,598 --> 00:06:28,209 + 문제 해결사 다시 물건의 종류의 모든과는 어떤 종류를 볼 수 있습니다 + +90 +00:06:28,209 --> 00:06:32,438 + 옵션을 제공합니다 당신은 이런 종류의 소스 코드를 찾아야한다 사용할 수 있습니다 + +91 +00:06:32,439 --> 00:06:35,639 + 당신이 일 모두가이 모든 일이 있음을 함께 맞는 방법의 좋은 개요 + +92 +00:06:35,639 --> 00:06:40,069 + 인터넷에서 할 것이다 오른쪽 녹색 상자에 포함 각을 얼룩 + +93 +00:06:40,069 --> 00:06:44,250 + 블로그는 데이터를 포함하고 빨간색 상자가 연결되어있는 층이다 텍스트 + +94 +00:06:44,250 --> 00:06:51,038 + 함께 블록과 모든 일이 시편에 최적화 얻을 것이다 그래서 + +95 +00:06:51,038 --> 00:06:55,538 + 카페 프로토콜이라는 재미있는 것은 많이 사용 너희들의 버퍼 수 + +96 +00:06:55,538 --> 00:07:00,938 + 이제까지 숫자 후 구글에 다시 너희들은이 폭탄 그러나 프로토콜에 대해 알고 + +97 +00:07:00,939 --> 00:07:05,099 + 권총은 거의 이진 강하게 나는 종류의에 같은 JSON을 입력처럼 이쪽 + +98 +00:07:05,098 --> 00:07:08,550 + 구글이 처음에 데이터를 이용하여 내부에 매우 널리 사용되는 그것에 대해 생각 + +99 +00:07:08,550 --> 00:07:14,750 + 네트워크를 통해 죽음은 그래서 프로토콜이있다 버퍼링한다. 그 프로필 + +100 +00:07:14,750 --> 00:07:18,639 + 서로 다른 객체의 형태를 어떻게 그렇게 느낌의 종류를 정의 + +101 +00:07:18,639 --> 00:07:22,819 + 이 예에서 사용자는 이름 및 ID 및 이메일이 생명을 가지고있을 것 + +102 +00:07:22,819 --> 00:07:26,300 + 최고의 프로필입니다. 프로필 + +103 +00:07:26,300 --> 00:07:31,490 + 클래스의 유형을 찾기 위해 주어진 당신은 실제로에 인스턴스를 실현 볼 수 있습니다 + +104 +00:07:31,490 --> 00:07:37,379 + 사람이 읽을. 예를 들어 이것은주는 이름 총 TXT 파일을 채우고 있으므로 + +105 +00:07:37,379 --> 00:07:40,968 + 당신은 아이디어가 당신에게 이메일을 제공하고이 사람의 인스턴스입니다 수 + +106 +00:07:40,968 --> 00:07:45,930 + 이 텍스트 파일로 저장 한 후 제품이 컴파일러를 포함 + +107 +00:07:45,930 --> 00:07:49,579 + 실제로 액세스 다양한 프로그래밍 언어에서 클래스를 생성 할 수 있습니다 + +108 +00:07:49,579 --> 00:07:55,418 + 당신이 할 수있는 이러한 데이터 형식 포토 북 컴파일러를 실행 한 후이 그것을 프로필 + +109 +00:07:55,418 --> 00:08:01,038 + 당신은 자바와 C C에 가져 ++과 파이썬과 갈 수있는 클래스를 생성 + +110 +00:08:01,038 --> 00:08:05,300 + 모든 그래서 실제로 카페가 있습니다 단지에 대해 왜 이러한 프로브를 말합니까 + +111 +00:08:05,300 --> 00:08:08,270 + 이러한 프로토콜 버퍼 그들은 거의 모든 것을 저장을 사용하여 + +112 +00:08:08,269 --> 00:08:16,008 + 캐시는 내가 말했듯 있도록 카페를 이해하는 코드를 읽을 필요가 이해하기 + +113 +00:08:16,009 --> 00:08:20,480 + 카페이 하나의 거대한 파일이라는 카페 어두운 도로가 + +114 +00:08:20,480 --> 00:08:24,470 + 그들은 단지에서 사용되는 프로토콜 버퍼 유형 모두를 정의하지만 + +115 +00:08:24,470 --> 00:08:29,170 + 카페는이 나는 그것이 몇 생각의 거대한 파일 만 라인 긴하지만 + +116 +00:08:29,170 --> 00:08:32,200 + 실제로 꽤 잘 문서화 그리고 내가 가장 최신의 생각입니다 + +117 +00:08:32,200 --> 00:08:35,890 + 은신처 유형이 무엇인지의 문서는 어떤 이들 계층에 대한 옵션입니다 + +118 +00:08:35,889 --> 00:08:39,629 + 당신이 솔버 및 레이어와마다 모든 옵션을 지정하는 방법입니다 + +119 +00:08:39,629 --> 00:08:43,100 + 그 때문에 난 정말이 파일을 체크 아웃하고 읽어 보시기 바랍니다 모든 나쁘지 않다 + +120 +00:08:43,100 --> 00:08:48,019 + 일이 카페 단지 어떻게 작동하는지 그것을 통해 당신에 대한 질문이있는 경우 + +121 +00:08:48,019 --> 00:08:53,120 + 이 매개 변수보다 정의이 당신을 보여줍니다 당신이 내 왼쪽 귀에서 맛을 제공 + +122 +00:08:53,120 --> 00:08:58,519 + 이는 카페 도끼를 나타내는 데 사용 프로토콜 버퍼의 종류 및 + +123 +00:08:58,519 --> 00:09:03,970 + 오른쪽 해법을 나타내는 데,이 해석 파라미터 있도록 + +124 +00:09:03,970 --> 00:09:09,009 + 예를 들어 솔버 프로모터의 경계가 그물에 대한 참조를 취하고 + +125 +00:09:09,009 --> 00:09:12,409 + 또한 학습 속도와 빈도 점을 확인하는 방법과 같은 것들을 포함 + +126 +00:09:12,409 --> 00:09:19,549 + 당신이 카페에서 작업 할 때 바로 그래서 실제로 있다는 같은 다른 것들 + +127 +00:09:19,549 --> 00:09:23,729 + 정말 멋진 그렇게 할 때 모델을 양성하기 위해 코드를 작성할 필요가 없습니다 + +128 +00:09:23,730 --> 00:09:27,889 + 카페 작업 당신은 일반적으로 당신은 그래서 먼저이 4 단계 프로세스를 + +129 +00:09:27,889 --> 00:09:31,960 + 데이터를 변환하고 그냥 이미지 분류 문제가 발생할 경우 특히 + +130 +00:09:31,960 --> 00:09:34,540 + 당신은 단지 기존 중 하나를 사용이를 위해 당신은 어떤 코드를 작성할 필요가 없습니다 + +131 +00:09:34,539 --> 00:09:40,240 + 다니엘 이진 카파 배는 방금로 할 것이다 당신의 파일을 정의 + +132 +00:09:40,240 --> 00:09:45,230 + 작성 또는 편집이 단백질 다니엘 중 하나는 다시 해석을 정의 + +133 +00:09:45,230 --> 00:09:49,509 + 단지 프로보 TXT TXT 파일에 살 것이다 당신은 텍스트 내에서 작동 할 수 + +134 +00:09:49,509 --> 00:09:54,200 + 편집기 다음은 훈련이 기존 바이너리에이 일을 모두 통과합니다 + +135 +00:09:54,200 --> 00:09:57,990 + 모델과 전투는 기차가 그 다음에 할 수 있는지 테스트하는 모델을 계속 뱉어 + +136 +00:09:57,990 --> 00:10:02,820 + 당신은 당신이 할 수 이미지에 ResNet 훈련을 할 경우에도, 그래서 다른 것들에 사용 + +137 +00:10:02,820 --> 00:10:06,000 + 그냥 간단한 절차를 따르 쓰기없이 거대한 네트워크를 훈련 + +138 +00:10:06,000 --> 00:10:12,110 + 정말 멋진 등 만 데이터를 변환하기 위해 일반적으로 한 단계 코드 + +139 +00:10:12,110 --> 00:10:17,259 + 그래서 카페 나는 우리가 형식으로 HTML5에 대해 조금 얘기했습니다 알고 사용 + +140 +00:10:17,259 --> 00:10:21,460 + 지속적으로 책상에 픽셀을 저장하고 효율적으로 읽는하지만, + +141 +00:10:21,460 --> 00:10:26,940 + 경우가 물었다 그래서 기본적으로 캐시는 LM TV라는이 다른 파일 형식을 사용 + +142 +00:10:26,940 --> 00:10:30,570 + 당신이 당신이 가진 모든 레이블이 각 이미지 다음 이미지의 무리 인 경우 당신은 할 수 + +143 +00:10:30,570 --> 00:10:31,480 + 롤 호출 + +144 +00:10:31,480 --> 00:10:35,370 + 카페는 수 거대한 alamoudi에 그 전체 데이터 집합을 변환하는 스크립트가 있습니다 + +145 +00:10:35,370 --> 00:10:42,169 + 젠은 당신에게 방법의 아이디어를 제공 할 수 있도록이는 그것의 훈련을 위해 사용할 수 있습니다 + +146 +00:10:42,169 --> 00:10:46,240 + 당신은 당신의 이미지에 대한 경로를 가진 텍스트 파일을 작성 정말 쉽고 + +147 +00:10:46,240 --> 00:10:49,959 + 라벨로 구분하고 그냥 승객은 스크립트가 몇 기다려 유지 + +148 +00:10:49,958 --> 00:10:56,018 + 시간 데이터가 디스크에 큰 거대한 IMDB 파일을 설정하고있는 경우에는 작업하는 경우 + +149 +00:10:56,019 --> 00:11:01,860 + HBO 오 같은 뭔가 다른 당신은 아마 카페 그래서 자신을 만들어야합니다 + +150 +00:11:01,860 --> 00:11:06,060 + 실제로 데이터를 읽는 몇 가지 옵션을 가지고이 날짜에있다 않습니다 + +151 +00:11:06,059 --> 00:11:11,888 + 자신의 윈도우 다토 보호를위한 시장 실제로는 HDL 5에서 읽을 수 있습니다 + +152 +00:11:11,889 --> 00:11:14,350 + 특히의 직접 메모리에서 읽기 물건에 대한 옵션이있다 + +153 +00:11:14,350 --> 00:11:18,480 + 이 모든 파이썬 인터페이스하지만 내 관점에서 적어도 유용 + +154 +00:11:18,480 --> 00:11:22,339 + 캠페인에 읽기 및 데이터의 다른 방법의 종류가 조금 있습니다 + +155 +00:11:22,339 --> 00:11:26,120 + 두 번째 수준의 카페 생태계에서 시민과 엘렌 DBA 정말입니다 + +156 +00:11:26,120 --> 00:11:30,669 + 가장 쉬운 것은 당신이 당신이 아마 변환을 시도해야 할 수 있습니다 그래서 만약 작동하는 방법 + +157 +00:11:30,669 --> 00:11:40,179 + 그래서 24 캠페인 단계와 mp3 형식으로 데이터가 있으므로 객체를 정의하는 것입니다 + +158 +00:11:40,179 --> 00:11:44,609 + 같은 나는 그가 그냥하지 그래서 여기에이를 찾기 위해 큰 프로모션 TXT를 작성하는 것입니다 말했다 + +159 +00:11:44,610 --> 00:11:48,818 + 로지스틱 회귀 분석이 단순한 모델 당신은 내가하지 않았다 것을 알 수있다 + +160 +00:11:48,818 --> 00:11:53,948 + 내 자신의 조언을 따라 난 다음 여기에 HDL (5) 파일에서 데이터를 읽고 있어요 + +161 +00:11:53,948 --> 00:11:59,278 + 내적 및 캐세이보다라고 완전히 연결 층을 + +162 +00:11:59,278 --> 00:12:03,588 + 완전히 은신처를 연결되어 자신의 권리는 당신에게 수업 방법의 수를 알려줍니다 + +163 +00:12:03,589 --> 00:12:10,399 + 값을 초기화하고 내가 읽어 부드러운 최대 손실 함수를합니다 + +164 +00:12:10,399 --> 00:12:15,458 + 라벨과는 반대 선출 된 리더에서 손실 성분을 생성하므로 + +165 +00:12:15,458 --> 00:12:20,009 + 이 파일에 대해 지적하는 몇 가지 있습니다 일반적으로 한 모든 층이 + +166 +00:12:20,009 --> 00:12:23,588 + 가중치의 데이터를 저장하기 위해 약간의 블로그와 기울기를 포함 + +167 +00:12:23,589 --> 00:12:28,680 + 그리고 층의 모양과 벨레 자체가 일반적으로 할 수있는 같은 이름을 가진 + +168 +00:12:28,679 --> 00:12:34,269 + 다른 일을 혼란 조금 이들 층의 많은이있을 것입니다 + +169 +00:12:34,269 --> 00:12:39,250 + 바로 당신이거야 여기에이 네트워크에 실제로 14 무게 14 바이어스와 모양 + +170 +00:12:39,250 --> 00:12:43,149 + 즉 학습 속도가 그래서 그 두 모양의 학습 속도를 찾아 + +171 +00:12:43,149 --> 00:12:44,769 + 방법 모두 정규화 + +172 +00:12:44,769 --> 00:12:50,198 + 참고로 나중에 또 하나의 바이어스는 출력의 수를 지정하는 것입니다 + +173 +00:12:50,198 --> 00:12:51,568 + 클래스는 단지 숫자입니다 + +174 +00:12:51,568 --> 00:12:57,378 + 이 완전히 연결 은신처 주변에 출력하고 마지막으로 신속하고 + +175 +00:12:57,379 --> 00:13:01,139 + 레이어와 카페를 동결 더러운 방법은 학습 속도 (204)를 설정하는 것입니다 + +176 +00:13:01,139 --> 00:13:08,048 + 그런 식으로 연결된 모양에 대한 우리의 편견 지적하는 또 다른 일이 + +177 +00:13:08,048 --> 00:13:12,600 + 밖으로 구글과 같은 ResNet 및 기타 대형 모델이 얻을 수 있다는 것입니다 + +178 +00:13:12,600 --> 00:13:17,110 + 정말 정말 빨리 손에서 카페 정말 당신처럼 정의 할 수 없습니다 있도록 + +179 +00:13:17,110 --> 00:13:20,989 + 조성의 ality ResNet을 위해 그들은 단지 동일한 패턴을 통해 반복되도록 + +180 +00:13:20,989 --> 00:13:26,459 + 반복해서 ResNet의 프로토 TXT 그래서 프로 txt 파일에 거의 7,000 선이다 + +181 +00:13:26,458 --> 00:13:31,219 + 당신이 손으로 그것을 쓸 수 있지만 중간 연습 사람들이 쓰는 경향이 긴 있도록 + +182 +00:13:31,220 --> 00:13:35,470 + 그가가의 그래서 작은 파이썬 스크립트는 자동으로이 일을 생성하는 + +183 +00:13:35,470 --> 00:13:41,879 + 네트워크에 찾을 것이 아니라 시작하려면 약간의 총 당신을 + +184 +00:13:41,879 --> 00:13:46,509 + 처음부터 당신은 일반적으로 몇 가지 기존 제품 내선을 다운로드 할 수 있습니다 및 + +185 +00:13:46,509 --> 00:13:50,230 + 일부 기존 무게 파일과 작업 거기에서 그래서 당신이 생각해야하는 방법 + +186 +00:13:50,230 --> 00:13:54,139 + 우리가 여기에 본 적이 제품 txt 파일이이를 정의하기 전에 것입니다 + +187 +00:13:54,139 --> 00:13:58,159 + 설교자와 무게이 살고있는 네트워크와 멘델의 아키텍처 + +188 +00:13:58,159 --> 00:14:03,230 + 진 일이 그리고 당신이 정말로 검사를하지만, 수 없습니다 카페 모델 파일 + +189 +00:14:03,230 --> 00:14:07,869 + 그것은 그 작동 방법은 이름 위치를 일치 기본적으로 키 - 값 쌍의 + +190 +00:14:07,869 --> 00:14:13,790 + 카페 모델 안에이 때문에 가정의 수호신으로 범위가 이들의 이름과 일치 + +191 +00:14:13,789 --> 00:14:19,389 + 그런데 그가 마지막에 해당 불구과 XC70 무게는-것 + +192 +00:14:19,389 --> 00:14:24,048 + 완전히 연결 층과 알렉스하지 그래서 다음에 당신을 찾을 때 + +193 +00:14:24,048 --> 00:14:29,600 + 자신의 데이터는 카페를 시작하고 당신은 모델과 제품 내선을로드 할 때 + +194 +00:14:29,600 --> 00:14:33,459 + 단지 이름의 키 - 값 쌍을 일치 시키려고하고 카페 사이 대기 + +195 +00:14:33,458 --> 00:14:35,008 + 모델 제품 EXT + +196 +00:14:35,009 --> 00:14:39,209 + 그래서 같은 이름은 다음 새로운 네트워크는에서 초기화되는 경우 + +197 +00:14:39,208 --> 00:14:43,008 + 가치와 정말 정말 유용하고 좋은 편리하다 프로토 TXT + +198 +00:14:43,009 --> 00:14:49,230 + 조정하지만, 층 이름은 실제로 해당 계층보다 일치하지 않으면하면 + +199 +00:14:49,230 --> 00:14:52,980 + 이 예를 들어 당신이 국유화 읽을 수있는 방법 그래서 처음부터 초기화 + +200 +00:14:52,980 --> 00:14:57,810 + 당신이했습니다 경우 카페에서 출력 그래서 조금 더 구체적으로 + +201 +00:14:57,809 --> 00:15:02,250 + 아마 모델은 다음이 래리가 완전히 마지막에가는 이미지를 다운로드 + +202 +00:15:02,250 --> 00:15:06,289 + 출력 클래스 과정에서의 연결 층은 천 출력이되지만 + +203 +00:15:06,289 --> 00:15:09,480 + 지금 어쩌면 당신은 당신에 대해 걱정 몇 가지 문제에 대한 10 출력을 원하는 + +204 +00:15:09,480 --> 00:15:13,149 + 당신은 마지막 층을 reindustrialize하는 거 필요가있어, 그것을 실현 + +205 +00:15:13,149 --> 00:15:17,309 + 무작위로 미세 조정 네트워크는 그래​​서 당신이 할 방법은 당신이 필요하다 + +206 +00:15:17,309 --> 00:15:22,088 + 실제로입니다 있는지 확인하기 위해 프로 txt 파일에있는 은신처의 이름을 변경 + +207 +00:15:22,089 --> 00:15:26,890 + 무작위가 아닌 카페 모델과 경우에서 읽는 초기화 + +208 +00:15:26,889 --> 00:15:30,919 + 다음은 실제로 충돌합니다이 작업을 수행하는 것을 잊지 그것은 당신에게 이상한 오류를 줄 것이다 + +209 +00:15:30,919 --> 00:15:35,419 + 모양이이를 저장하려고 할 것이다 원인 정렬하지에 대한 메시지 + +210 +00:15:35,419 --> 00:15:39,299 + 새에서이 열 차원 일에 천 차원의 가중치 행렬 + +211 +00:15:39,299 --> 00:15:46,129 + 파일 및 카페로 작업하는을 정의 할 때 그렇게 다음 단계를 작동하지 않습니다 + +212 +00:15:46,129 --> 00:15:51,100 + 해석 솔버는 당신이 그것을위한 모든 옵션을 볼 수 있습니다 그냥 프로 txt 파일입니다 + +213 +00:15:51,100 --> 00:15:56,620 + 나는이 같은 작은 모양의 무언가에 대한 링크를 준 거 프로필 + +214 +00:15:56,620 --> 00:16:00,169 + 알렉스 밤 그 학습 속도를 정의 할 것이다 당신은 배우고 어쩌면 있도록 + +215 +00:16:00,169 --> 00:16:04,809 + 방법은 k로하고 정규화가 얼마나 자주 그런 모든 것을 확인하지만, + +216 +00:16:04,809 --> 00:16:10,169 + 이러한 덜 훨씬 덜 복잡 그는위한 프로 TXT 있다는보다 끝나게 + +217 +00:16:10,169 --> 00:16:15,069 + 네트워크이 알렉스 넥타이 다만 어쩌면 십사 라인 당신이되지만 + +218 +00:16:15,070 --> 00:16:18,530 + 실제로 몇 번을 참조 것은 그 사람들은 복잡한의 종류가하려는 경우 + +219 +00:16:18,529 --> 00:16:22,299 + 거래 파이프 라인 위치를 특정의 속도를 학습 나는 11 훈련 그들이 첫 번째 + +220 +00:16:22,299 --> 00:16:25,039 + 네트워크의 일부는 다른 학습 속도 특정 부분과 함께 훈련 할 + +221 +00:16:25,039 --> 00:16:28,389 + 서로 다른 해석 파일의 폭포로 끝날 수있는 네트워크의 + +222 +00:16:28,389 --> 00:16:31,490 + 실제로 우리가 일종의 자신을 미세 조정하는 가장 독립적 인 저를 실행 + +223 +00:16:31,490 --> 00:16:38,070 + 다른 솔버를 사용하여 별도의 단계에서 모델은 모든 일을 한 번, 그래서 그 + +224 +00:16:38,070 --> 00:16:43,550 + 다음은 단지 트레이너 모델 내 조언을 따라하면 당신이 단지를 사용하는 경우 있도록 + +225 +00:16:43,549 --> 00:16:49,208 + MTB 당신은 그냥 존재, 즉 바이너리 전화에 대한 모든 것을 + +226 +00:16:49,208 --> 00:16:55,569 + 아직 여기 캠페인에 그냥 통과하여 해석하고 TXT 및 + +227 +00:16:55,570 --> 00:16:59,540 + 유럽​​은 미세 조정 인 경우 무게가 파일을 재교육과 금요일 아마 실행합니다 + +228 +00:16:59,539 --> 00:17:03,659 + 아마 오랜 시간 동안 그냥 확인하고 책상을 절감하고 알 수있을 것 + +229 +00:17:03,659 --> 00:17:08,549 + 여기에서 지적하는 한 것은이가에 실행 GPU를 지정할 것입니다 + +230 +00:17:08,549 --> 00:17:11,209 + 마지막 텍스트하지만 당신은 실제로 CPR에서 실행할 수 있습니다 + +231 +00:17:11,209 --> 00:17:17,288 + 마지막에 언젠가 내 음 하나에이 플래그를 설정하고 실제로 최근에 의해 + +232 +00:17:17,288 --> 00:17:21,048 + 올해 카페는 걸쳐 많은 배치를 분할 할 수있는 데이터 병렬 처리를 추가 + +233 +00:17:21,048 --> 00:17:26,318 + 시스템에서 여러 GPU는 실제로이 플래그에 여러 개의 GPU를 추가 할 수 있습니다 + +234 +00:17:26,318 --> 00:17:29,710 + 그냥 모든되어 카페를 말하면 자동으로 여러 배치를 분할합니다 + +235 +00:17:29,710 --> 00:17:33,600 + 컴퓨터에있는 모든 GPU를 통해이 정말 당신은 멀티 GPU를 수행 한 멋진 있도록 + +236 +00:17:33,599 --> 00:17:51,689 + 오 코드 정말 멋진 카페의 한 줄을 작성하지 않고 훈련 그래 + +237 +00:17:51,690 --> 00:17:57,230 + 그래, 정말 문제는 당신이 좀 더 일에 대해 이동하는 방법을 생각 + +238 +00:17:57,230 --> 00:18:00,778 + 당신은 아마 무게를 초기화 할 복잡한 초기화 전략 + +239 +00:18:00,778 --> 00:18:04,019 + 설교자와 모델에서의 여러 부분과 그 같은 방법을 사용하여 + +240 +00:18:04,019 --> 00:18:07,710 + 네트워크는 대답은 당신이 아마 간단한으로 그렇게 할 수 없다는 것입니다 + +241 +00:18:07,710 --> 00:18:11,278 + 당신이 무게와 파이썬에 돈의 종류 할 수있는 메커니즘 그것은 아마 + +242 +00:18:11,278 --> 00:18:17,669 + 내가 우리가 전에 언급 한 것 같아 바로 그래서 그 일에 대해 가지 방법 + +243 +00:18:17,669 --> 00:18:21,710 + 카페는 서로 다른 종류의 많은 다운로드 할 수 있습니다이 정말 좋은 모델이있다 + +244 +00:18:21,710 --> 00:18:25,919 + 이 때문에의 임무에 초반 이었죠 모델과 다른 데이터 세트의이 모델은이다 + +245 +00:18:25,919 --> 00:18:29,659 + 정말 최고의 당신이 알렉스 natin BGG을 가지고 있었다 당신은 거기 주민있어 + +246 +00:18:29,659 --> 00:18:33,840 + 이미 꽤 많이 많이와 정말 좋은 모델을 많이 그래서 거기있다 + +247 +00:18:33,839 --> 00:18:37,359 + 그는 그것이 정말 쉽습니다 카페에 대한 정말 정말 장점입니다입니다 + +248 +00:18:37,359 --> 00:18:40,428 + 누군가 다른 사람의 모델을 다운로드를 가리키는 데이터에서 실행하는 방법 + +249 +00:18:40,429 --> 00:18:42,350 + 데이터 + +250 +00:18:42,349 --> 00:18:46,298 + 내가 언급 한 것처럼 카페는 파이프 라인 인터페이스를 가지고 + +251 +00:18:46,298 --> 00:18:49,069 + 나는 세부 사항으로 뛰어들 수 있다고 생각하지 않습니다 충당하기 위해 너무 많은 일이 있기 때문에 + +252 +00:18:49,069 --> 00:18:53,378 + 여기에 있지만 코스와 카페 파의 일종으로 정말 정말 좋은이 아니다 + +253 +00:18:53,378 --> 00:18:57,980 + 코드를 읽을 필요하고, 그래서 파이썬 인터페이스에 대한 설명서 + +254 +00:18:57,980 --> 00:18:58,690 + 모든 + +255 +00:18:58,690 --> 00:19:02,730 + 파이썬 인터페이스 스트리트 카페는 대부분이 두에 두에 정의되어 있습니다 + +256 +00:19:02,730 --> 00:19:08,399 + 파일이 CPP 파일은 적에게 이야기하기 전에 것을 사용한 경우 파이썬을 향상 사용 + +257 +00:19:08,398 --> 00:19:13,369 + 는 C ++ 클래스의 일부를 마무리하고이에에 다음 걸릴에 노출 + +258 +00:19:13,369 --> 00:19:17,648 + . 평 실제로 추가 방법을 첨부하고 더 많은 파이썬를 제공 파일 + +259 +00:19:17,648 --> 00:19:22,469 + 인터페이스 당신은 어떤 방법과 데이터 타입의 종류 알고 싶어요 그래서 만약 + +260 +00:19:22,470 --> 00:19:27,000 + 카페 파이프에서 사용할 수있는 것이 가장 좋은 인터페이스 단지를 통해 3을 읽는 것입니다 + +261 +00:19:27,000 --> 00:19:31,339 + 이 두 파일 그리고 그들은 너무 오래 그것은 할 매우 쉽게 그래서 아니에요 + +262 +00:19:31,339 --> 00:19:37,038 + '예 일반적으로 파이썬 인터페이스는 당신이 아마 수행 할 수 있습니다 꽤 유용 + +263 +00:19:37,038 --> 00:19:40,558 + 미친 체중 초기화 전략은 당신이 뭔가 더 복잡한 작업을 수행해야하는 경우 + +264 +00:19:40,558 --> 00:19:44,960 + 단지 체인 모델 복사보다 그것은 또한 정말 쉽게 단지를 얻을 수 있습니다 + +265 +00:19:44,960 --> 00:19:48,710 + 네트워크는 다음 NumPy와 배열 NumPy와 함께 앞으로 뒤로 실행 + +266 +00:19:48,710 --> 00:19:53,129 + 그래서 예를 들어, 당신은 깊은 꿈과 클래스 등을 구현할 수있다 + +267 +00:19:53,128 --> 00:19:56,798 + 당신이 숙제에 당신도 그렇게 할 수 않았다 유사한 시각화 + +268 +00:19:56,798 --> 00:20:01,349 + 아주 쉽게 그냥 데이터를 취할 필요가 카페에 파이썬 인터페이스를 사용하여 + +269 +00:20:01,349 --> 00:20:03,899 + 다음 네트워크의 다른 부분을 통해 순방향 및 역방향 실행 + +270 +00:20:03,900 --> 00:20:08,720 + 파이썬 인터페이스도 아주 좋은 일을 그냥 추출 할 경우 경우 + +271 +00:20:08,720 --> 00:20:12,220 + 당신 같은 기능을 사용하면 일부 자유 무역 모델이 일부 데이터를 가지고 있고 + +272 +00:20:12,220 --> 00:20:15,610 + 네트워크의 일부에서 기능을 추적 할 다음 어쩌면에 저장 + +273 +00:20:15,609 --> 00:20:20,259 + 디스크가 아마 2005 파일은 아주 쉽게 약간의 다운 스트림 처리했다 + +274 +00:20:20,259 --> 00:20:25,660 + 당신이 할 수있는 파이썬 인터페이스를 수행하는 것은 실제로 카페는 새의 종류가 + +275 +00:20:25,660 --> 00:20:29,970 + 실제로 파이썬에서 레이어를 완전히 정의 할 수 있지만이 기능 + +276 +00:20:29,970 --> 00:20:33,600 + 내가 직접 해본 적이 있지만, 그것은 좋은 것 같지만 냉각 보인다 년대 + +277 +00:20:33,599 --> 00:20:37,259 + 단점은 그 층이 나 CPU가 될 것입니다 그래서 우리는 대한 이야기​​입니다 + +278 +00:20:37,259 --> 00:20:41,809 + 당신이 편지를 쓰는 경우 CPU와 GPU 사이의 통신 병목 + +279 +00:20:41,809 --> 00:20:46,460 + 파이썬은 모든 전후 패스가 난 오버 헤드가 고정됩니다 + +280 +00:20:46,460 --> 00:20:51,289 + 파이프와 전선으로 도움이 될 수 있습니다 하나의 좋은 장소 있지만 전송하지 + +281 +00:20:51,289 --> 00:20:58,450 + 사용자 정의 손실 함수 그건 당신이 마음에 그렇게 유지 수있는 일이 어쩌면 그래서 + +282 +00:20:58,450 --> 00:21:02,450 + 가톨릭 장점과 단점의 빠른 개요이 정말 내 관점에서 + +283 +00:21:02,450 --> 00:21:06,049 + 당신 싶어하는 모든 경우 종류의 간단한 기본 피드 포워드 네트워크를 훈련 + +284 +00:21:06,049 --> 00:21:09,730 + 특히 분류 및 캐시는 물건을 얻을 정말 쉽습니다 + +285 +00:21:09,730 --> 00:21:12,880 + 당신을 실행하면 자신이 방금 이러한 모든 사용하는 코드를 작성할 필요가 없습니다 + +286 +00:21:12,880 --> 00:21:17,660 + 도구를 사전이 내장 그것은 파이썬 인터페이스를 실행하는 것은 매우 쉽다 + +287 +00:21:17,660 --> 00:21:21,259 + 조금 사용하기위한 아주 좋은 조금 더 복잡한 사용 사례에 대해 작동합니다 + +288 +00:21:21,259 --> 00:21:25,329 + 당신이 정말로이있을 때 일이 정말 미친 얻을 때하지만 성가신 될 수 있습니다 + +289 +00:21:25,329 --> 00:21:29,299 + 특히 그들이 할 수있는 반복 모듈 패턴 회장 같은 큰 네트워크 + +290 +00:21:29,299 --> 00:21:33,450 + 당신이 원하는 위치 지루한와 같은 재발 네트워크와 같은 것들에 대한 수 + +291 +00:21:33,450 --> 00:21:37,519 + 네트워크의 다른 부분 사이의 공유 기다립니다 종류 회사의 종류 일 수있다 + +292 +00:21:37,519 --> 00:21:41,559 + 카페에서 성가신의 그것은 가능하지만 아마 사용할 수있는 가장 좋은 것은 아니다 + +293 +00:21:41,559 --> 00:21:46,250 + 내 관점에서 다른 큰 단점이 및 다른 단점 + +294 +00:21:46,250 --> 00:21:50,220 + 당신이 카페에서 은신처의 자신의 유형을 찾을 때 갖는 끝날 것입니다 + +295 +00:21:50,220 --> 00:21:55,440 + 그것이 당신에게 매우 빠른 개발주기를 제공하지 않습니다하지 그래서 C ++ 코드를 작성하는 + +296 +00:21:55,440 --> 00:22:00,769 + 그래서 그 그건 그래서 고통의 종류의 많은 종류의 당신에게 편지를 쓰고 있어요 + +297 +00:22:00,769 --> 00:22:04,750 + 우리의 세계는 카페의 여행을하는 간단한 질문이 있다면, 그래서 선풍 이유 + +298 +00:22:04,750 --> 00:22:06,669 + 네 + +299 +00:22:06,669 --> 00:22:14,028 + 교차 검증 및 카페 찾을 시도 할 수 있습니다이 txt 기차 발파라이소에서 매우 + +300 +00:22:14,028 --> 00:22:19,159 + 교육 단계와 테스트 단계는 그래서 일반적으로 약 기차처럼 좋아 + +301 +00:22:19,159 --> 00:22:20,269 + 제품 내선 + +302 +00:22:20,269 --> 00:22:24,960 + 제품 내선을 적용하고 손하지만 테스트에서 작업에에서 사용되는 배포 + +303 +00:22:24,960 --> 00:22:33,409 + 흔적 제품 내선의 위상은 그게 다야의 유효성 확인에 사용됩니다 + +304 +00:22:33,409 --> 00:22:39,820 + 다음 하나는 토치 그래서 토치 정말 내 그래서 캐비닛에 대해 알고있다 + +305 +00:22:39,819 --> 00:22:42,980 + 내가 여기 편견 조금 그래서 개인적으로 좋아 단지에 그를 얻을 수 + +306 +00:22:42,980 --> 00:22:46,259 + 오픈 나는 꽤 많이 내 자신에 거의 독점적으로 토치를 사용했던 것을 + +307 +00:22:46,259 --> 00:22:51,749 + 작년에 정도 횃불 그래서 프로젝트는 것 NYU 출신 + +308 +00:22:51,749 --> 00:22:56,450 + C에서 최대 대신에 기록하고 페이스 북 참으로 마음에 많이 사용되는 + +309 +00:22:56,450 --> 00:23:02,409 + 특히 나는 큰 중 하나 있도록 토치를 사용하는 트위터에서도 사람들을 많이 생각 + +310 +00:23:02,409 --> 00:23:05,309 + 과정에서 사람들을 괴물 일들이 당신이 대신 작성해야한다는 것입니다 + +311 +00:23:05,308 --> 00:23:11,038 + 나는 듣지 또는 토치 작업을 시작하기 전에 사용 된 적이 적이있는 + +312 +00:23:11,038 --> 00:23:16,700 + 하지만 실제로 유혹 가장 높은이 높은 수준의 스크립트는 것을 너무 나쁘지 않다 + +313 +00:23:16,700 --> 00:23:20,999 + 더 실행할 수 있도록 정말 임베디드 디바이스를위한 것입니다 언어 + +314 +00:23:20,999 --> 00:23:24,720 + 효율적으로는 많은 방법에서 자바 스크립트와 매우 유사 많이있어 + +315 +00:23:24,720 --> 00:23:29,749 + 그래서 루에 대한 또 다른 좋은 점은 그것이 의미가 있기 때문에 내장에서 실행할 수 있다는 것입니다 + +316 +00:23:29,749 --> 00:23:33,929 + 당신이 실제로 루프를 위해 할 수있는 장치는 정말 빨리하고 성화 당신은 알고있다 + +317 +00:23:33,929 --> 00:23:37,149 + 당신은 진짜로 그건 천천히 것 for 루프에서 어떻게 파이썬에서라면 + +318 +00:23:37,148 --> 00:23:40,798 + 실제로 적시를 사용하기 때문에 성화에의 할 사실은 완전히 잘 + +319 +00:23:40,798 --> 00:23:46,249 + 이러한 일들이 정말 빨리 만들 수 컴파일과 성화는 우리의 최신 가장입니다 + +320 +00:23:46,249 --> 00:23:50,200 + 이 기능 언어 기능이 인 것을 중요 자바 스크립트 + +321 +00:23:50,200 --> 00:23:54,058 + 그것은 다른에 주위에 콜백을 통과 전달하는 매우 흔한 일류 시민 + +322 +00:23:54,058 --> 00:24:01,200 + 코드의 일부는 또한 프로토콜 상속 곳의이 아이디어를 가지고있다 + +323 +00:24:01,200 --> 00:24:05,200 + 그들은 당신이 생각할 수있는 테이블 루아에있는 일종의 하나의 데이터 구조의 것 + +324 +00:24:05,200 --> 00:24:09,558 + 당신이 일을 구현할 수있는 자바 스크립트 객체와 매우 유사되고있는 + +325 +00:24:09,558 --> 00:24:13,378 + 객체 지향 프로그래밍 같은 비슷한에 프로토 타입 상속을 사용하여 + +326 +00:24:13,378 --> 00:24:18,428 + 방법이 자바 스크립트에서와 그리고 도시의 하나 단점 중의 하나로서 + +327 +00:24:18,429 --> 00:24:19,929 + 실제로 표준 라이브러리 + +328 +00:24:19,929 --> 00:24:24,820 + 취급 문자열과 이것 저것 할 수있다처럼 때로는 성가신 종류의 물건입니다 + +329 +00:24:24,819 --> 00:24:28,999 + 아마 가장 성가신 가시고의 종류는 그렇게 모든 인덱스의 하나입니다 + +330 +00:24:28,999 --> 00:24:33,058 + 네 루프에 대한 당신의 직관은 한 동안 있지만 다른에 대해 조금 꺼집니다 + +331 +00:24:33,058 --> 00:24:37,528 + 보다가 데리러 아주 쉽게 그리고 나는이 웹 사이트에 여기에 링크를 준 + +332 +00:24:37,528 --> 00:24:41,618 + 당신이 15 분 루아를 배울 수 있다고 주장하는 것은 그것은 약간있을 수 있습니다 + +333 +00:24:41,618 --> 00:24:45,209 + 그들은 그것을 조금 지나치게 될 수 있으므로 이상하지만 난 그것을 아주 쉽게 생각 + +334 +00:24:45,210 --> 00:24:50,298 + 그래서 주요 아이디어 뒤에 꽤 빨리 그것을 선택하고 코드를 작성 시작하기 + +335 +00:24:50,298 --> 00:24:55,398 + 토치 그래서 너희들에 NumPy와 많이 작업 된이 텐서 클래스 당신의 + +336 +00:24:55,398 --> 00:24:59,548 + 할당 및 할당 종류의 구조화하는 방법은 NumPy와 것입니다 + +337 +00:24:59,548 --> 00:25:03,329 + 배열은 당신에게 어떤 방법으로 당신이 원하는 데이터를 조작하는이 정말 쉬운 방법을 제공합니다 + +338 +00:25:03,329 --> 00:25:06,798 + 다음과 같은 축적 다른 추상화의 수 높은 비율을 사용할 수 있습니다 + +339 +00:25:06,798 --> 00:25:10,720 + 라이브러리와 이것 저것하지만 정말 NumPy와 배열이 당신을 할 것으로 알려져 + +340 +00:25:10,720 --> 00:25:16,909 + 당신이 그렇게 완벽한 유연성에 원하는 방식으로 숫자 데이터를 조작 + +341 +00:25:16,909 --> 00:25:20,580 + 통화하는 경우 어쩌면 여기 봐 여기에 일부 NumPy와의 예 + +342 +00:25:20,579 --> 00:25:24,918 + 지금 우리가 그냥 패스에 대한 간단한을 계산하고 의해 아주 잘 알고 있어야 코드 + +343 +00:25:24,919 --> 00:25:31,990 + 냉기 철도 네트워크의 어쩌면 블랙은 여기 최선의 선택은 아니었지만 우리는있어 + +344 +00:25:31,990 --> 00:25:36,569 + 우리는 우리는 우리가 어떤 어떤 상수 일부 경쟁하고 경쟁하는 일을하고있어 + +345 +00:25:36,569 --> 00:25:40,408 + 가중치는 어떤 임의의 데이터를 얻고있다 그리고 우리는 매트릭스에서 집회를 곱하고있는 + +346 +00:25:40,409 --> 00:25:44,789 + 곱 또 다른 주요 그래서 그 그 심판을 작성하는 것은 매우 간단이고 + +347 +00:25:44,788 --> 00:25:49,538 + 실제로이 이제 불을 지른 답변에 거의 120 번역이 + +348 +00:25:49,538 --> 00:25:53,970 + 오른쪽이 동일한 코드를하지만, 그래서 여기에 불을 지른 답변을 사용하고 + +349 +00:25:53,970 --> 00:25:58,509 + 우리의 가중치를 우리는 우리의 뒷부분 입력 크기를 정의하고 모든 우리는 정의하고 그 + +350 +00:25:58,509 --> 00:26:02,929 + 이는 단지 불을 지른 대답은 우리가있어 임의의 입력 벡터를 얻고 있었다있다 + +351 +00:26:02,929 --> 00:26:07,929 + 이 행렬을하고있다 전진 패스를하고 우리의 스폰서까지 번식이 + +352 +00:26:07,929 --> 00:26:09,179 + C-최대 + +353 +00:26:09,179 --> 00:26:13,149 + 진짜 문제이고 우리가 사용하는 코어를 계산할 수 요소 현명한 최대 + +354 +00:26:13,148 --> 00:26:17,089 + 코드의 일반적인 거의 모든 종류의 사용에 있도록 다른 행렬 곱셈 + +355 +00:26:17,089 --> 00:26:18,689 + 심판을 거래하는 것은 매우 간단합니다 + +356 +00:26:18,690 --> 00:26:22,460 + 꽤 많이 사용에 거의 하나씩 라인 별 번역이 + +357 +00:26:22,460 --> 00:26:25,400 + 불을 지른 대답 대신 + +358 +00:26:25,400 --> 00:26:28,880 + 그래서 또한 교환 및 다른 사용하기 정말 쉬운 것이 심판의 기억 + +359 +00:26:28,880 --> 00:26:33,690 + 우리는이 광고에 대해 얘기 데이터 유형은 최소한의 마지막 강의를 싫증하지만, + +360 +00:26:33,690 --> 00:26:38,500 + 당신이 할 필요가 아마 32 비트 부동 소수점로 전환 할 NumPy와 + +361 +00:26:38,500 --> 00:26:43,049 + 이 다른 데이터 유형으로 데이터를 캐스팅하고는 아주 아주 있다고 밝혀 + +362 +00:26:43,049 --> 00:26:47,589 + 뿐만 아니라 우리의 데이터 유형이이 강도가 지금 있음을 고문 할 쉽게 + +363 +00:26:47,589 --> 00:26:52,990 + 우리는 쉽게 다른 데이터 유형으로 우리의 데이터를 전송할 수 있지만, 여기 어디이있어 + +364 +00:26:52,990 --> 00:26:56,130 + 년 그래서 진짜 이유 등이 다음 슬라이드 고문은 무한 이유하지만 + +365 +00:26:56,130 --> 00:27:02,020 + NumPy와보다 나은 즉 GPU는 또 다른 데이터가 그렇게 할 때 입력 한 것을이다 + +366 +00:27:02,019 --> 00:27:07,879 + 당신이 횃불의 GPU에서 코드를 실행할 싶어 할 때 당신은 당신이 가져 이것을 사용 옳다 + +367 +00:27:07,880 --> 00:27:11,630 + 다른 패키지와 다른 시간을 지른 또 다른 데이터 형식이 + +368 +00:27:11,630 --> 00:27:16,810 + 텐서와 지금이 다른 데이터 유형에 텐서를 캐스팅하고 지금은 + +369 +00:27:16,809 --> 00:27:21,819 + GPU에서 살고 텐서에 수치 연산의 종류를 실행 단지 + +370 +00:27:21,819 --> 00:27:26,500 + 정말 정말 간단하고 토치 그냥 일반 쓸 수 있도록 GPU에서 실행 + +371 +00:27:26,500 --> 00:27:34,220 + tenser 과학 컴퓨팅 코드는 I GPU를 실행하고 정말 빨리 그래서 이런 식으로 내가 할 수 + +372 +00:27:34,220 --> 00:27:37,819 + 이 텐서 정말 당신은 유사한 그들을 제기 NumPy와 생각해야되는 것을 + +373 +00:27:37,819 --> 00:27:41,689 + 및 방법의 종류 만에 문서의 많은 거기에 당신 + +374 +00:27:41,690 --> 00:27:46,250 + 여기까지 10 서비스 내에서 작동하고이 문서를 얻을 수있는 것은 슈퍼 아니다 + +375 +00:27:46,250 --> 00:27:53,950 + 완전한하지만 당신이 다음 있도록에서 살펴 보셔야합니다 있도록이 나쁘지 않아 야하지만, + +376 +00:27:53,950 --> 00:27:58,200 + 실제로 당신은 정말 대신 횃불에 너무 많이 텐서를 사용하지 결국 + +377 +00:27:58,200 --> 00:28:02,880 + 그래서 신경 네트워크의 끝이라는 다른 패키지를 사용하고 이것입니다 + +378 +00:28:02,880 --> 00:28:06,800 + 실제로 그냥 신경 네트워크 패키지를 정의하는 매우 얇은 래퍼 + +379 +00:28:06,799 --> 00:28:10,930 + 이 텐트의 관점에서이 텐트의 조건은 당신이 생각해야 개체 + +380 +00:28:10,930 --> 00:28:15,049 + 이 숙제 코드의 BPR 더 산업용 강도 버전처럼 인 것으로 + +381 +00:28:15,049 --> 00:28:20,240 + 기본이이이 열 번째와 배열이 텐서 추상화가 어디 + +382 +00:28:20,240 --> 00:28:24,480 + 다음은 좋은 깨끗한에서 그 꼭대기에 방향족 라이브러리를 구현 + +383 +00:28:24,480 --> 00:28:30,410 + 인터페이스는 그래서 여기에 우리를 N 패키지를 사용 래리 애들러 네트워크에 동일합니다 + +384 +00:28:30,410 --> 00:28:33,900 + 이 거 순차적의 스택 수 그래서 우리의 네트워크가 순차적가 정의 + +385 +00:28:33,900 --> 00:28:38,360 + 작업을 우리가 완전히 연결되어 선형이 처음거야 거입니다 + +386 +00:28:38,359 --> 00:28:41,759 + 우리의 입력이 마케팅을 언급에서 우리는거야 난간을 가지고 언급 + +387 +00:28:41,759 --> 00:28:48,420 + 다른 대출은 지금 우리가 실제로 두 번째의 무게와 그라디언트를 얻을 수 있습니다 + +388 +00:28:48,420 --> 00:28:52,070 + 지금 대기가 될 것이 얻을 매개 변수 방법을 사용하여 각각에 대한 대답 + +389 +00:28:52,069 --> 00:28:55,750 + 네트워크 및 졸업생 모든 방법이있을 것이다 단일 불을 지른 답변 + +390 +00:28:55,750 --> 00:29:00,490 + 우리가 생성 할 수 있습니다 위의 모든 재료에 대해 하나의 횃불 대답을 할 것이다 + +391 +00:29:00,490 --> 00:29:05,730 + 어떤 임의의 데이터는 이제 앞으로 패스에 우리는 단지의 형식 매트를 호출 + +392 +00:29:05,730 --> 00:29:11,599 + 이 우리에게 컴퓨터 손실에 대한 우리의 점수 우리는이를 제공합니다 우리의 데이터를 사용하여 객체 + +393 +00:29:11,599 --> 00:29:16,769 + 그래서 우리는 컴퓨터에 의해 잃어버린 우리의 손실 함수입니다 별도의 기준 객체 + +394 +00:29:16,769 --> 00:29:21,289 + 기준의 네 번째 메서드를 호출하면 지금 우리는 우리의 예측을 수행 한 + +395 +00:29:21,289 --> 00:29:27,279 + 우리가 처음 설정 쉽고 뒤로 패스 20 호출 손실 함수에 역 + +396 +00:29:27,279 --> 00:29:31,609 + 내가 지금 일에있어 다음 역이는의 그라데이션을 모두 업데이트했습니다 + +397 +00:29:31,609 --> 00:29:35,319 + 대학원에서 네트워크는 우리가 그냥 아주 쉽게 그라데이션 물건을 만들 수 있습니다 params는 + +398 +00:29:35,319 --> 00:29:40,419 + 그래서 이것은 학습의 반대에 의해 졸업생을 곱한 것 + +399 +00:29:40,420 --> 00:29:44,130 + 레이트하고 간단한 그래디언트 디센트 갱신의 방법에 추가 + +400 +00:29:44,130 --> 00:29:50,400 + 그것은 어쩌면 조금했을 모든 권한을의 권리 + +401 +00:29:50,400 --> 00:29:53,560 + 더 명확하지만 우리는 우리가 기능을 잃은 무게 졸업생을하지 않은 + +402 +00:29:53,559 --> 00:30:00,730 + 우리는 앞으로에서 임의의 데이터를 얻을 뒤로 업데이 트를 확인하고 같이 + +403 +00:30:00,730 --> 00:30:03,930 + 그 대답보고에서 기대할 수있는 것이이 일을 실행할 수 있도록하는 것은 매우 쉽다 + +404 +00:30:03,930 --> 00:30:09,570 + GPU에 우리가 몇 가지 새로운 패키지를 가져 GPU에서 이러한 네트워크에서 실행 + +405 +00:30:09,569 --> 00:30:14,519 + 고문을 통해 모든 것을 두 가지 버전이 끝과 우리 + +406 +00:30:14,519 --> 00:30:17,930 + 그냥이 다른 데이터 유형으로 우리의 네트워크와 우리의 손실 함수를 캐스팅해야 + +407 +00:30:17,930 --> 00:30:23,490 + 우리는 또한 우리의 데이터와 레이블을 캐스팅 할 필요가 지금이 모든 네트워크는 것 + +408 +00:30:23,490 --> 00:30:28,660 + 그것은 그 40 어땠는지 지금 매우 쉽다는 그래서 실행하고 GPU에 대한 교육을 + +409 +00:30:28,660 --> 00:30:31,320 + 코드 라인은 우리가 완전히 연결 네트워크를 작성했습니다 우리는에 훈련 할 수있다 + +410 +00:30:31,319 --> 00:30:37,089 + 여기에 GPU하지만 하나의 문제는 우리가 그냥 바닐라 그라데이션을 사용하는 것입니다 + +411 +00:30:37,089 --> 00:30:41,000 + 너무 큰되지 않고 하강 당신은 할당과 같은 다른 것들에 본대로 + +412 +00:30:41,000 --> 00:30:45,329 + 아웃 실제로는 훨씬 더 나은 작업에 불쑥 우리의 혼란에 이렇게 것을 해결하기 위해 + +413 +00:30:45,329 --> 00:30:50,319 + 성화는 우리에게 다시 사용하기 때문에 낙관적 아주 쉽게 기회 패키지를 제공합니다 우리 + +414 +00:30:50,319 --> 00:30:51,799 + 바로 여기에 새로운 패키지를 가져 + +415 +00:30:51,799 --> 00:30:57,799 + 여기 그리고 지금 무엇을 변경하는 것은 우리가 실제로이 콜백을 정의 할 필요가있다 + +416 +00:30:57,799 --> 00:31:02,569 + 우리가 앞으로 전화 및 역방향 명시 적으로 제외되었다 전에 있도록 기능 + +417 +00:31:02,569 --> 00:31:06,960 + 해결할 대신 우리는을 실행이 콜백 함수를 찾을거야 + +418 +00:31:06,960 --> 00:31:10,750 + 네트워크는 앞으로 데이터를 뒤로하고 손실과 기울기를 반환 + +419 +00:31:10,750 --> 00:31:15,400 + 지금 우리의 네트워크에 업데이트 정지 실제로이 콜백을 통과 할 수 있도록하는 + +420 +00:31:15,400 --> 00:31:21,259 + Optim을 패키지에서이 아담 방법과 기능 때문에이이 어쩌면이다 + +421 +00:31:21,259 --> 00:31:26,940 + 조금 어색하지만 당신은 우리가 단지를 사용하여 업데이트 규칙의 어떤 종류를 사용할 수 있습니다 알고 + +422 +00:31:26,940 --> 00:31:31,430 + 우리가 전에 한 어떤에서 변화의 몇 라인은 다시는 매우 간단합니다 + +423 +00:31:31,430 --> 00:31:38,900 + 단지 우리가 본 바로 위하여 가고 모든 캐스팅하여 GPU에서 실행에 추가 + +424 +00:31:38,900 --> 00:31:44,220 + 카페 카페 일종의 다음의 용어와 레이어와 카페가 모든 것을 구현 + +425 +00:31:44,220 --> 00:31:48,750 + 그 사이 정말 열심히 구분 및 토치의 은신처 그들은 우리는하지 않습니다 + +426 +00:31:48,750 --> 00:31:52,400 + 정말이 구별 모든 것을 그릴 수없는 것은 전체 그래서 그냥 모델입니다 + +427 +00:31:52,400 --> 00:31:59,750 + 네트워크 모듈이며, 또한 각각의 모듈 래리 그래서 모듈은 + +428 +00:31:59,750 --> 00:32:03,650 + 구현되는 틀에 박힌 생활 대신에 정의되어 단지 클래스는 + +429 +00:32:03,650 --> 00:32:08,880 + 그 대답의 API를 사용하므로 이러한 모듈은 꽤있어 서면 법부터입니다 + +430 +00:32:08,880 --> 00:32:13,260 + 여기에 많은 이해하기 쉬운 완전 이제 연결된 완전하다 + +431 +00:32:13,259 --> 00:32:17,039 + 래리 연결이 당신이 그냥 볼 수있는 생성자 + +432 +00:32:17,039 --> 00:32:23,210 + 가중치 및 바이어스 용으로이 텐서 API로 인해 텐트 설정 + +433 +00:32:23,210 --> 00:32:28,100 + 성화는 우리가 쉽게 이러한 모든 층보다 GPU와 CPU에 동일한 코드를 실행할 수 있습니다 + +434 +00:32:28,099 --> 00:32:32,359 + 단지 텐서 API의 관점에서 작성하고 Heasley 모두에서 실행됩니다 + +435 +00:32:32,359 --> 00:32:37,529 + 장치이므로 이러한 모듈은 전후 원경 구현해야 + +436 +00:32:37,529 --> 00:32:42,670 + 앞으로 아기 때문에 여기의 예제는 출력을 업데이트 부르기로 결정 + +437 +00:32:42,670 --> 00:32:47,250 + 의 전체 텍스트에 대한 업데이트 출력은 나중에 실제로 몇 가지 경우가있어 그들이 + +438 +00:32:47,250 --> 00:32:50,480 + 다시 비 대 나와 함께 여기 몇 가지 경우를 처리해야하는 날 + +439 +00:32:50,480 --> 00:32:55,170 + 다시 부분에 있지만, 다른 것보다 있지만, 사용하기 전에 반드시 숙지 아주 쉬워야한다 + +440 +00:32:55,170 --> 00:33:00,830 + 더 뒤로받는 방법 업데이트 대학원 입력 한 쌍의 거기 통과 + +441 +00:33:00,829 --> 00:33:03,970 + 상류 구배 및 계산하는 그라디언트 존중 + +442 +00:33:03,970 --> 00:33:09,160 + 입력 및 다시 그냥 텐서 API로 구현됩니다 그래서 그것은 매우 쉽게 + +443 +00:33:09,160 --> 00:33:14,279 + 일의 그 조금을 이해 그냥 같은 유형은 숙제 우리에보고 + +444 +00:33:14,279 --> 00:33:17,990 + 또한 구현하고 기울기를 계산하는 잡아 매개 변수를 축적 + +445 +00:33:17,990 --> 00:33:21,480 + 네트워크의 무게에 대하여 당신은 생성자에서 본대로 + +446 +00:33:21,480 --> 00:33:25,610 + 편견의 무게가 인스턴스 변수이 모듈에서 개최되며, + +447 +00:33:25,609 --> 00:33:30,309 + 대학원 매개 변수는 업스트림에서 그라디언트를 받게됩니다 축적과 축적 + +448 +00:33:30,309 --> 00:33:34,940 + 다시 상류 라디안과 관련하여 파라미터 그라디언트 + +449 +00:33:34,940 --> 00:33:39,809 + 단지 텐서 API를 사용하여 매우 간단합니다 + +450 +00:33:39,809 --> 00:33:44,200 + 토치 실제로 사용 가능한 다른 모듈의 톤 여기에 문서를 가지고 + +451 +00:33:44,200 --> 00:33:46,980 + 당신은 단지에 가면 날짜가 조금있을 수 있지만, 당신은 모든 확인할 수 있습니다 일어나서 + +452 +00:33:46,980 --> 00:33:51,460 + 파일은 당신에게 놀 수있는 모든 케이크를 제공하는 그는 실제로 얻을 수있어 + +453 +00:33:51,460 --> 00:33:55,930 + 많은 그래서 그냥 포인트 아웃이이 전쟁 전 그냥 날 추가 몇 업데이트 + +454 +00:33:55,930 --> 00:34:00,750 + 횃불은 항상 당신이 당신의 네트워크를 추가 할 수있는 새로운 모듈을 추가, 그래서 지난 주 + +455 +00:34:00,750 --> 00:34:06,390 + 이는 꽤 재미 있지만, 이러한 기존의 모듈이 충분하지 않은 때이다 + +456 +00:34:06,390 --> 00:34:10,579 + 그냥이를 구현할 수 있기 때문에, 그래서 자신을 쓰기 실제로 매우 쉽게 + +457 +00:34:10,579 --> 00:34:13,989 + 텐서 API를 사용하여 단지를 구현하는이 tenser를 사용하는 것 + +458 +00:34:13,989 --> 00:34:17,259 + 전후 그것은에 구현 층보다 훨씬 더 어렵다 + +459 +00:34:17,260 --> 00:34:21,890 + 그래서 여기에 숙제 단지 작은 예입니다 이것은 단지 걸리는 바보 모듈입니다 + +460 +00:34:21,889 --> 00:34:28,210 + 입력과 2로 곱하면 우리가 업데이트 그래프를 구현 볼 수 있습니다 + +461 +00:34:28,210 --> 00:34:31,849 + 템플릿과 지금 우리는 단지 스물 라인 새 레이어과 토크를 구현 한 + +462 +00:34:31,849 --> 00:34:35,929 + 코드는 그 정말 쉽게 후 다른 코드를 사용하는 것은 매우 쉽다 + +463 +00:34:35,929 --> 00:34:40,710 + 그냥 가져 와서 나는 당신의 네트워크를 추가 등등과 정말 멋진 할 수 있습니다 + +464 +00:34:40,710 --> 00:34:44,920 + 이것은 당신이 할 수있는 단지 텐서 API이기 때문에 이것에 대해 것은 무엇이든 종류 + +465 +00:34:44,920 --> 00:34:48,579 + 임의의 일이 당신 앞으로 이러한 내부 원하고 필요한 경우 뒤로 + +466 +00:34:48,579 --> 00:34:52,730 + 아마 코드 또는 아무것도 또는 루프 또는 복잡하고 부모를 위해해야​​ 할 일 + +467 +00:34:52,730 --> 00:34:56,980 + 어떤 어떤 종류의보다 밖으로 드롭 또는 합리화 확률 일 + +468 +00:34:56,980 --> 00:34:59,949 + 당신은 기대 할 뒤로 당신을 통과 코드의 어떤 종류 + +469 +00:34:59,949 --> 00:35:03,500 + 그것은 일반적으로 매우 간단 매우 그래서 이러한 모듈 내부에 직접 구현 + +470 +00:35:03,500 --> 00:35:11,500 + 토치 있도록하지만, 물론 선수들과 토치의 자신의 새로운 유형을 쉽게 구현할 수 + +471 +00:35:11,500 --> 00:35:14,250 + 자신의 개별 레이어를 사용하는 것은 매우 유용하지 않습니다 + +472 +00:35:14,250 --> 00:35:16,960 + 우리는 더 큰 네트워크로 함께 스티치 할 사람이 필요 + +473 +00:35:16,960 --> 00:35:21,220 + 지금까지이 성화는 우리가 이미 앞의 예에서 일을보고 용기를 사용 + +474 +00:35:21,219 --> 00:35:26,549 + 이는이 순차적 컨테이너 그래서 결과적 컨테이너 그냥 스택이다 + +475 +00:35:26,550 --> 00:35:29,950 + 모든 우리는 이전의 출력을 수신하여 하나있어 모듈 + +476 +00:35:29,949 --> 00:35:35,639 + 하나는 그냥 가서 그게 아마 당신이 수도 가장 일반적으로 사용되는 또 다른 하나 + +477 +00:35:35,639 --> 00:35:40,799 + 볼이 입력이 있고 경우이 부모가 어쩌면 테이블이 성기입니다 + +478 +00:35:40,800 --> 00:35:44,289 + 댄 동일한 입력에 다른 모듈에 다른 적용 할 + +479 +00:35:44,289 --> 00:35:49,099 + 콘텐츠 테이블 당신은 그렇게 당신은 다른 출력 셀레스트를받는 + +480 +00:35:49,099 --> 00:35:53,280 + 당신이 입력의 목록이있는 경우는 병렬 테이블로 볼 수 있습니다 당신이 원하는 + +481 +00:35:53,280 --> 00:35:57,500 + 다음 다른에리스트의 각 요소를 다른 모듈을 수행 할 수 있습니다 적용 + +482 +00:35:57,500 --> 00:36:04,588 + 상황이 오면 건설의 종류에 대한 병렬 타 보르 테이블을 사용하지만, + +483 +00:36:04,588 --> 00:36:08,980 + 정말 그렇게 실제로 내가 당신에게하는 그 용기를 복잡 + +484 +00:36:08,980 --> 00:36:13,480 + 이론적으로 쉽게해야하는 보조 당신을 사과 단지에 대해 구현하는 것이 가능합니다 + +485 +00:36:13,480 --> 00:36:16,980 + 원하는하지만 정말 복잡 묶는 연습에 정말 털이 수 있습니다 + +486 +00:36:16,980 --> 00:36:21,480 + 토치 페넌트라는 다른 패키지를 제공하므로 그 용기를 사용하는 것 + +487 +00:36:21,480 --> 00:36:23,230 + 당신이 훅 수 있습니다 그래프 + +488 +00:36:23,230 --> 00:36:28,210 + 컨테이너 훅 가지 더 복잡한 토폴로지 아주 쉽게 그렇게 + +489 +00:36:28,210 --> 00:36:32,400 + 우리가있는 경우 여기에 우리가 3 개의 입력이있는 경우 어쩌면 우리가 하나를 생성 할 예입니다 + +490 +00:36:32,400 --> 00:36:36,930 + 출력과 우리는 아주 간단 업데이트 규칙을 생성하려는 + +491 +00:36:36,929 --> 00:36:40,379 + 우리는 많은 봤어요 계산 그래프의이 유형에 해당 + +492 +00:36:40,380 --> 00:36:44,869 + 당신이 실제로 구현할 수 있도록 문제의 다른 유형에 대한 강의에서 배 + +493 +00:36:44,869 --> 00:36:49,430 + 이 단지 사용하여 고급 병렬 및 순차 및 테이블 성기하지만 그것을 + +494 +00:36:49,429 --> 00:36:53,009 + 당신 싶어이 같은 일을 할 때 그래서 아주의 질량의 종류 수 + +495 +00:36:53,010 --> 00:36:58,470 + 이 그래프 코드가 대신 있도록 그래프를 전송하는 것이 일반적 그래서 여기이 매우 쉽습니다 + +496 +00:36:58,469 --> 00:37:03,179 + 함수는 그래프를 사용하여 모듈을 구축하고 그래서 여기를 반환하는 것입니다 + +497 +00:37:03,179 --> 00:37:09,129 + 우리는 그래프 패키지를 가져온 다음 여기 안에이 돈의 비트입니다 + +498 +00:37:09,130 --> 00:37:14,329 + 구문은 그래서 이것은 실제로는이 상징적 인 변수를 발견하는 텐서 아니다 + +499 +00:37:14,329 --> 00:37:19,480 + 그래서 이것은 우리의 우리 텐트 또는 객체로 XY 및 Z를 받으려고하는 것을 말하고있다 + +500 +00:37:19,480 --> 00:37:25,300 + 입력과 현재 점유율은 실제로 그렇게 그 입력에 상징적 인 작업을하고 있었다 + +501 +00:37:25,300 --> 00:37:26,840 + 여기에 우리가 말을하는지 + +502 +00:37:26,840 --> 00:37:32,700 + 우리는 우리가 두 번 연주 한 할 X & Y의 점별 버전을 가지고 싶어 + +503 +00:37:32,699 --> 00:37:38,159 + 및 수 ANZ 저장소의 곱셈과 A & B의 지금 점별 판과 + +504 +00:37:38,159 --> 00:37:42,159 + 이러한 실제적인 tenser이 지금의 객체이다 다시 저장하고보고 + +505 +00:37:42,159 --> 00:37:45,109 + 당신이 구축하는 데 사용되는 기호 참조의 종류 + +506 +00:37:45,110 --> 00:37:50,420 + 계산 백그라운드에서 그래프와 지금 우리가 실제로 반환 할 수 있습니다 + +507 +00:37:50,420 --> 00:37:55,159 + 우리는 우리의 모듈은 입력 XY 및 Z 출력을 것이라고 여기 모듈 + +508 +00:37:55,159 --> 00:38:00,920 + 볼이 엔드 IG 모듈은 실제로 우리에 부합하는 객체를 제공합니다 + +509 +00:38:00,920 --> 00:38:05,559 + 그럼 우리가 구축 한 후 그 계산을 구현하는 모듈 API + +510 +00:38:05,559 --> 00:38:10,619 + 몬트리올 우리는 콘크리트 법원이 답변을 지른 구성 할 수 있습니다 다음에 그들을 먹여 + +511 +00:38:10,619 --> 00:38:19,170 + 실제로 기능을 너무 횃불 사실은 꽤 계산됩니다 모듈 + +512 +00:38:19,170 --> 00:38:22,670 + 초반 이었죠 모델을 잘 할 수있는 패키지라는로드 캠페인이있다 + +513 +00:38:22,670 --> 00:38:27,050 + 당신은 카페에서 사전 시험 모델의 많은 다른 유형을로드하고는거야 + +514 +00:38:27,050 --> 00:38:31,590 + 그들의 고문의 등가물로 변환 당신은 카페를로드 할 수 있습니다 + +515 +00:38:31,590 --> 00:38:35,539 + 제품 내선 및 카페 모델 파일과는의 거대한 스택에 설정합니다 + +516 +00:38:35,539 --> 00:38:39,929 + 연속 시장은, 카페 슈퍼 일반 보하지로드하고 특정 작동 + +517 +00:38:39,929 --> 00:38:44,649 + 네트워크 그러나 특정 부하 카페의 종류는 알렉스를하지로드 할 것이며, + +518 +00:38:44,650 --> 00:38:49,660 + 그들은 아마 가장 일반적으로 몇 가지있어 수 있도록 캠페인 및 PGG이 있습니다 사용 + +519 +00:38:49,659 --> 00:38:54,259 + 또한 몇 가지 다른 구현 당신은 횃불로 구글 매트에로드 + +520 +00:38:54,260 --> 00:38:58,520 + 당신이 재시도 구글을로드 할 수 있도록 그 모델 성화에 실제로 매우 + +521 +00:38:58,519 --> 00:39:01,869 + 최근 페이스 북은 잔류 네트워크를 나서서 다시 구현 + +522 +00:39:01,869 --> 00:39:07,900 + 바로 횃불 최대 알렉스 사이 때문에 그들은 그것을 위해 초반 이었죠 모델 출시 + +523 +00:39:07,900 --> 00:39:11,849 + BG 그룹과 ResNet에서 캠페인되지 그게 아마 모든 것을 당신을 생각 + +524 +00:39:11,849 --> 00:39:17,869 + 이다 대부분의 사람들이 다른 점을 이용하려는 모든 초반 이었죠 모델 필요 + +525 +00:39:17,869 --> 00:39:21,549 + 횃불이 미끼를 사용하기 때문에 우리는 패키지를 설치 핍 사용하고있을 수 없습니다 + +526 +00:39:21,550 --> 00:39:24,920 + 쉽게 새로운 설치할 것 막사라는 또 다른 매우 유사한 아이디어 + +527 +00:39:24,920 --> 00:39:26,750 + 업데이트 패키지를 패키지 + +528 +00:39:26,750 --> 00:39:29,650 + 그것은 아주 사용하기 매우 쉽습니다 + +529 +00:39:29,650 --> 00:39:34,079 + 이 종류의 나는 매우 유용한 몇 가지 패키지 단지 목록입니다 + +530 +00:39:34,079 --> 00:39:38,349 + 이름으로이 취소 할 수 있도록 성화는 5 파일을 읽고 HDR 쓸 수 있습니다 + +531 +00:39:38,349 --> 00:39:44,640 + 당신은 인 트위터 autorad에서이 재미 하나가 거기에 읽기 및 JSON을 쓸 수 있습니다 + +532 +00:39:44,639 --> 00:39:47,980 + 조금하지만 사용하지 않은 얘기 할 동물처럼 조금 + +533 +00:39:47,980 --> 00:39:52,369 + 하지만 가지를보고 멋진 실제로 페이스 북은 꽤 유용이 + +534 +00:39:52,369 --> 00:39:57,849 + 횃불에 대한 라이브러리는 또한 오십 회선과를 구현하면서 + +535 +00:39:57,849 --> 00:40:01,548 + 데이터 병렬 모델 병렬 처리를 구현 + +536 +00:40:01,548 --> 00:40:07,449 + 그래서 그 횃불에 이렇게 아주 일반적인 워크 플로우가 꽤 꽤 좋은 일이 + +537 +00:40:07,449 --> 00:40:11,239 + 당신이 있습니다 종종 몇 가지 전처리 스크립트를 가지고 그 피칸 것이다 + +538 +00:40:11,239 --> 00:40:15,818 + 전처리 데이터와는 일반적으로 책상 HDL 5에 몇 가지 좋은 형식으로 그것을 덤프 + +539 +00:40:15,818 --> 00:40:20,528 + 큰 일 제이슨 작은 것들에 대해 당신은 내가 일반적으로 쓰기 것이다 것 + +540 +00:40:20,528 --> 00:40:25,318 + 최대 낮은에서 훈련에있는 모든 HDL 5에서 읽고 모델을 학습하고 최적화 + +541 +00:40:25,318 --> 00:40:30,088 + 모델과는 체크 포인트에게 책상을 저장하고 일반적으로 좀 평가가 + +542 +00:40:30,088 --> 00:40:35,019 + 기차 모델을로드하고 그래서 뭔가 유용한 경우 그것을하지 스크립트 + +543 +00:40:35,019 --> 00:40:39,000 + 워크 플로우의이 유형에 대한 연구는 내가 일주일 전에 GitHub의에 올려이 프로젝트 + +544 +00:40:39,000 --> 00:40:43,969 + 즉 문자 수준의 언어 모델과 토치 그래서 여기에 거기를 구현 + +545 +00:40:43,969 --> 00:40:48,239 + HTML 5에 텍스트 파일을 변환 전처리 스크립트는 거기에 파일을 + +546 +00:40:48,239 --> 00:40:52,889 + HTML5에 대한로드하고이 재발 네트워크와 열차 훈련 스크립트 + +547 +00:40:52,889 --> 00:40:57,190 + 그건 있도록 검사 점은 세금이 생성 최대로드 샘플링 스크립트가있다 + +548 +00:40:57,190 --> 00:41:03,720 + 즉, 빠른 장점과 단점 나는 나의 일반적인 워크 플로우 및 토치와 같은 종류의 + +549 +00:41:03,719 --> 00:41:07,169 + 그 유혹은 사람들을위한 큰 분기점하지만 나는하지 않는 것이 고문에 대해 말할 것입니다 + +550 +00:41:07,170 --> 00:41:11,690 + 큰 거래는 확실히 덜 플러그가 있다고 실제로 생각하고 그래서 카페에서 재생 + +551 +00:41:11,690 --> 00:41:15,760 + 당신은 아마 조금 전형적으로 자신의 코드를 많이 작성하게 될 겁니다 + +552 +00:41:15,760 --> 00:41:20,028 + 또한 더 많은 오버 헤드하지만 비트는 당신이 모듈을 많이 가지고 더 많은 유연성을 제공 + +553 +00:41:20,028 --> 00:41:24,278 + 플러그 앤 플레이하기 쉽고 조각 표준 라이브러리처럼 + +554 +00:41:24,278 --> 00:41:26,880 + 이 모든 푸른로 작성하기 때문에 읽기가 매우 간단하고 아주 쉽게 + +555 +00:41:26,880 --> 00:41:31,740 + 아주 좋은 초반 이었죠 모델이 많이있어 이해하지만, + +556 +00:41:31,739 --> 00:41:34,598 + 불행하게도 그것의 재발 네트워크를 사용하는 것이 조금 어색입니다 + +557 +00:41:34,599 --> 00:41:38,640 + 일반 그래서 당신은 당신이 여러 개의 모듈을하고자 할 때 한 달 가지고 싶어 할 때 + +558 +00:41:38,639 --> 00:41:42,028 + 서로 공유하는 가중치는 실제로이와 횃불을 할 수 있지만입니다 + +559 +00:41:42,028 --> 00:41:42,469 + 그것은 종류의 + +560 +00:41:42,469 --> 00:41:47,199 + 취성 그 아마의 그, 그래서 당신은 거기에 미묘한 버그로 실행할 수 있습니다 + +561 +00:41:47,199 --> 00:41:49,649 + 가장주의해야 할 점은 재발 네트워크가 까다로운 일이 될 수 있다는 것입니다 + +562 +00:41:49,650 --> 00:42:15,800 + 어떤 어떤 토치에 대한 질문이 그래 그래하지만 질문에서이 아니다 + +563 +00:42:15,800 --> 00:42:21,570 + 그, 그래서 네 루프와 피칸를 잘 해석하는 방법 나쁜 방법에 대해이었다 + +564 +00:42:21,570 --> 00:42:24,359 + 즉, 해석 있기 때문에이 파이썬에서 정말 나쁜 이유에 대해 정말 + +565 +00:42:24,358 --> 00:42:27,139 + 실제로 메모리 할당과 다른 꽤 많은 일 루푸스에 대한 모든 + +566 +00:42:27,139 --> 00:42:31,960 + 무대 뒤에서 일하지만 만약 당신이 자바 스크립트를 사용 한 경우 다음 루프와 + +567 +00:42:31,960 --> 00:42:35,059 + 자바 스크립트는 꽤 빨리하는 경향이 있기 때문에 런타임 실제로 단지 + +568 +00:42:35,059 --> 00:42:39,759 + 자바 스크립트에서 루프 그래서 네이티브 코드까지 즉석에서 코드를 컴파일 + +569 +00:42:39,760 --> 00:42:44,520 + 정말 빠르고 유체 및 루 실제로 어디 정렬거야 유사한 메커니즘을 가지고 + +570 +00:42:44,519 --> 00:42:49,588 + 인간의 유전자 코드의 자동 마술 컴파일 된 코드 입술 때문에 + +571 +00:42:49,588 --> 00:42:53,608 + 정말 빨리하지만 난 여전히 사용자 지정 벡터화 코드를 쓰고 있다고 할 수 있습니다 + +572 +00:42:53,608 --> 00:43:01,619 + 우리가 가진 모든 권한을 당신에게 속도 업을 많이주고 지금 아마 반 시간은 왼쪽 + +573 +00:43:01,619 --> 00:43:06,420 + 아니 우리가 그래서 옆에 시간이 부족하고, 그래서 두 개 더 프레임 워크를 커버 + +574 +00:43:06,420 --> 00:43:12,000 + 내가 아는 같은 건 몬트리올 대학에서 여호수아 밴조 그룹에서이며, + +575 +00:43:12,000 --> 00:43:16,250 + 우리는 그래프 innn 조금을 보았다 그래서 계산 그래프에 대해 정말 전부 + +576 +00:43:16,250 --> 00:43:19,559 + 토치에서 그 계산 공예 함께 스티치이 꽤 좋은 방법입니다 + +577 +00:43:19,559 --> 00:43:24,139 + 큰 복잡한 아키텍처와 Fionna 정말 계산이 아이디어에 소요 + +578 +00:43:24,139 --> 00:43:29,409 + 그래픽 및 실행 그것 극단적하고 또한 약간 높은 수준의 라이브러리를 갖는다 + +579 +00:43:29,409 --> 00:43:33,940 + 여기에 같은 계산이 그래서 부족뿐만 아니라에 터치합니다 라자냐입니다 + +580 +00:43:33,940 --> 00:43:38,570 + 우리가 전에 그래프의 맥락에서 본 공예 우리는 실제로 통해 걸을 수 + +581 +00:43:38,570 --> 00:43:43,400 + 2010 년이의 구현은 그래서 당신은 여기에 우리가 가져 오는 것을 볼 수 있습니다 + +582 +00:43:43,400 --> 00:43:49,440 + fiato과 fiato의 tenser 객체와 지금 여기에 우리가 같이 XY 및 Z를 정의하고 + +583 +00:43:49,440 --> 00:43:53,099 + 이 실제로 말과 매우 유사 기호와 같은 기호 변수 + +584 +00:43:53,099 --> 00:43:55,530 + 그래프의 예를 우리는 불과 몇 슬라이드 전에 보았다 + +585 +00:43:55,530 --> 00:43:59,500 + 이 실제로되도록하는 것은 이러한 종류의 상징적 인 개체 인상 NumPy와하지 + +586 +00:43:59,500 --> 00:44:05,690 + 계산 잔디에서 다음 우리가 할 수 실제로 컴퓨터 이러한 출력에 + +587 +00:44:05,690 --> 00:44:11,679 + XY 및 Z는이 상징적 인 일이며 우리는 AB & C를 계산할 수 있습니다 상징적 있도록 + +588 +00:44:11,679 --> 00:44:15,769 + 바로 이러한 오버로드 된 연산자를 사용하고이를 구축 할 수 있습니다 + +589 +00:44:15,769 --> 00:44:19,929 + 우리가 구축 한 후, 일단 백그라운드에서 계산 그래프 우리 + +590 +00:44:19,929 --> 00:44:23,839 + 계산 공예 우리는 사실에 그것의 특정 부분을 실행할 수 있도록하려면 + +591 +00:44:23,840 --> 00:44:29,240 + 실제 데이터는 그래서 우리는 그래서이 약 말하고있는이 양극 이상한 함수 일 전화 + +592 +00:44:29,239 --> 00:44:33,269 + 우리는 입력 XY 및 Z를 취할 것입니다 우리의 기능을 할 그것을 생산합니다 + +593 +00:44:33,269 --> 00:44:38,329 + 출력이 우리가 평가할 수있는 실제 파이썬 함수를 반환합니다 참조 + +594 +00:44:38,329 --> 00:44:42,239 + 실제 데이터와 나는이 정말 지적하고 싶은 경우 모든 마법과 + +595 +00:44:42,239 --> 00:44:46,319 + Fionna 당신이 함수를 호출 할 때 미친 미친 일을 할 수 있다는 일어나고 + +596 +00:44:46,320 --> 00:44:49,580 + 물건은 그것이 더 확인하기 위해 계산 그래프를 단순화 할 수 있습니다 + +597 +00:44:49,579 --> 00:44:54,199 + 효율적인 실제로 상징적 내가 가식과 다른 것들과 산기 수 있습니다 + +598 +00:44:54,199 --> 00:44:58,319 + 당신이 그것을 연결하는 함수를 호출 할 때 실제로 그렇게 네이티브 코드를 생성 할 수 있습니다 + +599 +00:44:58,320 --> 00:45:02,450 + 실제로 때로는 항공편에서 코드를 컴파일 때문에 GPU에 비공식적 있습니다 + +600 +00:45:02,449 --> 00:45:06,389 + 모든 마법과 Fiano 정말이 작은 죄에서이오고있다 + +601 +00:45:06,389 --> 00:45:11,750 + 파이썬에서 문을 찾고 있지만 여기에 후드 일이 많이있다 및 + +602 +00:45:11,750 --> 00:45:14,710 + 지금 한 번 우리는이 미친 물건을 통해이 마법의 기능을 쪘 + +603 +00:45:14,710 --> 00:45:19,159 + 우리는 단지 우리가 인스턴스화 그래서 여기 인상보다 실제 수에서 실행할 수 있습니다 + +604 +00:45:19,159 --> 00:45:25,440 + xxyyxx 실제 수보다 높은 등급으로 실제 쉽고 그 다음 우리는 막을 수 + +605 +00:45:25,440 --> 00:45:30,639 + 이러한 실제 번호를 전달하는 우리의 기능은 이것을 값을 얻을 수 있습니다 + +606 +00:45:30,639 --> 00:45:35,359 + 파이썬에서 폭발적으로 이러한 계산을하고 같은 일을하고있다 + +607 +00:45:35,360 --> 00:45:39,289 + 것을 제외하고 최종 버전으로 인해 모든 마법에 훨씬 더 효율적이 될 수 + +608 +00:45:39,289 --> 00:45:42,840 + 후드와 피아노 버전에서 실제로 GPU 경우에 실행 될 수있다 + +609 +00:45:42,840 --> 00:45:47,289 + 당신은 구성하지 않은 그러나 불행하게도 우리가 정말 걱정하지 않는다 + +610 +00:45:47,289 --> 00:45:51,659 + 우리가 여기 먹으 렴 알고 싶어이 같은 일을 계산하는 것은의 예 + +611 +00:45:51,659 --> 00:45:57,629 + (10)에서 간단한 도구 공기 풍선 그래서 아이디어는 우리가 가고있는이 동일하다 + +612 +00:45:57,630 --> 00:46:02,860 + 우리의 입력을 선언하지만 지금은 대신 그냥 XY 및 Z 우리가 우리의 입력 구문 + +613 +00:46:02,860 --> 00:46:06,490 + 더 나은 우리의 라벨 및 Y + +614 +00:46:06,489 --> 00:46:11,009 + 행렬 W & W 너무 그래서 우리는 그저 설정하는이 상징적 인 체중한다 + +615 +00:46:11,010 --> 00:46:17,540 + 변수는 우리의 계산 잔디의 요소가 될 것이다 지금 44 패스 우리를 + +616 +00:46:17,539 --> 00:46:21,179 + 좀 NumPy와 같습니다하지만 기호에없는 기괴한 작업이다 + +617 +00:46:21,179 --> 00:46:24,669 + 그래서 여기 백그라운드에서 그래프를 구축 계산되는 개체 + +618 +00:46:24,670 --> 00:46:28,909 + 이과 활성화. 방법 행렬 곱셈하지만 우리는 상징적 필요가 + +619 +00:46:28,909 --> 00:46:33,210 + 개체는 우리는이 라이브러리 함수를 사용하여 실제 문제를하고있는 우리는있어 + +620 +00:46:33,210 --> 00:46:37,769 + 또 다른 행렬의 곱셈을하고 우리는 실제로 손실을 계산할 수 있습니다 + +621 +00:46:37,769 --> 00:46:41,210 + 다시 이러한 몇 가지 다른 라이브러리 기능을 이용하여 확률과 로스 + +622 +00:46:41,210 --> 00:46:44,349 + 까지 구축하고 상징적 인 개체에 대한 모든 작업은 + +623 +00:46:44,349 --> 00:46:50,420 + 우리는 우리의 기능 때문에이 기능을 컴파일 할 수 있도록 전산 잔디 + +624 +00:46:50,420 --> 00:46:54,570 + 걸릴 것입니다 우리의 데이터는 레이블이며, 행렬을 가중하는 우리의 28 요소는 + +625 +00:46:54,570 --> 00:46:58,890 + 및 풋 출력으로 나는 손실과 스칼라 우리를 반환합니다 + +626 +00:46:58,889 --> 00:47:04,109 + 분류 점수 벡터에서 지금 우리가 실제 데이터에이 일을 실행할 수 있습니다 + +627 +00:47:04,110 --> 00:47:07,559 + 우리는 이전 슬라이드에서 본 것처럼 우리는 몇 가지 실제 수의 I를 인스턴스화 할 수 + +628 +00:47:07,559 --> 00:47:13,759 + 함수에 제기하고 전달 그래서 이것은 큰하지만 이것은 단지입니다 + +629 +00:47:13,760 --> 00:47:17,820 + 네 번째는 실제로 그렇게이 네트워크 및 컴퓨터 생기를 양성 할 수 있도록 + +630 +00:47:17,820 --> 00:47:23,000 + 여기에 우리가 너무이 동일하다고 할 코드의 몇 라인을 추가 할 필요가 + +631 +00:47:23,000 --> 00:47:27,170 + 우리는 우리가 정의하고있어 이전과 같이 우리의 입에 대한 상징적 인 변수는 + +632 +00:47:27,170 --> 00:47:29,510 + 우리의 무게 등 우리는 함께있어 + +633 +00:47:29,510 --> 00:47:33,980 + 전에 같은 4 패스를 실행하는 컴퓨터의 법률에 손실을 계산하기 + +634 +00:47:33,980 --> 00:47:37,920 + 상징적으로 차이가 우리가 실제로 할 수있는 알고 + +635 +00:47:37,920 --> 00:47:43,680 + 여기에 상징적 인 차별화 그래서이에 우리가 내가 말하는 거 야 디 W 하나 TW입니다 + +636 +00:47:43,679 --> 00:47:47,129 + 우리가 그는 손실의 성분의 기울기가되고 싶어요 것을 알고있다 + +637 +00:47:47,130 --> 00:47:52,280 + 그래서이 두 개의 W 하나의 최소 W 그 다른 상징적 인 변수에 대하여 + +638 +00:47:52,280 --> 00:47:52,930 + 정말 멋진 + +639 +00:47:52,929 --> 00:47:56,549 + fiato는 그래프의 어떤 부분에 임의의 그라데이션을 할 수 있습니다 + +640 +00:47:56,550 --> 00:48:00,289 + 그래프의 다른 부분에 대한 새로운으로 그 도입 도입하지 + +641 +00:48:00,289 --> 00:48:05,190 + 그래프의 기호 변수는 당신이 정말로 그와 함께 미친 갈 수 있도록하지만, + +642 +00:48:05,190 --> 00:48:09,470 + 여기이 경우 우리는 지금 출력으로 그 캐나다인을 반환거야 + +643 +00:48:09,469 --> 00:48:14,049 + 우리는 다시 새로운 기능을 컴파일거야 우리의 입력이 걸릴 것입니다 + +644 +00:48:14,050 --> 00:48:19,510 + 입력 입력 픽셀 자루와 우리의 레이블 왜 28 행렬과 함께 + +645 +00:48:19,510 --> 00:48:23,140 + 지금은 우리의 손실을 반환 할 것 이러한 분류 점수와 + +646 +00:48:23,139 --> 00:48:28,250 + 지금 우리는 실제로 매우 간단한 훈련이 설정을 사용할 수있는 두 가지 성분 있도록 + +647 +00:48:28,250 --> 00:48:32,809 + 신경 네트워크는 그래​​서 우리는 실제로 단지 그라데이션을 구현 그라데이션 하강을 사용할 수 있습니다 + +648 +00:48:32,809 --> 00:48:36,630 + 이 계산을 사용하여이이를 이용하여 단지 몇 줄의 하강 + +649 +00:48:36,630 --> 00:48:38,990 + 그래서 여기에 우리가있어 잔디 + +650 +00:48:38,989 --> 00:48:43,599 + 데이터 세트 및 요인에 대한 인상보다 실제 수를 인스턴스화 + +651 +00:48:43,599 --> 00:48:45,489 + 로 다시 어떤 임의의 행렬 + +652 +00:48:45,489 --> 00:48:49,839 + 실제 수는 더 높은 올릴 때 우리는 지금 우리가이 전화를 걸 때마다 물어 + +653 +00:48:49,840 --> 00:48:50,519 + 돌아 가야 + +654 +00:48:50,519 --> 00:48:54,710 + NumPy와 배열은 지금 우리가 그 손실 및 점수와 그라데이션을 포함한다 + +655 +00:48:54,710 --> 00:48:57,800 + 그라디언트를 우리는 단지 간단한 그라데이션 업데이트를 우리의 가중치에 만들 수 있습니다 + +656 +00:48:57,800 --> 00:49:01,970 + 및 조치는 우리의 네트워크를 훈련 골목 - OOP를 약속하지만 실제로있다 + +657 +00:49:01,969 --> 00:49:06,039 + 이 문제의 큰 당신이 할 수있는 GPU를 누군가에 실행중인 특히 + +658 +00:49:06,039 --> 00:49:15,599 + 완전히 문제를 손실 할 사람이 실제로 많이 초래된다는 것이다 + +659 +00:49:15,599 --> 00:49:21,059 + CPU와 GPU 사이의 통신 오버 헤드를 통해 때문에 우리가 우리 때마다 + +660 +00:49:21,059 --> 00:49:24,799 + 기능이 전화 그리고 우리는 복사 먹으 렴 다시이 그라디언트를 얻을 수 + +661 +00:49:24,800 --> 00:49:29,720 + 다시 CPU에 GPU에서 그라디언트 나는 비용이 많이 드는 작업 할 수 있고, + +662 +00:49:29,719 --> 00:49:35,000 + 이제 우리는 실제로 우리의 그라데이션 중지 만들고있어이 너무 NumPy와의 CPU 계산이다 + +663 +00:49:35,000 --> 00:49:38,190 + 우리가 그 기울기 업데이트를 우리의 매개 변수를 할 수 있도록 할 수 있다면 정말 좋을 텐데 + +664 +00:49:38,190 --> 00:49:45,389 + 실제로 직접 GPU 길에 우리가 그 Fiano이 함께 있음 + +665 +00:49:45,389 --> 00:49:50,619 + 공유 변수라고이 학교 일에 그래서 변수가 다른 것입니다 공유 + +666 +00:49:50,619 --> 00:49:54,230 + 네트워크 실제로 일부는 내부 연산 사는 값 + +667 +00:49:54,230 --> 00:49:59,340 + 공예 실제로이 실제로이 그래서 여기에 전화 통화에서 지속 + +668 +00:49:59,340 --> 00:50:04,150 + 매우 유사 이제 정의 된 우리 같은 상징적 인 변수 X & Y 전 + +669 +00:50:04,150 --> 00:50:08,769 + 데이터와 라벨과 지금 우리가 새로운 펑키의 몇 가지를 정의하고 대한 + +670 +00:50:08,769 --> 00:50:13,809 + 행렬 및 초기화를 가중하는 우리를위한 것 펑키 공유 변수를 + +671 +00:50:13,809 --> 00:50:19,110 + NumPy와 이러한 가중치 행렬을 제기하고 지금이 이전과 동일 + +672 +00:50:19,110 --> 00:50:22,910 + 이들을 사용하여 포워드 패스를 산출 여기서 이전과 동일한 코드이다 + +673 +00:50:22,909 --> 00:50:24,980 + 라이브러리 함수는 상징적이다 + +674 +00:50:24,980 --> 00:50:30,940 + 그라디언트 그러나 우리는 지금이 그래서 우리의 함수를 정의하는 방법을 지금의 차이에 + +675 +00:50:30,940 --> 00:50:32,269 + 컴파일 기능 + +676 +00:50:32,269 --> 00:50:36,780 + 만 가중치를 수신하지 않는 수신하고 그 실제로 살고 둔다 + +677 +00:50:36,780 --> 00:50:41,320 + 연산 그래프 내부 대신 우리는 단지 데이터 및 상기 수신 데이터 + +678 +00:50:41,320 --> 00:50:45,210 + 그리고 라벨과 지금 우리는 출력보다 오히려 손실을 넣어 가고있다 + +679 +00:50:45,210 --> 00:50:49,639 + 성분은 명시 적으로 대신 우리가 실제로 이러한 업데이트 규칙을 제공하는 그들 + +680 +00:50:49,639 --> 00:50:53,819 + 이 업데이트 규칙을 알 수 있도록 함수가 호출 될 때마다 실행해야 우리 + +681 +00:50:53,820 --> 00:50:57,920 + 이 그냥 그래서 상징적 인 변수에서 작동 작은 기능 + +682 +00:50:57,920 --> 00:51:02,010 + 우리는 그가 산타를 만드는 것해야한다는 것은 하나의 최소 W W 업데이트 할 중지 + +683 +00:51:02,010 --> 00:51:09,290 + 이 우리가이 계산 그래프를 실행할 때마다 그래서 지금 매주 업데이트 및 기록 + +684 +00:51:09,289 --> 00:51:12,880 + 우리가해야 할 모든 반복마다이 함수를 호출 인이 네트워크를 훈련 + +685 +00:51:12,880 --> 00:51:16,869 + 우리가 함수를 호출 할 때 사람들은 그렇게의 방법에 그라데이션 중지 할 것 + +686 +00:51:16,869 --> 00:51:21,210 + 우리는 그냥 반복해서 그냥이 일을 호출하여이 네트워크를 시도 할 수 있습니다 + +687 +00:51:21,210 --> 00:51:23,769 + 당신이 이런 종류의 일을 할 때 할 때 연습하고 난 당신거야 알고 + +688 +00:51:23,769 --> 00:51:27,579 + 종종 다음 무게를 업데이트하고 우리의 교육 함수 호출을 정의 + +689 +00:51:27,579 --> 00:51:31,719 + 난 그냥 점수를 넣어 당신이 할 수있는 업데이트를하지거야 기능을 평가 + +690 +00:51:31,719 --> 00:51:34,609 + 실제로이 컴파일 된 함수의 여러에 대한 8 개의 다른 것이있다 + +691 +00:51:34,610 --> 00:51:47,220 + 같은 그래프의 부분은 그래 그래 문제는 우리가 기울기를 계산하는 방법이다 + +692 +00:51:47,219 --> 00:51:51,119 + 그것은 실제로는 그렇지 않아 잘은 상징적 종류의 사람이 밖으로에게 S 않습니다 + +693 +00:51:51,119 --> 00:51:55,219 + 때마다 당신이이 통화를 할 수 있기 때문에 실제로 성에서 사람이의 일종이다 + +694 +00:51:55,219 --> 00:51:58,769 + 그래픽 객체에이 계산을 구축하고 당신은 계산할 수 있습니다 + +695 +00:51:58,769 --> 00:52:06,090 + 단지 그래픽의 계산에 노드를 추가하여 그라디언트 그래서 개체 그래 + +696 +00:52:06,090 --> 00:52:09,360 + 그래, 그래서 그것이 무엇을 알고 이러한 기본 운영자의 각을 알 필요가 + +697 +00:52:09,360 --> 00:52:12,500 + 파생 상품과 파생 및이 여전히 정상 정상 + +698 +00:52:12,500 --> 00:52:17,309 + 당신이 그것을 볼 것을 다시 전파는 작동하지만 그 중 일부하지만 피치 + +699 +00:52:17,309 --> 00:52:21,299 + 내가 아는 그것이 작동하고 그는 매우 매우 낮은 수준의 기본 작업입니다 + +700 +00:52:21,300 --> 00:52:24,920 + 이러한 요소의 사물과 행렬 곱셈으로하고 바라고 때처럼 + +701 +00:52:24,920 --> 00:52:27,800 + 이 효율적인 코드를 컴파일 할 수있는 사람들을 통합하고 단순화 + +702 +00:52:27,800 --> 00:52:32,210 + 상징적으로 내가 어떻게 작동하는지 잘 모르겠어요하지만 적어도 무슨하다고 + +703 +00:52:32,210 --> 00:52:37,110 + 그들은 그렇게 다른 고급 많은 것들을 많이 거기에 이렇게 주장하는 당신 + +704 +00:52:37,110 --> 00:52:40,309 + 무엇이든 할 수 나는 우리가 당신이 할 수있는 얘기 할 시간이없는 것을 알고있다 + +705 +00:52:40,309 --> 00:52:43,610 + 실제로 사용하여 경쟁 공예 내부에 직접 조건문은 다음과 같습니다 + +706 +00:52:43,610 --> 00:52:44,809 + 이 파일 + +707 +00:52:44,809 --> 00:52:49,029 + 그리고 스위치는 당신이 실제로 루프를 내부자 계산을 포함 할 수 있습니다 명령 + +708 +00:52:49,030 --> 00:52:52,370 + 그래프이 정말 이해하지 못하는이 재미 스캔 기능을 사용하여 + +709 +00:52:52,369 --> 00:52:57,409 + 하지만 힘든하지만 이론적으로는 매우 재발 네트워크를 구현할 수 있습니다 + +710 +00:52:57,409 --> 00:53:01,909 + 쉽게 당신이 잠시 상상할 수있는 다음 중 하나의 작업에서 발생하는 + +711 +00:53:01,909 --> 00:53:05,539 + 계산 공예가에 동일한 가중치 행렬을 통과하고있는 모든 + +712 +00:53:05,539 --> 00:53:10,110 + 여러 노드 및 정렬 할 그 루프 및이 수 실제로 검사 + +713 +00:53:10,110 --> 00:53:14,680 + 루프는 그래프의 명시 적 부분의 일부와 우리가 실제로 미친 갈 수 있습니다 + +714 +00:53:14,679 --> 00:53:17,909 + 파생 상품과 우리는 밖으로 어떤과 ​​관련하여 파생 상품을 계산할 수 있습니다 + +715 +00:53:17,909 --> 00:53:21,149 + 우리는 또한 자코비 계산할 수있는 다른 부분에 대한 선박의 일부 + +716 +00:53:21,150 --> 00:53:24,300 + 파생 상품 우리는 알렌을 사용할 수의 파생 상품을 계산하여 종료 우리 + +717 +00:53:24,300 --> 00:53:29,140 + 운영자는 공식적으로 배우로서 큰 주요 행렬 - 벡터 곱셈을 만든하려면 + +718 +00:53:29,139 --> 00:53:32,500 + 와 자코비 존스 당신은 정말 멋진 다른 파생 테이크를 많이 할 수 + +719 +00:53:32,500 --> 00:53:36,610 + 아마 상단과 다른 프레임 워크 그리고 그것은 또한 일부가 피아노에 재고 + +720 +00:53:36,610 --> 00:53:40,180 + 스파 스 매트릭스에 대한 지원은 즉석에서 코드를 최적화하기 위해 시도 + +721 +00:53:40,179 --> 00:53:45,669 + 내가 아는 다른 멋진 일을하는 것은이 거기에 멀티 GPU 지원을 가지고 + +722 +00:53:45,670 --> 00:53:50,599 + 내가 그 제외한 사용하지 않은 패키지는 데이터 병렬 처리를 얻을 수 있다고 주장 + +723 +00:53:50,599 --> 00:53:54,500 + 그래서 여러 GPU를 통해 분할에 대한 의미와 거기에 배포 + +724 +00:53:54,500 --> 00:53:57,260 + 이 계산에 모델 병렬 처리에 대한 실험 지원 + +725 +00:53:57,260 --> 00:54:01,320 + 그래프는 다른 장치 있지만 문서 사이에 분할됩니다 + +726 +00:54:01,320 --> 00:54:08,030 + 라고 그 실험 아마 정말 실험 그래서, 그래서 당신이 본 있도록 + +727 +00:54:08,030 --> 00:54:11,730 + 작업을 할 때 나는 API가 조금 낮은 수준 인 것을 알고 우리는 필요 + +728 +00:54:11,730 --> 00:54:15,769 + 일종의 해결할은 somos 아냐가 업데이트 규칙과 모든 것을 구현 + +729 +00:54:15,769 --> 00:54:19,900 + 는 I 주변이 높은 수준의 래퍼는 거리의 일부를 추상의 종류를 알고 + +730 +00:54:19,900 --> 00:54:24,660 + 당신을 위해 그 세부 그래서 다시 우리가 일종의 상징적 인 행렬을 정의하고 있고 + +731 +00:54:24,659 --> 00:54:28,659 + 라자냐는 자동으로 설정됩니다 이러한 계층 기능이있는 + +732 +00:54:28,659 --> 00:54:32,489 + 공유 변수와 그런 종류의 물건 우리는의 확률을 계산할 수 있습니다 + +733 +00:54:32,489 --> 00:54:38,469 + 손실 라이브러리에서이 편리한 것을 사용하고, 라자냐 실제로 수 + +734 +00:54:38,469 --> 00:54:41,969 + 우리가 구현하는 이러한 업데이트 규칙과 강한 추진력을 작성하고 + +735 +00:54:41,969 --> 00:54:47,109 + 다른 멋진 것들과 지금 우리가 컴파일 우리의 기능 우리가 실제로 단지 + +736 +00:54:47,110 --> 00:54:51,390 + 내 라자냐와 모두가 우리를 위해 기록 된이 업데이트 규칙에 전달 + +737 +00:54:51,389 --> 00:54:51,839 + 방법 + +738 +00:54:51,840 --> 00:54:56,309 + 객체뿐만 아니라 라자냐가 우리를 위해 찍은 치료의 치료를 찍은 + +739 +00:54:56,309 --> 00:54:59,579 + 다음 날의 끝에서 우리는 이러한 컴파일 피아노 하나 결국 + +740 +00:54:59,579 --> 00:55:04,599 + 기능과 우리는 또 다른 거기에 다른 거기에 이전과 같은 방법으로 사용 + +741 +00:55:04,599 --> 00:55:10,480 + 꽤 인기있는 문화의 래퍼 4390 우리는 조금 짝수되는 + +742 +00:55:10,480 --> 00:55:15,730 + 그래서 여기에 우리가 순차적으로 컨테이너를 만들고 추가하는 데있어 더 높은 수준의 + +743 +00:55:15,730 --> 00:55:20,559 + 그것에 층의 스택은 그래서 이것은 종류의 횃불처럼 이제 우리는이 데있어 + +744 +00:55:20,559 --> 00:55:25,789 + 가는이 상사 개체를 만들기에 실제로 우리를 갱신하고 지금 우리가 할 수있는 + +745 +00:55:25,789 --> 00:55:29,759 + 이 슈퍼 그래서 그냥 방법을 맞는 모델을 사용하여 우리의 네트워크를 훈련 + +746 +00:55:29,760 --> 00:55:36,570 + 높은 수준의 당신은 심지어 피아노를 사용하여 실제로 우리를 수행 할뿐만 아니라 말할 수 없다 + +747 +00:55:36,570 --> 00:55:40,289 + 뿐만 아니라 당신이하지 않아도 배경은 그것으로 영광을 사용하지만 거기하기 + +748 +00:55:40,289 --> 00:55:44,500 + 당신이 당신의 경우 경우 실제로 하나의 큰이 코드 조각 문제와 나도 몰라 + +749 +00:55:44,500 --> 00:55:49,219 + 경험이 나중에 알고 있지만 실제로 충돌 수 있으며, 그것은에 충돌로 + +750 +00:55:49,219 --> 00:55:54,750 + 정말 나쁜 방법은 오류 메시지가 그렇습니다 우리는이 거대한 스택 트레이스 아무 것도 얻을 + +751 +00:55:54,750 --> 00:55:58,380 + 우리가 작성한 코드의 통해 우리는이 거대한 값 오류가 발생하는 + +752 +00:55:58,380 --> 00:56:03,440 + 내가 Fiano 정말 전문가 그래서이 아니에요 그래서 나에게 어떤 의미가 없습니다 + +753 +00:56:03,440 --> 00:56:07,039 + 정말 나에게 혼란되었다 그래서 우리는 간단한 찾고 코팅 처리의이 종류를 썼다 + +754 +00:56:07,039 --> 00:56:11,259 + 우리하지만 팩으로 fiato를 사용하고 그것을 밖으로 허튼과 우리에게 준 때문에 + +755 +00:56:11,260 --> 00:56:15,030 + 즉, 내가 일반적인 통증 포인트 중 하나라고 생각, 그래서 정말 오류 메시지가 혼란 + +756 +00:56:15,030 --> 00:56:18,730 + 디버깅 배경으로 사용 아무것도 실패 사례 수 + +757 +00:56:18,730 --> 00:56:24,949 + 좀 열심히 공기를 봤와 나는 것을 발견 좋은 개발자가 같은 수 + +758 +00:56:24,949 --> 00:56:28,659 + 내가 잘못 흰색 변수의 폭을 포함하고 나는 것을 발견 + +759 +00:56:28,659 --> 00:56:32,579 + 내 아내 변수로 변환이 기타 다른 기능을 사용하는 가정 및 + +760 +00:56:32,579 --> 00:56:35,690 + 문제가 멀리 갈 수 있도록하지만 오류 메시지에서 분명 아니었다 + +761 +00:56:35,690 --> 00:56:41,139 + 그 피아노 피아노를 사용할 때 신경이 쓰이는 것이 좋을 위해 뭔가 + +762 +00:56:41,139 --> 00:56:44,699 + 우리가 라자냐에 대해 이야기 때문에 실제로 초반 이었죠 모델이 + +763 +00:56:44,699 --> 00:56:48,539 + 실제로 꽤 좋은 모델 당신에게 다른 인기있는 모델을 많이 가지고 + +764 +00:56:48,539 --> 00:56:52,820 + 아키텍처는 당신이 그렇게 라자냐 당신이 알렉스와 구글을 사용 할 수도 있다는 것입니다 + +765 +00:56:52,820 --> 00:56:56,190 + 매트와 BG 나는 그들이 아직 주민을 생각하지 않습니다하지만 그들은 꽤 많이있다 + +766 +00:56:56,190 --> 00:57:00,320 + 거기에 유용한 것들과 내가 발견 몇 가지 다른 패키지가 + +767 +00:57:00,320 --> 00:57:04,550 + 분명 정말로 내 말을 제외하고이 명확하게 굉장했다 좋은 것 같다 그것 때문에 + +768 +00:57:04,550 --> 00:57:07,030 + 작년부터 CS2 (31) 및 프로젝트였다 + +769 +00:57:07,030 --> 00:57:10,330 + 당신거야 하나를 선택하면하지만 난 그것이 아마 라자냐 모델을 생각한다 + +770 +00:57:10,329 --> 00:57:16,139 + 내가 아는 가지고 노는 내 하루 경험 그래서 정말 좋은 + +771 +00:57:16,139 --> 00:57:20,029 + 장점과 단점에 대해 나는 곳의 파이프 라인이 심판의 볼 수있는 + +772 +00:57:20,030 --> 00:57:20,890 + 훌륭 해요 + +773 +00:57:20,889 --> 00:57:23,920 + 이 계산 쓰레기 특히 주위에 정말 강력한 아이디어처럼 보인다 + +774 +00:57:23,920 --> 00:57:28,760 + 상징적으로 그라데이션을 계산하고 이러한 모든 최적화 그것은 특히 R과 + +775 +00:57:28,760 --> 00:57:32,070 + 나는이 계산 그래프를 사용하여 구현하는 것이 훨씬 쉬울 것이라고 생각 종료 + +776 +00:57:32,070 --> 00:57:37,570 + Rottino의 종류 추한 및 총하지만 특히 라자냐는 꽤 좋아 보인다 + +777 +00:57:37,570 --> 00:57:41,470 + 저와 종류의 오류 메시지가 꽤 될 수있는 고통의 일부를 빼앗아 + +778 +00:57:41,469 --> 00:57:46,279 + 내가 무슨 소리를 들었어요에서 우리가 보았 듯이 고통과 큰 모델은 정말 오래 할 수 있습니다 + +779 +00:57:46,280 --> 00:57:51,190 + 컴파일 시간 그래서 그 우리에 대한 즉시 해당 함수를 컴파일 할 때 + +780 +00:57:51,190 --> 00:57:54,579 + 거의 순간적으로 실행되는 모든 간단한 예제하지만 우리는있어 + +781 +00:57:54,579 --> 00:57:58,159 + 내가 이야기를 들었습니다 신경 튜링 기계처럼 큰 복잡한 일을 그 + +782 +00:57:58,159 --> 00:58:01,969 + 즉, 실제로는 아니에요 그건 너무 컴파일 아마 반 시간이 걸릴 수 있습니다 + +783 +00:58:01,969 --> 00:58:06,239 + 좋은 그것은 당신의 모델과 다른 종류에 빠르게 반복에 대한 좋지 않다 + +784 +00:58:06,239 --> 00:58:10,509 + 통증의 포인트는 API가 모든 일을거야 토치보다 훨씬 더 나은 것입니다 + +785 +00:58:10,510 --> 00:58:13,470 + 그것은 이해하기 힘든 종류, 그래서 배경이 복잡한 물건과 + +786 +00:58:13,469 --> 00:58:17,969 + 디버그하지만, 실제로 코드를 발생하고 초반 이었죠 모델은 어쩌면이다 + +787 +00:58:17,969 --> 00:58:22,569 + 라자냐는 꽤 좋은처럼되지 확실히 카페 또는 토치 좋은 그러나 그것은 본다 + +788 +00:58:22,570 --> 00:58:30,320 + 확인은 그래서 우리는 첫 번째 경우 비록 1000을 얘기 십오분 지금있어 + +789 +00:58:30,320 --> 00:58:38,309 + 내가 시도 할 수 있습니다 알고는 확인이 그렇게 tenser 흐름이 아니다에 대한 질문이있다 + +790 +00:58:38,309 --> 00:58:42,809 + 센서는 정말 시원하고 반짝 새롭고의 Google에서 모든 사람의 흐름 + +791 +00:58:42,809 --> 00:58:47,829 + 그것에 대해 흥분하고 실제로 많은 방법에서 피오나와 매우 유사 그 + +792 +00:58:47,829 --> 00:58:51,170 + 그들은 정말 계산 그래프의이 아이디어와 건물을 복용하고 + +793 +00:58:51,170 --> 00:58:55,650 + 그는 모든 것을 tenser 흐름과 Fiano 실제로 아주 아주 밀접하게 연결되도록 + +794 +00:58:55,650 --> 00:58:59,090 + 내 마음에 그 중 하나를 사용하여 멀리 얻을 수있는 종류의 해리스 등이다 + +795 +00:58:59,090 --> 00:59:04,760 + 하나는 백핸드이며, 또한 어떤 하나의 어쩌면 약 1000인지 확인 가리 + +796 +00:59:04,760 --> 00:59:07,200 + 그것은에서 설계된 이러한 프레임 워크 중 제 1의 일종 + +797 +00:59:07,199 --> 00:59:10,750 + 전문 엔지니어에 의해 접지 + +798 +00:59:10,750 --> 00:59:14,000 + 그래서 다른 프레임 워크의 많은 종류의 학술 연구 실험실에서 회전 및 + +799 +00:59:14,000 --> 00:59:17,320 + 그들은 정말 좋은있어, 그들은 당신이 정말 잘 일을 할 수 있도록하지만 그들은 일종의했다 + +800 +00:59:17,320 --> 00:59:23,120 + 토치 특히 의해 관리되고, 특히 있도록 대학원 학생들에 의해 유지 + +801 +00:59:23,119 --> 00:59:26,500 + 지금 트위터와 페이스 북에서 일부 엔지니어하지만 원래 학술했다 + +802 +00:59:26,500 --> 00:59:30,070 + 프로젝트 및 이들의 모든 나는 tenser 흐름이었다 첫 번째 생각 + +803 +00:59:30,070 --> 00:59:35,000 + 그래서 아마 이론적으로 산업 곳에서 목에서 처음부터 + +804 +00:59:35,000 --> 00:59:37,989 + 그없이 내가 그나마 나은 코드 품질이나 시험 범위 또는 무언가로 이어질 수 + +805 +00:59:37,989 --> 01:00:04,519 + 나는 확실하지 그래서 여기에 꽤 무서운 듯 우리 마음에 드는 사람이 누워있어 여기 그래서 해요 + +806 +01:00:04,519 --> 01:00:07,389 + 우리는거야 라빈 우리는 그것을했고, 다른 모든 프레임 워크는 의도의이하자 + +807 +01:00:07,389 --> 01:00:12,769 + 그래서 우리가 걸 볼 수 있도록이 실제로 알고있는 나는 정말 비슷합니다 흐름 + +808 +01:00:12,769 --> 01:00:17,320 + Fiano에 tenser 흐름을 가져 우리가이 행렬과 벡터를 기억 + +809 +01:00:17,320 --> 01:00:21,019 + 기호 변수 강렬한 워크로드 그들은 자리라고하고 있지만입니다 + +810 +01:00:21,019 --> 01:00:26,380 + 이 같은 생각은 우리가있어 우리의 계산 그래프에 입력 노드를 작성하는 + +811 +01:00:26,380 --> 01:00:30,650 + 또한 fiato에 무게 행렬을 정의하는 것 우리는 이러한 공유 일이 + +812 +01:00:30,650 --> 01:00:34,490 + 그라는 계산 그래프 같은 생각하고 유연한 텐서 안에 살았다 + +813 +01:00:34,489 --> 01:00:40,359 + 변수를 우리가 같은 단지 계산 Ciano처럼 사용하여 통과 기대된다 + +814 +01:00:40,360 --> 01:00:44,610 + 작동이 라이브러리 방법은이 일에 상징적에 작동 + +815 +01:00:44,610 --> 01:00:48,289 + 즉, 쉽게를 계산 할 수 있도록 전산 그래프를 구축 + +816 +01:00:48,289 --> 01:00:52,210 + 확률은 상징적으로이 같은 손실과 모든에 + +817 +01:00:52,210 --> 01:00:56,190 + 실제로 나는 나보다는 조금 더 우리에게 보이는 관심이 더 많은 것 같습니다에 생각 + +818 +01:00:56,190 --> 01:01:00,740 + 바위보다 회전 목마 라자냐처럼 알고 있지만, 우리는이 경사 하강을 사용하는 + +819 +01:01:00,739 --> 01:01:04,669 + 최적화는 그리고 우리는 그래서 여기에 우리가하지 않은 손실을 최소화하기 위해 그것을 말하는 거 + +820 +01:01:04,670 --> 01:01:08,970 + 명시 적으로하지만, 그라디언트를 토하고 우리는 명시 적으로 대해 서면으로하지 않는 + +821 +01:01:08,969 --> 01:01:13,489 + 무역 업데이트 규칙 대신이 사람들의 일을 사용하지만 그냥 일종의 추가되었다 + +822 +01:01:13,489 --> 01:01:19,250 + 그냥 지금 손실을 최소화하기 위해 그래프로 할 필요가 무엇이든 + +823 +01:01:19,250 --> 01:01:23,059 + 같은 Ciano 시장에서 우리가 실제로 더 높은 실제 번호를 사용하여 인스턴스화 할 수 + +824 +01:01:23,059 --> 01:01:23,779 + 증가 + +825 +01:01:23,780 --> 01:01:29,470 + 일부 몇 가지 작은 데이터 세트 후 우리는 루프 너무 강한 공기 흐름에서 실행할 수 있으며, + +826 +01:01:29,469 --> 01:01:33,750 + 당신은 실제로 당신이 그것을 포장 할 필요가 사용해야하는 코드를 실행하려면 + +827 +01:01:33,750 --> 01:01:39,199 + 세션 코드는 내가 무엇을하고 있는지 이해가 안하지만 그건 당신이 실제로 그것을 할 수 있었다 + +828 +01:01:39,199 --> 01:01:42,599 + 실제로 무엇을하고있어하면 그 모든 정거장 짧은하지만 할 갔다 + +829 +01:01:42,599 --> 01:01:45,869 + 당신의 계산 잔디를 설정하고 놓친 세션이 실제로하고있다 + +830 +01:01:45,869 --> 01:01:48,440 + 어떤 최적화 실제로 실행하고자 할 필요가 + +831 +01:01:48,440 --> 01:01:58,110 + 그래 그래서 당신이 있다면 그래서 질문은 당신이 기억 그래서 만약 하나의 뜨거운 무엇인가 + +832 +01:01:58,110 --> 01:02:01,840 + 과제는 부드러운 최대 손실 함수처럼했지만 왜 때 + +833 +01:02:01,840 --> 01:02:06,170 + 항상 정수는 당신이 원하는하지만 어떤 일을 말하는 이들 중 일부에 + +834 +01:02:06,170 --> 01:02:11,420 + 모든 곳에 대신 정수의 프레임 워크는 요인이 될한다 + +835 +01:02:11,420 --> 01:02:15,090 + 즉 실제로이었을 정도로 신용 클래스가 있던 일을 제외하고 제로 + +836 +01:02:15,090 --> 01:02:20,420 + 에 나를 넘어 버그는 차이가 뜨거운 일 사이에 있었다 다시 우리를 걱정 + +837 +01:02:20,420 --> 01:02:28,710 + 뜨거운 그것은 10 2011 핫 어떤 권리를 밝혀없는 것보다 우리 때 + +838 +01:02:28,710 --> 01:02:34,250 + 실제로 우리가 실제로 기억하고 우리가 fiato에 전화 한 후이 네트워크를 훈련 할 + +839 +01:02:34,250 --> 01:02:37,610 + 이 함수 객체를 컴파일 한 다음 또 다시 함수를 호출 + +840 +01:02:37,610 --> 01:02:41,940 + 해당 강한 공기 흐름은 우리가의 run 메소드를 호출하는 데 사용한다는 것입니다 + +841 +01:02:41,940 --> 01:02:46,409 + 세션 객체 우리는 우리가 계산하는 원한 스위치 출력을 말해 + +842 +01:02:46,409 --> 01:02:50,349 + 여기에서 우리는 우리가 기차가 내가이야 무엇을 중단 계산 할 것인지를 말하는 거 + +843 +01:02:50,349 --> 01:02:54,769 + 라 셀리 퍼트 우리는 이러한 입력에 발생이 NumPy와에 거 피드있어 그렇게 + +844 +01:02:54,769 --> 01:02:57,699 + 우리가 run 메소드를 호출하는 것을 제외하고이 디아 노 같은 아이디어의 종류 + +845 +01:02:57,699 --> 01:03:02,210 + 오히려 명시 함수를 컴파일 컴파일보다 과정에서 + +846 +01:03:02,210 --> 01:03:06,179 + 이 열차 정지 객체 선거을 평가에 그라데이션 하강을 + +847 +01:03:06,179 --> 01:03:10,690 + 무게는 그래서 우리는 단지 루프에서이 일을 실행하고 로스가 다운 될 수 있습니다 + +848 +01:03:10,690 --> 01:03:16,450 + 모든 것이 tenser 흐름에 대한 정말 멋진 것들 중 하나 아름답습니다 그래서 + +849 +01:03:16,449 --> 01:03:20,519 + 쉽게 쉽게 무엇을 시각화 할 수 있습니다 tenser 보드라는이 일입니다 + +850 +01:03:20,519 --> 01:03:24,880 + 그래서 여기에 네트워크에서 진행하는 것은 우리가 가진 거의 동일한 코드입니다 + +851 +01:03:24,880 --> 01:03:29,150 + 우리는 희망이 세 개의 작은 줄을 추가 한 제외하기 전에 경우를 볼 수 있습니다 + +852 +01:03:29,150 --> 01:03:34,280 + 의 스칼라 요약을 계산하는 곳없는 당신은 그래서 여기에 저를 신뢰해야합니다 + +853 +01:03:34,280 --> 01:03:37,200 + 손실 그것은 우리에게 새로운 상징적 인 변수를주고 + +854 +01:03:37,199 --> 01:03:40,929 + 법의 개요 더욱 가중 행렬들의 히스토그램을 산출 요약 + +855 +01:03:40,929 --> 01:03:46,049 + 화가 하나 W2 W 2 w도 우리에게 점점 새로운 상징적 인 변수에 W + +856 +01:03:46,050 --> 01:03:51,390 + 지금 우리가라는 또 다른 상징적 인 변수를 얻고 야유하는 것으로 나타났다 + +857 +01:03:51,389 --> 01:03:54,349 + 함께 그렇게하지 ​​마법을 사용하는 모든 사람들 요약으로 부상 할 수 있습니다 + +858 +01:03:54,349 --> 01:03:58,929 + 이해하고 우리는 우리가 사용할 수있는이 요약 작가 개체를 받고있어 + +859 +01:03:58,929 --> 01:04:03,000 + 실제로 우리가있을 때 우리의 루프에서 지금 책상에 그 요약을 덤프하고 + +860 +01:04:03,000 --> 01:04:06,570 + 실제로 다음 네트워크를 실행 우리는을 평가하는 평가를 말해 + +861 +01:04:06,570 --> 01:04:10,460 + 교육 직원과 같은 손실 그녀의 모든 그래서이 병합 요약 개체 전에 + +862 +01:04:10,460 --> 01:04:14,190 + 그래서 과시 요약을 평가하는 프로세스에서 계산할 것이다 개체 + +863 +01:04:14,190 --> 01:04:17,690 + 이 가중치의 히스토그램을 계산합니다 그라데이션 책상에 그 요약을 덤프 + +864 +01:04:17,690 --> 01:04:22,019 + 그리고, 우리는 요약에 내가 그 어디 같아요 실제로 우리의 작가에게 + +865 +01:04:22,019 --> 01:04:26,610 + 그 다음에이 일을 실행하면 사용자가 쇼핑몰을 얻을 수 있도록이에 대한 권리가 발생합니다 + +866 +01:04:26,610 --> 01:04:28,890 + 이 일을 지속적으로 정렬을 실행하는 모든이를 스트리밍 + +867 +01:04:28,889 --> 01:04:33,069 + 책상에 네트워크에서 무슨 일이 일어나고 있는지에 대한 정보는 다음 당신은 + +868 +01:04:33,070 --> 01:04:37,480 + 그 텐서 유량 센서 보드와 함께 제공되는이이 웹 서버를 시작하고 우리가 얻을 + +869 +01:04:37,480 --> 01:04:41,420 + 그래서 당신의 네트워크에서 무슨 일이 일어나고 있는지에 대한이 아름다운 아름다운 시각화 + +870 +01:04:41,420 --> 01:04:42,539 + 여기 왼쪽에 + +871 +01:04:42,539 --> 01:04:46,230 + 회원 우리는 우리가이 때문에 손실 스칼라 요약을 얻고 있었다 말하고 있었다 + +872 +01:04:46,230 --> 01:04:49,360 + 실제로 손실이 나는 그것이 큰 작은이었다 의미 내려가는 것을 보여줍니다 + +873 +01:04:49,360 --> 01:04:52,760 + 네트워크 및 소규모 데이터 세트하지만 그 모든 의미가 작동이됩니다 + +874 +01:04:52,760 --> 01:04:56,860 + 여기 오른쪽에 표시 당신은 당신을 보여주는 시간이 지남에 따라 히스토그램 + +875 +01:04:56,860 --> 01:05:00,900 + 당신의 체중 행렬의 값의 분포이 물건은 그래서 + +876 +01:05:00,900 --> 01:05:04,579 + 정말 정말 시원하고 나는이 정말 정말 아름다운 디버깅 도구라고 생각합니다 + +877 +01:05:04,579 --> 01:05:09,289 + 내가 프로젝트와 토치 작업을했습니다 때 때 그래서 나는이 작성했습니다 + +878 +01:05:09,289 --> 01:05:11,250 + 종류의 손으로 자신을 물건 + +879 +01:05:11,250 --> 01:05:14,900 + 그냥 좀 고문에서 JSON의 모양을 덤핑하고 내 자신의 정의를 작성 + +880 +01:05:14,900 --> 01:05:18,369 + 시각화 시각화은이기 때문에 통계의 이러한 종류를 볼 수 있습니다 + +881 +01:05:18,369 --> 01:05:21,609 + 정말 유용하고 텐트 당신은 자신에 대해 어떤을 작성하지 않아도됩니다으로 + +882 +01:05:21,610 --> 01:05:25,019 + 그들은 말을하는지 훈련 스크립트 실행에 코드를 사용하면 단지 몇 라인과 + +883 +01:05:25,019 --> 01:05:27,489 + 당신은이 모든 아름다운 시각화하여 디버깅 도움을 얻을 수 있습니다 + +884 +01:05:27,489 --> 01:05:35,059 + tenser 유량 센서 보드는 또한 당신도 어떤 네트워크 시각화 할 수 있습니다 + +885 +01:05:35,059 --> 01:05:39,820 + 이 이름을 가진 변수는 구조가 그래서 여기에 우리가 주석했던 모양이다 + +886 +01:05:39,820 --> 01:05:43,510 + 지금 우리는 우리가 범위 실제로 몇몇을 할 수있는 전진 패스를하고있는 경우 + +887 +01:05:43,510 --> 01:05:47,450 + 네임 스페이스와 함께 슬라이스 그룹의 종류에 따라 합병증 + +888 +01:05:47,449 --> 01:05:48,949 + 계산이 + +889 +01:05:48,949 --> 01:05:52,519 + 그것은과 동일합니다보다 함께 의미 지금 다른 속해야 + +890 +01:05:52,519 --> 01:05:56,949 + 우리가 전에 지금보고 같은 것은 우리가이 네트워크를 실행하고 텐트를로드하는 경우 또는 + +891 +01:05:56,949 --> 01:06:00,909 + 더 우리가 실제로 어떻게 같은이 아름다운 시각화를 얻을 수 있습니다 + +892 +01:06:00,909 --> 01:06:04,789 + 우리의 네트워크는 실제로처럼 보이는 우리가 실제로 클릭하고 찾아 볼 수있는 + +893 +01:06:04,789 --> 01:06:07,820 + 정말 무슨 점수에 화면 내부에 무슨 일이 일어나고 있는지 디버그에 도움 + +894 +01:06:07,820 --> 01:06:12,170 + 이 네트워크 이집트 당신은 손실과 성적이 참조 + +895 +01:06:12,170 --> 01:06:15,030 + 이는 우리가 포워드 패스 중에 정의 된 의미있는 네임 + +896 +01:06:15,030 --> 01:06:18,940 + 우리가 예를 들어 점수를 클릭하면 그것을 열어 우리 모두 볼 수 있습니다 + +897 +01:06:18,940 --> 01:06:22,679 + 그래픽에 계산 내부 여기에를 가지고 작업하는 + +898 +01:06:22,679 --> 01:06:28,108 + 노드는 그래서 나는 정말 쉽게 디버그처럼 당신이 할 수 있다면 이것은 정말 멋진 줄 알았는데 + +899 +01:06:28,108 --> 01:06:31,039 + 이 중 하나를 작성하는만큼 실행중인 동안 무슨 일이 당신의 네트워크 내부에서 무슨 일 + +900 +01:06:31,039 --> 01:06:39,300 + 애플의 아시아 코드로 자신을 너무 부드러운 흐름은 멀티 GPU 지원을해야합니까 그래서 + +901 +01:06:39,300 --> 01:06:42,750 + 데이터 병렬 그렇게 예상처럼 나는 것을 지적하고 싶습니다있다 + +902 +01:06:42,750 --> 01:06:45,809 + 실제로는이 메일 부분은 아마도 중 하나입니다 배포 + +903 +01:06:45,809 --> 01:06:50,460 + 실제로 시도 할 수있는 다른 주요 판매 포인트는 때때로 흐름 + +904 +01:06:50,460 --> 01:06:53,338 + 다른 기기에서 다른 방법으로 분산 계산 쓰레기 + +905 +01:06:53,338 --> 01:06:57,828 + 실제로 통신을 최소화하기 위해 기민하게 그 쓰레기를 분산 배치 + +906 +01:06:57,829 --> 01:07:02,839 + 오버 헤드 등등 그래서 당신이 할 수있는 한 가지 데이터 병렬 어디에이다 + +907 +01:07:02,838 --> 01:07:05,559 + 단지 다른 기기에서 다시 돈을 넣어 각 하나를 실행 + +908 +01:07:05,559 --> 01:07:08,409 + 전후 다음 중 하나 그라디언트 중 일부는해야 할 일 + +909 +01:07:08,409 --> 01:07:12,068 + 단지에 동기 업데이트를하거나 동기 분산 교육 당신의 + +910 +01:07:12,068 --> 01:07:16,730 + 매개 변수와는 동기 훈련 뷰캐넌에게 그녀가 할 수있는 백서 주장을 + +911 +01:07:16,730 --> 01:07:21,300 + 이런 것들과 텐서 흐름의 양을하지만 나는 그것을 밖으로 당신이 할 수하지 않았다 않았다 + +912 +01:07:21,300 --> 01:07:25,000 + 또한 실제로뿐만 아니라 집중적 인 흐름 모델의 병렬 처리를 수행하지만 다음을 수행 할 수 있습니다 + +913 +01:07:25,000 --> 01:07:27,829 + 같은 모델을 분할에 같은 모델의 다른 부분을 계산 + +914 +01:07:27,829 --> 01:07:32,190 + 그래서 여기에 다른 장치는 그래서이 유용 할 수 있습니다 한 곳에는 예입니다 + +915 +01:07:32,190 --> 01:07:36,510 + 다층 재발 네트워크가 실제로 실행하는 것이 좋습니다 수 있습니다 + +916 +01:07:36,510 --> 01:07:39,900 + 다른 CPU에서 네트워크의 다른 레이어 그 일을 할 수 있기 때문에 + +917 +01:07:39,900 --> 01:07:42,838 + 그 물건의 종류가 그래서 실제로 많은 메모리를 가지고 당신은 실제로 수 + +918 +01:07:42,838 --> 01:07:47,599 + 당신은 너무 많은 고통없이 그 강한 공기 흐름을 할 수 않습니다 + +919 +01:07:47,599 --> 01:07:51,599 + tenser 흐름은 분산으로 실행할 수있는 프레임 워크의 유일한입니다 + +920 +01:07:51,599 --> 01:07:56,000 + 하나의 시스템 및 다중 GPU를 실제로에서 모드뿐만 아니라 스트립 + +921 +01:07:56,000 --> 01:07:58,309 + 그들에게 많은 기계에서 교육 모델을 배포 + +922 +01:07:58,309 --> 01:08:04,709 + 광고 여기서주의해야 할 점은 그 부분은 아직 오늘로 평가 오픈 소스 아니라고 그래서 + +923 +01:08:04,708 --> 01:08:08,328 + 텐서 흐름의 오픈 소스 버전은 단일 시스템 멀티 GPU 작업을 수행 할 수 있습니다 + +924 +01:08:08,329 --> 01:08:13,890 + 훈련하지만 난 생각하지만 희망이 곧 그 부분은 정말로 발표 될 예정이다 + +925 +01:08:13,889 --> 01:08:16,500 + 바로 그래서 여기에 멋진 + +926 +01:08:16,500 --> 01:08:22,069 + 아이디어는 그냥 응답이 모두 통신 비용을 알고 있었다 끝낼 수있다 + +927 +01:08:22,069 --> 01:08:26,489 + 네트워크에있는 다른 시스템 간의 또한 GPU와 CPU 만 사이에 그렇게 + +928 +01:08:26,488 --> 01:08:30,118 + 그것은 현명 다른 걸쳐 계산 공예를 배포하려고 할 수 있습니다 + +929 +01:08:30,118 --> 01:08:33,750 + 그 기계에서 다른 CPU에서 기계는 계산하기 + +930 +01:08:33,750 --> 01:08:37,649 + 모든 가능한 한 효율적으로 그 그래서 난 그게 정말 멋진 생각 + +931 +01:08:37,649 --> 01:08:41,629 + 즉 다른 프레임 워크가 바로 지금 일을 할 수없는 뭔가 + +932 +01:08:41,630 --> 01:08:46,409 + 수십 흐름을 가리킬 수 있습니다 것은 초반 이었죠 모델 나는 보았다 그래서 나는 철저한했다 + +933 +01:08:46,408 --> 01:08:51,448 + 구글은 검색과 내가 함께 올 수있는 유일한 방법은 처음 모듈가 있었다 + +934 +01:08:51,448 --> 01:08:56,028 + 사전 시험 개시 모델 만이 탐구 안드로이드를 통해 만 접근 가능 + +935 +01:08:56,029 --> 01:08:59,569 + 이 그들 있도록 모든 평가 내가 될 것으로 예상했을 것이다 뭔가 + +936 +01:08:59,569 --> 01:09:04,219 + 명확한 문서가 더 있지만 그건 적어도 당신은 1 피치 모델이 + +937 +01:09:04,219 --> 01:09:09,109 + 그 중 하나가 아닌 다른 내가 다른 초반 이었죠 모델의 정말 알고 아니에요 아니에요 + +938 +01:09:09,109 --> 01:09:12,109 + 강한 공기 흐름하지만 어쩌면 어쩌면 어쩌면 그들은 거기있어, 내가 거기에 있습니다 + +939 +01:09:12,109 --> 01:09:13,230 + 그냥 모르는 + +940 +01:09:13,229 --> 01:09:19,729 + 확률값은 그래서 제대로 봤하지 말합니다 그 대답 흐름 장점과 단점 때문에 + +941 +01:09:19,729 --> 01:09:23,689 + 다시는 내 빠른 일일 실험 정말 좋은 파이프 라인의 때문에 + +942 +01:09:23,689 --> 01:09:27,928 + 나는 그것이 그래픽에 대한 계산이 아이디어를 가지고 알고 정말 멋진 심판 + +943 +01:09:27,929 --> 01:09:32,289 + 이는 내가 슈퍼 강력 생각하고 실제로 이런 생각을합니다 + +944 +01:09:32,289 --> 01:09:35,948 + 계산 그래프도 더 Fiano보다보다 정말와 같은 것들 + +945 +01:09:35,948 --> 01:09:40,000 + 검사 점 및 장치를 통해 배포 이러한 모든처럼 알고 결국 + +946 +01:09:40,000 --> 01:09:46,380 + 그것은 정말 멋진 그래픽 4000에 계산 내부의는 주장이다 + +947 +01:09:46,380 --> 01:09:49,520 + 더 빨리가 나는 공포를 들었습니다 알고 시간 뭔가를 컴파일 + +948 +01:09:49,520 --> 01:09:53,670 + 아마 컴파일 시간 반을 복용 신경 나무 기계에 대한 이야기 + +949 +01:09:53,670 --> 01:09:59,219 + 어쩌면 그게 더 빠를 계획 또는 내가 들었어요 tenser 보드 외모 흐름한다 + +950 +01:09:59,219 --> 01:10:03,369 + 놀라운 보이는 멋진 내가 사방에 그것을 사용하려면 + +951 +01:10:03,369 --> 01:10:07,340 + 내가 훨씬 더 진보 된 생각 정말 멋진 데이터와 모델 모델 병렬 처리가 + +952 +01:10:07,340 --> 01:10:11,079 + 다른 프레임 워크보다 배포 중지는 여전히 비밀 소스 있지만 + +953 +01:10:11,079 --> 01:10:15,689 + 구글 만 잘하면 나는 결국 우리의 나머지 부분에 나올거야하지만 난 생각하는 것이 + +954 +01:10:15,689 --> 01:10:19,989 + 밥이도 아마 실제로 기반 무서운 코드를 파고있어 말을 하였다으로 + +955 +01:10:19,989 --> 01:10:24,409 + 후드 암에 대해 너무 적어도 나의 두려움에서 일하고 이해 + +956 +01:10:24,409 --> 01:10:29,010 + 흐름은 당신 미친 이상한 필수적 코드와 어떤 종류의 작업을 수행하려는 경우 + +957 +01:10:29,010 --> 01:10:32,690 + 쉽게 계산 그래프 추상화로 작동하지 않을 수 + +958 +01:10:32,689 --> 01:10:38,159 + 당신은 문제를 많이 될 수있는 것 같다 같은 것은 우리에있어 고문에에의 할 수있다 + +959 +01:10:38,159 --> 01:10:40,659 + 당신은 당신이 앞으로 내부에 원하는 필수적 어떤 코드를 작성할 수 있으며, + +960 +01:10:40,659 --> 01:10:44,659 + 이전 버전과 그들의 고유 한 사용자 정의의 패스하지만 가장 큰 유사한 것 + +961 +01:10:44,659 --> 01:10:49,979 + 법의 톤 작업에 대한 나를 위해 점을 걱정하고 또 다른 연습 + +962 +01:10:49,979 --> 01:10:52,959 + 어색한 일의 종류는 그 종류이다, 그래서 이완이 모델을 주셔서 감사하다 + +963 +01:10:52,960 --> 01:11:12,239 + 의 총 + +964 +01:11:12,239 --> 01:11:22,019 + 심지어 2002 년에 설치하는 것은 그들이이 주장 조금 아팠다 + +965 +01:11:22,020 --> 01:11:25,680 + 파이썬은 우리 방금 다운로드 PEP에 설치할 수있는 모든하지만 파산 및 I + +966 +01:11:25,680 --> 01:11:29,150 + 그리고 그들은했다 설치 얻기 위해 매년 파일 이름을 변경했다 및 + +967 +01:11:29,149 --> 01:11:32,479 + 내가 수동으로 업데이트했고, 같은 일부 랜덤를 다운로드 깨진 의존성 + +968 +01:11:32,479 --> 01:11:36,759 + zip 파일 그리고 그것은 결국 포장을 풀고 주위에 어떤 임의의 파일을 복사 할 수 있지만 + +969 +01:11:36,760 --> 01:11:41,520 + 일을하지만 설치 내가 sudo는 2012이 심지어 내 자신의 컴퓨터에 어려웠다 + +970 +01:11:41,520 --> 01:11:47,400 + 그래서 그들은 내가 함께이 빨리 넣어 그 함께 자신의 행동을 얻어야한다 + +971 +01:11:47,399 --> 01:11:51,529 + 나는 사람들이 주에 대한 관심이라고 생각하면 종류의 커버 개요 표 + +972 +01:11:51,529 --> 01:11:55,529 + 좀 초반 이었죠 모델이 무엇인지 언어 급류 프레임 워크 사이의 점 + +973 +01:11:55,529 --> 01:11:56,210 + 유효한 + +974 +01:11:56,210 --> 01:12:05,029 + 문제 + +975 +01:12:05,029 --> 01:12:09,988 + 문제는 미안 이러한 지원 Windows의입니다하지만 난하지 않습니다 + +976 +01:12:09,988 --> 01:12:11,769 + 알고있다 + +977 +01:12:11,770 --> 01:12:16,830 + 난 당신이 자신에 있다고 생각 + +978 +01:12:16,829 --> 01:12:24,439 + 앗 확인을 윈도우에서 AWS를 사용할 수 있습니다 + +979 +01:12:24,439 --> 01:12:29,359 + 확인 그래서 나는이 빠른 간의 빠른 비교 차트 와서 함께 넣어 + +980 +01:12:29,359 --> 01:12:32,198 + 나는 사람들이 관심있는 주요 총알 포인트의 일부를 커버 생각 프레임 워크 + +981 +01:12:32,198 --> 01:12:37,460 + 이야기에 대해 어떤 언어가 자유 무역 모델을 가지고 있는지 여부를 손쉽게 + +982 +01:12:37,460 --> 01:12:41,300 + 당신이 병렬 처리의 종류와 방법을 읽을 수있는 소스 코드 등의 여부 + +983 +01:12:41,300 --> 01:12:47,029 + 내가하자 참조에 사용 사례 몇 가지를 가지고, 그래서 그들은 우리가 거​​룩한있어 우리의 손을 얻을 + +984 +01:12:47,029 --> 01:12:52,939 + 쓰레기 우리는 250 슬라이드를 얻었고, 우리는 여전히 이분은의이의이하자하자하자 그렇게 떠났다 + +985 +01:12:52,939 --> 01:12:56,710 + 약간의 게임은 당신이하고 싶었던 모든 알렉산드르의 BGG 추출한다고 가정 재생 + +986 +01:12:56,710 --> 01:12:58,619 + 어떤 프레임 워크 기능은 당신이 선택할 것 + +987 +01:12:58,619 --> 01:13:06,969 + 그래 내가 너무 일부에에 그물 알렉스를 찾을의 우리가하고 싶었던 모든 말을 하였다하자 + +988 +01:13:06,969 --> 01:13:19,189 + 새로운 데이터는 그래의 우리가 미세 조정을 확인 I와 이미지 캡션을하고 싶은 말을하자 + +989 +01:13:19,189 --> 01:13:22,889 + 좋은 분포를 들어 본 것은 그래서 이것은 내가이가 말하는 게 아니에요 내 생각 과정입니다 + +990 +01:13:22,890 --> 01:13:26,289 + 나는 이것에 대해 생각하는 정답하지만 방법은 해당이 문제에 대한 우리 + +991 +01:13:26,289 --> 01:13:30,969 + 초반 이었죠 모델은 카페 나 고문 라자냐 우리보고 있었다 초반 이었죠 모델 필요 + +992 +01:13:30,969 --> 01:13:36,239 + 캐시는 거의 밖으로 사람들이 할 경우에도, 그래서 우리의 손이 필요합니다 + +993 +01:13:36,238 --> 01:13:39,869 + 아마 고문을 사용하는 것 때문에 그냥 가지 고통이 물건을 구현 + +994 +01:13:39,869 --> 01:13:44,869 + 어쩌면 우리는 모든 분류 할 의미 분할에 대해 라자냐 + +995 +01:13:44,869 --> 01:13:49,880 + 우리는 입력 영상을 읽을 수와 대신을주는 바로 그래서 여기 화소를 + +996 +01:13:49,880 --> 01:13:57,900 + 우리가 독립적으로 확인 모든 픽셀에 라벨을 할 전체 출력 이미지에 레이블을 + +997 +01:13:57,899 --> 01:14:01,969 + 그 좋은 그렇게 다시 내 생각 과정은 우리가 여기 초반 이었죠 모델을 필요로했다입니다 + +998 +01:14:01,969 --> 01:14:06,800 + 대부분 우리는 당신을 위해 이상한 사용 케이스의 종류에 대해 얘기 듣고 + +999 +01:14:06,800 --> 01:14:10,739 + 이 레이어가 발생, 그래서 만약 우리 자신의 프로젝트의 일부를 정의 할 필요가 있습니다 + +1000 +01:14:10,738 --> 01:14:14,738 + 그들은 레이더 자체의 다른 어떤 잘 맞는 것 카페에 존재 + +1001 +01:14:14,738 --> 01:14:23,109 + 해결할없이 각 객체 검출을 위해 최소 10 점을 보이​​는이 일을 작성 + +1002 +01:14:23,109 --> 01:14:24,329 + 생각 + +1003 +01:14:24,329 --> 01:14:30,750 + 예 확인 캐시는 생각이 나의 생각 과정을 다시 우리가 초반 이었죠보고있는 것입니다 + +1004 +01:14:30,750 --> 01:14:33,149 + 모델은 그래서 우리는 카페 필요 + +1005 +01:14:33,149 --> 01:14:38,069 + 토치 또는 라자냐 우리는 실제로 문자 메시지로 당신이 많이 필요 할 수 있습니다 + +1006 +01:14:38,069 --> 01:14:41,609 + 그것은 할 수 있습니다 펑키 필수적 코드는 계산에 넣어 + +1007 +01:14:41,609 --> 01:14:47,799 + 11 선택이 카페 + 파이썬이 때문에 항공기하지만 나에게 무서운 것 같다의 일부가 + +1008 +01:14:47,800 --> 01:14:52,529 + 우리가에 대한 이야기​​ 봄 실제로이 길을 갔고, 나는 실제로 일을했습니다 + +1009 +01:14:52,529 --> 01:14:56,939 + 이 같은 유사 프로젝트와 나는 횃불을 선택하고 나를 위해 좋은 밖으로 일을하지만, + +1010 +01:14:56,939 --> 01:14:59,809 + 당신은 언어 모델링하려는 경우 펑키이 강렬 할 싶어하고 당신처럼 + +1011 +01:14:59,810 --> 01:15:06,270 + 너희들의 일을 어떻게 재발 역할 토치와 함께 놀고 싶어 내가 네 + +1012 +01:15:06,270 --> 01:15:09,550 + 우리가 원한다면 사실이 전혀 그래서 여기에 고문을 사용하지 않을 것 + +1013 +01:15:09,550 --> 01:15:13,650 + 언어 모델링과는 우리가하지 않은 재발 관계의 펑키 종류의 일을 + +1014 +01:15:13,649 --> 01:15:17,109 + 이 모든 단지 비과세의 이미지에 대해 이야기하는 것은 그래서 우리는 어떤 사전 시험이 필요하지 않습니다 + +1015 +01:15:17,109 --> 01:15:22,309 + 모델과 우리가 정말 쉽게 재발 관계와 놀고 싶어 + +1016 +01:15:22,310 --> 01:15:25,430 + 내가 생각하는 거기 때문에 현재 네트워크에서 작동하는 모든 반환에 아마 수수료 + +1017 +01:15:25,430 --> 01:15:32,570 + 당신이 배치 표준을 구현하려는 경우입니다 흐름은 좋은 선택이 될 수 있습니다 + +1018 +01:15:32,569 --> 01:15:39,769 + 당신이에 의존 할 경우 싶다면 확인 확인 그래서 여기에 그 권리 미안 슬라이드 + +1019 +01:15:39,770 --> 01:15:42,230 + 당신은 그라데이션을 직접 운전하지 않는 당신은이에 의존 할 수 있다면 + +1020 +01:15:42,229 --> 01:15:46,899 + 흐름처럼하지만 방식 때문에의 계산 공예 것들 그 것들 + +1021 +01:15:46,899 --> 01:15:50,089 + 당신이 열정 숙제에서 본대로 작동하거나 실제로을 단순화 할 수 있습니다 + +1022 +01:15:50,090 --> 01:15:54,900 + 꽤 많은 그라데이션 나는 확실하지 않다 이러한 계산 공예 프레임 워크 경우 + +1023 +01:15:54,899 --> 01:15:57,589 + 이 효율적인 양식을 만드는까지 제대로 그라데이션을 단순화 것 + +1024 +01:15:57,590 --> 01:16:09,489 + 문제 + +1025 +01:16:09,488 --> 01:16:13,009 + 나는 질문이 구하기하는 방법 쉬운에 얼마나 쉽게입니다 생각 생각 + +1026 +01:16:13,010 --> 01:16:18,860 + 피아노 모델 횃불 모델 같은과 나는 고통스러운 듯 보이지만에서 생각 + +1027 +01:16:18,859 --> 01:16:22,819 + fiato에서 적어도 당신은 라자냐 틱 틱 액세스 등 초반 이었죠 모델을 사용할 수 있습니다 + +1028 +01:16:22,819 --> 01:16:26,498 + 빌어 먹을 함께 라자냐 모델은 뭔가 다른 내가 이론적으로 생각하다 + +1029 +01:16:26,498 --> 01:16:31,748 + 당신이 원하는 경우에 당신이 진짜 진짜 좋아해 몇 가지가 있다면 아마 그래서 여기에 쉽게해야한다 + +1030 +01:16:31,748 --> 01:16:35,429 + 정확히 어떻게 당신이 뒤로 패스를 할 방법에 대한 좋은 지식은 계산한다 + +1031 +01:16:35,429 --> 01:16:38,179 + 당신은 당신이 아마 사용하지 않습니다보다 효율적으로 스스로를 구현하려는 + +1032 +01:16:38,179 --> 01:16:43,300 + 토치 당신은 너무에 추천을 자신에게 물어 해당 백업을 구현할 수 있습니다 + +1033 +01:16:43,300 --> 01:16:46,949 + 프레임 워크는 당신은 싶어 아마 기능의 특징 추출을 할 경우, 또는 + +1034 +01:16:46,948 --> 01:16:51,248 + 기존 모델의 미세 조정하거나 간단 바닐라의 전송 + +1035 +01:16:51,248 --> 01:16:54,929 + 작업은 다음 카페 아마 그렇지 사용하기 정말 쉽게 갈 수있는 올바른 방법이다 + +1036 +01:16:54,929 --> 01:16:58,649 + 당신이 초반 이었죠 모델 주위에 작업 할 경우 임의의 코드를 작성해야하지만, + +1037 +01:16:58,649 --> 01:17:02,738 + 어쩌면 프놈펜 당신에게 잘을 초반 이었죠 모델 이상한 물건을하고 있지 + +1038 +01:17:02,738 --> 01:17:07,209 + 라자냐 또는 토치의 더 나은 일을 할 수도 있습니다 것은 가지를에 쉽게있다 + +1039 +01:17:07,210 --> 01:17:11,328 + 초반 이었죠 모델의 구조와 혼란 당신은 당신이 만약 정말로를 원하는 경우 + +1040 +01:17:11,328 --> 01:17:14,788 + 정말 어떤 이유로 자신의 레이어를 작성하려는 당신은 당신을 생각하지 않는다 + +1041 +01:17:14,788 --> 01:17:18,788 + 쉽게 이러한 계산 공예에 들어갈 수있는 것은 다음 아마도 경우 토치를 사용한다 + +1042 +01:17:18,788 --> 01:17:22,948 + 당신은 정말 멋진 일 우리의 강렬하고 어쩌면 다른 유형을 사용하려면 그 + +1043 +01:17:22,948 --> 01:17:26,138 + 계산 그래프에 따라 다음 아마에 어쩌면 요금을 이야기 + +1044 +01:17:26,139 --> 01:17:30,090 + 당신이 거대한 모델이있는 경우 수익률도 낮은이고, 당신이 필요 + +1045 +01:17:30,090 --> 01:17:33,449 + 전체 클러스터에서 배포하고 구글의 내부에 액세스 할 수 있습니다 + +1046 +01:17:33,448 --> 01:17:36,169 + 코드베이스는 그녀 흐름을 사용해야합니다 + +1047 +01:17:36,170 --> 01:17:39,989 + 내가 말했듯이 그 부분은 우리의 나머지 부분에 대한 발표 될 예정이다 희망이 있다고하지만, + +1048 +01:17:39,988 --> 01:17:44,889 + 당신 싶어 사용 텐트가 지루해하는 경우 곧 있도록도 그리고 당신이 너무 느린있어 + +1049 +01:17:44,890 --> 01:17:48,810 + 즉, 그 모든의 꽤 많이 내 내 개요 내 빠른 회오리 바람 투어의의 + +1050 +01:17:48,810 --> 01:17:58,210 + 에 대한 프레임 워크 어떤 그래서 어떤 마지막 순간 질문 질문 질문 + +1051 +01:17:58,210 --> 01:18:02,630 + 그래서 약간의 속도를 비교하는 정말 좋은 페이지가 실제로 거기에 속도 + +1052 +01:18:02,630 --> 01:18:06,039 + 벤치 마크 모든 다른 프레임 워크의 속도와 지금 한 그 + +1053 +01:18:06,039 --> 01:18:10,488 + 승리 승리이 하나 하나도없는 것은이 일이 열반에서 저를 불려 + +1054 +01:18:10,488 --> 01:18:15,049 + 이 사람이 실제로이 녀석을 쓴 있도록 시스템은 미친 그들 + +1055 +01:18:15,050 --> 01:18:20,119 + 실제로 G4 및 비디오 하드웨어에 대한 자신의 정의 어셈블러를 쓴 사람들 + +1056 +01:18:20,119 --> 01:18:22,448 + 같은과 동영상에 만족하지 않았다 + +1057 +01:18:22,448 --> 01:18:26,500 + 툴체인 그들은 다음 유사한의 하드웨어와 회 전자를 리버스 엔지니어링 + +1058 +01:18:26,500 --> 01:18:30,948 + 어셈블리에서 구현 된 모든 커널 자체가 그래서이 사람들이 + +1059 +01:18:30,948 --> 01:18:35,859 + 미친 실제로이 존재하고, 그래서 자신의 물건은 정말 정말 빠릅니다 + +1060 +01:18:35,859 --> 01:18:39,309 + 물건은 지금 실제로 가장 빠른하지만 난 정말 내가 그들의했습니다 사용한 적이 + +1061 +01:18:39,310 --> 01:18:42,510 + 결코 정말 자신의 프레임 워크 나 자신을 사용하고 난 좀 덜 일반적인 생각 + +1062 +01:18:42,510 --> 01:18:47,010 + CUDA와 속도를 사용하는 사람에 대한하지만 이들은 대략이다 + +1063 +01:18:47,010 --> 01:18:52,030 + 같은 지금 나는 약간의 다른 사람보다 10 여분 상당히 느린 생각 + +1064 +01:18:52,029 --> 01:18:55,609 + 내가 생각하는 어리석은 이유는 후속 릴리스에서 만에 정리한다 + +1065 +01:18:55,609 --> 01:18:58,729 + 적어도 근본적으로 당신이해야해야해야해야 할 이유가 없습니다 + +1066 +01:18:58,729 --> 01:19:04,209 + 다른 사람보다 느린 + +1067 +01:19:04,210 --> 01:19:07,319 + 당신의 사람들은 소총을 집어 들고있다 + +1068 +01:19:07,319 --> 01:19:24,279 + 다 좋아 + +1069 +01:19:24,279 --> 01:19:27,198 + 즉 실제로 미친 아니에요 대부분의 팀이 마지막에 대한 꽤있다 + +1070 +01:19:27,198 --> 01:19:29,738 + 올해이 실제로 오프라 프로젝트와 같은 기호를 사용하고 그것을 잘했다 + +1071 +01:19:29,738 --> 01:19:34,658 + 그래 나는 또한 다른 프레임 워크가 있다는 것을 언급해야한다 + +1072 +01:19:34,658 --> 01:19:45,359 + 난 그냥이 가장 일반적인 질문에 대한 평화를 생각한다 + +1073 +01:19:45,359 --> 01:19:52,299 + 그래서 질문은 잡아 토치 토치가 실제로있다, 그래서 내가 파이썬에 관한 것입니다 + +1074 +01:19:52,300 --> 01:19:56,770 + 대령 실제로 종류의 멋진 나의 성화 노트북에서 사용할 수 있으며, + +1075 +01:19:56,770 --> 01:20:00,150 + 실제로 당신은 실제로 밤 또는 두 개의 노트북 만에 잡는 몇 가지 간단한 작업을 수행 할 수 있습니다 + +1076 +01:20:00,149 --> 01:20:04,899 + 난 보통 내 데이터가 내 토치 모델은 데이터를 덤프 실행 덤프됩니다 않는 것을 실천 + +1077 +01:20:04,899 --> 01:20:09,899 + 심지어 JSON의 HDL 5 파이썬에서 시각화에 조금 조금이다 + +1078 +01:20:09,899 --> 01:20:19,359 + 고통스러운하지만 당신은 작업이 완료 얻을 수 있습니다 + +1079 +01:20:19,359 --> 01:20:23,309 + 문제는 지루 tenser 당신이 넣을 수 있습니다 당신은 원시 데이터를 덤프 할 수 있는지 여부입니다 + +1080 +01:20:23,310 --> 01:20:28,300 + 거기에 자신이 실제로 그들은 실제로 일부 로그에 모든 물건을 덤핑하고 + +1081 +01:20:28,300 --> 01:20:33,050 + 임시 디렉토리에있는 파일 나는 사람들은 드문 드문 얼마나 쉽게 잘 모르겠어요하지만 당신은 할 수 + +1082 +01:20:33,050 --> 01:20:45,900 + 쉽게 할 수있는 시도하거나 나는 문제가 있는지 있는지 질문하지 아니에요 + +1083 +01:20:45,899 --> 01:20:49,899 + 이 현대적인 네트워크를위한 텐서 보드의 다른 타사 도구입니다 + +1084 +01:20:49,899 --> 01:20:53,269 + 거기서 몇 가지있을 수 있습니다하지만 난 정말 그들을 사용한 적이 난 그냥 내 자신을 읽기 + +1085 +01:20:53,270 --> 01:20:58,159 + 과거 다른 질문 + +1086 +01:20:58,158 --> 01:21:00,319 + 확실히 나는 그게 생각 생각 + diff --git a/captions/Ko/Lecture13_ko.srt b/captions/Ko/Lecture13_ko.srt new file mode 100644 index 00000000..8022bfca --- /dev/null +++ b/captions/Ko/Lecture13_ko.srt @@ -0,0 +1,3672 @@ +1 +00:00:00,000 --> 00:00:06,878 + 그래서 오늘 우리의 관리 포인트는 할당 3 때문에 오늘 밤은 너무하다 + +2 +00:00:06,878 --> 00:00:14,399 + 그 희망 좋은 확인에 할당보다 쉽게​​ 될 것 이루어졌다 + +3 +00:00:14,400 --> 00:00:18,320 + 당신도 기억 귀하의 프로젝트를 수행하는 데 더 많은 시간을 제공하여 + +4 +00:00:18,320 --> 00:00:22,500 + 우리가에있어, 그래서 우리 못했습니다 된 이정표가 지난 주에 반환 + +5 +00:00:22,500 --> 00:00:25,028 + 사람들은 확인 있는지 확인하기 위해 이정표를 통해보고의 과정과 + +6 +00:00:25,028 --> 00:00:28,609 + 또한 우리가 만드는 임무에 최선을 다하고 그래서 우리는 그 짓을해야한다 + +7 +00:00:28,609 --> 00:00:32,289 + 언젠가 이번 주 또는 다음 주 초 + +8 +00:00:32,289 --> 00:00:36,329 + 마지막으로 우리는 모든 축구의 회오리 바람 투어 모든 일반적인 소프트웨어를했다 + +9 +00:00:36,329 --> 00:00:40,058 + 사람들이 깊은 학습을 위해 사용하고 우리가 코드를 많이 보았다 패키지 + +10 +00:00:40,058 --> 00:00:43,468 + 슬라이드와 당신이 그것을 발견 잘하면 코드를 단계별로 및 많은 + +11 +00:00:43,469 --> 00:00:48,730 + 프로젝트에 유용 오늘 우리는 다른 주제에 대해 이야기하는거야 + +12 +00:00:48,729 --> 00:00:53,308 + 우리가 분할에서 분할에 대한 거 얘기있어 두 가지가 있습니다 + +13 +00:00:53,308 --> 00:00:57,488 + 우리는 또한 말할거야 의미 인스턴트 분할을 하위 문제 + +14 +00:00:57,488 --> 00:01:01,509 + 에 대한 부드러운 관심과 부드러운주의에서 다시 그들은 일종의 두의 것 + +15 +00:01:01,509 --> 00:01:07,069 + 우리는 그러나로 일을 분할 한 양동이 우리는이 들어갈 전에 먼저 + +16 +00:01:07,069 --> 00:01:12,849 + 나는 이것이 그래서 간단히 불러하려면 뭔가 다른 세부 사항이 있었다 + +17 +00:01:12,849 --> 00:01:16,769 + 이미지 분류 오류가 나는 당신이 보았던 클래스의이 시점에서 생각 + +18 +00:01:16,769 --> 00:01:23,079 + 그림의 종류가 여러 번 바로 그래서 2012 알렉스 2013 ZF 그것을 분쇄하지 + +19 +00:01:23,079 --> 00:01:29,118 + 최근 구글 매트 이상 ResNet 도움 그러나 경이 분류의 일종이다 + +20 +00:01:29,118 --> 00:01:37,400 + 새로운 이미지 최종 결과가 너무가 오늘로 2015 년에 도전하지만 밝혀 + +21 +00:01:37,400 --> 00:01:41,140 + 이 논문은 지난 밤에 나왔다 + +22 +00:01:41,140 --> 00:01:48,609 + 그래서 구글은 실제로 이미지에 예술의 현재 상태를 갖는다 3.08 %, 상위 5 오차 + +23 +00:01:48,609 --> 00:01:55,560 + 어떤 미친하고이 작업을 수행하는 방법은에서 부르는이 일을 함께 + +24 +00:01:55,560 --> 00:01:59,900 + 이 괴물의 조금 전에 섹션 그래서 나는 너무로 가고 싶지 않아 + +25 +00:01:59,900 --> 00:02:05,280 + 많은 세부하지만 당신은이가이 정말 깊은 네트워크의 것을 볼 수 있습니다 + +26 +00:02:05,280 --> 00:02:11,150 + 반복 모듈 그래서 여기 줄기 줄기 거기에 여기에 이​​상이 사람이다 + +27 +00:02:11,150 --> 00:02:14,789 + 이 아키텍처에 대해 지적하는 몇 가지 흥미로운 일들이 수도 실제로 + +28 +00:02:14,789 --> 00:02:18,979 + 즉 수 있도록 그들이 패딩이 없음을 의미하는 일부 균형 회선을 사용 + +29 +00:02:18,979 --> 00:02:22,229 + 모든 모든 수학 더 복잡하지만 그들은 똑똑하고 나는 일을 알아 냈 + +30 +00:02:22,229 --> 00:02:27,299 + 그들은 또한 여기에 흥미로운 기능은 실제로 병렬로 가지고있다가 + +31 +00:02:27,300 --> 00:02:31,459 + 귀에 거슬리는 회선도 최대 Pooley은 그래서 그들은 종류의이 두 가지 작업을 할 + +32 +00:02:31,459 --> 00:02:34,900 + 아래 샘플 이미지 평행 다음의 종류와 연결할 + +33 +00:02:34,900 --> 00:02:39,909 + 다른 것은 그들이 정말이 효율적으로 모든 외출하고있다 + +34 +00:02:39,909 --> 00:02:43,389 + 당신이 볼 수 있도록 우리가 전에 몇 강의에 대해 이야기 회선 점검 + +35 +00:02:43,389 --> 00:02:47,518 + 그들은 실제로 일곱 일곱에 의해 같은 이러한 비대칭 필터를했습니다 + +36 +00:02:47,519 --> 00:02:51,750 + 하나의 회선 그들은 또한이 하나씩 길쌈을 많이 사용합니다 + +37 +00:02:51,750 --> 00:02:56,449 + 이것은 단지 줄기 그래서 병목 계산 비용을 줄이기 위해 + +38 +00:02:56,449 --> 00:03:01,939 + 그들이 가지고 있도록 네트워크와 실제로 각이 부품은 종류의 다른 + +39 +00:03:01,939 --> 00:03:07,769 + 남용 개시 모듈이되어 있지만 아래로 다음의 어떤​​ 일곱 모듈을 샘플링 + +40 +00:03:07,769 --> 00:03:11,599 + 이 사람하고 다른 다운 샘플링 모듈과의 다음 3 개의 + +41 +00:03:11,599 --> 00:03:16,889 + 다음이 사람과 마침내 그들은 중퇴하고 완전히 은신처를 연결 + +42 +00:03:16,889 --> 00:03:20,919 + 지적하는 또 다른 것은 그래서 클래스 레이블 다시의 어떤 종류가 없습니다 + +43 +00:03:20,919 --> 00:03:24,859 + 완전히 연결 히틀러의 이곳은 단지 계산이 세계 평균이 + +44 +00:03:24,860 --> 00:03:29,320 + 마지막 특징 벡터 그들이 본 논문에서 한 또 다른 멋진 일이었다 + +45 +00:03:29,319 --> 00:03:34,900 + 처음 거주자들은 셉션이 잔류 버전을 제안 할 수 있도록 + +46 +00:03:34,900 --> 00:03:39,579 + 또한 줄기 꽤 크고 무서운 아키텍처는 이전과 동일 + +47 +00:03:39,579 --> 00:03:43,950 + 이제 이러한 잔기는 반복 개시 블록 반복 + +48 +00:03:43,949 --> 00:03:48,289 + 네트워크를 통해 그들은 실제로 이러한 잔류 연결 그래서이 + +49 +00:03:48,289 --> 00:03:51,409 + 즉, 그 종류의이 잔류 생각에 점프의 그 종류를 냉각 것입니다 + +50 +00:03:51,409 --> 00:03:55,609 + 지금하지 그래서 다시 그들은 많은 반복이있는 아트 이미지의 상태를 개선 + +51 +00:03:55,610 --> 00:04:00,880 + 모듈 및 모든 업이 일을 추가 할 때 내가했던 가정에 대한 7875 층의 + +52 +00:04:00,879 --> 00:04:07,939 + 수학은 바로 그렇게 그들은 또한 마지막 밤을 표시하는 새로운 자신의 종류의 사이 + +53 +00:04:07,939 --> 00:04:12,680 + 처음하지만 인 셉션 Google지도 및 잔류의 새로운 버전 4 + +54 +00:04:12,680 --> 00:04:17,079 + 실제로 모두 동일한 대해 수행 구글지도 버전이므로 + +55 +00:04:17,079 --> 00:04:22,909 + 지금 당신이 볼 수있는이 이미지에 신 (新) 시대의 함수로 진정한 상위 5 공기를하다 + +56 +00:04:22,910 --> 00:04:28,070 + 발단 네트워크와 실제로 빠른 비트를 수렴 읽을 수는 있지만 + +57 +00:04:28,069 --> 00:04:33,180 + 같은 값에 대한 일종의 대화 그들 모두에게 시간에 너무 + +58 +00:04:33,180 --> 00:04:38,340 + 그 가지 종류의 다른 일을 멋진 흥미로운의 종류 있다는 + +59 +00:04:38,339 --> 00:04:42,369 + 지적 흥미로운이 문서는 여기에서 x 축에 원료 수있다 + +60 +00:04:42,370 --> 00:04:46,030 + 이들은 지금 이러한 일들이 백을 위해 훈련중인 이미지에 수두 있습니다 + +61 +00:04:46,029 --> 00:04:52,089 + 그리고 그건 육십 파키스탄 이미지 그물 그래서 그 훈련에 많은 시간이다하지만 그건 + +62 +00:04:52,089 --> 00:04:55,469 + 즉, 현재 이벤트의 충분과의 정기적으로 예약 된 우리로 돌아 가자 + +63 +00:04:55,470 --> 00:05:02,710 + 프로그래밍은 그래서 오늘 오, 그래 질문에 나는 그것이 될 것 같아요 모른다 + +64 +00:05:02,709 --> 00:05:11,789 + 종이 그러나 나는주의 깊게 읽어하지 않았고 러시아의 다른 질문에 드롭 수 + +65 +00:05:11,790 --> 00:05:16,600 + 이 마지막 층에서만인지 잘 모르겠어요에서 다시 난을 읽어 보지 않았 + +66 +00:05:16,600 --> 00:05:21,620 + 종이에 조심스럽게 아직하지만 링크가 당신이 그것을 확인해야합니다 여기 있어요 + +67 +00:05:21,620 --> 00:05:29,600 + 확인 그래서 오늘 우리는 두 개의 다른 주제의 종류에 대해 얘기하는거야 + +68 +00:05:29,600 --> 00:05:33,970 + 그래서 그 생각 일반적인 것들과 연구 요즘 분할이다 + +69 +00:05:33,970 --> 00:05:37,490 + 이는 고전적인 컴퓨터 비전 주제의이 종류도 이런 생각입니다 + +70 +00:05:37,490 --> 00:05:41,550 + 내가 생각하는주의는 정말로 일하기 정말 인기있는 일이있다한다 + +71 +00:05:41,550 --> 00:05:46,060 + 우리가 거​​ 얘기있어 특히 첫 있도록 지난 한 해 동안 깊은 애도의 + +72 +00:05:46,060 --> 00:05:50,889 + 분할에 대해 당신은 몇에서이 슬라이드를 기억했을 수 있도록 + +73 +00:05:50,889 --> 00:05:53,649 + 강의 전에 우리에 대해 얘기했다 물체 검출에 대해 이야기 + +74 +00:05:53,649 --> 00:05:58,000 + 사람들이 컴퓨터 비전에서 일을하고 우리가 많이 소비 다른 작업 + +75 +00:05:58,000 --> 00:06:02,259 + 수업 시간 강의에서 다시 분류에 대해 이야기하고 우리 + +76 +00:06:02,259 --> 00:06:03,750 + 다른 모델에 대해 이야기 + +77 +00:06:03,750 --> 00:06:08,339 + 현지화 및 물체 감지하지만 오늘 우리는에 실제로거야 초점을거야 + +78 +00:06:08,339 --> 00:06:12,239 + 우리가 이전에 마지막으로 시간이 지남에 따라 생략 분할이 생각 + +79 +00:06:12,240 --> 00:06:18,189 + 두 개의 서로 다른 일부 작업의 종류 거기에 분할 이내에 강의하는 우리 + +80 +00:06:18,189 --> 00:06:21,870 + 우리가 정의 할 필요가 사람들이 실제로 이런 일에 작동하는지 확인 필요 + +81 +00:06:21,870 --> 00:06:26,389 + 조금 별도로 첫 번째 작업은 의미 분할이라는 생각이다 + +82 +00:06:26,389 --> 00:06:32,370 + 그래서 여기에 우리는 우리가 입력 영상이 끝을 가지고 우리의 일부 사진 번호가 + +83 +00:06:32,370 --> 00:06:38,000 + 어떤 종류의 건물과 나무와 지상 및 암소와 같은 종류의 것 + +84 +00:06:38,000 --> 00:06:42,629 + 당신은 일반적으로 원하는 의미 라벨은 또한 클래스의 몇 가지 작은 고정 번호가 + +85 +00:06:42,629 --> 00:06:46,199 + 일반적으로는 적합하지 않다 최초의 것들에 대한 몇 가지 배경 클래스를해야합니다 + +86 +00:06:46,199 --> 00:06:51,360 + 이러한 클래스에 다음 작업은 우리가 입력 인치로 수행 할 것입니다 + +87 +00:06:51,360 --> 00:06:55,240 + 그리고, 우리는 이러한 의미 중 하나를 사용하여 해당 이미지의 모든 픽셀에 라벨을 할 + +88 +00:06:55,240 --> 00:06:59,850 + 우리가 현장에서 이러한 세 소의이, 입력 화상을 촬영 한 여기에서 매우 클래스 + +89 +00:06:59,850 --> 00:07:05,490 + 그리고 이상적인 출력 대신 인 RGB 값이 이미지는 우리가 실제로 + +90 +00:07:05,490 --> 00:07:11,228 + 우리는 이것과 다른 이미지와 아마 세그먼트를 할 수있는 픽셀 당 하나의 클래스 레이블이 + +91 +00:07:11,228 --> 00:07:16,789 + 나무와 하늘과 도로에서 잔디 때문에 작업이 유형의 예쁜 + +92 +00:07:16,790 --> 00:07:19,950 + 멋진 나는 당신에게 무엇의 이해의 높은 수준의 종류를 제공합니다 생각 + +93 +00:07:19,949 --> 00:07:23,029 + 단지 전체에 단일 레이블을 넣어 비교 이미지에서 진행 + +94 +00:07:23,029 --> 00:07:28,668 + 영상이 너무 이것 실제로 컴퓨터 비전에서 아주 오래 된 문제 + +95 +00:07:28,668 --> 00:07:32,649 + 이 그림은 실제로 온다, 그래서 그 깊은 학습 혁명의 종류를 선행 + +96 +00:07:32,649 --> 00:07:37,259 + 컴퓨터 비전은 2007 년 문고에서 그 어떤 깊은 학습을 사용하지 않은 + +97 +00:07:37,259 --> 00:07:43,728 + 모든 사람에서 다른에게 몇 년 전에 작업을이 다른 방법을 가지고 그 + +98 +00:07:43,728 --> 00:07:48,949 + 사람들은이 일을 여기서 지적하는 것은이 때문이다 바로 그래서입니다 작업 + +99 +00:07:48,949 --> 00:07:54,310 + 일이이 이미지는 실제로이 집 또는 네 그래서 여기에 인스턴스를 인식하지 못합니다 + +100 +00:07:54,310 --> 00:07:58,329 + 실제로 좀 누워 서 세 소와 한 소 소 거기 + +101 +00:07:58,329 --> 00:08:02,300 + 이 출력 여기에 낮잠 만 복용 잔디는 정말 분명하지 않다 얼마나 많은 + +102 +00:08:02,300 --> 00:08:07,560 + 소는 서로 다른 소가 픽셀 그래서 여기에 겹쳐 실제로있다 + +103 +00:08:07,560 --> 00:08:11,540 + 가 없다는 출력이 다른 소 있다는 것을 + +104 +00:08:11,540 --> 00:08:15,480 + 그것은 어쩌면 같은 정보를하지 그래서 우리가 모든 픽셀을 라벨링하고 출력을 그리워 + +105 +00:08:15,480 --> 00:08:20,009 + 당신이 좋아할 것하고 실제로 일부에 대한 몇 가지 문제가 발생할 수 있으므로 + +106 +00:08:20,009 --> 00:08:23,409 + 그것은 이것을 극복 그래서 다운 스트림 응용 프로그램 + +107 +00:08:23,410 --> 00:08:28,080 + 사람들은 개별적으로 인스턴트 분할이라는이나 문제에 작동 한 + +108 +00:08:28,079 --> 00:08:32,039 + 이것은 또한 때때로 동시 탐지 및 분할 호출되는 + +109 +00:08:32,039 --> 00:08:37,879 + 그래서 여기에 문제는 우리가 클래스의 일부 세트가 이전에 어딘가에서 다시가요 + +110 +00:08:37,879 --> 00:08:43,370 + 그 인식하려고 우리가 모든 출력을 원하는 입력 영상을 받았다 + +111 +00:08:43,370 --> 00:08:48,370 + 이러한 클래스의 각 인스턴스에 대한 인스턴스 우리는 픽셀 밖으로 세그먼트를 원하는 + +112 +00:08:48,370 --> 00:08:52,970 + 그가이 입력 이미지 때문에 여기에 해당 인스턴스에 속하는 + +113 +00:08:52,970 --> 00:08:57,509 + 지금의 두 부모와 아이 실제로 세 가지 다른 사람들이있다 + +114 +00:08:57,509 --> 00:09:00,860 + 우리가 실제로에서와 다른 사람들을 구분 출력 + +115 +00:09:00,860 --> 00:09:05,279 + 그 세 사람들이 지금 서로 다른 색으로 표시되는 입력 영상 + +116 +00:09:05,279 --> 00:09:09,360 + 다른 인스턴스를 표시하고 다시 해당 인스턴스 각각에 대해 우리는거야 + +117 +00:09:09,360 --> 00:09:14,009 + 해당 인스턴스에 속하는 입력 이미지의 모든 픽셀을 레이블을 너무 + +118 +00:09:14,009 --> 00:09:18,639 + 의미 분할 실제로 인스턴트 분할 사람들이 두 작업 + +119 +00:09:18,639 --> 00:09:22,409 + 조금 별도로에서 일한 우리가 거​​ 얘기 야 처음 그래서 + +120 +00:09:22,409 --> 00:09:27,269 + 의미 론적 분할에 대한 일부 모델에 대해 그래서 이것은 기억 + +121 +00:09:27,269 --> 00:09:30,399 + 당신을위한 작업은 이미지의 모든 픽셀에 레이블을 원하는 당신은 상관 없어 + +122 +00:09:30,399 --> 00:09:38,230 + 그래서 여기에 인스턴스에 대한 생각은 일부 입력 주어진 실제로 매우 간단하다 + +123 +00:09:38,230 --> 00:09:43,269 + 이미지 이것은 우리가거야 소와의 마지막은 일부 작은 패치를 가지고있다 + +124 +00:09:43,269 --> 00:09:48,720 + 입력 이미지와는 종류의 현지 정보를 제공하는이 패치를 추출 + +125 +00:09:48,720 --> 00:09:53,340 + 이미지는 우리는거야이 패치를 가지고 우리는 몇 가지를 통해거야 공급을거야 + +126 +00:09:53,340 --> 00:09:57,230 + 콘벌 루션 신경망이 아키텍처 중 우리가된다는 것입니다 수 + +127 +00:09:57,230 --> 00:10:01,070 + 지금까지 클래스 이제이 이야기 + +128 +00:10:01,070 --> 00:10:04,890 + 콘벌 루션 신경망 실제로 중심 화소 A를 분류 할 + +129 +00:10:04,889 --> 00:10:10,080 + 패치 그래서이 신경 네트워크는 우리가 알고있는 뭔가 저스틴 분류입니다 + +130 +00:10:10,080 --> 00:10:14,379 + 그래서이 일을 그냥 말 것입니다 수행하는 방법이 파견의 중심 픽셀 + +131 +00:10:14,379 --> 00:10:19,769 + 실제로 우리가 작동이 네트워크를 복용 상상할 수있는 것보다 소입니다 + +132 +00:10:19,769 --> 00:10:20,710 + 패치 + +133 +00:10:20,710 --> 00:10:26,019 + 및 중앙 픽셀 레이블 그리고 우리는 단지 전체 이미지에 걸쳐 것을 실행 + +134 +00:10:26,019 --> 00:10:33,269 + 이 실제로 매우 그래서 우리에게 이미지의 각 픽셀에 대한 레이블을 제공합니다 + +135 +00:10:33,269 --> 00:10:36,699 + 이제 이미지의 많은 많은 패치를 거기에 바로 있기 때문에 비용이 많이 드는 작업 + +136 +00:10:36,700 --> 00:10:40,120 + 그리고 그것은 모두 독립적으로이 네트워크를 실행하는 슈퍼 슈퍼 비싼 것 + +137 +00:10:40,120 --> 00:10:44,139 + 그들의 연습 사람들이 우리가 물체를 보았다 같은 트릭을 사용 그렇게 + +138 +00:10:44,139 --> 00:10:48,639 + 완전히 길쌈 II를이 일을 실행하고거야 검출 모든 + +139 +00:10:48,639 --> 00:10:54,220 + 한 번에 전체 이미지 그러나 여기에서 문제에 대한 출력은 당신이 인 경우이다 + +140 +00:10:54,220 --> 00:10:58,879 + 컨볼 루션 네트워크 풀링에 또는 어느 샘플링 다운의 종류를 포함 + +141 +00:10:58,879 --> 00:11:02,899 + 다음 이제 출력 출력 이미지 것 스트라이커 회선을 통해 + +142 +00:11:02,899 --> 00:11:07,139 + 그 그건 그래서 실제로 작은 공간 크기와 입력 이미지가 + +143 +00:11:07,139 --> 00:11:09,929 + 그들은 이러한 유형의를 사용할 때 사람들이 주위에 일을해야 할 일 + +144 +00:11:09,929 --> 00:11:14,629 + 접근의 의미에 대한 기본 설정의이 종류에 해당하므로, 어떠한 질문 + +145 +00:11:14,629 --> 00:11:28,208 + 분할 그래 + +146 +00:11:28,208 --> 00:11:32,979 + 문제는 팻 팻 팻 옳은 일을 그냥 충분히 제공하지 않습니다 여부 + +147 +00:11:32,980 --> 00:11:37,800 + 어떤 경우에는 정보와 그 다음 이러한 위해 이렇게 때때로 진실 + +148 +00:11:37,799 --> 00:11:41,688 + 네트워크 사람들은 실제로 그들이 가지고 별도의 오프라인 정제 단계를 + +149 +00:11:41,688 --> 00:11:44,980 + 다음이 출력은 최대 청소 그래픽 모델의 일종으로 공급 + +150 +00:11:44,980 --> 00:11:48,028 + 밀어 도움이 될 수 있도록 때때로 출력 조금 침대를 정리하여 + +151 +00:11:48,028 --> 00:11:52,838 + 입력 - 출력 모델에 대한 좀 더 나은 성능을하지만 공의에 텐트를 설정 + +152 +00:11:52,839 --> 00:12:09,600 + 그냥 그래 난 당신이 아니에요 필요 해요 구현하기 쉬운 무언가로 꽤 잘 작동 + +153 +00:12:09,600 --> 00:12:13,019 + 확실히 나는 정확히 아마 꽤 큰 어쩌면 몇 백 200 모르겠어요 + +154 +00:12:13,019 --> 00:12:19,919 + 크기 때문에 하나의 확장의 순서로하는 사람들이 사용했다고하는 것이 픽셀 + +155 +00:12:19,919 --> 00:12:23,289 + 기본적인 접근 방식은 실제로 때때로 다중 스케일 시험이 좋습니다 + +156 +00:12:23,289 --> 00:12:28,230 + 단일 규모는 우리가 우리의 입력 이미지를 데리고가는 것하고 그래서 여기 충분하지 않습니다 + +157 +00:12:28,230 --> 00:12:33,009 + 이 공통 트릭의 일종이다, 그래서 실제로는 여러 다른 크기로 크기를 조정 + +158 +00:12:33,009 --> 00:12:36,688 + 사람들이 컴퓨터 비전에 사용하는 것이 많은 이미지 피라미드 방금 걸릴라고 + +159 +00:12:36,688 --> 00:12:41,458 + 같은 차원과 당신은 지금 많은 다른 스케일을 만든 및 크기 조정 + +160 +00:12:41,458 --> 00:12:44,528 + 이러한 비늘 각각은 컨볼 루션 신경망을 통해 실행 된 거 + +161 +00:12:44,528 --> 00:12:49,568 + 그 다음 사진을 보호하는 것입니다 서로 다른 이미지의 현명한 라벨은 + +162 +00:12:49,568 --> 00:12:52,969 + 이러한 서로 다른 해상도의 너무 다른 점은 여기에 따라 지적합니다 + +163 +00:12:52,970 --> 00:12:56,249 + 질문의 라인이 각각의 네트워크가 실제로 동일한 경우 해당 + +164 +00:12:56,249 --> 00:12:59,639 + 아키텍처는 이러한 출력은 각각 다른 영향을 미칠 것입니다 + +165 +00:12:59,639 --> 00:13:04,490 + 그래서 지금 우리가 입수 한 것을 입력 2222 이미지 피라미드 수용 필드 + +166 +00:13:04,490 --> 00:13:08,720 + 우리는 모두를 취할 수있는 것보다 의도에 대해이 다른 크기의 픽셀 라벨 + +167 +00:13:08,720 --> 00:13:13,660 + 그들과 리사와 그냥 그 응답을 샘플링하는 일부 오프라인 업 샘플링을 + +168 +00:13:13,659 --> 00:13:18,129 + 그래서 지금 우리는 우리의 3 개의 출력을 쪘 입력 이미지와 동일한 크기로 + +169 +00:13:18,129 --> 00:13:24,319 + 다른 샘플을 크기 및 그들과이 글이 실제로 종이를 쌓아 + +170 +00:13:24,318 --> 00:13:29,139 + 다시 2013 년 검둥이에서 그래서 그들은 실제로이 별도의이 + +171 +00:13:29,139 --> 00:13:33,709 + 오프라인 처리 STAP 그들은 상향식 (bottom-up) 분할이 생각하는 곳 + +172 +00:13:33,708 --> 00:13:39,119 + 이러한 이상이 일종의 그래서 이러한 슈퍼 픽셀 방법을 사용하여 권한을 사용하여 + +173 +00:13:39,120 --> 00:13:41,370 + 고전 컴퓨터 비전 화상 처리 유형 + +174 +00:13:41,370 --> 00:13:45,470 + 실제로 인접 픽셀 사이의 차이를 보면 방법 + +175 +00:13:45,470 --> 00:13:48,589 + 다음 둘을 병합하려고 이미지는 당신이 일관성을 제공합니다 + +176 +00:13:48,589 --> 00:13:52,900 + 그럼 실제로이 방법 이미지에별로 변화가있는 지역 + +177 +00:13:52,899 --> 00:13:56,519 + 소요 종류의 이러한 다른 전통적인을 통해 이미지를 오프라인으로 실행 + +178 +00:13:56,519 --> 00:14:02,230 + 이미지 처리 기술은 슈퍼 픽셀 또는 나무의 세트를 얻을 수 + +179 +00:14:02,230 --> 00:14:06,629 + 화소 말은 이미지에 병합되어야한다 그리고 그들은이 있습니다 + +180 +00:14:06,629 --> 00:14:09,519 + 이러한 모든 다른 일을 병합하는 다소 복잡한 과정 + +181 +00:14:09,519 --> 00:14:13,028 + 이제 우리는 낮은 수준의 정보 말 이런 종류의 쪘 유발하는 + +182 +00:14:13,028 --> 00:14:14,110 + 이미지의 픽셀 + +183 +00:14:14,110 --> 00:14:18,909 + 실제로 색상과 좋은 정보의 종류에 따라 서로 유사하다 + +184 +00:14:18,909 --> 00:14:22,439 + 우리는 길쌈에서 서로 다른 해상도의 이러한 출력을 가지고있어 + +185 +00:14:22,440 --> 00:14:25,810 + 신경망 레이블이 다른 지점에있다 의미 무엇인지 우리에게 이야기 + +186 +00:14:25,809 --> 00:14:29,929 + 이미지에 그들은 실제로에 대한 몇 가지 아이디어를 탐구 할 수 사용 + +187 +00:14:29,929 --> 00:14:33,870 + 함께이 일을 합병하는 것은 당신에게 당신의 마지막을 어떻게이 사실은 알아주는 + +188 +00:14:33,870 --> 00:14:38,419 + 나는 갈등에 대한 이전 질문 중 하나가 해결 될 때 또한 응답 + +189 +00:14:38,419 --> 00:14:43,809 + 그래서 이러한 외부 읽기 슈퍼 픽셀의 방법 또는를 사용하여 자체적으로 충분히있는 + +190 +00:14:43,809 --> 00:14:47,729 + 분할 나무 종류의 당신에게 추가 정보를 제공하는 또 다른 일이 + +191 +00:14:47,730 --> 00:14:55,649 + 입력 이미지에 대한 어쩌면 더 큰 문맥이 약하므로, 어떠한 질문 + +192 +00:14:55,649 --> 00:15:03,879 + 확인 모델을 다른 사람들이 의미에 사용되는 것을 멋진 아이디어를 다른 종류의 이렇게 + +193 +00:15:03,879 --> 00:15:08,299 + 이에 분할은 반복 정제의이 아이디어는 우리가 실제로보고있다 + +194 +00:15:08,299 --> 00:15:12,809 + 우리가 얘기 할 때이 몇 강의 전에 언급 한 많은 추정을 제기하지만, + +195 +00:15:12,809 --> 00:15:17,149 + 아이디어는 우리가거야 그들이 밖으로 분리 여기에 입력 된 이미지를 가지고 있다는 것입니다 + +196 +00:15:17,149 --> 00:15:20,929 + 세 가지 채널 그리고 우리는 우리의 마음에 드는 종류에 대한 그 일 거 실행있어 + +197 +00:15:20,929 --> 00:15:24,929 + 콘벌 루션 신경망이 저해상도 패치를 예측할 + +198 +00:15:24,929 --> 00:15:30,309 + 오히려 이미지의이없는 해상도 분할을 예측하고 지금 우리는있어 + +199 +00:15:30,309 --> 00:15:34,899 + 거야의 다운 샘플링 된 버전과 함께 CNN에서 해당 출력을 + +200 +00:15:34,899 --> 00:15:38,829 + 우리가이 과정을 반복합니다 및 원본 이미지를 다시 때문에이 허용 + +201 +00:15:38,830 --> 00:15:43,990 + 네트워크는 출력의 유효 수용 필드를 증가 포도주를 정렬하려면 + +202 +00:15:43,990 --> 00:15:48,399 + 또한 수행하거나 처리를의 입력 이미지를 그리고, 우리는 할 수있는 + +203 +00:15:48,399 --> 00:15:54,009 + 이 좀 멋진 그래서 다시이 과정을 반복이 세 가지를 그렇다면 + +204 +00:15:54,009 --> 00:15:54,769 + 길쌈 + +205 +00:15:54,769 --> 00:15:58,249 + 네트워크는 실제로 다음이 재발 길쌈가된다 가중치를 공유 + +206 +00:15:58,249 --> 00:16:03,489 + 네트워크 어디 종류의 전복 시간에 동일한 입력으로 작동하지만, + +207 +00:16:03,489 --> 00:16:07,528 + 실제로 이러한 업데이트 단계는 각각의 전체 컨벌루션 네트워크는 + +208 +00:16:07,528 --> 00:16:10,139 + 실제로 매우 유사한 아이디어는 우리가 본 네트워크를 재발하기 + +209 +00:16:10,139 --> 00:16:18,789 + 이전에 2014 년에 있었던이 논문의 뒤에 아이디어는 것입니다 당신이 경우 + +210 +00:16:18,789 --> 00:16:22,558 + 실제로는 수 잘하면 것은 동일한 유형의 더 많은 반복을 수행 + +211 +00:16:22,558 --> 00:16:28,219 + 네트워크는 일종의 반복적 그래서 여기에 우리가있는 경우 그 출력을 수정하기 + +212 +00:16:28,220 --> 00:16:32,220 + 이 원시 입력 이미지는 한 세대 후에는 실제로 볼 수 있습니다 + +213 +00:16:32,220 --> 00:16:35,959 + 특히 객체의 경계에 있지만 같은 소음이 꽤가있다 + +214 +00:16:35,958 --> 00:16:39,359 + 우리는이 재발 길쌈을 통해 둘, 셋, 반복에 대해 실행 + +215 +00:16:39,360 --> 00:16:42,769 + 네트워크 실제로는 그런 종류의 많은 정리하기 위해 네트워크를 할 수 있습니다 + +216 +00:16:42,769 --> 00:16:46,989 + 낮은 수준의 불쾌 및 생산 훨씬 청소기 훨씬 깨끗하고 더 멋진 결과 + +217 +00:16:46,989 --> 00:16:51,119 + 그래서 나는 함께 이러한 병합의 종류 아주 아주 멋진 아이디어라고 생각했다 + +218 +00:16:51,119 --> 00:16:55,199 + 재발 네트워크의 아이디어와이 아이디어를 시간이 지남에 따라 가중치를 공유 + +219 +00:16:55,198 --> 00:17:03,479 + 컨볼 루션 네트워크는 아주 잘 매우 광범위하게 다른 그래서 이미지를 다른를 처리하는 + +220 +00:17:03,480 --> 00:17:07,470 + 의미 론적 분할에 매우 매우 잘 알려진 논문은 버클리에서이 하나입니다 + +221 +00:17:07,470 --> 00:17:12,419 + 그는 CBP에 게시 된 우리의 지난해 그래서 여기에 그것은 매우 유사한 모델이다 + +222 +00:17:12,419 --> 00:17:16,850 + 우리는 입력 영상을 가지고 어떤 수를 통해 실행하는거야 전에 + +223 +00:17:16,849 --> 00:17:22,259 + 회선은 결국 화소하지만 일부 일부 기능지도를 추출 + +224 +00:17:22,259 --> 00:17:26,638 + 모든 하드 코딩 된 업 샘플링 이런 종류의에 의존하는 기존 방법 대비 + +225 +00:17:26,638 --> 00:17:31,138 + 실제로 에너지하지만이에 최종 분할을 생산하는 + +226 +00:17:31,138 --> 00:17:34,668 + 종이 그들은 잘 우리는 우리가 우리가 원하는 깊은 학습 사람들이야이야 제안 + +227 +00:17:34,669 --> 00:17:39,149 + 우리가 거​​ 네트워크의 일환으로 업 샘플링을 배울 수있는, 그래서 모든 것을 배울 그래서 + +228 +00:17:39,148 --> 00:17:43,298 + 그들은없는 일이 마지막 층이 최대 학습 가능 샘플링에서이 포함되어있어 + +229 +00:17:43,298 --> 00:17:50,798 + 이 실제로 최대 샘플 때문에 학습 가능 방식으로 기능지도 예 + +230 +00:17:50,798 --> 00:17:55,179 + 그들은 마지막에 샘플링지도와 길을 자신의 모델 종류까지왔다 + +231 +00:17:55,179 --> 00:17:59,940 + 그들은 그것을하지 그래서 그들은이 알렉스 된 시간에이를 것으로되어 보이는 그들의 + +232 +00:17:59,940 --> 00:18:04,090 + 회선 및 당기 및 여러 단계의 실행 입력 영상 + +233 +00:18:04,089 --> 00:18:08,028 + 결국 그들은 그들이 가지고있는이 풀 (5) 출력에서​​ 생산 꽤 + +234 +00:18:08,028 --> 00:18:12,048 + 샘플 이미지 아래 특별한 크기 샘플링 상당히 아래로 입력 화상과 비교하여 + +235 +00:18:12,048 --> 00:18:16,999 + 다음까지 학습 가능 샘플링이 다시에 샘플을 그들을 reup있다 + +236 +00:18:16,999 --> 00:18:19,460 + 입력 화상의 원래 크기 + +237 +00:18:19,460 --> 00:18:25,909 + 이 논문의 또 다른 멋진 기능은 스킵 연결이 아이디어들은 그렇게 + +238 +00:18:25,909 --> 00:18:30,489 + 그들은 실제로을 사용하여 실제로 단지이 가난한 오 기능을 사용하지 않는 + +239 +00:18:30,489 --> 00:18:34,598 + 다른 레이어와 네트워크에서 길쌈 기능하는 일종의 + +240 +00:18:34,598 --> 00:18:39,200 + 당신이 상상할 수 있도록 다양한 규모에서 존재하는 당신의 수영장에있어 한 번 + +241 +00:18:39,200 --> 00:18:42,649 + 알렉스는 지금 실제로의 레이 업 후 더 큰 기능지도 풀 오 + +242 +00:18:42,648 --> 00:18:48,069 + 수영장 3 그래서 직관이 낮은 것입니다에 대한 풀보다 더 크다 + +243 +00:18:48,069 --> 00:18:52,148 + 길쌈 방송 해 실제로 당신이에 미세한 입자 구조를 캡처 도움이 될 수 있습니다 + +244 +00:18:52,148 --> 00:18:56,408 + 그들은 작은 수용 필드가 있기 때문에 입력 이미지가 그래서 실제로 우리에게 영향을 + +245 +00:18:56,409 --> 00:18:59,889 + 서로 다른 길쌈 기능 맵을 별도 적용 + +246 +00:18:59,888 --> 00:19:03,428 + 최대 배운 그들 모두를 결합 한 후이 기능 맵의 각을 샘플링하고 + +247 +00:19:03,429 --> 00:19:09,070 + 최종 출력을 생성하고 그 결과에 그들은 실제로 그가를 추가하는 것을 보여 + +248 +00:19:09,069 --> 00:19:15,408 + 스킵 연결을 통해 때문에 이러한 낮은 수준의 세부 사항에 많은 도움이 경향 + +249 +00:19:15,409 --> 00:19:19,979 + 여기 왼쪽에있는 이들 만이 가난한 오 출력을 사용하는 결과는 + +250 +00:19:19,979 --> 00:19:24,919 + 당신은이 종류의 자전거에 사람의 거친 생각이라도 것을 알 수 있습니다 + +251 +00:19:24,919 --> 00:19:29,330 + 하지만 좀 blobby 그리고 미세한 세부 사항을 많이 누락 가장자리에 있지만 + +252 +00:19:29,329 --> 00:19:31,819 + 당신은 이러한 낮은에서 다음 단계 연결에 추가 할 때 + +253 +00:19:31,819 --> 00:19:35,468 + 당신에 대한 더 많은 세부적인 정보를 제공 길쌈 오류 + +254 +00:19:35,469 --> 00:19:39,940 + 그 작업이 그렇게 사람들을 추가 있도록 이미지에서 사물의 공간 위치 + +255 +00:19:39,940 --> 00:19:43,919 + 하위 계층에서 연결을 이동하는 것은 정말 경계를 정리하는 데 도움이 + +256 +00:19:43,919 --> 00:19:51,159 + 이 이러한 출력 질문에 대한 몇 가지 경우에 문제는 방법이다 있도록 + +257 +00:19:51,159 --> 00:19:55,070 + 정확성을 분류 나는 사람들이 일반적으로이 사용되는 두 메트릭 생각 + +258 +00:19:55,069 --> 00:19:58,829 + 당신이 모든 픽셀 분류를 분류하고 같이 다만 분류입니다 + +259 +00:19:58,829 --> 00:20:03,968 + 내 트랙은 또한 때때로 사람들은 각각 그래서 노동 조합의 교차를 사용 + +260 +00:20:03,969 --> 00:20:09,058 + 클래스 당신은 내가 우리에게 그 클래스를 예측 이미지의 영역이 무엇인지 계산 + +261 +00:20:09,058 --> 00:20:12,368 + 이었다 무엇 클래스를 가지고 이미지의 지상군 영역에 도달 + +262 +00:20:12,368 --> 00:20:17,158 + 다음 두 사이에 노동 조합의 교차점을 계산 나는 확실하지 않다 + +263 +00:20:17,159 --> 00:20:20,510 + 이는 측정하는 특히 사용이 논문 + +264 +00:20:20,509 --> 00:20:26,609 + 이렇게까지 ​​학습 가능 샘플링이 아이디어는 실제로 정말 멋진 이후부터입니다 + +265 +00:20:26,609 --> 00:20:30,839 + 이 문서가 적용되었습니다 및 기타 연락처를 많이는 우리가 본 적이 우리가 알고있는 사촌 + +266 +00:20:30,839 --> 00:20:35,839 + 우리는 아래로 우리의 기능지도를 다양한 방법으로뿐만되는 샘플 수 있음 + +267 +00:20:35,839 --> 00:20:39,689 + 최대 네트워크 내부를 샘플링 할 수 실제로 매우 유용 할 수있다 + +268 +00:20:39,690 --> 00:20:44,750 + 매우 가치있는 일이 그래서이 때로는 호출되는 디컨 볼 루션을 수행하는 + +269 +00:20:44,750 --> 00:20:48,980 + 즉 우리 모두가 몇 분 안에 그 이야기 때문에 매우 좋은 조건은 아니지만 + +270 +00:20:48,980 --> 00:20:54,130 + 당신이 정상적인 일을 할 때 그냥 일종의 요약하자면, 그래서 그것은 매우 일반적인 용어이다 + +271 +00:20:54,130 --> 00:20:59,870 + 보폭 보폭 1353 컨볼 루션 우리의 종류 우리는 우리가이 사진을 가지고이 있습니다 + +272 +00:20:59,869 --> 00:21:04,489 + 즉 우리의 4 × 4 입력 우리를 부여하는 것이 지금까지 꽤 잘 알고 있어야합니다 + +273 +00:21:04,490 --> 00:21:08,710 + 세 가지 필터에 의해 약 3를 가지고 우리는 이상이 셋으로 세 가지 필터를 플롯 + +274 +00:21:08,710 --> 00:21:10,059 + 입력의 일부 + +275 +00:21:10,059 --> 00:21:14,539 + 제품이 우리에게 지금 고통 때문에 하나의 출력 요소 및 제공 + +276 +00:21:14,539 --> 00:21:19,240 + 아스테로이드 하나는 필터를 이동 출력 우리의 다음 요소를 계산 + +277 +00:21:19,240 --> 00:21:22,599 + 하나의 입력을 다시 컴퓨터 내적의 슬롯을 통해 그 우리를 제공합니다 + +278 +00:21:22,599 --> 00:21:29,409 + 출력에서 하나의 원소이며, 현재 걸음에 해당 회선에 대한 그것의 A의 + +279 +00:21:29,410 --> 00:21:32,360 + 아이디어의 매우 유사한 유형 어디 + +280 +00:21:32,359 --> 00:21:36,099 + 출력은 두 개의 출력으로 버전이 샘플링 다운 될 것입니다 + +281 +00:21:36,099 --> 00:21:40,459 + 4 × 4 곳에서 다시는 우리가 우리의 필터를 같은 생각 가지고있어 우리는 풍덩 + +282 +00:21:40,460 --> 00:21:44,279 + 화상 컴퓨터 내적 아래로 우리에게 출력의 한 요소를 준다 + +283 +00:21:44,279 --> 00:21:48,450 + 유일한 차이점은 지금 우리가 두 개의 슬롯을 통해 컨볼 루션 필터를 밀어 것입니다 + +284 +00:21:48,450 --> 00:21:53,610 + 입력은 출력에 디컨 볼 루션 elaire 하나를 계산 + +285 +00:21:53,609 --> 00:21:57,439 + 실제로 우리가 저를 먹고 싶어 그래서 여기에 조금 다른 무언가를 + +286 +00:21:57,440 --> 00:22:02,490 + 해상도 입력하고 그래서 이것은 아마 것보다 높은 해상도 출력을 생성 + +287 +00:22:02,490 --> 00:22:08,309 + 바로 그래서 여기에 하나에서 APPT까지의 무료 디컨 볼 루션에 의해 몇 + +288 +00:22:08,309 --> 00:22:12,659 + 이것은 당신이 정상적인 회선 알고있는 이상한 조금 당신을 상상하다 + +289 +00:22:12,660 --> 00:22:16,750 + 당신의 세 가지로 세 개의 필터가 있고 여기에 도트 제품과 입력을하지만, + +290 +00:22:16,750 --> 00:22:21,000 + 당신은 당신의 세 가지로 세 가지 필터를 복용 상상 할 단지에 이상 복사 + +291 +00:22:21,000 --> 00:22:26,230 + 출력은 유일한 차이점은이 하나의 스칼라 값을 같은 무게 + +292 +00:22:26,230 --> 00:22:27,579 + 무게 및 입력 + +293 +00:22:27,579 --> 00:22:31,788 + 당신에 대한 대기를 제공합니다 당신은 당신이 설 때 필터를 관련 거라고 + +294 +00:22:31,788 --> 00:22:38,298 + 우리가 거​​의 1 단계 발짝을 따라 우리가이 일을 시작했을 때 출력에 지금 + +295 +00:22:38,298 --> 00:22:43,298 + 모든 출력을 통해 이상 입력과 두 단계 이제 우리는 같은 걸릴거야 + +296 +00:22:43,298 --> 00:22:47,798 + 우리는 동일한 학습 컨볼 루션 필터 아래로 풍덩거야 + +297 +00:22:47,798 --> 00:22:53,378 + 같은 컨볼 루션 필터를 주셔서 지금은 가슴으로 출력하지만, 지금 + +298 +00:22:53,378 --> 00:22:56,928 + 우리는 차이가 있다는 것을 출력에 두 번을 보여주고있어 + +299 +00:22:56,929 --> 00:23:02,139 + 레드 박스는 컨볼 루션 필터는이 스칼라 값에 의해 가중된다 + +300 +00:23:02,138 --> 00:23:06,148 + 입력과 파란색 상자에 대한 컨볼 루션 필터에 의해 가중된다 + +301 +00:23:06,148 --> 00:23:10,978 + 블루 스칼라 입력의 값과 위치를이 곳이 지역이 당신을 중복 + +302 +00:23:10,979 --> 00:23:16,590 + 바로이 종류의 당신이 배울 수 있으며 네트워크 내부 샘플링까지 이렇게 추가 + +303 +00:23:16,589 --> 00:23:23,118 + 그래서 당신은에 구현 회선에서에서 기억한다면 + +304 +00:23:23,118 --> 00:23:27,999 + 할당 일종의 특히 눈에 띄는과 추가의이 아이디어 + +305 +00:23:27,999 --> 00:23:31,348 + 보통의 뒤로 패스 당신을 생각 나게한다 겹치는 영역 + +306 +00:23:31,348 --> 00:23:36,729 + 회선 및 이들이 그 완전히 동일하다는 것을 밝혀 + +307 +00:23:36,729 --> 00:23:40,440 + 컨벌루션 포워드 패스 정확하게 정상 컨벌루션와 동일 + +308 +00:23:40,440 --> 00:23:44,840 + 역방향 패스 정상 및 컨벌루션 후방 패스는 동일 + +309 +00:23:44,839 --> 00:23:50,238 + 일반 회선 앞으로 때문에 실제로 용어는 그렇게 통과 + +310 +00:23:50,239 --> 00:23:54,989 + 디컨 볼 루션 어쩌면 그렇게 크지 않다 당신은 신호 처리가있는 경우 + +311 +00:23:54,989 --> 00:23:58,700 + 당신이 디컨 볼 루션 이미 본 적이있다 배경은 매우가 + +312 +00:23:58,700 --> 00:24:03,308 + 의미는 잘 정의하고 그 때문에 회선의 역입니다 + +313 +00:24:03,308 --> 00:24:07,470 + 디콘 볼 루션은 상당히 다르다 컨벌루션 연산을 취소해야 + +314 +00:24:07,470 --> 00:24:11,909 + 어떤이 실제로 그렇게하는 대신 이에 대한 아마 더 좋은 이름을하고있다 + +315 +00:24:11,909 --> 00:24:17,609 + 가끔 볼 수 디컨 볼 루션은 컨볼 루션 전치 또는 우리가 될 것 + +316 +00:24:17,608 --> 00:24:22,148 + 뒤로 귀에 거슬리는 회선 또는 단편적으로 귀에 거슬리는 회선 또는 또는 + +317 +00:24:22,148 --> 00:24:27,148 + 나는 이러한 종류의 이상한 이름이라고 생각하므로 회선까지 나는대로 디컨 볼 루션을 생각한다 + +318 +00:24:27,148 --> 00:24:30,988 + 인기가 덜 기술 될 수있다하더라도 말을 쉬운 그냥 사촌 + +319 +00:24:30,989 --> 00:24:35,369 + 당신이 논문을 읽고 실제로 경우 그 일부를 볼 수 있지만 기술적으로 정확 + +320 +00:24:35,368 --> 00:24:38,699 + 사람들은 그래서 이것에 대해 화 + +321 +00:24:38,700 --> 00:24:43,539 + 이 회선 대신 디컨 볼 루션의 트랜스 말을 더 적절한이고 + +322 +00:24:43,539 --> 00:24:47,529 + 이 다른 논문은 정말 분별 보폭 회선을 착색하고 싶어 + +323 +00:24:47,529 --> 00:24:51,750 + 그래서 내가 사회는 여전히 올바른 용어를 결정할 생각 생각 + +324 +00:24:51,750 --> 00:24:55,240 + 여기 그러나 나는 이러한 종류의 디컨 볼 루션은 아마 매우 아니다 그들에 동의 + +325 +00:24:55,240 --> 00:25:00,309 + 기술적으로 정확하고 매우 느낌이 특히이 논문은이 알코올 + +326 +00:25:00,309 --> 00:25:04,139 + 강하게이 문제에 대해 그들은 용지에 한 페이지 인덱스 부록했다 + +327 +00:25:04,140 --> 00:25:09,230 + 당신이있어 그래서 만약 내가 적절한 용어를 트랜스 회선 왜 실제로 설명 + +328 +00:25:09,230 --> 00:25:11,849 + 관심은 정말 꽤 좋은 것을 확인하는 것이 좋습니다 것입니다 + +329 +00:25:11,849 --> 00:25:16,289 + 이에 대한 설명은 실제로 해당하므로, 어떠한 질문 + +330 +00:25:16,289 --> 00:25:26,299 + 그래, 정말 문제는 패치이 상대를 기반으로 얼마나 빨리 생각 + +331 +00:25:26,299 --> 00:25:29,930 + 일이 대답은 연습 아무도에도 덕분에이 일을 실행하는 것입니다 + +332 +00:25:29,930 --> 00:25:34,820 + 단지 방법이 너무 느린 그래서 실제로 모든 것 희망 패치 짐승 모드 + +333 +00:25:34,819 --> 00:25:36,000 + 내가 본 논문의 + +334 +00:25:36,000 --> 00:25:39,109 + 이럭저럭 폴리 길쌈 것은 어떤 종류의 작업을 수행 + +335 +00:25:39,109 --> 00:25:44,729 + 실제로 다른 트릭의 대신 샘플링까지 종류가있다 그 사람 + +336 +00:25:44,730 --> 00:25:49,309 + 때때로 사용하고 그 때문에 네트워크가 실제로 있다고 가정이다 + +337 +00:25:49,309 --> 00:25:52,599 + 4 배에 의해 거 아래 샘플은 당신이 할 수있는 한 가지 가지고 당신의 + +338 +00:25:52,599 --> 00:25:57,199 + 입력 이미지는 하나의 픽셀로 배송 지금은 다시 내가 네트워크를 통해 실행 + +339 +00:25:57,200 --> 00:26:00,710 + 다른 출력을 얻을 당신은 네 가지 일의 종류에 대해이 작업을 반복 + +340 +00:26:00,710 --> 00:26:04,870 + 픽셀 입력의 선박 지금은 출력지도를받은 적이 당신은 정렬 할 수 있습니다 + +341 +00:26:04,869 --> 00:26:08,339 + 그 그건 그래서의 원래의 입력 맵을 재구성하는 인터리브 + +342 +00:26:08,339 --> 00:26:12,279 + 사람들이 가끔 사용되는 또 다른 트릭은 그 문제를 해결 얻을 수 있지만, 나는 생각한다 + +343 +00:26:12,279 --> 00:26:19,740 + 오늘 아침 샘플링이 꽤 청소기입니다 + +344 +00:26:19,740 --> 00:26:28,440 + 그래서 내가 정말 좋은 나는 다시 한 번 I 시도라고 생각 혀를 롤 생각 + +345 +00:26:28,440 --> 00:26:33,799 + 단편적으로 귀에 거슬리는 회선 실제로 내가 생각하는 정말 멋진 권리라고 생각합니다 + +346 +00:26:33,799 --> 00:26:36,928 + 그것은 가장 긴 이름이다하지만 천체 정상 바로 정말 설명이다 + +347 +00:26:36,929 --> 00:26:40,910 + 일반적으로 우리와 그가 이동 결국 요소의 역할을 이동할 회선에 시도 + +348 +00:26:40,910 --> 00:26:45,808 + 같은 당신은 어떤 입력과 출력을하지 않으 여기에 당신은 무슨 일이 있었는지 이동하고 + +349 +00:26:45,808 --> 00:26:48,940 + 무슨 일이 있었는지 움직이는 해당하는 입력하면 입력 싶어 있습니다 + +350 +00:26:48,940 --> 00:26:55,140 + 출력은 그래서 내가 내가 전화 할게 무엇인지 확실하지 않다, 그래서 아주 능숙하게 아이디어를 캡처 + +351 +00:26:55,140 --> 00:27:02,790 + 나는 신문에서 그것을 사용할 때 우리는 그것에 대해 그러나 지금에도 불구하고 참조해야 할 때 + +352 +00:27:02,789 --> 00:27:06,440 + 사람들이 같은 디컨 볼 루션을 호출하는 사람들에 대한 우려에도 불구하고 단지 + +353 +00:27:06,440 --> 00:27:10,980 + ICC에서이 논문은 수 있었다 어쨌든 그래서이 아이디어를 소요 호출 + +354 +00:27:10,980 --> 00:27:16,319 + 이 길쌈의 금주의 / 단편적으로 생각을 마련하려고 + +355 +00:27:16,319 --> 00:27:21,428 + 그리고 일종의 그래서 여기에 극단적으로 푸시하고 ​​그들이 금액 무엇을했다 + +356 +00:27:21,429 --> 00:27:28,170 + 이 전체 BGG 네트워크 I 입력 싶어하기 전에이는 동일한 모델 그래서 + +357 +00:27:28,170 --> 00:27:33,720 + 의미 분할 작업하지만 여기 출력 픽셀 현명한 예측 + +358 +00:27:33,720 --> 00:27:40,220 + 우리는 BGG를 초기화 이상 여기 BGG 거꾸로이며, 그것은 육에 대한 훈련 + +359 +00:27:40,220 --> 00:27:44,509 + 세금에 일 때문에이 일을 꽤 느린 실제로 정말 정말 좋은있어 + +360 +00:27:44,509 --> 00:27:51,160 + 결과와 나는 그 꽤 있어요 그래서 그것도 아주 아름다운 그림이라고 생각 + +361 +00:27:51,160 --> 00:27:54,308 + 많은 모든 내가 어떤 존재인지 의미 론적 분할에 대해 말할 필요가 있음 + +362 +00:27:54,308 --> 00:27:59,799 + 그에 대한 질문이 그래 + +363 +00:27:59,799 --> 00:28:04,909 + 문제는 내가에서 스크린 샷을했다이 메인 답변입니다 방법입니다 + +364 +00:28:04,910 --> 00:28:09,090 + 자신의 종이 그래서 난 몰라하지만 당신은 우리가 마지막에서 본 흐름 답변을 시도 할 수 있습니다 + +365 +00:28:09,089 --> 00:28:15,069 + 그래 당신이 그림을 만들하지만이만큼 좋은 아니에요 수 있습니다 강연 + +366 +00:28:15,069 --> 00:28:22,579 + 훈련 데이터와 같은 질문은 예이 것은 이런 종류의 데이터 세트를 존재 + +367 +00:28:22,579 --> 00:28:28,449 + 어디 파스칼 분할 데이터가 그렇게 설정되어 일반적인 일이 있다고 생각 + +368 +00:28:28,450 --> 00:28:31,380 + 당신이 이미지가 접지 진실은 당신은 이미지를 그들은 가지고있다 + +369 +00:28:31,380 --> 00:28:37,780 + 표시된 모든 픽셀 그래 그것은이 해당 데이터를 가져 가지 비싼이다 + +370 +00:28:37,779 --> 00:28:43,049 + 데이터 세트는 약간 작은 경향이 있지만, 실제로 유명한 인터페이스가있다 + +371 +00:28:43,049 --> 00:28:46,299 + 이미지를 업로드 할 수있는 곳 나 라벨을 불러 투어 다음 종류의 술 + +372 +00:28:46,299 --> 00:28:49,240 + 본 발명의 다른 지역의 주위에 당신이 본 발명의 주위에 + +373 +00:28:49,240 --> 00:28:54,140 + 이 세그먼트의 일종으로 그 윤곽을 변환 할 수 있습니다 묻는 그 방법을의 + +374 +00:28:54,140 --> 00:29:02,130 + 당신은 우리가 거​​ 생각 질문이있는 경우 방식으로이 일에 라벨을하는 경향이 + +375 +00:29:02,130 --> 00:29:07,290 + 다만 인스턴스 분할을 정리해 할 수 있도록이가 즉시 분할로 이동 + +376 +00:29:07,289 --> 00:29:11,089 + 일반화 또는 우리뿐만 아니라 이미지의 픽셀에 라벨을 원하는 위치에 있지만 + +377 +00:29:11,089 --> 00:29:15,089 + 또한 즉시 우리가가는거야 그래서 인스턴스를 구별 구별 할 + +378 +00:29:15,089 --> 00:29:18,419 + 우리 클래스의 다른 인스턴스를 감지하고 각각에 대해 우리가 원하는 + +379 +00:29:18,420 --> 00:29:25,320 + 그래서이이 실제로이 모델 최대 인스턴스의 픽셀을 라벨 + +380 +00:29:25,319 --> 00:29:28,419 + 우리가 전에 몇 강의에 대해 이야기 검출 모델처럼 많이 찾고 + +381 +00:29:28,420 --> 00:29:34,150 + 그래서이 실제로 나는 또한해야 것을 알고 최초의 논문 중 하나 + +382 +00:29:34,150 --> 00:29:38,040 + 이 내가 메신저 훨씬 더 최근이의이 아이디어 부탁 생각이라고 지적 + +383 +00:29:38,039 --> 00:29:42,319 + 의미 론적 분할 긴 장시간 컴퓨터 비전에 사용되었지만 + +384 +00:29:42,319 --> 00:29:45,409 + 나는 즉시 분할이 생각보다 많이받은 것 같아요 + +385 +00:29:45,410 --> 00:29:50,970 + 특히 2014 종류의에서이 논문 그래서 지난 몇 년에 인기 + +386 +00:29:50,970 --> 00:29:53,890 + 이 걸렸다 나는 그들이 그것을 동시 탐지 및 세분화를 호출 생각 + +387 +00:29:53,890 --> 00:29:59,600 + 또는 SDS는 좋은 이름의 종류 그리고이 사실은 우리의 CNN과 매우 유사 + +388 +00:29:59,599 --> 00:30:03,839 + 우리가 여기에 보호를 보았다 모델은 우리가 입력 치매를 거 가지고있어 + +389 +00:30:03,839 --> 00:30:09,399 + 당신이 우리의 CNN에서 기억한다면 우리는 이러한 외부 지역의 제안에 의존하는 + +390 +00:30:09,400 --> 00:30:12,269 + 오프라인 컴퓨터 비전의 이러한 종류입니다 수 있습니다 + +391 +00:30:12,269 --> 00:30:16,538 + 이 이미지의 개체를 생각하는 위치에 예측을 계산 글로벌 일 수도 + +392 +00:30:16,538 --> 00:30:17,658 + 위치 + +393 +00:30:17,659 --> 00:30:21,419 + 잘은 대신 세그먼트를 제안하기위한 다른 방법이 있다고 밝혀 + +394 +00:30:21,419 --> 00:30:25,419 + 상자의 우리는 이러한 기존 세그먼트 제안 방법 중 하나를 다운로드 + +395 +00:30:25,419 --> 00:30:30,879 + 이들 각각에 대해이 세그먼트 우리가 할 수있는 각 대신 지금 사용 + +396 +00:30:30,878 --> 00:30:35,398 + 제안 된 세그먼트 우리는 단지의 상자에 앉아하여 경계 상자를 추출 할 수 있습니다 + +397 +00:30:35,398 --> 00:30:40,298 + 다음 세그먼트는 입력 영상의 덩어리에서 작물을 실행하고 실행 + +398 +00:30:40,298 --> 00:30:47,108 + 상자를 통해 CNN이 실행됩니다 병렬보다 그 상자 기능을 추출 + +399 +00:30:47,108 --> 00:30:52,358 + 지역 CNN을 통해 그래서 그녀는 우리가 취할 수 입력에서 관련 청크 + +400 +00:30:52,358 --> 00:30:57,168 + 발명 작물이 밖으로 그러나 여기에서 우리는 실제로이 제안을 가지고 있기 때문에 + +401 +00:30:57,169 --> 00:31:01,320 + 다음 세그먼트에 대해 우리는 평균을 사용하여 배경 영역을 마스크거야 + +402 +00:31:01,319 --> 00:31:05,700 + 그래서 이것은 당신이이 종류를 취할 수있는 해킹의 일종이다 데이터의 색 + +403 +00:31:05,700 --> 00:31:09,838 + 이상한 모양의 입력과 CNN로 먹이를 그냥 배경을 마스크 + +404 +00:31:09,838 --> 00:31:14,479 + 검은 색으로 우리와 부분 그래서이 마스크 입력을하고 실행할 수 있습니다 + +405 +00:31:14,479 --> 00:31:18,769 + 별도의 영역을 통해 CNN은 지금 우리가 입수 한 두 개의 서로 다른 특징 벡터 하나 + +406 +00:31:18,769 --> 00:31:22,739 + 전체 상자를 통합의 종류 만 기업에서 하나의 + +407 +00:31:22,739 --> 00:31:26,328 + 제안 된 전경 픽셀 우리는 이러한 것들과 연결하여 바로 + +408 +00:31:26,328 --> 00:31:30,638 + 우리의 CNN에 우리가 구분을 같은 결정하는 어떤 클래스 실제로해야 + +409 +00:31:30,638 --> 00:31:37,128 + 이 세그먼트 B는 다음 그들은 또한이 지역의 정제 단계가 어디 + +410 +00:31:37,128 --> 00:31:42,108 + 당신이 모르는, 그래서 만약 잘하는 방법을 제안 영역에게 조금 수정하려면 + +411 +00:31:42,108 --> 00:31:45,218 + 당신은 우리의 CNN 프레임 워크 기억하지만 실제로 우리의 CNN과 매우 유사 + +412 +00:31:45,219 --> 00:31:52,909 + 다만이 경우 동시 검출 및 분할 작업 때문에이를 적용 + +413 +00:31:52,909 --> 00:31:56,950 + 이 지역의 정제 단계에 대한 아이디어는 실제로 후속 종이있다 그 + +414 +00:31:56,950 --> 00:32:03,288 + 같은 사람의 논문에서 이렇게 여기에서 그것을 할 수있는 아주 좋은 방법입니다 제안 + +415 +00:32:03,288 --> 00:32:07,578 + 우리는이 입력을 할 버클리 다음 회의 있지만 여기 + +416 +00:32:07,578 --> 00:32:12,940 + 이 세그먼트에 제안 된 세그먼트를 제안하고 그것을 정리할되는 + +417 +00:32:12,940 --> 00:32:17,778 + 어떻게 든 우리는 실제로 매우 유사한 유형 매우 유사한 접근 방식을거야 + +418 +00:32:17,778 --> 00:32:20,230 + 우리가에서 본 다중 스케일 방법 + +419 +00:32:20,230 --> 00:32:24,839 + 그래서 여기에 얼마 전 의미 분할 모델에서 우리는 걸릴거야 + +420 +00:32:24,839 --> 00:32:30,139 + 우리의 우리의 이미지 자르기 아웃이 해당 세그먼트에 해당하는 상자를 지탱하고 + +421 +00:32:30,140 --> 00:32:34,350 + 다음과 알렉스 그물을 통해 그것을 통과하고 우리는 길쌈을 추출 할거야 + +422 +00:32:34,349 --> 00:32:37,849 + 그 각각에 대해 그 알렉스 NAT의 여러 레이어의 기능 + +423 +00:32:37,849 --> 00:32:42,139 + 기능 맵은 최대 난과 함께 결합 지금 것 샘플링합니다 + +424 +00:32:42,140 --> 00:32:48,370 + 이이 그림이 제안 그림 지상 분할을 생성하므로이이 + +425 +00:32:48,369 --> 00:32:52,308 + 가지 재미 출력 사실이지만 그것은 예측을 정말 쉽게있어 + +426 +00:32:52,308 --> 00:32:55,910 + 아이디어는 우리가 단지거야이 출력 이미지가 물류를 수행 투자입니다 + +427 +00:32:55,910 --> 00:33:00,990 + 각각의 독립적 인 픽셀 내부 분류 그래서 우리는 단지이 이러한 기능을 제공 + +428 +00:33:00,990 --> 00:33:04,410 + 독립 물류의 전체 무리를 어떻게 예측하고 머리카락을 분류 + +429 +00:33:04,410 --> 00:33:08,250 + 출력이 많은 화소가 전경 될 가능성이있다 + +430 +00:33:08,250 --> 00:33:13,390 + 배경과 그들이 보여이 다중 스케일 정제 단계의이 유형이 + +431 +00:33:13,390 --> 00:33:16,610 + 실제로 이전 시스템의 다른 부분을 정리 아주 준다 + +432 +00:33:16,609 --> 00:33:27,899 + 아주 좋은 결과 질문 + +433 +00:33:27,900 --> 00:33:34,390 + 단편적으로 보폭 및 회선 나는 그것이 어떤 종류의 대신 생각 + +434 +00:33:34,390 --> 00:33:37,870 + 그 또는 같은 선형 보간 또는 뭔가처럼 샘플링을 수정 + +435 +00:33:37,869 --> 00:33:41,449 + 어쩌면 가장 가까운 이웃 뭔가 고정 및 가변 그러나 나는 할 수 + +436 +00:33:41,450 --> 00:33:44,170 + 잘못하지만 당신은 확실히 교환 및 일부 학습 가능 상상할 수 + +437 +00:33:44,170 --> 00:33:46,250 + 너무 것 같아요 + +438 +00:33:46,250 --> 00:33:52,980 + 확인 그래서 실제로이이 우리의 CNN뿐만 검출에 매우 유사합니다 + +439 +00:33:52,980 --> 00:33:57,049 + 우리는 우리의 CNN이 이야기의 시작에 불과했다 모든이 있다고 보았다 강의 + +440 +00:33:57,049 --> 00:34:03,329 + 그것이 나오는 바로 있도록 빠른 버전이 빠르게에서 유사한 직관 우리 + +441 +00:34:03,329 --> 00:34:08,090 + CNN은 실제로도 있으므로이 경우 세그멘테이션 문제에 적용된 + +442 +00:34:08,090 --> 00:34:12,050 + 이 해당 작업이 모델은 실제로 코코아 우승 Microsoft에서 작품입니다 + +443 +00:34:12,050 --> 00:34:16,860 + 예를 세분화 도전 그들의 거대한했다 그들은 그래서 올해 + +444 +00:34:16,860 --> 00:34:20,000 + 공명 그들은 그 위에이 모델을 고집하고 그들은 호감 + +445 +00:34:20,000 --> 00:34:25,489 + 코코 인스턴스 분할 도전에 다른 사람 때문에이이 + +446 +00:34:25,489 --> 00:34:28,668 + 실제로 우리가 걸릴 거 야에 우리의 독립 과거와 매우 유사하다 우리 + +447 +00:34:28,668 --> 00:34:34,148 + 입력 영상 단지 빠른처럼 빠르게 우리의 CNN 우리의 입력 영상은하지 않습니다 + +448 +00:34:34,148 --> 00:34:37,730 + 꽤 높은 해상도를하고 우리는이 거대한 코미디 쇼가지도 기능을합니다거야 + +449 +00:34:37,730 --> 00:34:44,260 + 우리의 높은 해상도를 통해 다음이 고해상도에서 우리는 실제로있어 + +450 +00:34:44,260 --> 00:34:48,700 + 이전의 방법은 우리가 우리 자신의 영역 제안을 제안하는 것 + +451 +00:34:48,699 --> 00:34:52,319 + 이러한 외부 세그먼트의 제안에 의존하지만, 여기에 우리는거야 + +452 +00:34:52,320 --> 00:34:56,870 + 우리는 그냥 막대기 우리 자신의 영역 제안이 너무 여기에 빠른 우리의 CNN을 좋아하는 배우 + +453 +00:34:56,869 --> 00:35:00,859 + 몇 가지 추가 길쌈 상단까지에 개최 부부는 논란 기능지도입니다 + +454 +00:35:00,860 --> 00:35:04,740 + 그 중 각 하나에 대한 관심의 여러 지역을 예측하는 것입니다 + +455 +00:35:04,739 --> 00:35:11,109 + 이미지 우리가 검출 작업에서 본 상자의이 아이디어를 사용하여 + +456 +00:35:11,110 --> 00:35:15,200 + 차이점은 우리는이 지역이 지금은 일단이 지역의 제안이 있었다이다 + +457 +00:35:15,199 --> 00:35:18,559 + 우리가 마지막에 본 매우 유사한 접근 방식을 사용하는 방법에 대한 거 세그먼트 + +458 +00:35:18,559 --> 00:35:24,380 + 이 제안 된 영역 각각에 대해 너무 미끄러이 투자 수익 (ROI)을 사용하려고 무엇을 + +459 +00:35:24,380 --> 00:35:28,579 + 그들은 뒤틀림이나 풀링 및 고정 된 사각형에 이르기까지 그들 모두를 뭉개 버려 ROI를 호출 + +460 +00:35:28,579 --> 00:35:33,000 + 크기하고 생성하는 컨볼 루션 신경망을 통해 각각 실행될 + +461 +00:35:33,000 --> 00:35:36,710 + 우리와 같은 이러한 과정 그림 지상 분할 마스크는 이전에보고 + +462 +00:35:36,710 --> 00:35:41,909 + 이 시점에서 이제 이전 슬라이드에서 우리는 우리가 가지고 우리의 이미지를 쪘 + +463 +00:35:41,909 --> 00:35:45,859 + 각 지역의 제안에 대한 지역 제안의 무리 지금 우리는 거친이 + +464 +00:35:45,860 --> 00:35:49,240 + 전경 어느 한 부분으로 그 상자의 어느 부분의 아이디어는 배경입니다 + +465 +00:35:49,239 --> 00:35:54,489 + 지금 우리는 우리가 예측하는 것이 지금 마스킹의 이런 생각을하는거야 + +466 +00:35:54,489 --> 00:35:57,709 + 우리가 밖으로 마스크거야이 세그먼트의 각 전경 배경 + +467 +00:35:57,710 --> 00:36:02,889 + 배경을 예측 만 예측 전경에서 픽셀을 유지하고 + +468 +00:36:02,889 --> 00:36:07,179 + 과거 다른 몇 층을 통과 실제로 분류에 대한 분류하기 + +469 +00:36:07,179 --> 00:36:13,629 + 우리의 서로 다른 범주로 그 세그먼트 그래서이이 모든 일을 할 수있는 사람이다 + +470 +00:36:13,630 --> 00:36:18,380 + 단지 공동으로 두 배울 수 및 아이디어 우리는이 세 가지 있는데 그 + +471 +00:36:18,380 --> 00:36:22,490 + 우리의 네트워크의 중간 계층에 의미 해석 출력 및 + +472 +00:36:22,489 --> 00:36:26,589 + 그들 각각 우리는 그냥 그렇게이 지역 지상 진실 데이터를 감독 할 수 있습니다 + +473 +00:36:26,590 --> 00:36:29,900 + 지상 진실 섹스 객체는 객체와 이미지에 어디에 관심을 우리는 알고있다 + +474 +00:36:29,900 --> 00:36:34,349 + 이러한 분할이 우리 요청에 대해 우리는 그 출력에 감독을 제공 할 수 있습니다 + +475 +00:36:34,349 --> 00:36:37,929 + 우리가 감독을 줄 수있는 진정한 전경과 배경 우리 알고 + +476 +00:36:37,929 --> 00:36:42,759 + 그리고 우리는 우리가 분명히 그렇게 그 다른 세그먼트의 클래스를 알고있다 + +477 +00:36:42,760 --> 00:36:46,760 + 우리는 이러한 네트워크의 여러 계층에서 감독을 제공하고 거래를하려고 + +478 +00:36:46,760 --> 00:36:50,420 + 모든 다른 손실 조건 해제와 희망을 수렴 할 수있는 일을 얻을 수 있지만, + +479 +00:36:50,420 --> 00:36:53,670 + 이 실제로 훈련, 둘, 그들은 재미 있고 그것을에 발견되었다 + +480 +00:36:53,670 --> 00:36:59,809 + 정말 정말 잘 그래서 여기에 작동하는 결과가 우리가 보여해야한다는 그림이다 + +481 +00:36:59,809 --> 00:37:04,519 + 그래서 이러한 결과는 정말 나에게 적어도 예를 들어, 그래서 정말 인상적이다 + +482 +00:37:04,519 --> 00:37:09,159 + 이 입력 영상이 방에와 앉아이 모든 다른 사람들이 + +483 +00:37:09,159 --> 00:37:12,539 + 예상 출력은 모든 다른를 분리하는 정말 좋은 일을 + +484 +00:37:12,539 --> 00:37:15,360 + 사람들은 중복에도 불구하고 많은있다 그리고 그들은 매우있어 + +485 +00:37:15,360 --> 00:37:16,500 + 닫기 + +486 +00:37:16,500 --> 00:37:20,699 + 이 차와 같은 조금 더 쉽게하지만, 특히이이 백성 만든 + +487 +00:37:20,699 --> 00:37:24,629 + 나는 꽤 감동하지만, 때 당신은 그래서이 화분 완벽 하진 볼 수 있습니다 + +488 +00:37:24,630 --> 00:37:28,840 + 그에서 식물은 정말보다 여기 차단 그것은에이 의자를 혼동했다 + +489 +00:37:28,840 --> 00:37:32,230 + 사람에 대한 권리와 나는이 사람을 놓친하지만 전체 결과 + +490 +00:37:32,230 --> 00:37:36,300 + 아주 아주 인상적과 같은 나는이 모델 하나는 코코 세분화했다 + +491 +00:37:36,300 --> 00:37:43,250 + 분할의 개요 우리가 이것들을 가지고 있다는 것입니다, 그래서 올해에 도전 + +492 +00:37:43,250 --> 00:37:47,519 + 이 두 가지 작업 론적 분할 및 분할 인스턴트 + +493 +00:37:47,519 --> 00:37:52,210 + 의미 분할을 위해이 이것을 사용하는 것은 매우 흔한 일 + +494 +00:37:52,210 --> 00:37:56,800 + 콘데 호송은 접근 한 다음 예를 세분화 당신이와 끝까지 + +495 +00:37:56,800 --> 00:38:02,180 + 어떤이의 경우, 그래서 더 유사 이러한 파이프 라인 검출 객체하기 + +496 +00:38:02,179 --> 00:38:08,338 + 분할에 대한 마지막 순간의 질문에 나는 슈퍼 지금 그 대답을 시도 할 수 있습니다 + +497 +00:38:08,338 --> 00:38:14,329 + 분명 나는 우리가 서로에 거 이동을하고있어 꽤 멋진 아닌 것 같아요 + +498 +00:38:14,329 --> 00:38:18,150 + 흥미로운 항목이 너무 먹으 렴주의 모델은 내가 생각하는 뭔가 + +499 +00:38:18,150 --> 00:38:24,550 + 그래서 경우의 일종으로 관심과 작년과 지역 사회를 많이 가지고있다 + +500 +00:38:24,550 --> 00:38:29,780 + 연구 우리는 여기에서 확인하지만 같은 다른 인용문에서 모델에 대한 거 얘기 야 + +501 +00:38:29,780 --> 00:38:32,349 + 사례 연구의 일종으로 같은 + +502 +00:38:32,349 --> 00:38:35,190 + 이미지에 적용되는 우리는 저를 캡처 관심의 아이디어에 대해 이야기하는거야 + +503 +00:38:35,190 --> 00:38:39,530 + 그래서 나는이 모델이 재발 네트워크 강좌에 미리 있었다라고 생각하지만 + +504 +00:38:39,530 --> 00:38:43,740 + 나는 정리 해보 여기하지만 먼저 더 많은 세부 사항으로 단계하고자 원하는 단지 + +505 +00:38:43,739 --> 00:38:47,029 + 그래서 우리는 희망 당신은 내가하여 자막 작업을 놓친 방법을 알고 같은 페이지에있어 + +506 +00:38:47,030 --> 00:38:51,540 + 이제 숙제 때문에 몇 시간 예정이다 그러나 우리는 우리의 입력을거야 + +507 +00:38:51,539 --> 00:38:54,869 + 본 발명은하고 길쌈을 통해 그것을하지 실행하고 일부 기능을 얻을 + +508 +00:38:54,869 --> 00:38:58,869 + 이러한 기능의 첫 번째 숨겨진 상태를 초기화 아마 사용됩니다 우리 + +509 +00:38:58,869 --> 00:39:03,780 + 현재의 네트워크는 토큰 시작 멀리하거나 첫 번째 단어가 숨겨져 있음을 얻었다 + +510 +00:39:03,780 --> 00:39:06,609 + 상태는 우리가 단어 이상이 분포를 생성하는거야 우리 + +511 +00:39:06,608 --> 00:39:11,940 + 어휘 것입니다 단순한 형식으로 배포 단어를 생성하는 것보다 및 것 + +512 +00:39:11,940 --> 00:39:16,429 + 그저 자막를 생성하기 위해이 프로세스 초과 근무를 반복 + +513 +00:39:16,429 --> 00:39:20,199 + 여기서 문제는이 네트워크는 일종의 보는 하나의 기회를 얻을 수 있다는 것입니다 + +514 +00:39:20,199 --> 00:39:23,899 + 입력 이미지와 그것이 전체 입력 영상에 모두를 찾고 않을 때 + +515 +00:39:23,900 --> 00:39:29,970 + 실제로 한번보고하는 기능에있는 경우 한 번 그리고 냉각기 수 있습니다 + +516 +00:39:29,969 --> 00:39:33,809 + 그것이 다른 부분에 초점을 맞출 수 있다면, 또한 입력 화상 여러번 + +517 +00:39:33,809 --> 00:39:41,969 + 작년에 나온 입력 이미지가 달렸다 정도로 하나의 정말 멋진 종이이었다 + +518 +00:39:41,969 --> 00:39:46,409 + 이 하나라는 쇼 교환은 원래 하나의 쇼를 말해 우리는 추가 말 + +519 +00:39:46,409 --> 00:39:51,289 + ㄱ - 열 부분과 아이디어는 우리가 걸릴 거 야, 그래서 매우 간단합니다 + +520 +00:39:51,289 --> 00:39:54,750 + 우리의 입력 영상 그리고 우리는 여전히 컨볼 루션 네트워크를 통해 실행거야 + +521 +00:39:54,750 --> 00:39:58,440 + 대신 마지막 완전히 연결 이후의 특징을 추출 + +522 +00:39:58,440 --> 00:40:01,659 + 대신 우리는 이전 선상 중 하나에서 거 풀 기능이있어 + +523 +00:40:01,659 --> 00:40:05,549 + 길쌈 상속인과는 우리에게 기능이 그리드를 줄 것 + +524 +00:40:05,550 --> 00:40:09,160 + 오히려이 때문에, 그래서이에서 오는 하나의 특징 벡터보다 + +525 +00:40:09,159 --> 00:40:13,460 + 길쌈 공기는 당신이 당신을 그 아마 왼쪽 위를 상상할 수 + +526 +00:40:13,460 --> 00:40:17,320 + 기능의 조약 공간 격자로하고 각 격자 안에이 생각할 수 + +527 +00:40:17,320 --> 00:40:21,130 + 그리드의 각 점은 어떤 부분에 해당하는 기능을 제공합니다 + +528 +00:40:21,130 --> 00:40:26,890 + 입력 이미지는 이제 다시 초기화하는이 이러한 기능을 사용합니다 + +529 +00:40:26,889 --> 00:40:30,099 + 어떤 방법으로 우리의 네트워크의 상태를 숨겨진 물건을 얻을 경우 지금 여기 + +530 +00:40:30,099 --> 00:40:34,400 + 다른 이제 우리는하지 계산하기 위해 우리의 숨겨진 상태를 사용하는거야 + +531 +00:40:34,400 --> 00:40:38,220 + 단어를 통해 분배하는 대신 서로 다른 이상 배포 + +532 +00:40:38,219 --> 00:40:43,459 + 우리의 길쌈 기능지도에서의 위치 때문에 다시이 것입니다 + +533 +00:40:43,460 --> 00:40:47,050 + 아마도 몹시 아마와 잘 연결 수로 구현 될 수 + +534 +00:40:47,050 --> 00:40:51,260 + 층 또는 두 후 일부 소프트 맥스는 당신에게 메일을주고 있지만, 우리는 단지 종료 + +535 +00:40:51,260 --> 00:40:54,410 + 우리에게 확률 분포를주는이 알 차원 벡터 최대 + +536 +00:40:54,409 --> 00:41:01,019 + 서로 다른 위치와 우리의 입력을 통해 지금 우리는이 확률을 + +537 +00:41:01,019 --> 00:41:05,780 + 분포는 실제로 이들의 가중 합을 얻기 위해 기다리는 읽을 사용 + +538 +00:41:05,780 --> 00:41:10,810 + 우리 학년 우리는이 걸릴 그래서 일단 우리의 다른 점에 특징 벡터 + +539 +00:41:10,809 --> 00:41:15,849 + 우리의 그리드를 받아 그것을 아래로 요약 기능의 가중 조합 + +540 +00:41:15,849 --> 00:41:22,420 + 이 하나의 요인과 질병 벡터의이 이런 종류의 입력을 요약 + +541 +00:41:22,420 --> 00:41:26,909 + 다른 유형의 몇 가지 방법으로 인해 이미지가 이것에 어떻게 할 + +542 +00:41:26,909 --> 00:41:30,619 + 확률 분포는 네트워크를 집중하는 능력을 준다 + +543 +00:41:30,619 --> 00:41:35,299 + 이미지의 다른 부분은 지금의이이 가중치를 간다 + +544 +00:41:35,300 --> 00:41:39,730 + 입력 기능에서 생성 이제 첫 번째 단어와 함께 공급됩니다 + +545 +00:41:39,730 --> 00:41:43,960 + 우리가 재발 네트워크의 재발을 할 때 우리는 실제로 세와 부품이 + +546 +00:41:43,960 --> 00:41:49,139 + 우리는 우리의 이전의 숨겨진 상태를 가지고 우리는이 참석 특징 벡터가 우리 + +547 +00:41:49,139 --> 00:41:52,929 + 생산이 함께 사용되는 모든 지금이 첫 번째 단어가 우리의 + +548 +00:41:52,929 --> 00:41:56,929 + 새로운 숨겨진 상태와 지금이 숨겨진 상태에서 우리가 실제로 갈거야 + +549 +00:41:56,929 --> 00:42:01,419 + 우리는 다른 새로운 유통을 통해 생산하는거야 두 개의 출력을 생성 + +550 +00:42:01,420 --> 00:42:04,940 + 위치와 우리의 입력 이미지와 우리는 또한 우리의 표준을 감소시키는거야 + +551 +00:42:04,940 --> 00:42:08,599 + 이 때문에 단어 이상 분포는 아마 몇으로 구현 될 수있다 + +552 +00:42:08,599 --> 00:42:13,679 + 의 활성 숨겨진 상태의 상단에 레이어 이제이 과정은 그렇게 반복 + +553 +00:42:13,679 --> 00:42:17,739 + 우리는 입력 기능 그랜드​​으로 돌아가 새로운 아마도 분포를 부여 + +554 +00:42:17,739 --> 00:42:22,949 + 그 닥터을 본 발명에 대한 새로운 요약 벡터에 걸릴 온다. + +555 +00:42:22,949 --> 00:42:25,618 + 함께 뉴 헤이븐을 계산 문장의 다음 단어 + +556 +00:42:25,619 --> 00:42:34,930 + 국가의 생산은 확인 그래서 조금 나쁜 버릇이 있지만 벤에 의한 것 사실이야 + +557 +00:42:34,929 --> 00:42:50,109 + 캡션을 생성하는이 프로세스 초과 근무를 반복 그래 그래서 질문은 어떻게 + +558 +00:42:50,110 --> 00:42:54,190 + 여기서이 기능 좋은에서 오는가 당신이 때있을 때 대답은 + +559 +00:42:54,190 --> 00:42:57,510 + 당신은 예를 들어 당신이 CON- 싶어 나라에 와서 가지고 일을하고 알렉스있어 + +560 +00:42:57,510 --> 00:43:01,670 + 칸 푸르는로 와서 시간으로 다섯을 그 텐서의 형상이되어 온에 도착 + +561 +00:43:01,670 --> 00:43:05,960 + 그래서 오백열둘에 의해 일곱으로 칠 등 지금 뭔가 + +562 +00:43:05,960 --> 00:43:11,050 + 입력 및 각 격자를 통해 일곱 일곱하여 공간 격자에 해당 + +563 +00:43:11,050 --> 00:43:15,450 + 그래서 사람들은 그냥 뽑아되는 512 차원의 특징 벡터의 위치 + +564 +00:43:15,449 --> 00:43:27,858 + 길쌈 중 하나에서 네트워크 문제가있다 + +565 +00:43:27,858 --> 00:43:33,219 + 우리가 실제로있어 그래서 이렇게 질문이 아마 분포에 관한 것입니다 + +566 +00:43:33,219 --> 00:43:37,899 + 모든 시간에 두 개의 서로 다른 확률 분포를 생성하면 단계 + +567 +00:43:37,900 --> 00:43:42,400 + 이 D 벡터의 하나의 제와 푸른 그래서 그 유통을 통해 아마 + +568 +00:43:42,400 --> 00:43:46,920 + 어휘 단어 우리가 정상적인 이미지 캡션과도에서처럼 + +569 +00:43:46,920 --> 00:43:50,759 + 때마다 단계는이 이상의 두 번째 확률 분포를 생성합니다 + +570 +00:43:50,759 --> 00:43:55,170 + 우리가 원하는 위치를 입력 이미지의 끝에 위치는 우리에게 말하고되는 + +571 +00:43:55,170 --> 00:43:59,690 + 단계 동생이 아주 적합한 단지로 조정하고, 그래서 만약 실제로 다음에 봐 + +572 +00:43:59,690 --> 00:44:05,200 + 업 후 퀴즈 당신이 그들을 사용하고자하는 어떤 프레임 워크를 같이보고 싶었다로 + +573 +00:44:05,199 --> 00:44:09,679 + 개월 동안 우리는 약 어쩌면위한 좋은 선택이 될 것입니다 강렬한 r에 어떻게 이야기를 우리의 + +574 +00:44:09,679 --> 00:44:16,288 + 텐트는 흐름이고 나는 미친이 그렇게 될 때이 자격이 생각 나는 + +575 +00:44:16,289 --> 00:44:19,749 + 아마 조금 더 자세히 이야기하고 싶었 방법이 주목 벡터 + +576 +00:44:19,748 --> 00:44:24,308 + 이 요약 의사가 생성되는 방식이 문서가 실제로 회담 그래서 + +577 +00:44:24,309 --> 00:44:29,278 + 이러한 요인 때문에 아이디어로서 생성에 대해 두 가지 방법 + +578 +00:44:29,278 --> 00:44:33,559 + 우리가 마지막 슬라이드에서 본 우리가 우리의 입력 이미지를 가지고이 위대한를 얻을 수 있다는 것입니다 + +579 +00:44:33,559 --> 00:44:38,019 + 우리의 네트워크에서 길쌈 영역 중 하나에서 오는 교사와 + +580 +00:44:38,018 --> 00:44:41,899 + 각 시간이 확률 분포를 만들어 우리의 네트워크를 중지 + +581 +00:44:41,900 --> 00:44:45,789 + 위치 이상 그래서 이것은에 끝나가 소프트 토지의 전체 영향 것 + +582 +00:44:45,789 --> 00:44:50,329 + 그것을 정상화 지금 생각은 우리가이 위대한 기능을 수행 할 것입니다 + +583 +00:44:50,329 --> 00:44:54,249 + 이러한 확률 분포와 함께 벡터와 하나의 생산 + +584 +00:44:54,248 --> 00:44:59,798 + D-차원 요소 입력 영상 것을 요약하고 용지가있다 + +585 +00:44:59,798 --> 00:45:04,159 + 실제로 쉬운 방법은 그래서이 문제를 해결하는 두 가지 방법을 탐구 + +586 +00:45:04,159 --> 00:45:08,969 + 그녀는 Rd에는 차원 r에 그래서 그들은 부드러운 구금 부르는 것을 사용 + +587 +00:45:08,969 --> 00:45:13,518 + 벡터의 예는 그리드 여기서 모든 요소의 가중 합계 것 + +588 +00:45:13,518 --> 00:45:18,028 + 각 요소는 바로 아마 그 예측 확률에 의해 그것의에 의해 대기한다 + +589 +00:45:18,028 --> 00:45:23,318 + 이것은 또 다른 층과 같은 종류의 좋은 그것을 구현하기 위해 실제로 매우 간단합니다 + +590 +00:45:23,318 --> 00:45:28,599 + 신경망과이 문맥의 유도체 등이 그라디언트 + +591 +00:45:28,599 --> 00:45:32,588 + 대한 요인은 확률을 예측 P는 아주 좋은 쉽습니다 + +592 +00:45:32,588 --> 00:45:36,818 + 그냥 보통의 구배를 사용하여 우리가 실제로 훈련을 수있는이 일을 계산하기 + +593 +00:45:36,818 --> 00:45:40,019 + 하강 및 역 전파 + +594 +00:45:40,019 --> 00:45:44,559 + 그러나 실제로이 경쟁하는 다른 또 다른 옵션을 탐험 + +595 +00:45:44,559 --> 00:45:48,210 + 특징 벡터 그래서 대신 심장주의라는 그 뭔가 + +596 +00:45:48,210 --> 00:45:52,630 + 이 가중 합을 갖는 우리는 단지 하나의 요소를 선택 할 수 있습니다 + +597 +00:45:52,630 --> 00:45:57,940 + 그래서 당신이 할 수있는 매우 간단한 일을 상상하기에 참석하기 위해 업그레이드 + +598 +00:45:57,940 --> 00:46:02,440 + 단지 확률이 가장 높은 단지로 그리드의 요소를 선택합니다 + +599 +00:46:02,440 --> 00:46:07,269 + 그 부분 세금 위치에 대응하는 특징 벡터 빌려 당겨 + +600 +00:46:07,269 --> 00:46:13,150 + 이 공원 옆 경우 경우에이 카드가 최대에 대해 생각하면 문제는 지금 + +601 +00:46:13,150 --> 00:46:16,829 + 당신에 대한이 파생 상품에 대한 미분을 생각하는 우리의 + +602 +00:46:16,829 --> 00:46:18,360 + 배포 P + +603 +00:46:18,360 --> 00:46:22,980 + 이것이 그래서 더 이상 역 전파에 대한 매우 친절 아니라고 밝혀 + +604 +00:46:22,980 --> 00:46:29,059 + 내가 실제로 가장 큰 그 PA한다고 가정 또는 경우에 우리의 다음 경우 상상 + +605 +00:46:29,059 --> 00:46:33,119 + 요소와 우리의 입력과 우리가 조금의 pH를 변경하는 경우 지금 무슨 일이 + +606 +00:46:33,119 --> 00:46:40,130 + 비트 레이트는 그래서 만약 그가 건축가이며, 우리는 확률을 가볍게 흔들다 + +607 +00:46:40,130 --> 00:46:44,869 + 유통 조금 NPA는 여전히 건축가가 될 것입니다 그래서 우리는 여전히거야 + +608 +00:46:44,869 --> 00:46:49,400 + 실제로 미분을 의미하는 입력에서 동일한 요소를 선택 + +609 +00:46:49,400 --> 00:46:53,990 + 이 요소의 대해 쉽게 예측할 확률은 0이 될 것입니다있다 + +610 +00:46:53,989 --> 00:46:58,689 + 지금 우리가 정말 사용할 수 없습니다 거의 모든 곳에서 그, 그래서 그것은 아주 나쁜 주입니다 + +611 +00:46:58,690 --> 00:47:02,970 + 그들이 제안하는 것을 알 수 있도록 역 전파 더 이상이 일을 훈련합니다 + +612 +00:47:02,969 --> 00:47:06,549 + 강화 학습을 기반으로 또 다른 방법은 실제의 모델을 학습합니다 + +613 +00:47:06,550 --> 00:47:12,710 + 당신이 원하는 이러한 상황은 단일 요소를 선택하지만 약간의 + +614 +00:47:12,710 --> 00:47:16,260 + 더 복잡한 우리는이 강의에서 그것에 대해 않을거야 말거야하지만 단지 수 있도록 + +615 +00:47:16,260 --> 00:47:18,900 + 그건 당신이 부드러운의 차이를 볼 수 있습니다 뭔가 알고 있음 + +616 +00:47:18,900 --> 00:47:26,010 + 실제로 지금 우리가 볼 수 하나를 선택 관심과 심장주의 + +617 +00:47:26,010 --> 00:47:30,450 + 우리가 실제로 발생하고 그렇게 때문에이 모델에서 일부 꽤 결과 + +618 +00:47:30,449 --> 00:47:34,480 + 그리드 위치 우리가 할 수있는 모든 시간이 정지 이상의 확률 분포 + +619 +00:47:34,480 --> 00:47:38,519 + 있습니다 우리는 예술의 각 단어를 생성로서 그 확률 분포를 시각화 + +620 +00:47:38,519 --> 00:47:44,039 + 새를 모두 다시 보여줍니다 생성 된 캡션 그럼이 입력 이미지들은 + +621 +00:47:44,039 --> 00:47:48,279 + 마음주의 모델 모두이 경우에 그녀의 부드러운주의 모델 모두 + +622 +00:47:48,280 --> 00:47:51,650 + 캡션에게 물주기의 몸에 비행 조류를 생산 + +623 +00:47:51,650 --> 00:47:57,090 + 이 두 모델들은 무엇을 그 확률 분포의 모양을 시각화 + +624 +00:47:57,090 --> 00:48:01,690 + 이 두 가지 모델처럼 상단은 부드러운주의를 할 수 있도록 보여줍니다 있도록 + +625 +00:48:01,690 --> 00:48:04,849 + 이 모든에서 확률을 평균이기 때문에 그것은 일종의 확산있어 볼 + +626 +00:48:04,849 --> 00:48:09,309 + 위치와 이미지 하단에 단지 하나의 요소를 보여주는 것 + +627 +00:48:09,309 --> 00:48:16,289 + 그것은 꺼내 실제로 아주 좋은 로맨틱 드라마의 의미를 당신에게 있다는 + +628 +00:48:16,289 --> 00:48:19,779 + 모델이 특히 부드러운 관심이 상단에있을 때 볼 수 있습니다 + +629 +00:48:19,780 --> 00:48:23,340 + 나는 새에 대해 얘기하고 얘기 할 때 매우 좋은 결과이다 생각 + +630 +00:48:23,340 --> 00:48:26,610 + 초점 종류의 비행에 대해 새에 적합한 다음이 얘기 할 때 + +631 +00:48:26,610 --> 00:48:30,820 + 물에 대해는 좀 다른 모든 것들 때문에 다른 일에 초점을 맞추고 + +632 +00:48:30,820 --> 00:48:34,269 + 지적 것은 대한 감독 및 교육 시간을받지 않은 것입니다 + +633 +00:48:34,269 --> 00:48:38,869 + 이미지의 일부 단지에 자신의 마음을 만들어에 참석해야한다 + +634 +00:48:38,869 --> 00:48:43,289 + 더 나은 일을 캡처 도움이 무엇 이건을 기반으로 그 부분에 참석 + +635 +00:48:43,289 --> 00:48:46,480 + 우리가 실제로 단지에서 이러한 해석 결과를 얻을 수 꽤 멋지다 + +636 +00:48:46,480 --> 00:48:51,920 + 이 자막 작업, 우리는 몇 몇 다른 결과의 원인을 볼 수 있습니다 + +637 +00:48:51,920 --> 00:48:56,340 + 우리가 볼 수 그들은 재미있어 그 우리가 던지는 한 여자를 던지고 개를 때 + +638 +00:48:56,340 --> 00:49:01,079 + 다양한에서 개에 대해 이야기 Presby 공원에서 프리즈는 인식 + +639 +00:49:01,079 --> 00:49:05,259 + 개, 특히 흥미로운 바로 때를 바닥에서이 사람이다 + +640 +00:49:05,260 --> 00:49:08,790 + 그것은 실제로 모든 것들에 초점을 맞추고 단어 나무를 생성 + +641 +00:49:08,789 --> 00:49:13,440 + 배경 다시뿐만 아니라 기린과는 전혀 나오고 있지이 + +642 +00:49:13,440 --> 00:49:22,179 + 감독은 모든 단지 캡션을 기반으로 네 질문을하거나 + +643 +00:49:22,179 --> 00:49:27,440 + 문제는 당신이주의 대 하드 선호하는 경우가 그래서 무엇이다 + +644 +00:49:27,440 --> 00:49:31,380 + 나는 사람들이 일반적으로 그녀의도에 원하는 줄 것을 일종의 두 동기의 생각 + +645 +00:49:31,380 --> 00:49:33,530 + 처음에 전혀 관심을 + +646 +00:49:33,530 --> 00:49:37,580 + 그 중 하나는 좋은 끝없는 출력을주고 당신이 얻을 생각하는 것입니다 + +647 +00:49:37,579 --> 00:49:42,710 + 두 경우 모두에서 좋은 해석 출력 적어도 이론적으로는 아마도 그녀의 + +648 +00:49:42,710 --> 00:49:46,130 + 구금는 확실히 꽤 있지만, 다른 동기 부여하지 않았다 것 같아요 + +649 +00:49:46,130 --> 00:49:49,970 + 주의를 사용하여 때를 특히 계산 부담을 완화하는 것입니다 + +650 +00:49:49,969 --> 00:49:54,989 + 매우 매우 큰이 있고 실제로 계산 비용이 많이들 수 있습니다 넣어 + +651 +00:49:54,989 --> 00:49:58,619 + 각 시간 단계에서 그 전체의 입력을 처리하고 더 효율적일 수도 + +652 +00:49:58,619 --> 00:50:02,869 + 우리는 단지 각 시간 단계에서 입력 한 부분에 초점을 맞출 수 계산하는 경우 + +653 +00:50:02,869 --> 00:50:07,380 + 단지 작은 부분 집합 척 처리가 부드러운 관심 때문에 너무 + +654 +00:50:07,380 --> 00:50:10,730 + 우리는 우리가 어떤을하지 않는 모든 포지션에 걸쳐 평균 이런 종류의 일을하고 + +655 +00:50:10,730 --> 00:50:14,369 + 계산 저축은 여전히​​ 모든 시간에 전체 입력을 처리하는 + +656 +00:50:14,369 --> 00:50:17,799 + 단계하지만 마음의 관심과 우리는 실제로 계산 절감 효과를 얻을 수 있습니까 + +657 +00:50:17,800 --> 00:50:22,680 + 명시 적으로 I 있도록 입력의 일부 작은 부분 집합을 따기 (pic)의 한 사람 + +658 +00:50:22,679 --> 00:50:26,289 + 또한 그녀의 구금이 강화됩니다 즉 그 큰 혜택을의 생각 + +659 +00:50:26,289 --> 00:50:41,420 + 학습과 CRN 당신이 똑똑 그 종류의 그래 그래서 질문있어 보이게 확장 + +660 +00:50:41,420 --> 00:50:46,150 + 문제는 모든에서이 작업을 수행하는 방법을 내가 대답은 그것의 생각 + +661 +00:50:46,150 --> 00:50:49,789 + 정말 그것의 입력 오른쪽 상관 관계 구조의 종류를 학습 + +662 +00:50:49,789 --> 00:50:54,779 + 강아지와 이미지의 많은 예를 보지하고 강아지와 함께 많은 문장입니다 만 + +663 +00:50:54,780 --> 00:50:57,480 + 강아지와 함께 그 다른 이미지의 개는 다른에 표시하는 경향이 + +664 +00:50:57,480 --> 00:51:01,349 + 입력의 위치와 나는 그것이 최적화를 통해 밝혀 같아요 + +665 +00:51:01,349 --> 00:51:05,659 + 절차 실제로 장소에 더 무게를 두는 경우 개 + +666 +00:51:05,659 --> 00:51:10,399 + 실제로 실제로 존재 그렇게하지 ​​않도록 몇 가지 방법으로 자막 작업을하는 데 도움이 + +667 +00:51:10,400 --> 00:51:14,460 + 그냥 그냥도 난 작업 할 일이 아주 아주 좋은 답이 있다고 생각 + +668 +00:51:14,460 --> 00:51:18,500 + 확실하지 그래서 분명이 다음이에서 수치입니다 인물의 사진입니다 + +669 +00:51:18,500 --> 00:51:23,300 + 그것은 임의의 이미지를 어떻게 작동하는지 잘 잘 모르겠어요 그래서 논문은 임의의 결과가 마음에 들지 + +670 +00:51:23,300 --> 00:51:31,870 + 하지만 다른 점은 정말이 특히이 모델 소프트에 대한 지적합니다 + +671 +00:51:31,869 --> 00:51:35,739 + 구금은 제약 조건의 종류가에서이 고정 된 격자 점이다 + +672 +00:51:35,739 --> 00:51:41,199 + 우리 같은 이들보다이 좋은이 확산 점점 얻을 컨볼 루션 기능지도 + +673 +00:51:41,199 --> 00:51:44,449 + 일을 찾고 있지만, 사람들은 그저이이 밖으로 흐리게처럼 + +674 +00:51:44,449 --> 00:51:48,210 + 분포 모델 실제로보고 할 능력이없는 + +675 +00:51:48,210 --> 00:51:52,220 + 입력의 임의의 영역 만이 고정 그리드 볼 수있어 + +676 +00:51:52,219 --> 00:51:55,959 + 지역 + +677 +00:51:55,960 --> 00:51:59,690 + 또한 부드러운 관심이 아이디어는 정말 아니라고 지적한다 + +678 +00:51:59,690 --> 00:52:04,789 + 본 논문에서 소개 난 정말이 개념을 가지고 있었던 첫 번째 논문을 생각한다 + +679 +00:52:04,789 --> 00:52:09,159 + 부드러운 관심은이 유사 그래서 여기에 기계 번역에서 온 + +680 +00:52:09,159 --> 00:52:13,299 + 우리는 다음 스페인어, 여기에 몇 가지 입력 문장을하려는 의욕 + +681 +00:52:13,300 --> 00:52:17,960 + 영어로 출력 문장을 생성하고이 재발와 함께 할 것 + +682 +00:52:17,960 --> 00:52:22,179 + 우리는 먼저 판독 할 시퀀스 모델 신경망 시퀀스 우리 + +683 +00:52:22,179 --> 00:52:26,588 + 입력 재발 네트워크와 문장하고는 출력 시퀀스가​​ 생성 + +684 +00:52:26,588 --> 00:52:29,269 + 우리는 자막에서와 같은 매우 유사 + +685 +00:52:29,269 --> 00:52:33,119 + 그러나이 논문에서 그들은 실제로 입력을 통해 관심을 가지고 싶어 + +686 +00:52:33,119 --> 00:52:38,599 + 강렬한 그들은 조금으로 정확한 메커니즘 때문에 자신의 문장을 생성 된 + +687 +00:52:38,599 --> 00:52:43,080 + 다른 그러나 직관은 지금 우리가 처음이를 생성 할 때 같은있다 + +688 +00:52:43,079 --> 00:52:47,469 + 말씀은 우리를 통해 전력 분배를하지 계산하려는의 나 + +689 +00:52:47,469 --> 00:52:52,000 + 이미지의 대신 우린 있도록 입력 문장에서 단어를 통해 지역 + +690 +00:52:52,000 --> 00:52:55,289 + 거 희망을 스페인어로이 첫 번째 단어에 초점을 맞출 것이다 분포를 얻을 + +691 +00:52:55,289 --> 00:52:59,170 + 문장, 그리고, 우리는 각 단어의 일부 사진을 촬영 한 후 관련이 있습니다 + +692 +00:52:59,170 --> 00:53:03,780 + 이를 반복 것이이 프로세스의 다음 단계로 궤환 + +693 +00:53:03,780 --> 00:53:08,820 + 모든 시간에 너무 부드러운 구금이 아이디어는 매우있는 네트워크를 단계 + +694 +00:53:08,820 --> 00:53:12,230 + 이미지 캡처에뿐만 아니라 기계뿐만 아니라 쉽게 적용 + +695 +00:53:12,230 --> 00:53:18,990 + 번역 질문은 질문은 가변 길이에 대해이 작업을 수행 할 방법입니다 + +696 +00:53:18,989 --> 00:53:23,409 + 문장 그리고 내가 조금 이상 호도 뭔가하지만 아이디어는 당신입니다 + +697 +00:53:23,409 --> 00:53:26,980 + 이미지 캡션에 대한 그래서 주소 내용이라고 기반으로 무엇을 사용 + +698 +00:53:26,980 --> 00:53:31,559 + 우리 모두는이 일곱 그리드에 의해 아마 일곱이 고정되어 있는지 미리 알 수 있도록 + +699 +00:53:31,559 --> 00:53:35,579 + 우리는 단지이 직접하는 대신 확률 분포를 생성 + +700 +00:53:35,579 --> 00:53:40,440 + 인코더와 같은 모델은 일부 벡터를 생성있어 입력 된 문장을 읽는 + +701 +00:53:40,440 --> 00:53:45,320 + 인코딩이 디코더에서 지금 입력 문장의 각 단어 대신 + +702 +00:53:45,320 --> 00:53:49,300 + 의 직접 확률 분포 확률 벡터를 생성 그것 + +703 +00:53:49,300 --> 00:53:52,900 + 방법은 각각과 내적을 얻을 것이다 벡터의 종류를 확산하기 + +704 +00:53:52,900 --> 00:53:57,000 + 그 코드 벡터와 입력 한 다음 그 위에 제품을 얻을 익숙해 + +705 +00:53:57,000 --> 00:54:02,159 + 재 정규화 및 분포로 변환 + +706 +00:54:02,159 --> 00:54:06,940 + 그래서 부드러운 구금이 아이디어를 구현하기 꽤 용이하고 + +707 +00:54:06,940 --> 00:54:10,970 + 꽤 훈련하기 쉬운 그래서 작년 정도에 매우 인기가있어와 + +708 +00:54:10,969 --> 00:54:14,489 + A와 부드러운 관심이 아이디어를 적용 논문의 전체 무리가있다 + +709 +00:54:14,489 --> 00:54:18,349 + 다른 문제의 전체 무리보고 몇 논문이 있었다 있도록 + +710 +00:54:18,349 --> 00:54:22,360 + 우리가 본대로 기계 번역 소프트 구금에서이되어왔다 + +711 +00:54:22,360 --> 00:54:24,230 + 실제로하고 싶은 몇 가지 서류 + +712 +00:54:24,230 --> 00:54:28,179 + 그들은 오디오 신​​호에 읽은 다음 나는 놓을 게요 음성 녹음 + +713 +00:54:28,179 --> 00:54:32,589 + 영어 단어는 너무 부드러운 관심을 사용하는 몇 가지 서류가되었습니다 + +714 +00:54:32,590 --> 00:54:37,130 + 해당 작업 주에 도움이 입력 오디오 시퀀스를 통해 거기에있었습니다 + +715 +00:54:37,130 --> 00:54:41,300 + 당신이 읽을 그래서 여기에 동영상 캡션 소프트 관심을 사용하는 방법에 적어도 하나의 종이 + +716 +00:54:41,300 --> 00:54:45,260 + 프레임의 일부 순서와 단어와 당신의 다음 출력을 어떤 순서 + +717 +00:54:45,260 --> 00:54:49,110 + 가있는 한, 입력 시퀀스의 프레임인지를 통해 장력을 갖도록 할 + +718 +00:54:49,110 --> 00:54:53,050 + 캡션을 생성하는 당신은 아마이 작은 비디오 것을 볼 수 있었다 + +719 +00:54:53,050 --> 00:54:57,240 + 시퀀스들은 출력 누군가가 냄비에 물고기 위해 노력하고 때 생성된다 + +720 +00:54:57,239 --> 00:55:01,169 + 단어 누군가가 실제로 비디오에서이 두 번째 프레임에 많은 참석 + +721 +00:55:01,170 --> 00:55:05,590 + 순서 그들은 단어 튀김이 마지막에 더 많은 참석 생성 할 때 + +722 +00:55:05,590 --> 00:55:11,480 + 비디오 시퀀스의 요소는이 작업에 대한 몇 가지 서류가 있었다 + +723 +00:55:11,480 --> 00:55:16,059 + 당신에게 그래서 여기에 설정을 응답 질문을하면 자연에서 읽을 수 있다는 것입니다 + +724 +00:55:16,059 --> 00:55:20,590 + 언어 질문 당신은 또한 이미지와 이미지를 읽고 모델에 필요 + +725 +00:55:20,590 --> 00:55:22,870 + 그 질문에 대한 답을 생산 + +726 +00:55:22,869 --> 00:55:28,139 + 그래서 자연 언어에서 그 질문에 대한 답을 생산하고 거기 + +727 +00:55:28,139 --> 00:55:31,869 + 이미지를 통해 공간주의의 아이디어를 탐구 커플 논문 + +728 +00:55:31,869 --> 00:55:35,420 + 또 다른 일을 응답 질문이 문제를 돕기 위해 + +729 +00:55:35,420 --> 00:55:38,860 + 지적하는 것은이 있었다, 그래서이 논문의 일부는 좋은 게임을 가지고있다 + +730 +00:55:38,860 --> 00:55:43,000 + 보여 앤 알려 쇼 교환이 있었다 나는거야 거기 십분 미만 + +731 +00:55:43,000 --> 00:55:45,039 + 주문 + +732 +00:55:45,039 --> 00:55:49,999 + II 정말 약 즐길 수 있도록이 하나 답변을 참석하도록 요청 + +733 +00:55:49,998 --> 00:55:56,808 + 나는이 작업 줄에 불과 해요 이름으로 창의력과 부드러운의이 아이디어 + +734 +00:55:56,809 --> 00:55:59,910 + 구금 그래서 많은 사람들이 두을 업로드 구현하기 매우 쉽다 + +735 +00:55:59,909 --> 00:56:05,899 + 하지만 작업의 톤은 우리가 구현 이러한 종류의와 함께이 문제를보고 기억 + +736 +00:56:05,900 --> 00:56:09,709 + 부드러운 관심의 그것은 우리가 지역을 중재하기 위해 참석 할 수 없습니다입니다 + +737 +00:56:09,708 --> 00:56:14,038 + 입력 대신 제약하고 만 주어진이 고정 된 그리드에 참석할 수 + +738 +00:56:14,039 --> 00:56:18,699 + 길쌈 기능지도에 의해, 그래서 문제는 우리가 이것을 극복 할 수 있는지 여부입니다 + +739 +00:56:18,699 --> 00:56:23,559 + 제한은 여전히​​ 참석하고 어떻게 든 임의의 입력 영역에 참석 + +740 +00:56:23,559 --> 00:56:28,089 + 내가 생각하는 다른 방법 + +741 +00:56:28,088 --> 00:56:32,900 + 작업 이러한 유형의 전구체는 알렉스에서이 논문은 그래서 2013 년에 다시 무덤이다 + +742 +00:56:32,900 --> 00:56:38,249 + 여기에 그는 같은 입력 자연 언어 문장을 읽고 다음과 같이 생성 원 + +743 +00:56:38,248 --> 00:56:43,598 + 작성하는 것처럼 일반 필기 될 출력 실제로 이미지 + +744 +00:56:43,599 --> 00:56:48,528 + 필기에 그 문장과 그가 실제로 관심을 가지고있는 방법이 + +745 +00:56:48,528 --> 00:56:53,418 + 멋진 방법의 종류이 출력 이미지 위에 우리는 지금 그가 실제로 예측하는 것하고 + +746 +00:56:53,418 --> 00:56:57,608 + 그런 다음 출력 영상 이상 현금과 혼합 모델의 파라미터 + +747 +00:56:57,608 --> 00:57:02,739 + 실제적으로 출력 영상의 부분을 중재하는 것을 사용하고 참석 + +748 +00:57:02,739 --> 00:57:07,028 + 이 사실은 이들 중 일부는 오른쪽에 정말 잘 그래서 정말 작동 + +749 +00:57:07,028 --> 00:57:12,259 + 실제로 사람들에 의해 작성하고 나머지는 그의 그를 작성했습니다 + +750 +00:57:12,259 --> 00:57:16,269 + 네트워크는 그래​​서 당신은 현실에서 발생하는 차이를 알 수 있습니다 + +751 +00:57:16,268 --> 00:57:24,418 + 내가 할 수 없습니다 생성하는 그래서 상단 하나는 진짜 그가 있다고 밝혀 + +752 +00:57:24,418 --> 00:57:31,049 + 네트워크에 의해 생성 된 모든 바닥 + +753 +00:57:31,050 --> 00:57:35,580 + 그래 어쩌면 어쩌면 진짜 기적은 문자 나 사이에 많은 차이가 + +754 +00:57:35,579 --> 00:57:39,380 + 그런 일하지만, 이러한 결과는 정말 잘 작동 실제로 그는이 + +755 +00:57:39,380 --> 00:57:42,820 + 당신이 가서 그 방금 할 수있는 브라우저에서 실행하려고 할 수있는 온라인 데모 + +756 +00:57:42,820 --> 00:57:46,800 + 단어를 입력하고 재미 종류의 당신을위한 필기를 생성합니다 + +757 +00:57:46,800 --> 00:57:52,840 + 우리가 이미 본 다른 또 다른 종이 종류의 소요가 그 그릴입니다 + +758 +00:57:52,840 --> 00:57:56,500 + 다음을 통해 임의 구금이 아이디어는 몇 가지 더로 확장 + +759 +00:57:56,500 --> 00:58:01,050 + 실제 세계의 문제는 세대가 그래서 그들은 고려 하나의 작업은 필기하지 + +760 +00:58:01,050 --> 00:58:05,960 + 이미지 분류는 여기에 우리가하지만 그 과정에서이 숫자를 분류 할 + +761 +00:58:05,960 --> 00:58:09,920 + 분류의 우리는 실제로 입력 영역을 중재하기 위해 참석거야 + +762 +00:58:09,920 --> 00:58:14,639 + 이 분류 작업에 도움하기 위해 이미지 그래서 이것은이입니다 + +763 +00:58:14,639 --> 00:58:17,909 + 가지가 종류의 자체 학습 냉각하지만이 참석해야 + +764 +00:58:17,909 --> 00:58:22,710 + 순서대로 숫자는 영상 분류에 도움이 또한 철회합니다 + +765 +00:58:22,710 --> 00:58:27,849 + 유사한과 임의 출력 이미지를 생성하는 개념을 고려 + +766 +00:58:27,849 --> 00:58:31,589 + 우리가 가지고있는거야 필기 생성과 같은 동기 부여의 종류 + +767 +00:58:31,590 --> 00:58:35,740 + 출력 이미지를 통해 임의의 관심은 단지에이 출력을 생성 + +768 +00:58:35,739 --> 00:58:42,589 + 내 침대와 나는 우리가 전에이 동영상을 본 것 같아요하지만이 그래서 정말 멋지다 + +769 +00:58:42,590 --> 00:58:47,190 + 당신이 여기에서 우리는거야 볼 수 있도록 내 마음에서 무승부 네트워크는 우리가있어 그것을 할 + +770 +00:58:47,190 --> 00:58:51,200 + 분류 작업을하는 것은 일종의에서 영역을 중재에 참석하기 위해 배운다 + +771 +00:58:51,199 --> 00:58:55,439 + 우리는 우리가 지역을 중재하기 위해 참석거야 생성 입력 할 때 + +772 +00:58:55,440 --> 00:58:59,579 + 그것은 다수 생성 할 수 있도록 출력은 실제로 이러한 숫자를 생성 + +773 +00:58:59,579 --> 00:59:04,000 + 한 번에 숫자와 실제로이이 집 번호 다음를 생성 할 수 있습니다 + +774 +00:59:04,000 --> 00:59:10,639 + 집 번호는 그래서 이것은 정말 멋진 당신이 볼 수 있듯이 당신은이 지역을 좋아한다 + +775 +00:59:10,639 --> 00:59:13,920 + 실제로 종류의 성장과 초과 근무 축소되었다 참석했다 + +776 +00:59:13,920 --> 00:59:17,430 + 이미지 위에 연속적으로 이동 그것은 확실히에 구속되지 않았습니다 + +777 +00:59:17,429 --> 00:59:21,690 + 우리와 같은 고정 된 그리드이되도록하는 방법을 알려 쇼 교환 보았다 + +778 +00:59:21,690 --> 00:59:26,840 + 작동 종이 조금 조금 이상하고 깊은 일부 후속 작품입니다 + +779 +00:59:26,840 --> 00:59:34,260 + 내 모든 초점은 모든 괜찮 좀 더 분명 왜 하늘은 실제로 생각하는 마음 + +780 +00:59:34,260 --> 00:59:38,630 + 바로 그래서 매우 유사한을 사용 걸릴이 후속 논문이있다 + +781 +00:59:38,630 --> 00:59:43,500 + 기구의 관심은 특별한 전송 네트워크라고하지만 난 많은 생각 + +782 +00:59:43,500 --> 00:59:44,500 + 이해하기 쉽게 + +783 +00:59:44,500 --> 00:59:49,039 + 그리고 매우 깨끗한 방법으로 제시 한 아이디어는 우리가 갖고 싶어한다는 것입니다 수 있도록 + +784 +00:59:49,039 --> 00:59:53,369 + 입력 영상이 우리의 마음에 드는 새와 우리는 이런 종류의를 갖고 싶어 + +785 +00:59:53,369 --> 00:59:57,589 + 당신은 당신에게 있습니다에 참석하려는 우리에게 말하고 변수의 연속 세트 + +786 +00:59:57,590 --> 01:00:01,579 + 우리가 어떤 상자의 중심과 너비와 높이의 모서리를 상상 + +787 +01:00:01,579 --> 01:00:06,170 + 이 지역의 우리에 첨부 할 다음 우리는 몇 가지 기능을 갖고 싶어 그 + +788 +01:00:06,170 --> 01:00:10,240 + 우리의 입력 영상을 받아 이러한 지속적인 관심 좌표 + +789 +01:00:10,239 --> 01:00:14,919 + 다음 몇 가지 고정 된 크기의 출력을 생성하고 우리는이 작업을 수행 할 수 없습니다 + +790 +01:00:14,920 --> 01:00:21,840 + 미분 방법은 그래서 이것은 이것은 당신에 그 상상할 수 좀 하드 바로 보인다 + +791 +01:00:21,840 --> 01:00:26,250 + 자르기의 생각과 이상은 다음이 입력은 정말 연속이 될 수 없습니다 + +792 +01:00:26,250 --> 01:00:30,590 + 그들은 두 개의 정수 그렇게 우리 나라 픽셀 값의 정렬을해야하고 그렇지 않아 + +793 +01:00:30,590 --> 01:00:34,550 + 우리는이 함수가 연속 또는 차등 할 수있는 방법을 정확하게 정말 선택 + +794 +01:00:34,550 --> 01:00:39,210 + 그리고 그들은 실제로 아주 좋은 해결책을 와서 생각은 우리가 걸이다 + +795 +01:00:39,210 --> 01:00:44,679 + 거 픽셀의 좌표에서지도하는 매개 변수화 기능을 적어 + +796 +01:00:44,679 --> 01:00:50,469 + 그래서 여기에 입력 픽셀의 좌표를 출력에 우리는거야 말거야 + +797 +01:00:50,469 --> 01:00:54,839 + 이이 왼쪽 오른쪽 상단 픽셀 것을 다른 가능성이있다 + +798 +01:00:54,840 --> 01:00:59,700 + 좌표는 출력에 TYT을 X와 우리는이를 계산 확인하는거야 + +799 +01:00:59,699 --> 01:01:04,480 + 이 민영화를 사용하여 입력 이미지에 액세스 및 백악관 좌표 + +800 +01:01:04,480 --> 01:01:08,900 + 즉 그, 그래서 좋은 함수는 우리가 할 수있는 좋은 미분 가능 함수의 + +801 +01:01:08,900 --> 01:01:13,349 + 다음 이들에 대해 벌금 전송에 따라을 우리가 할 수있는 차별화 + +802 +01:01:13,349 --> 01:01:17,059 + 에서 아마 상단 왼쪽 상단 픽셀 다시이 과정을 반복 + +803 +01:01:17,059 --> 01:01:21,219 + 출력 화상 우리의 좌표에 매핑이 행성 상승 기능을 사용하여 + +804 +01:01:21,219 --> 01:01:27,199 + 입력의 화소 이제 우리는 우리의 출력의 모든 픽셀에 대해이 작업을 반복 할 수있는 + +805 +01:01:27,199 --> 01:01:31,689 + 생각이 될 것입니다, 그래서 우리에게 샘플링 그리드라는 뭔가를 제공합니다 + +806 +01:01:31,690 --> 01:01:36,480 + 출력의 각 픽셀에 대해 다음 우리의 출력 이미지와 샘플링 그리드는 우리에게 알려줍니다 + +807 +01:01:36,480 --> 01:01:41,610 + 여기서 입력에 픽셀에서 온해야 얼마나 많은을 복용하는 사람 + +808 +01:01:41,610 --> 01:01:47,590 + 컴퓨터 그래픽 과정 많지 않은이 왼쪽에 보이는 질감처럼 좀 보인다 그래서 + +809 +01:01:47,590 --> 01:01:52,510 + 그들은 컴퓨터에 텍스처 매핑에서이 아이디어를 취할 수 있도록 매핑은하지 않습니다 + +810 +01:01:52,510 --> 01:01:56,300 + 단지 선형 보간하여 사용하고 그래픽 한 번 출력을 계산하기 + +811 +01:01:56,300 --> 01:01:57,720 + 우리는 샘플링 그리드가 + +812 +01:01:57,719 --> 01:02:02,669 + 그래서 우리가 가지고 이제 지금 무슨 일이 지금이 지금이 실제로 우리의 네트워크를 할 수 있습니다 + +813 +01:02:02,670 --> 01:02:07,450 + 입력과 좋은 미분 방법의 일부를 중재하기 위해 참석 우리 + +814 +01:02:07,449 --> 01:02:11,789 + 지금 바로이 변형을 예측합니다 네트워크는 PANA 것을 좌표 + +815 +01:02:11,789 --> 01:02:16,639 + 그래서 입력 영상의 영역을 중재에 참석하기 위해 전체를 수 + +816 +01:02:16,639 --> 01:02:20,199 + 그들은 좋은 작은 독립적 인 모듈에 모두이 일을 넣어 그 + +817 +01:02:20,199 --> 01:02:24,608 + 공간 변압기가 어떤 입력을 수신 그래서 그들은 특별한 변압기를 호출 + +818 +01:02:24,608 --> 01:02:29,679 + 우리의 원시 입력 이미지로하고 다음 실제로 실행 당신이 생각할 수있는 + +819 +01:02:29,679 --> 01:02:33,949 + 작은 완전히 연결 네트워크 또는 될 수있는 작은 현지화 네트워크 + +820 +01:02:33,949 --> 01:02:38,409 + 매우 얕은 길쌈 네트워크 및이이 현지화 네트워크의 뜻 + +821 +01:02:38,409 --> 01:02:44,500 + 실제로 지금이 출력이 좌표 데이터를 변환 계획으로 생산 + +822 +01:02:44,500 --> 01:02:48,829 + 아핀 변환 좌표는 이제 샘플링 그리드를 계산하는데 사용될 + +823 +01:02:48,829 --> 01:02:51,750 + 우리가보기 흉한에서 이러한 예측 한 것을 국산화로 변환 + +824 +01:02:51,750 --> 01:02:56,280 + 우리는 네트워크의 출력에서​​의 각 화소의 좌표를 각각의 화소를 매핑 + +825 +01:02:56,280 --> 01:03:02,280 + 출력을 다시 입력이 지금은 좋은 부드러운 미분 함수 + +826 +01:03:02,280 --> 01:03:06,230 + 우리가 샘플링 그리드를 일단 우리는 단지에 선형 보간에 의해 적용 할 수 있습니다 + +827 +01:03:06,230 --> 01:03:11,309 + 출력의 픽셀 값을 계산하고 경우에 당신이 생각하는 경우 + +828 +01:03:11,309 --> 01:03:15,588 + 어떻게이 일을하고있는 것은이 네트워크의 모든 단일 부품이 하나라는 것을 분명 + +829 +01:03:15,588 --> 01:03:21,159 + 이 일이 어떤없이 공동으로 관리 할 수​​ 있도록 지속적이고 두 개의 차동 + +830 +01:03:21,159 --> 01:03:26,579 + 11 종류의 비록 아주 좋은 미친 강화 학습 물건 + +831 +01:03:26,579 --> 01:03:31,789 + 당신이 선형으로 샘플링이 그것을 어떻게 작동하는지 알고있는 경우주의 사항은 바이 리니어 샘플링에 대해 알고 + +832 +01:03:31,789 --> 01:03:36,449 + 출력의 각 화소가 넷의 누적 평균가는 것을 의미 + +833 +01:03:36,449 --> 01:03:41,639 + 픽셀과 입력 그래서 그 기울기는 실제로 매우 로컬 그래서 이것은이다 + +834 +01:03:41,639 --> 01:03:45,549 + 연속과 미분 멋진하지만 난 당신의 전체 많이 얻을 생각하지 않습니다 + +835 +01:03:45,550 --> 01:03:50,300 + 세 번째 바이 리니어 샘플링을 통해하지만 당신이이 한 번 기울기 신호 + +836 +01:03:50,300 --> 01:03:54,410 + 특별이 ​​좋은 특수 전송 모듈 우리는 단지에 삽입 할 수 있습니다 + +837 +01:03:54,409 --> 01:03:58,739 + 네트워크에 존재하는 일종의 그들이 있도록 두 가지 참석을 배울 수 있도록 + +838 +01:03:58,739 --> 01:04:03,739 + 드롭 종이와 매우 유사이 분류 작업을 고려 어디 + +839 +01:04:03,739 --> 01:04:08,118 + 실제로 그들이 그렇게 사면 제트기의 이러한 변형 된 버전을 분류 할 + +840 +01:04:08,119 --> 01:04:09,519 + 여러 가지 다른 생각 + +841 +01:04:09,519 --> 01:04:13,610 + 더 복잡한 변환하지 당신은 또한 할 수있는 단지 그가의 좋은 형질 전환 + +842 +01:04:13,610 --> 01:04:18,260 + 그의 화소에서 출력 픽셀 SPEKTR에서 매핑의 상상 + +843 +01:04:18,260 --> 01:04:21,470 + 우리는 아핀을 보였다 이전의 비행 변환 그러나 또한 고려 + +844 +01:04:21,469 --> 01:04:25,339 + 사영 변환도 얇은 판 스플라인하지만 아이디어는 당신에게 그냥 + +845 +01:04:25,340 --> 01:04:28,970 + 일부 민간 상승과 미분 가능 함수를 원하고 당신은 갈 수 + +846 +01:04:28,969 --> 01:04:34,829 + 그래서 여기에 왼쪽 네트워크의 일부 미친 그냥 분류하려고 + +847 +01:04:34,829 --> 01:04:38,380 + 왼쪽에 이렇게 일을하는이 자리는 우리의 서로 다른 버전이 + +848 +01:04:38,380 --> 01:04:43,340 + 이 가운데 콜린에 변형 된 자리는 서로 다른 얇은 판을 보이고있다 + +849 +01:04:43,340 --> 01:04:47,460 + 스플라인은 오른쪽에 다음 이미지의 일부에 참석하고 사용하고 있음 + +850 +01:04:47,460 --> 01:04:51,590 + 뿐만 아니라이 공간 변압기 모델의 출력을 보여줍니다 + +851 +01:04:51,590 --> 01:04:56,250 + 그 영역에 참석뿐만 아니라에 그 비행기에 대응에서 근무 + +852 +01:04:56,250 --> 01:05:01,730 + 오른쪽에 그들은은을 사용하고 오른쪽에 발산 찾기 앱을 사용하고 + +853 +01:05:01,730 --> 01:05:05,559 + 아핀이 실제로하고있는 것을 볼 수 있습니다 장소의 계획에 있지 변환 + +854 +01:05:05,559 --> 01:05:09,369 + 단지 입력에 참석하거나 실제로는 물론, 입력을 변화보다 + +855 +01:05:09,369 --> 01:05:14,849 + 그래서이 가운데 열에서 예를 들어이는 4이지만 실제로 회전있어 + +856 +01:05:14,849 --> 01:05:19,069 + 90도 같은으로 뭔가에 의해 때문에이 응용 프로그램을 사용하여 및 + +857 +01:05:19,070 --> 01:05:23,140 + 네트워크 변환 것이 아니라주의 전뿐만 아니라 회전 수로 + +858 +01:05:23,139 --> 01:05:27,839 + 적절한 직장에서 하류 분류에 대한 위치와이 전부입니다 + +859 +01:05:27,840 --> 01:05:31,930 + 아주 멋진 그리고 난 우리가 필요로하지 않는 부드러운 관심을 비슷한의 정렬 할 수 있습니다 + +860 +01:05:31,929 --> 01:05:35,949 + 이에 참석하고 싶어 그냥 스스로 결정할 수 있습니다 명시 적 감독 + +861 +01:05:35,949 --> 01:05:41,710 + 이 사람뿐만 아니라 매우있는 멋진 동영상을 가지고 있도록 문제를 해결하기 위해 + +862 +01:05:41,710 --> 01:05:53,860 + 인상적인 그래서 이것은 우리가 압축을 푼 여기 변압기 모듈입니다 + +863 +01:05:53,860 --> 01:05:58,930 + 우리는 실제로 지금이 실제로 분류 작업을 실행하는 바로 보여주는 것 + +864 +01:05:58,929 --> 01:06:03,389 + 그러나 우리가 입력을 변경하고 지속적으로 이러한 서로 다른 입력 것을 볼 수있다 + +865 +01:06:03,389 --> 01:06:08,429 + 네트워크 (22) 다음 실제로 경제 제휴에 참석 배운다 그 + +866 +01:06:08,429 --> 01:06:13,169 + 자리는 고정 알려진 포즈의 정렬하는 등 우리는 매우 입력이 주변으로 이동 + +867 +01:06:13,170 --> 01:06:18,500 + 이미지 네트워크는 여전히 자리에와에 잠금의 좋은 일을 + +868 +01:06:18,500 --> 01:06:23,059 + 바로 당신은 때때로 잘 그래서뿐만 아니라 회전을 해결할 수 있음을 알 수 + +869 +01:06:23,059 --> 01:06:26,809 + 여기 왼쪽에 실제로 그 자리 실제로 네트워크를 회전했다 + +870 +01:06:26,809 --> 01:06:31,619 + 다시 모두 부채와 경제 생활 투표와 회전 배운다 + +871 +01:06:31,619 --> 01:06:36,420 + 친구가 변형 또는 얇은 판 스플라인이도 미쳤 사용하여 + +872 +01:06:36,420 --> 01:06:40,389 + 예상 전송으로 휘게 그녀는 정말 좋은 일을 볼 수 있습니다 + +873 +01:06:40,389 --> 01:06:48,099 + 의 또한 자신의 작품에 참석하고 학습하고 다른 꽤 많이 할 + +874 +01:06:48,099 --> 01:06:52,829 + 대신 분류의 실험은이 일을 함께 추가 학습 + +875 +01:06:52,829 --> 01:06:58,369 + 가지 이상한 일이다하지만 그렇게 네트워크가 후퇴되어 작동 자리 + +876 +01:06:58,369 --> 01:07:05,389 + 두 개의 입력 입력 영상에 관해서는 나는 합계를 놓을 게하고 화상과 알지도 + +877 +01:07:05,389 --> 01:07:08,679 + 이 그것이 참석하고 작업에 필요가 있음을 알게 이상한 작업의 종류 + +878 +01:07:08,679 --> 01:07:15,659 + 이 때문에 그 이미지 최적화이 테스트입니다 기록 중입니다 + +879 +01:07:15,659 --> 01:07:20,009 + 공동 지역화라는 개념은 두 네트워크를 받으려고한다는 것이다 + +880 +01:07:20,010 --> 01:07:25,560 + 등의 입력 영상 아마 두 개의 서로 다른 이미지 네 다리를하고 작업이 말을하는 것입니다 + +881 +01:07:25,559 --> 01:07:31,179 + 여부 그 이미지는 다음 동일하거나 상이하고, 동일한 + +882 +01:07:31,179 --> 01:07:34,750 + 지역 공간 변압기를 사용하여 같은 것들을 지역화 학습 결국 + +883 +01:07:34,750 --> 01:07:38,139 + 잘 훈련의 과정을 통해 실제로 배운다 것을 볼 수 있습니다 + +884 +01:07:38,139 --> 01:07:42,239 + 우리가 가까이있을 때 매우 매우 정밀도이 일을 현지화 + +885 +01:07:42,239 --> 01:07:50,479 + 이러한 네트워크는 여전히 아주 아주 정확하게의 그 지역화하는 법을 배워야보다 이미지 + +886 +01:07:50,480 --> 01:07:58,280 + 꽤 멋진 깊은 마음의 최근 논문 + +887 +01:07:58,280 --> 01:08:11,519 + 특수 변압기에 대해 너무 다른 마지막 분 질문 그래 그래서 + +888 +01:08:11,519 --> 01:08:13,989 + 간단한 때문에 문제는이 일의 작업이 무엇인지 무엇이고 있습니다 + +889 +01:08:13,989 --> 01:08:17,420 + 일을하고 바닐라 버전 적어도 그것 때문에 그냥 분류입니다 + +890 +01:08:17,420 --> 01:08:21,810 + 뒤틀린 될 수있는 입력 이런 종류의 수신 그녀의 어수선 또는 이것 저것하고 + +891 +01:08:21,810 --> 01:08:26,060 + 모두가 할 필요가 그 과정에서 일종의 분류 광고 예산입니다 + +892 +01:08:26,060 --> 01:08:29,839 + 또한 그것을 분류하는 학습 즉, 그래서 금이 부분에 참석하기 위해 계획 + +893 +01:08:29,838 --> 01:08:40,189 + 그건 내 개요이이 작품 오른쪽 정렬이의 정말 멋진 기능입니다 + +894 +01:08:40,189 --> 01:08:44,588 + 관심의 우리가 정말 쉬운 부드러운 관심을 가지고있다 + +895 +01:08:44,588 --> 01:08:49,119 + 고정 된 입력 위치의이 맥락에서 특히 구현하는 우리 단지 + +896 +01:08:49,119 --> 01:08:53,039 + 이상 분포를 생산하고 우리가 기다리는 사람을 넣어 우리는 그 사람들을 먹이 + +897 +01:08:53,039 --> 01:08:56,850 + 다시 어떻게 든 네트워크 요인이 많은에 구현하기 정말 쉽습니다 + +898 +01:08:56,850 --> 01:08:59,930 + 다른 문맥 및 다른 작업의 많은 구현 된 + +899 +01:08:59,930 --> 01:09:04,770 + 당신은 당신이 조금을 얻기 위해 필요한 것보다 지역을 중재하기 위해 참석하고자 할 때 + +900 +01:09:04,770 --> 01:09:09,130 + 비트 애호가와 나는 공간 변압기 우아한 아주 아주 좋은 생각 + +901 +01:09:09,130 --> 01:09:13,949 + 입력 이미지의 영역을 중재하기 위해 참석의 방법이 논문 많이 있습니다 + +902 +01:09:13,949 --> 01:09:17,889 + 실제로 그녀의 구금 작업이 아주 조금 더 도전 때문이다 + +903 +01:09:17,890 --> 01:09:21,579 + 열심히주의 용지 일반적으로 사용하는 그라디언트이 문제에 대한 + +904 +01:09:21,579 --> 01:09:26,199 + 강화 학습은 우리가 정말 그렇게 어떤 임의의 오늘에 대해 이야기하지 않았다 + +905 +01:09:26,199 --> 01:09:39,429 + 긴장 또는 있는지 확인에 대한 다른 질문 + +906 +01:09:39,429 --> 01:09:51,958 + 캡션 우리는 변압기를 얻었고, 그래 그 폐쇄하기 전에 + +907 +01:09:51,958 --> 01:09:56,649 + 캡션에 해당 네트워크에서이 스크립트를 기반으로 일 만에를 사용하여 생산됩니다 + +908 +01:09:56,649 --> 01:10:01,299 + 특히 나는 실제로 꽤 많은, 그래서 그것은 (14) (14) 그리드 생각 + +909 +01:10:01,300 --> 01:10:04,550 + 여전히 제한되어있어 위치하지만 그것의 그것을 훨씬에 착용 + +910 +01:10:04,550 --> 01:10:22,800 + 그래서 부드러운 관심과 그녀의 구금 사이에 보간에 대한 질문이 그래 + +911 +01:10:22,800 --> 01:10:26,279 + 당신이 상상할 수있는 11 점은 다음 부드러운 방식으로 네트워크를 훈련하고있다 + +912 +01:10:26,279 --> 01:10:29,929 + 당신이 종류의 분포가 선명하고 선명하게 처벌 훈련 동안과 + +913 +01:10:29,929 --> 01:10:32,949 + 선명하고 테스트 시간은 단지 전환과 그녀의 구금을 사용 + +914 +01:10:32,948 --> 01:10:37,938 + 대신에 나는 내가 그 짓하는 종이 기억할 수 있다고 생각하지만 난 꽤 해요 해요 + +915 +01:10:37,939 --> 01:10:43,130 + 확인은 어디 선가 그 생각을 본 적이 있지만, 실제로 나는 그녀와 함께 훈련 생각 + +916 +01:10:43,130 --> 01:10:46,099 + 구금은 선명 방식보다 더 잘 작동하는 경향이 있지만, 확실히이다 + +917 +01:10:46,099 --> 01:10:51,800 + 뭔가 확인을 시도 할 수 있다면 우리가 일을하고 있다고 생각하지 질문 + +918 +01:10:51,800 --> 01:10:54,179 + 몇 분 일찍 오늘은 그래서 당신의 숙제 완수 + diff --git a/captions/Ko/Lecture14_ko.srt b/captions/Ko/Lecture14_ko.srt new file mode 100644 index 00000000..a56f02d9 --- /dev/null +++ b/captions/Ko/Lecture14_ko.srt @@ -0,0 +1,4232 @@ +1 +00:00:00,000 --> 00:00:04,990 + 행정 난 당신이 내가 일을하지 않는 경우 모두가 지금 73 수행해야 말해 + +2 +00:00:04,990 --> 00:00:07,790 + 당신이 늦게 당신은 문제가 있다고 생각 + +3 +00:00:07,790 --> 00:00:11,280 + 이슬람 무덤은 우리가 아직도 그들을 통해 가서 거기에있어 매우 곧 될 것입니다 + +4 +00:00:11,279 --> 00:00:13,779 + 기본적으로있는 내가 다 생각하지만, 우리는 내가 보낼 것 몇 가지를 한 번 확인해야 + +5 +00:00:13,779 --> 00:00:14,199 + 그들을 밖으로 + +6 +00:00:14,199 --> 00:00:18,820 + 확인 그래서 우리는 클래스에 당신을 생각 나게 측면에서 어제 아주 보였다 + +7 +00:00:18,820 --> 00:00:22,629 + 간단히 분할에서 우리는 약간의 부드러운주의 모델 변전소 보았다 + +8 +00:00:22,629 --> 00:00:25,829 + 모델은 선택적으로 다른 부분에 주목 떨어져있는 + +9 +00:00:25,829 --> 00:00:28,028 + 당신의 처리와 같은 이미지는 재발 신경과 같은했다 + +10 +00:00:28,028 --> 00:00:32,020 + 네트워크 다행 당신이 선택적으로 장면의 어떤 부분에주의를 지불하고 + +11 +00:00:32,020 --> 00:00:35,450 + 이러한 기능을 강화하고이되는 특수 변압기에 대해 시작됩니다 + +12 +00:00:35,450 --> 00:00:38,929 + 이미지의 일부를 자르기 다른 방법으로 기본적으로 아주 좋은 방법 + +13 +00:00:38,929 --> 00:00:43,769 + 또는 수소 또는 변형 된 모양의 모든 종류의 하나 일부 기능에없는 + +14 +00:00:43,770 --> 00:00:48,579 + 당신은 내부 네트워크를 슬롯 수 있습니다 PC의 때문에 매우 흥미 종류 등 + +15 +00:00:48,579 --> 00:00:52,049 + 아키텍처는 그래서 오늘 우리는 동영상에 대해 얘기하자 + +16 +00:00:52,049 --> 00:00:56,229 + 구체적으로 현재 이미지 분류에서 이제 여부에 의해 잘 알고 있어야합니다 + +17 +00:00:56,229 --> 00:00:59,390 + 기본적인 전투는 당신이 그것을 재 처리 (A)에 오는 이미지가 설정 + +18 +00:00:59,390 --> 00:01:03,239 + 동영상의 경우 분류 예를 들어 우리는 단지 하나의 이미지가되지 않습니다 + +19 +00:01:03,238 --> 00:01:07,728 + 이 실제로있을 것이다 (32)에 의해 32의 이미지가 그래서하지만 여러 프레임을해야합니다 + +20 +00:01:07,728 --> 00:01:13,829 + 전체 동영상은 32 강 이뿌다 언젠가 범위 그래서 확인에 의해 그래서 32 프레임 + +21 +00:01:13,829 --> 00:01:17,340 + 나는 우리가 I와 이러한 문제를 접근하는 방법에 뛰어 전에 대해 이야기하고 싶습니다 + +22 +00:01:17,340 --> 00:01:21,170 + 우리가 진정 사용에 대해 문의를 해결하기 위해 사용하는 방법에 대한 매우 간략하게 + +23 +00:01:21,170 --> 00:01:25,629 + 오른쪽에 오기 전에 그래서 가장 인기있는 기능 중 일부를 방법을 PCR 기반 + +24 +00:01:25,629 --> 00:01:30,019 + 이 조밀 한 궤적 특징 곳 일 오늘은 매우 인기가 + +25 +00:01:30,019 --> 00:01:34,140 + 모든 매달려 개발 한 난 그냥 당신에게 간단한 맛을주고 싶어 + +26 +00:01:34,140 --> 00:01:36,989 + 정확히 어떻게 가지 흥미로운 때문에 이러한 기능이 근무하고 + +27 +00:01:36,989 --> 00:01:39,609 + 그들은 온 방법의 측면에서 나중에 개발의 일부를 영감 + +28 +00:01:39,609 --> 00:01:43,429 + 이 궤도에 그래서 실제로 작동하는 비디오를 작동 쇼는 무엇이고 우리 + +29 +00:01:43,430 --> 00:01:47,140 + 이렇게 우리는이 비디오 재생이되고, 우리는 이러한 키를 검출 할거야 + +30 +00:01:47,140 --> 00:01:50,709 + 좋은 점은 비디오에서 추적하고 우리는 그들을 추적 할거야 + +31 +00:01:50,709 --> 00:01:54,679 + 당신은 모든 작은 트랙 결국 우리가 실제로 추적하는 것을하자 + +32 +00:01:54,680 --> 00:01:57,759 + 그 트랙하자 약 기능의 영상 다음 제비에서 + +33 +00:01:57,759 --> 00:02:01,868 + 에 대해에게 단지 범죄를 축적 주변 기능, 그래서 그냥주는 + +34 +00:02:01,868 --> 00:02:06,549 + 당신은 어떻게에 대한 생각은 세 단계는 우리가 기본적으로이 대략있는 일 + +35 +00:02:06,549 --> 00:02:10,868 + 이미지에 서로 다른 규모에서 특징점을 검출 나는 나에게 간단히 말씀 드리죠 + +36 +00:02:10,868 --> 00:02:11,960 + 그 어떻게하는지에 대한 + +37 +00:02:11,960 --> 00:02:16,810 + 다음 광 광류 방법을 사용하여 시간이 지남에 따라 그 기능을 트랙으로 이동 + +38 +00:02:16,810 --> 00:02:20,270 + 흐름 방법은 매우 간단하게 설명 해결 그들은 기본적으로 당신에게 운동 필드를 제공 + +39 +00:02:20,270 --> 00:02:23,800 + 한 가지에서 다른 그들은 장면이 하나의 프레임에서 이동하는 방법을 알려 + +40 +00:02:23,800 --> 00:02:28,070 + 어느 정도 익스트림 다음 우리가 기능의 전체 무리를 추출거야하지만, + +41 +00:02:28,069 --> 00:02:30,609 + 중요한 것은 우리는 단지 고정하는 기능 세트를 추출하지 않을거야 + +42 +00:02:30,610 --> 00:02:33,930 + 이미지에 위치하지만 우리가 실제로거야 나를 공격한다 + +43 +00:02:33,930 --> 00:02:37,700 + 말을하고 로컬 좌표 시스템은 모든 단일 트랙하자 등 + +44 +00:02:37,699 --> 00:02:41,869 + 욕심 이러한 히스토그램 돼 흐름을 주장하고 우리가 가고있는 자원이 될 + +45 +00:02:41,870 --> 00:02:45,610 + 열심히 여기 트랙 재치 떨어져 좌표계를 추출 할 수 + +46 +00:02:45,610 --> 00:02:49,200 + 우리는 히스토그램 구배 및 2 차원 화상은 기본적으로봤을 + +47 +00:02:49,199 --> 00:02:51,750 + 그 일반화 너무 + +48 +00:02:51,750 --> 00:02:54,780 + 비디오 및 그래서 사람들이를 인코딩하는 데 사용되는 물건의 종류입니다 + +49 +00:02:54,780 --> 00:03:01,009 + 키 포인트 검출부의 관점에서 공간 - 시간 폭탄 테러가있었습니다 + +50 +00:03:01,009 --> 00:03:04,239 + 좋은 기능과 추적하는 비디오를 감지하는 방법을 정확하게에 작업 꽤 많은 + +51 +00:03:04,240 --> 00:03:07,930 + 직관적으로 당신은 그가 때문에 너무 부드러운 있습니다 비디오를 추적하지 않도록 + +52 +00:03:07,930 --> 00:03:11,580 + 기본적를 취득하기위한 방법이 임의의 시각 기능에 로그인 할 수 없습니다 + +53 +00:03:11,580 --> 00:03:16,620 + 이에 대한 몇 가지 서류가되도록 추적하고 비디오하기 쉬운 점 세트 + +54 +00:03:16,620 --> 00:03:19,509 + 그래서 당신은이 같은 기능의 무리를 검출 + +55 +00:03:19,509 --> 00:03:23,039 + 이 동영상에 광학 플로우 알고리즘 + +56 +00:03:23,659 --> 00:03:28,060 + 프레임 및 제 2 프레임을 그리고 모션 필드 해결할 것 + +57 +00:03:28,060 --> 00:03:32,409 + 이 방법은 여행 곳에서 모든 단일 위치에서 변위 벡터 + +58 +00:03:32,409 --> 00:03:35,919 + 내가 광학 플로우 결과의 몇 가지 예를 들어으로 무료 이동 + +59 +00:03:36,439 --> 00:03:42,270 + 기본적으로 여기에 모든 단일 픽셀은 방향에 의해 착색되는 것을 + +60 +00:03:42,270 --> 00:03:46,260 + 이는 예를 들어자가 갖도록 이미지의 부분은 현재 영상으로 이동 + +61 +00:03:46,259 --> 00:03:49,939 + 아마 당신은 수평 또는 뭔가 변환하는 모든 노란색 의미 + +62 +00:03:49,939 --> 00:03:53,680 + 추천 그것이 컴퓨팅 광학 흐름을 사용하기위한 두 가지 일반적인 방법 + +63 +00:03:53,680 --> 00:03:58,069 + 권투 말리크에서 블록으로 여기에 가장 일반적인 적어도 나 하나 + +64 +00:03:58,069 --> 00:04:00,949 + 그래서 당신이 경우에 사용하는 디폴트로 같은 종류의 인 하나 + +65 +00:04:00,949 --> 00:04:03,399 + 자신의 프로젝트에 광학 흐름을 계산하는 것은 내가 사용하는 것이 좋습니다 것 + +66 +00:04:03,400 --> 00:04:08,950 + 이 큰 변위 광학 플로우 방법은 그래서이 광 흐름이 우리를 사용하여 + +67 +00:04:08,949 --> 00:04:12,199 + 우리가 알고있는 광학 플로우를 사용하여 모든 주요 장소로는 우리로 이동을했습니다 + +68 +00:04:12,199 --> 00:04:15,859 + 한 번에 약 십오 프레임 일 수있다 이러한 릴 트럭 분량을 추적 끝 + +69 +00:04:15,860 --> 00:04:20,509 + 그래서 우리는이와 끝까지 0.5 초 정도 트랙이 비디오를 통해 할 수 있습니다 및 + +70 +00:04:20,509 --> 00:04:21,519 + 우리는 인코딩 + +71 +00:04:21,519 --> 00:04:26,129 + 모든 이들 기술자와의 어떤이 트랙 주변 지역에 갔다 + +72 +00:04:26,129 --> 00:04:29,710 + 함께 플레이하는 데 사용되는 모든 피터슨 시각이 히스토그램 사람들을 축적 + +73 +00:04:29,709 --> 00:04:34,668 + 정확히 특별히 때문에 비디오를자를 어떻게 같은 다른 종류의 + +74 +00:04:34,668 --> 00:04:37,359 + 우리는 히스토그램을 독립적 히스토그램과의 모든 일을 할 겁니다 + +75 +00:04:37,360 --> 00:04:40,389 + 이러한 비즈니스 다음 우리는 기본적으로 모든 히스토그램을 만들거야 + +76 +00:04:40,389 --> 00:04:45,220 + 이러한 모든 시각적 기능을 갖춘 도시와이 일을 모두가 SVM에 가서 + +77 +00:04:45,220 --> 00:04:48,050 + 사람들이 이러한 문제의 해결 방법의 측면에서 바위 레이아웃의 종류 + +78 +00:04:48,050 --> 00:04:55,720 + 트럭 단지로 생각 과거는 다섯 프레임이 될 것입니다 그리고 그것은이다 + +79 +00:04:55,720 --> 00:05:01,639 + 단지 XY 위치 그렇게 15 XY는 다음 교살 및 좌표 우리 + +80 +00:05:01,639 --> 00:05:07,168 + 우리가 실제로 접근 방법의 관점에서 현재 로컬 좌표계 추출 + +81 +00:05:07,168 --> 00:05:13,859 + 그와 함께 이러한 문제는 그녀가 첫 번째 층에 알렉스 그물을 호출되지 작동 + +82 +00:05:13,860 --> 00:05:17,560 + 세에 의해 예를 들어 227 (227)에 대한 이미지 thatís을 받게되며 + +83 +00:05:17,560 --> 00:05:22,310 + 11 11 96 필터를 재 처리하면 오른쪽에 대한 등의 적용 + +84 +00:05:22,310 --> 00:05:27,978 + 우리는 알렉스 그물이 아흔여섯 볼륨에 의해 5555 결과 보았다 + +85 +00:05:27,978 --> 00:05:30,468 + 우리는 실제로 각에서 모든 필터의 모든 응답을 갖는 + +86 +00:05:30,468 --> 00:05:34,788 + 하나의 공간적 위치 당신이 경우 합리적인 방법이 될 것입니다 무슨 지금 + +87 +00:05:34,788 --> 00:05:38,158 + 우리는 단지이없는 경우에 작동하는 모든 작업을 수행 일반화하고 싶어 + +88 +00:05:38,158 --> 00:05:42,579 + 220 누군가가 23집니다하지만 일이 될 수는 인코딩 좋아하는 프레임 + +89 +00:05:42,579 --> 00:05:47,278 + 그래서 당신은에오고 그 15 227 227 배터리의 전체 블록이 + +90 +00:05:47,278 --> 00:05:50,180 + 달성 당신이 공간을 모두 에코하려는 일을 모두하고 + +91 +00:05:50,180 --> 00:05:54,209 + 시간적 패턴과 볼륨이 작은 블록 내부 그래서처럼 될 것이다 + +92 +00:05:54,209 --> 00:05:57,379 + 변경하는 방법에 대한 아이디어는 모든 일을 성취 + +93 +00:05:57,379 --> 00:06:00,379 + 이 경우에 일반화 + +94 +00:06:03,899 --> 00:06:27,609 + 나는 것으로 기대 확인 그 흥미로운 두 블록 등 그들을 배치 + +95 +00:06:27,610 --> 00:06:33,870 + 그게 문제가 관심의 종류 그래서 아주 아주 잘 작동하지 않는다 + +96 +00:06:33,870 --> 00:06:36,850 + 기본적으로 모든 신경에 의해 다음 단 하나의 프레임에서 찾고있다 + +97 +00:06:36,850 --> 00:06:39,720 + 당신이 당신과 함께 결국 주석의 끝이 그에 큰보고하고, + +98 +00:06:39,720 --> 00:06:43,310 + 더 큰 영역과 도전 그래서 결국 모두 볼 이러한 뉴런 + +99 +00:06:43,310 --> 00:06:46,470 + 귀하의 의견하지만 그들은 아주 쉽게 연관 할 수 없을 것입니다 + +100 +00:06:47,589 --> 00:06:52,589 + 이 이미지에서 조금 특별한 제어 패치 같은 사실은 확실하지 않다 + +101 +00:06:52,589 --> 00:07:04,149 + 정말 좋은 아이디어는 내가 그래서 우리는 그 중 몇 가지를 얻을 수있을 거라 생각 그것으로 만들어 놓을 않았다 + +102 +00:07:04,149 --> 00:07:07,149 + 그 같은 일을 + +103 +00:07:09,930 --> 00:07:25,199 + 효과적으로 45 채널을 가지고 그, 그래서 당신은에 코멘트를 넣을 수 + +104 +00:07:25,199 --> 00:07:28,919 + 모든 I에 도착 뭔가 당신은 내가는 생각하지 않는 것을 할 수 있다고 생각 + +105 +00:07:28,918 --> 00:07:44,049 + 그래서 당신이 시간의 한 조각의 일이 당신을 것을 말을하는지 예 '로 최고의 아이디어 + +106 +00:07:44,050 --> 00:07:48,379 + 다음 다른를 한 번에 기능과 유사한 종류의 압축을 + +107 +00:07:48,379 --> 00:07:48,990 + 시각 + +108 +00:07:48,990 --> 00:07:52,829 + 피터이기 때문에 특별히 공유 그 일의 동기 부여와 유사 + +109 +00:07:52,829 --> 00:07:55,909 + 여기에 당신이 재산 곳의 같은 종류 그래서뿐만 아니라 거기 유용 + +110 +00:07:55,910 --> 00:07:58,910 + 당신은 공간뿐만 아니라 무게와 시간을 공유하고 싶습니다 + +111 +00:07:59,689 --> 00:08:03,550 + 확인 그래서 사람들이 일반적으로 수행하는 것이 기본 일의 아이디어 위에 구축 + +112 +00:08:03,550 --> 00:08:06,400 + 그들은 이러한 확장으로 상용 네트워크와 비디오를 적용 할 때 + +113 +00:08:06,399 --> 00:08:10,138 + 필터는 공간 필터를하지뿐만 아니라 할 수 있지만 이러한이 + +114 +00:08:10,139 --> 00:08:14,840 + 필터 우리가 Bielema (11)가 전에, 그래서 시간에 그들에게 소량의 확장 + +115 +00:08:14,839 --> 00:08:15,750 + 필터 + +116 +00:08:15,750 --> 00:08:21,709 + 몇 가지 작은 시간 정도 그렇게 예를 들어 말 티아 차 필터에 의해 1111 우리 + +117 +00:08:21,709 --> 00:08:28,759 + 세 가지 필터에 의해 그는 2011 년 30이었다 특히이 경우 최대 15로 사용할 수 있습니다 및 + +118 +00:08:28,759 --> 00:08:33,979 + 다음 세 가지에 의해 우리는 RGB를 가지고 있기 때문에 기본적으로이 필터는 당신이있어 지금 + +119 +00:08:33,979 --> 00:08:36,969 + 뿐만 아니라 공간에서 필터를 슬라이딩 생각하고 전체를 조각 + +120 +00:08:36,969 --> 00:08:40,469 + 활성화지도하지만 실제로뿐만 아니라 공간에서 필터를 슬라이딩하고 있지만, + +121 +00:08:40,469 --> 00:08:44,450 + 또한 시간에 그들은 시간에 작은 유한 한 시간 정도가 있고 + +122 +00:08:44,450 --> 00:08:48,379 + 당신이 도입하고, 그래서 확인 전체 활성화 볼륨을 조각 끝 + +123 +00:08:48,379 --> 00:08:51,909 + 시간은 모든 커널에 모든 죽어가는 단계에 언급하기를 + +124 +00:08:51,909 --> 00:08:55,899 + 그래서 회선을 수행 된 따라 추가 시간이 언급 + +125 +00:08:55,899 --> 00:08:59,659 + 그 사람들이 기능을 추출하는 방법 일반적으로 그리고 당신은이 속성을 얻을 + +126 +00:08:59,659 --> 00:09:04,009 + 안전 그래서 여기에 등 세 곳에 우리는 공간적 시간적를 수행 할 때 + +127 +00:09:04,009 --> 00:09:07,230 + 경쟁 우리는이 매개 변수를 공유하는 방식은 시간가는 결국 + +128 +00:09:07,230 --> 00:09:11,639 + 뿐만 아니라 당신이 그렇게 기본적으로 언급 한 바와 같이 어느 정도 모든 필터 시간과 + +129 +00:09:11,639 --> 00:09:14,360 + 우리는 공간뿐만 아니라 시간뿐만 아니라 회선을 + +130 +00:09:14,360 --> 00:09:18,800 + 활성 볼륨 정품 인증과 바람은 그래서 이들 중 일부 매핑 + +131 +00:09:18,799 --> 00:09:22,818 + 접근 방식은 이전의 것들의 예를 하나 아주 초기에 제안했다 + +132 +00:09:22,818 --> 00:09:28,238 + 활동 인식이 2010 년부터 아마이기 때문에 아이디어는 여기가 있음을했다 + +133 +00:09:28,239 --> 00:09:31,798 + 일의 대신 (40)에 의해 예순의 단일 입력을 받고 단지 몇 + +134 +00:09:31,798 --> 00:09:36,108 + 사진은 또한 우리는 마흔에 의해 사실 예순 일곱 프레임을 받고 자신의 + +135 +00:09:36,109 --> 00:09:40,119 + 우리가 그래서 이러한 필터들을 참조로 결론은 세 디컨 볼 루션 있습니다 + +136 +00:09:40,119 --> 00:09:44,220 + 예를 들어뿐만 아니라 우리가 3 차원으로 끝낼 같이 세 가지로 이제 일곱 판매 될 수 있지만 + +137 +00:09:44,220 --> 00:09:49,499 + 진정과 세 가지 조건은 여기에 모든 단일 단계에서 적용됩니다 + +138 +00:09:50,649 --> 00:09:55,208 + 2011 년 비슷한 종이하지만 같은 생각 우리는 친구의 블록을 + +139 +00:09:55,208 --> 00:09:59,518 + 들어오는 당신은 3 차원 완료 입체 필터에서 그들을 약속 + +140 +00:09:59,519 --> 00:10:03,229 + 이 상용 네트워크에있는 모든 단일 지점 그래서이 2011 아니다 + +141 +00:10:04,948 --> 00:10:08,748 + 매우 유사한 아이디어도 그렇게이 다음이 전에 실제로 알렉스 출신 + +142 +00:10:08,749 --> 00:10:12,889 + 접근 방식은 일이 그렇게 모든 작업을 수행하는 것이 작은 알고 같은 종류의 수 있습니다 + +143 +00:10:12,889 --> 00:10:16,829 + 이러한 대규모 애플리케이션의 제 1 종이 출신 + +144 +00:10:16,828 --> 00:10:19,828 + 용량에 의해 2014 년 멋진 종이의 모든 + +145 +00:10:20,830 --> 00:10:27,540 + 이 처리 동영상을 여기에 바로 오른쪽에있는 모델이 주 그래서 + +146 +00:10:27,539 --> 00:10:31,159 + 우리는 내가이는 지금까지 그렇게되게 같은 생각 느린 융합이라고 + +147 +00:10:31,159 --> 00:10:35,750 + 세 가지 차원 모두 시간과 공간에서 일어나는 경쟁 때문에 그건 + +148 +00:10:35,750 --> 00:10:38,879 + 느린 융합 천천히이 시간을 사용하고 있기 때문에 우리는 그것을 참조로 + +149 +00:10:38,879 --> 00:10:43,649 + 단지 우리는 이전과 정보는 천천히 이제 공간 정보를 사용하고 + +150 +00:10:43,649 --> 00:10:47,100 + 당신은 또한 왜 코미디 쇼 네트워크 및 그냥있는 수있는 다른 방법이 있습니다 + +151 +00:10:47,100 --> 00:10:51,769 + 몇 가지 컨텍스트를 제공하는 것은 역사적으로이 구글의 연구이며, 알렉스하자 + +152 +00:10:51,769 --> 00:10:55,039 + 그냥 와서 그들이 매우 잘 작동하기 때문에 모두가 슈퍼 흥분했다 + +153 +00:10:55,039 --> 00:11:00,579 + 이미지와 나는 구글 비디오 분석 팀에 있었고, 우리는에 실행하고 싶었다 + +154 +00:11:00,580 --> 00:11:04,060 + 유튜브 동영상하지만 그것은 일반화하는 방법을 정확하게 꽤 명확하지 않았다 + +155 +00:11:04,059 --> 00:11:07,809 + 우리는 여러 가지 탐구 그래서 당신은 동영상을 다음 상용 네트워크와 알고 + +156 +00:11:07,809 --> 00:11:11,389 + 당신이 실제로 그래서 수레이를 착용하지 수있는 방법 건축 재료의 종류 + +157 +00:11:11,389 --> 00:11:17,889 + 접근 조기 융합의 종류라는 차원으로 융합는이 아이디어 사람 + +158 +00:11:17,889 --> 00:11:21,230 + 필요할 친구의 덩어리를 가지고 그냥 일어 났을 경우 앞에서 설명한 + +159 +00:11:21,230 --> 00:11:25,430 + 긴 채널은 45 등에 의해 227 227으로 끝낼 수 있습니다 + +160 +00:11:25,429 --> 00:11:29,500 + 이 종류의, 그래서 모든 것이 사들이고 당신은 그 위에 하나의 열을 + +161 +00:11:29,500 --> 00:11:35,200 + 맨 처음 통화하여 필터처럼 나중에 큰 시간적 범위를 가지고 있지만 + +162 +00:11:35,200 --> 00:11:38,780 + 다음 다른 모든부터 사실 두 차원의 경쟁은 우리 + +163 +00:11:38,779 --> 00:11:42,139 + 그는 매우 초기에 시간 정보를 거부했기 때문에 일찍 전화 + +164 +00:11:42,139 --> 00:11:45,879 + 다음 모두에의 첫 번째 편지는 당신이 상상할 수있는 호출 + +165 +00:11:45,879 --> 00:11:49,490 + 아이디어 알렉스 그물에 걸릴 여기 있도록 아키텍처는 가능성 회선입니다 + +166 +00:11:49,490 --> 00:11:53,169 + 우리는 그들을 떨어져 10 가지 그들이 그렇게 모두 독립적에 계산 말할 배치 + +167 +00:11:53,169 --> 00:11:57,169 + 이 10 점을 따로 따로 그리고, 우리는 완전히 연결에 많은 이상이어야합니다 + +168 +00:11:57,169 --> 00:12:00,620 + 레이어, 그리고, 우리는 단지보고 단일 청구 기준을했다 + +169 +00:12:00,620 --> 00:12:03,830 + 비디오의 한 프레임은 그래서 당신은 정확히 흰색 선까지로 재생할 수 있습니다 + +170 +00:12:03,830 --> 00:12:08,440 + 이 모델은 그들이 했어 상상할 수있는 아시아 모델을 보면 세 + +171 +00:12:08,440 --> 00:12:13,130 + 차원 대령은 이제 첫 번째 층은 실제로 그들을 시각화 할 수 있으며, + +172 +00:12:13,129 --> 00:12:16,210 + 이 다음은 동영상에 당신이 학습 결국 기능의 종류입니다 + +173 +00:12:16,210 --> 00:12:18,990 + 그들은 지금 때문에 이동하는 것을 제외하고 잘 알고 있었다 기본적 기능 + +174 +00:12:18,990 --> 00:12:22,680 + 이 필터는이 작은을 가지고 소량 및 시간을 연장된다 + +175 +00:12:22,679 --> 00:12:26,049 + 블롭을 이동하고, 그들 중 일부는 정적이고, 그들 중 일부는 이동 그들이있어 + +176 +00:12:26,049 --> 00:12:30,729 + 기본적으로 첫 번째 층에 움직임을 감지하고 그래서 당신은 멋진을 종료 + +177 +00:12:30,730 --> 00:12:31,960 + 폭탄 테러 이동 + +178 +00:12:31,960 --> 00:12:48,090 + 문제는 우리가 그에게거야 얼마나 내가 대답은 예 아마 생각 + +179 +00:12:48,090 --> 00:12:53,269 + 단지 공간에서이 경우 더 작은 필터를 작동하고 당신은 더 깊이가 + +180 +00:12:53,269 --> 00:12:56,370 + 같은 적용에 나는 시간에 생각하고 우리 것을 수행하는 아키텍처를 볼 수 있습니다 + +181 +00:12:56,370 --> 00:13:07,220 + 의미하지만 기대 + +182 +00:13:08,190 --> 00:13:13,580 + 이렇게 분류 우리는 영상이 여전히 카테고리의 수를 분류 한 + +183 +00:13:13,580 --> 00:13:17,970 + 매 프레임에서 그러나 지금 당신은 단지 하나의 프레임 것이 아니라 작동하지 않는 + +184 +00:13:17,970 --> 00:13:23,740 + 프레임 소수 어쩌면하여 예측이 양쪽 alot을 + +185 +00:13:23,740 --> 00:13:28,539 + 실제로 안전의 기능은 반에게 재미와 끝까지하는 제 2 비디오 음료 + +186 +00:13:28,539 --> 00:13:32,909 + 본 논문도 발표 동영상을 동영상을 그들은 하나 이상했다 + +187 +00:13:32,909 --> 00:13:36,639 + 이 실제로 이유에 대한 백만 동영상과 500 클래스는 주어진 컨텍스트 + +188 +00:13:36,639 --> 00:13:41,759 + 이 동영상 작업을 가지 어려운 지금은 내가 있기 때문에 생각 + +189 +00:13:41,759 --> 00:13:45,480 + 문제는 지금 내가 생각이 너무 많은 매우 큰 규모가 아니다 것입니다 + +190 +00:13:45,480 --> 00:13:49,820 + 당신은 이미지 것을 볼 매우 다양한 이미지의 수백만 같은 데이터 세트가 + +191 +00:13:49,820 --> 00:13:53,230 + 비디오 영역에서 그 어떤 정말 좋은 동등하지 않으며 그래서 우리는 함께 노력 + +192 +00:13:53,230 --> 00:13:56,730 + 이것은 그러나 2013 년 상태 및 다시 내가 그것이 실제로 우리가 충분히 달성 생각하지 않습니다 + +193 +00:13:56,730 --> 00:14:00,519 + 그와 나는 우리가 여전히 정말로 암살자을 잃었 아주 좋은 표시되지 않는 생각 + +194 +00:14:00,519 --> 00:14:03,579 + 비디오 및 그 우리는 또한 약간에서 당신의 일부를 낙담하는 이유 부분적이다 + +195 +00:14:03,580 --> 00:14:08,050 + 프로젝트에이 작업은 이러한 매우 강력한을 재교육 할 수 없기 때문에 + +196 +00:14:08,049 --> 00:14:12,969 + 기능 데이터 세트는 단지 확실히 거기에 다른 종류이기 때문에 + +197 +00:14:12,970 --> 00:14:16,100 + 당신이보고 우리가 때때로 사람을주의 이유는 흥미로운 것들 + +198 +00:14:16,100 --> 00:14:21,490 + 그 때문에 매우 빠르게 매우 정교을 동영상에 작업 점점에서 + +199 +00:14:21,490 --> 00:14:24,490 + 때때로 사람들은 동영상이 그들이 수행하려는 경우 매우 흥분 생각 + +200 +00:14:24,490 --> 00:14:27,810 + 3d 컬러 앨리스 팀을 표시하고는 모든 가능성에 대해 생각 + +201 +00:14:27,809 --> 00:14:31,469 + 그들을 위해 개방 실제로 단일 프레임 방법은 매우 것을 밝혀 + +202 +00:14:31,470 --> 00:14:34,820 + 강력한베이스와 나는 항상 첫 번째를하지 않는 실행하는 것이 좋습니다 것 + +203 +00:14:34,820 --> 00:14:37,710 + 동영상의 움직임에 대해 걱정하고 단지 첫 번째 작품 하나의 프레임을 시도 + +204 +00:14:37,710 --> 00:14:40,990 + 그래서이 논문의 예를 들어 우리는베이스 라인에서 하나에 대한 것을 발견 + +205 +00:14:40,990 --> 00:14:44,610 + 우리의 데이터 세트에서 59.3 %의 분류 정확도 + +206 +00:14:44,610 --> 00:14:48,600 + 다음 우리가 실제로 계정 작은 지역의 움직임을 고려하기 위해 최선을 시도했지만 + +207 +00:14:48,600 --> 00:14:54,440 + 우리는 11.6 %에 의해 아래로 당김이 모든 추가 작업 모든 여분의 컴퓨터 그래서 결국 + +208 +00:14:54,440 --> 00:14:57,529 + 그리고 당신은 내가 당신에게 시도거야 상대적으로 작은 이익에 결국 + +209 +00:14:57,528 --> 00:15:02,088 + 그가 될 이유 기본적으로 비디오는 항상 당신이하는만큼 유용하지 않다 + +210 +00:15:02,089 --> 00:15:07,230 + 직관적으로 생각하고, 그래서 여기에 예측 종류의 몇 가지 예입니다 그 우리 + +211 +00:15:07,230 --> 00:15:11,800 + 스포츠와 우리의 예측 다른 데이터 세트는 내가 이런 종류의 생각 + +212 +00:15:11,799 --> 00:15:15,528 + 강조 약간 이유에 비디오를 추가하는 것은 일부 설정에서와 같이 도움이되지 않을 수도 있습니다 + +213 +00:15:15,528 --> 00:15:19,740 + 여기에 특히 당신은 스포츠를 구분하고 그것에 대해 생각하려고하는 경우 + +214 +00:15:19,740 --> 00:15:23,930 + 이 회전처럼 수영이나 뭔가에서 테니스 말을 구별하려고 + +215 +00:15:23,929 --> 00:15:26,729 + 당신이 있다면 당신은 실제로 아주 좋은 지역의 움직임 정보를 필요로하지 않는 것을 + +216 +00:15:26,730 --> 00:15:29,610 + 파란색 물건을 많이 오른쪽 많은 수영에서 테니스를 구별하려고 + +217 +00:15:29,610 --> 00:15:33,350 + 빨간색 물건의 이미지가 실제로 정보의 엄청난 금액을 가지고과 같이 + +218 +00:15:33,350 --> 00:15:36,240 + 당신은 추가 매개 변수를 많이 넣고이 후 이동하려는 + +219 +00:15:36,240 --> 00:15:40,959 + 대부분의 클래스의 대부분은 실제로 지역 운동은하고 있지만, 지역 운동 + +220 +00:15:40,958 --> 00:15:44,289 + 매우 중요하지 그들은 당신이 매우 세분화 된 경우에만 중요한 것 + +221 +00:15:44,289 --> 00:15:47,919 + 작은 움직임이 실제로 정말 많은으로 많은 문제 카테고리 + +222 +00:15:47,919 --> 00:15:52,419 + 이 동영상이 경우 당신은 미친 시간적 공간적 사용하는 경향됩니다 + +223 +00:15:52,419 --> 00:15:56,860 + 비디오 네트워크 그러나 나는 그 운동이 매우 약 열심히 생각 + +224 +00:15:56,860 --> 00:15:59,980 + 중요하고 그렇지 않은 경우 결과를 얻을 수 있기 때문에 당신은 설정하는 + +225 +00:15:59,980 --> 00:16:04,070 + 그는 작업을 많이 넣어 곳이 같은 그것은 잘 작동의를 살펴 보자되지 않을 수 있습니다 + +226 +00:16:04,070 --> 00:16:10,180 + 작동 다른 비디오 분류 그래서 이것은 2015 4월 자사의 + +227 +00:16:10,179 --> 00:16:14,698 + 상대적으로 인기가 그것은 바다 3d 및 아이디어라고 여기에 기본적이었다 있어요 + +228 +00:16:14,698 --> 00:16:18,528 + 네트워크는 두 가지로이 아주 좋은 그 3 개월 불러 아키텍처와 두가 + +229 +00:16:18,528 --> 00:16:22,110 + 여기에 생각에 걸쳐 풀 멋진의 정확한 같은 일을 할 수 있다는 것입니다하지만, + +230 +00:16:22,110 --> 00:16:25,169 + 시간에 모든 확장하므로 지점으로 돌아가는 당신은 매우 작은합니다 + +231 +00:16:25,169 --> 00:16:29,069 + 이 모든 세 가지입니다 때문에 필터가 내 나무를 구입하는 구입 기억 수도 있습니다 + +232 +00:16:29,070 --> 00:16:33,100 + 아키텍처 전반에 걸쳐 풀은 그래서 차원에서 큰 미국의 매우 간단한 종류의 + +233 +00:16:33,100 --> 00:16:36,528 + 접근 방식의 종류 및 그 합리적으로 잘 작동하고 당신이 볼 수 + +234 +00:16:36,528 --> 00:16:38,429 + 참조 용 종이 + +235 +00:16:38,429 --> 00:16:42,389 + 접근 방법의 또 다른 형태는 실제로는 카렌 시몽에서로 아주 잘 작동합니다 + +236 +00:16:42,389 --> 00:16:43,778 + 2014 년 + +237 +00:16:43,778 --> 00:16:48,299 + 같은과 같은 방법으로 그는 BG하지 그가 해낸 사람의 SIMONIAN + +238 +00:16:48,299 --> 00:16:51,828 + 또한 비디오 분류에 아주 좋은 종이를 가지고 있으며 여기에 생각이 있다는 것입니다 + +239 +00:16:51,828 --> 00:16:54,299 + 이 종류의 때문에 그는 세 가지 차원의 경쟁을하고 싶지 않았다 + +240 +00:16:54,299 --> 00:16:55,219 + 그것을 가지고 고통 + +241 +00:16:55,220 --> 00:17:00,360 + 98 그것을 발견하고 너무 너무에 그는 단지 컴파일하지만 아이디어를 측정하는 데 사용 + +242 +00:17:00,360 --> 00:17:05,179 + 여기에 우리가 와서해야 할 이미지를 찾고, 다른 하나는 점이다 + +243 +00:17:05,179 --> 00:17:10,298 + 이 두 단지 이미지 만 너무 비디오의 광학 흐름에 있습니다보고 + +244 +00:17:10,298 --> 00:17:14,699 + 광학 흐름은 기본적으로 상황이 이미지의 이동 방법을 알려줍니다 + +245 +00:17:14,699 --> 00:17:19,120 + 그래서이 둘은 평균 그물 같은 또는 알렉스 싫어하는처럼 그냥 가지입니다 + +246 +00:17:19,119 --> 00:17:23,139 + 그 중 하나의 이미지에 이들의 또 다른 가까운 하나가 추출이 + +247 +00:17:23,140 --> 00:17:28,059 + 광학 흐름은 전 브롱스 방법을 말한다 다음은 University of Florida의 사용을 허용하는 + +248 +00:17:28,058 --> 00:17:31,720 + 아주 늦은 결국 이렇게 두 가지의 정보를 몇 가지 아이디어에 대해 생각해 + +249 +00:17:31,720 --> 00:17:34,850 + 다음 그들이 비디오의 클래스의 관점에서보고있다 및 거부 + +250 +00:17:34,849 --> 00:17:37,859 + 그들이 그들이 예를 찾을 수 있도록 그들을 이용하는 방법은 다양 + +251 +00:17:37,859 --> 00:17:42,979 + 당신은 그냥 특별한 코멘트는 이미지를 찾고 사용하는 경우 당신은 몇 가지를 얻을 + +252 +00:17:42,980 --> 00:17:47,120 + 방금 광 흐름에 와서 사용하는 경우 성능이 실제로도 수행 + +253 +00:17:47,119 --> 00:17:49,558 + 단지 원시 영상을보고보다 약간 더 + +254 +00:17:49,558 --> 00:17:54,178 + 이 경우 실제로 여기 광 흐름은 정보를 많이 포함 + +255 +00:17:54,179 --> 00:17:58,538 + 실제로 의해 여기 수 있도록 더 나은 지금 흥미로운 점을 끝낼 경우 + +256 +00:17:58,538 --> 00:18:01,879 + 방법은 당신 특히 여기 아키텍처의이 종류가있는 경우 + +257 +00:18:01,880 --> 00:18:05,700 + 세 가지 필터에 의해 많은 복잡한 역사는 실제로 것이라고 상상할 수 + +258 +00:18:05,700 --> 00:18:10,038 + 나는 그것이 실제로 당신이 좋겠 광학 흐름을 넣어하는 데 도움 않는 이유를 의미한다고 생각 + +259 +00:18:10,038 --> 00:18:13,158 + 중앙 및 프레임 워크에 우리가 이러한 의견 배울 것으로 기대하고 상상 + +260 +00:18:13,159 --> 00:18:16,049 + 특히 처음부터 모든 것을 그들이 뭔가를 배울 수 있어야합니다 + +261 +00:18:16,048 --> 00:18:20,599 + 즉, 광학 흐름을 계산하는 계산을 시뮬레이션하며 밝혀 + +262 +00:18:20,599 --> 00:18:24,230 + 때때로 비디오를 비교할 때 때문에 그 경우하지 않을 수 있음 + +263 +00:18:24,230 --> 00:18:29,440 + 만 병원에 네트워크 및 그것은 잘 작동 그래서 내가 생각 + +264 +00:18:29,440 --> 00:18:34,169 + 우리가 가지고 있지 않기 때문에 그 이유는 아마 실제로 데이터로 회복된다 + +265 +00:18:34,169 --> 00:18:37,900 + 충분한 데이터 우리가 당신이 실제로 아마이없는 생각 데이터의 소량 + +266 +00:18:37,900 --> 00:18:42,730 + 충분한 데이터가 실제로 기능 등 같은 아주 좋은 광학 흐름을 배울 수 + +267 +00:18:42,730 --> 00:18:45,599 + 실제로 하드에 갈 점점 왜 내 특정 대답을 것 + +268 +00:18:45,599 --> 00:18:48,819 + 너희들이에서 작업하는 경우 네트워크는 아마 대부분의 경우에서 돕는 당신의 + +269 +00:18:48,819 --> 00:18:51,839 + 내가 실제로 시도하는 것이 좋습니다 것입니다 비디오와 프로젝트는 이런 종류의 일하기 + +270 +00:18:51,839 --> 00:18:52,779 + 건축물 + +271 +00:18:52,779 --> 00:18:57,480 + 다음 광학 흐름과는 이미지의 척 당신은에 끝이 올 수 + +272 +00:18:57,480 --> 00:19:01,808 + 즉, 상대적으로 합리적인 접근 방식처럼 좋아 보인다 그래서 지금까지 우리는 얘기했습니다 + +273 +00:19:01,808 --> 00:19:06,339 + 시간의 작은 지역 정보에 대한 권리 그래서 우리는이 작은이 + +274 +00:19:06,339 --> 00:19:07,398 + 조각 + +275 +00:19:07,398 --> 00:19:10,069 + 블랙 0.5 초 적 좋을한다 활용하려 + +276 +00:19:10,069 --> 00:19:13,739 + 실제로 많은이 동영상이 경우 분류하지만 무슨 일이 + +277 +00:19:13,739 --> 00:19:14,489 + 더 길게 + +278 +00:19:14,489 --> 00:19:19,700 + 당신이 모델 같은 종속의 시간적 종류 그래서 그건뿐만 아니라 그 + +279 +00:19:19,700 --> 00:19:22,319 + 지역 운동은 중요하지만 실제로 어떤 이벤트가 걸쳐있다 + +280 +00:19:22,319 --> 00:19:25,548 + 비디오 네트워크와 실제로의 시간 규모에서 훨씬 더 큰 것을 + +281 +00:19:25,548 --> 00:19:29,618 + 문제 때문에 이벤트 이후에 발생하는 이벤트는 하나 몇 가지 클래스의 매우 나타낼 수 있습니다 + +282 +00:19:29,618 --> 00:19:33,999 + 당신이 실제로 그 모델이 그렇게 일하는 것이하려는 종류의은 + +283 +00:19:33,999 --> 00:19:39,659 + 실제로 당신은 얼마나 알고에 당신이 노력에 대해 생각하는 것이 접근 + +284 +00:19:39,659 --> 00:19:42,659 + 당신은 훨씬 더 긴 기간 이벤트 이러한 종류의 모델을 실제로 될까요 + +285 +00:19:44,618 --> 00:19:54,009 + 당신이있어 위에 어떤 긴장감을 가지고 같은 확인하므로주의 모델은 아마도 그래서 당신은 할 수있다 + +286 +00:19:54,009 --> 00:19:56,729 + 이 전체 비디오를 분류하려고하는 것은 어쩌면 통해 긴장을 갖고 싶어요 + +287 +00:19:56,729 --> 00:19:58,129 + 비디오의 다른 부분 + +288 +00:19:58,128 --> 00:20:12,689 + 그래 그게 내가보고 좋은 생각이 그래서 당신은 우리가 이러한 다중 스케일을 가지고 말을하는지이야 + +289 +00:20:12,690 --> 00:20:16,479 + 우리는 때때로 매우 낮은 상세 수준에 이미지를 처리​​하지만 어디 방법이다 + +290 +00:20:16,479 --> 00:20:20,298 + 우리는 이미지의 크기를 조정하고 아마 프레임으로 글로벌 수준에이를 처리 + +291 +00:20:20,298 --> 00:20:23,710 + 우리는 실제로 비디오의 속도를 내가 생각하지 않는에 코멘트를 넣어 원하는 수 있습니다 + +292 +00:20:23,710 --> 00:20:28,048 + 나는 그래서 네 생각은 매우 흔한 일이지만 상원 의원 재치있는 아이디어 + +293 +00:20:28,048 --> 00:20:33,618 + 문제는 대략 것을 기본적으로이 정도가 아마 열 번 너무 짧은 그것입니다 + +294 +00:20:33,618 --> 00:20:37,019 + 그래서 우리의 초를 소비하지 않는 방법을 우리가 아키텍처를 어떻게해야합니까 + +295 +00:20:37,019 --> 00:20:40,179 + 기능 훨씬 더 긴 시간 규모 및 예측 + +296 +00:20:42,150 --> 00:20:48,300 + 예 여기에 하나의 아이디어는 우리는이 동영상을 가지고 있으며 우리는 다른 클래스가 그 + +297 +00:20:48,299 --> 00:20:50,599 + 시간에 모든 단일 시점에서 예측하기 좋아하지만 우리는 것을 원하는 것 + +298 +00:20:50,599 --> 00:20:54,849 + 예측 함수가 될 조금까지 숨 막혀 15초뿐만 아니라 실제로하기 + +299 +00:20:54,849 --> 00:20:59,149 + 당신이 실제로 사용으로 분별있는 생각 때문에 훨씬 더 긴 시간 비용 + +300 +00:20:59,150 --> 00:21:01,769 + 기록 작업에서 어딘가 현재 때문에 건축에있는 동안 + +301 +00:21:01,769 --> 00:21:04,990 + 네트워크는 당신이 모든 것을 통해 무한 상황과 주체를 가질 수 있도록 + +302 +00:21:04,990 --> 00:21:08,579 + 당신이 돌아갈 특히 최대 그때까지 당신을하기 전에 그 일이있다 + +303 +00:21:08,579 --> 00:21:12,119 + 이미 2011 년을 보여주는 한이 논문 그것은 그들이이 밝혀 + +304 +00:21:12,119 --> 00:21:16,289 + 전체 섹션 뺨이 걸릴 그들은 실제로 분석 팀이 곳 + +305 +00:21:16,289 --> 00:21:21,109 + 내가 그렇게 방법이야이 NLST라는 차원을 사용하여 2011에서 들여다가 있음을 정확히 수행 + +306 +00:21:21,109 --> 00:21:25,899 + 그들은 2011 년에 호출 그래서이 논문은 기본적으로 모두가 전에 + +307 +00:21:25,900 --> 00:21:29,920 + 3 차원 침착하고 대부분의 모델 글로벌 모션 모델 작은 지역 운동 + +308 +00:21:29,920 --> 00:21:34,860 + 엘라 자세 등으로 이들은 전체 연결 층 때문에 플레이에 스탬프를 넣어 + +309 +00:21:34,859 --> 00:21:37,849 + 그들은 다음이 재발와 완전히 연결 층을 함께 중독 + +310 +00:21:37,849 --> 00:21:40,939 + 당신은 모든 단일 프레임 클래스를 예측 할 때 당신은 무한 컨텍스트가 + +311 +00:21:40,940 --> 00:21:45,930 + 나는 꽤 시대를 앞서 생각하는이 논문이며, 그것은 기본적으로 모든 권한을 가지고 + +312 +00:21:45,930 --> 00:21:49,900 + 이 단지 65 시간에 설정되어 제외하고 나는 사람들이 더 많은 인기를 생각하지 않은 확실하지 않다 + +313 +00:21:49,900 --> 00:21:54,680 + 기본적으로이 이들 모두를 인식하는 방법 앞서 시간 종이입니다입니다 + +314 +00:21:54,680 --> 00:21:59,380 + 국가 대표팀 땀 나는 심지어 그 이후 그들에 대해 알고있다 전에 + +315 +00:21:59,380 --> 00:22:02,990 + 몇 가지 최근 %는 실제로 가지에서 매우 유사한 접근 방식을 + +316 +00:22:02,990 --> 00:22:07,190 + 제프 도나휴 2015은 모든 버클리에서 여기에 아이디어는 가지고있다 + +317 +00:22:07,190 --> 00:22:08,610 + 비디오 다시에 좋아 + +318 +00:22:08,609 --> 00:22:11,819 + 매 프레임을 분류하지만 그들은 보면 이러한 의견이 + +319 +00:22:11,819 --> 00:22:14,809 + 각각의 프레임은하지만 그들은 또한 앨리스는 해당 문자열 팀 한이 + +320 +00:22:14,809 --> 00:22:19,389 + 함께 일시적으로 나는이 구글이다 종이에서도 비슷한 생각 생각에서 + +321 +00:22:19,390 --> 00:22:24,160 + 그래서 여기에 아이디어는 광학 흐름을 가지고 이미지를 처리​​하는 것입니다 + +322 +00:22:24,160 --> 00:22:28,930 + 복잡하고 다시 당신은 시간이 지남에 그렇게 다시 병합 애널리스트 오전이 + +323 +00:22:28,930 --> 00:22:34,680 + 로컬 및 글로벌이이 조합은 그래서 지금까지 우리는 어떤 종류의 검토 한 + +324 +00:22:34,680 --> 00:22:37,789 + 당신의 분류를 달성 두 아키텍처 패턴이 + +325 +00:22:37,789 --> 00:22:43,170 + 실제로 계정 중요한 정보 모델링 운동에 소요되는 + +326 +00:22:43,170 --> 00:22:47,289 + 예를 들어 짐승 항목은 사용 광학 플로우를 요구 이상의 전역 움직임을 볼 수 있습니다 + +327 +00:22:47,289 --> 00:22:51,059 + 여기서 우리는 화학 함께 시퀀스 아침 시간 단계 또는 융합이 + +328 +00:22:51,059 --> 00:22:54,418 + 두 사람은 지금 실제로 나는이 있다는 점을 확인하는 등의 + +329 +00:22:54,419 --> 00:22:59,879 + 내가 최근 논문에서 본 다른 청소기 아주 좋은 흥미로운 아이디어와 + +330 +00:22:59,878 --> 00:23:03,689 + 그때는 훨씬 더 좋아하고 그래서 여기에 기본적으로의 바위 그림 무엇 + +331 +00:23:03,690 --> 00:23:08,330 + 지금 우리가 일부 비디오를 가지고 같은 것들을 우리는 차원이 말을 그 온 보일 + +332 +00:23:08,329 --> 00:23:13,038 + 그 사용 광학 플로우는 차원 열 또는 둘 모두를 사용하여 주문할 수 있습니다 + +333 +00:23:13,038 --> 00:23:17,898 + 프레임의 트렁크는 데이터를 크랭크 후 불행하게도 꼭대기에 자리 잡고있다 한 + +334 +00:23:17,898 --> 00:23:20,979 + 또는 장기 모델링을하고 그 그 때문에 종류의 같은 + +335 +00:23:20,980 --> 00:23:24,950 + 이 약의 종류 아주 좋은되지는 불안하다 그이 자신의 아들 + +336 +00:23:24,950 --> 00:23:29,499 + 이러한 구성 요소에 대한 추악한 비대칭이 당사자에게 3 차원 내부의 신경 세포가하는 + +337 +00:23:29,499 --> 00:23:33,079 + 당신은 비디오의 몇 가지 작은 지방 덩어리의 일부입니다 그 와서 + +338 +00:23:33,079 --> 00:23:35,849 + 맨 이러한 신경 세포가 그 비디오의 모든 우리의 기능 + +339 +00:23:35,849 --> 00:23:40,808 + 올 모든 일의 함수 자신의 기록 단위 때문에 + +340 +00:23:40,808 --> 00:23:45,288 + 그 전에 그래서 그것은 불안 비대칭 또는 뭔가처럼 같은 종류의 + +341 +00:23:45,288 --> 00:23:48,720 + 그래서 몇 주 전에에서 매우 영리한 어떤 생각을 가지고 종이가있다 + +342 +00:23:48,720 --> 00:23:54,249 + 모든 것이 아주 좋은 곳이 훨씬 더 좋은 균일 한 라이프 스타일입니다 + +343 +00:23:54,249 --> 00:23:58,118 + 어떻게 우리가 할 수 있었던 사람이 생각할 수있는 경우 마진과 간단하고 그래서 난 몰라 + +344 +00:23:58,118 --> 00:24:06,819 + 하지만 우리는 모든 것을 훨씬 더 청소기를 만들기 위해 할 수 있습니다 내가 할 수 없었던 나는 때문에 + +345 +00:24:06,819 --> 00:24:09,019 + 이 아이디어를 제공하지만 난 그것을 읽고 무엇 멋진라고 생각하지 않습니다 + +346 +00:24:09,019 --> 00:24:22,399 + 주석이 실제로 어떤 것을 확실하지 않은 이미지 처리를 시작하기 전에 + +347 +00:24:22,398 --> 00:24:25,288 + 당신이 찢어진 것 참조 산산이 광 정보 및 의견을 줄 것이다 + +348 +00:24:25,288 --> 00:24:30,169 + 어떻게 든 당신이 확실히의 함수이다 신경을 것 위에 + +349 +00:24:30,169 --> 00:24:34,090 + 그것은하지만 모든 미국 팀이이 경우에 일을해야 될지 분명하지 않다 + +350 +00:24:34,089 --> 00:24:37,388 + 아마에서 처리 너무 낮은 수준의 픽셀을 흐리게 될 가능성이 + +351 +00:24:37,388 --> 00:24:51,678 + 그 시점은 다음 작품을 참을 같은 미디어를 많이있다 + +352 +00:24:51,679 --> 00:24:56,389 + 이 문제는 모든 비트를 찾고 있음을 다르게 시간적 해상도 + +353 +00:24:56,388 --> 00:25:04,038 + 모든 모든 여행 친구처럼 보이는 내가 그래서 당신의 말을 또 다른 시간 + +354 +00:25:04,038 --> 00:25:07,009 + 나는 당신이이 걸릴 경우 다른 사람이 지적한 것과 유사한 생각 아이디어 + +355 +00:25:07,009 --> 00:25:10,179 + 비디오 당신은 때 비디오를 빠르게 해당 동영상에 여러 저울에서 작동 + +356 +00:25:10,179 --> 00:25:14,778 + 당신은 비디오를 느리게 그리고 당신은 그 앞줄에있어 온 3D했습니다 + +357 +00:25:14,778 --> 00:25:23,989 + 그것은 현명한 생각이 같은 속도 또는 뭔가처럼 배경을 수행 할 수 있습니다 + +358 +00:25:23,989 --> 00:25:26,669 + 일을보기 위하여 흥미에 내가 그는 생각에 빼기 만 보면 + +359 +00:25:26,669 --> 00:25:30,639 + 내가 생각하는 합리적인 생각은 종류의 엔드 - 투 - 엔드를 갖는이 아이디어에 반하는 + +360 +00:25:30,638 --> 00:25:33,868 + 당신은 당신이 생각하는이 명시 적 계산과 같이 소개하고 있기 때문에 학습 + +361 +00:25:33,868 --> 00:25:37,759 + 그가 가지고로서 유용 + +362 +00:25:42,288 --> 00:25:48,658 + 3 차원 사이에 공유가 나오고 그들이 그 재미의 내가 아니에요 + +363 +00:25:48,659 --> 00:25:52,139 + 아르 논 때문에 확실히 백퍼센트는 상태 벡터와 행렬을 잤다된다 + +364 +00:25:52,138 --> 00:25:55,678 + 곱셈과 사물처럼하지만 진정 플레이어에서 우리는 공간을 싫어했다 + +365 +00:25:55,679 --> 00:26:05,369 + 구조 나 공유가 작동하는 방법을 실제로 모르겠지만 그래 좋아하므로 + +366 +00:26:05,368 --> 00:26:11,319 + 아이디어는 우리가 우리가있어 지금 없애하는거야 보게 될 것이다 + +367 +00:26:11,319 --> 00:26:14,408 + 기본적으로이에 걸릴 것 우리는 모든 단일 신경 세포를 만들거야 + +368 +00:26:14,409 --> 00:26:17,379 + 그 모든 같은 작은 재발 성 신경 네트워크로 나온다 + +369 +00:26:17,378 --> 00:26:21,648 + 하나의 신경 세포가 확인하는 방식 때문에이 작동합니다 진정에 재발된다 + +370 +00:26:21,648 --> 00:26:27,178 + 그리고 나는 그것이 아름다운 생각하지만, 자신의 사진이 그렇게 추한의 종류의 종류 + +371 +00:26:27,179 --> 00:26:29,730 + 많은이 말도 안돼 위해 이렇게 나를 약간이 설명하려고하자 + +372 +00:26:29,730 --> 00:26:36,278 + 우리가 대신 무엇을 할 거 야 다른 방법은 우리가 어딘가에 발신자를 가지고있다 + +373 +00:26:36,278 --> 00:26:40,278 + 신경 네트워크가 수술 이전에 침착 아래에서 입력을 받아 또는 + +374 +00:26:40,278 --> 00:26:43,398 + 우리는이를 통해 경쟁을하고있는 일이의 출력을 계산하기 + +375 +00:26:43,398 --> 00:26:47,528 + 여기에 아이디어는 우리가 매일 조금 오는 만들려고하고있다 우측 있도록 층 + +376 +00:26:47,528 --> 00:26:53,058 + 나중에 때문에 재발 플레이어의 종류 우리가 할 길을 우리가 그대로입니다 + +377 +00:26:53,058 --> 00:26:57,528 + 에 대한 우리는 우리 아래에서 입력을 받아 우리는 그 위에 오는 않지만 우리는 또한 우리를 취할 + +378 +00:26:57,528 --> 00:27:00,778 + 대신 이전 시간으로부터 이전 출력 + +379 +00:27:00,778 --> 00:27:05,638 + 그 외에도 이전 시간 단계에서이 발신자 그래서 거기 플레이어 + +380 +00:27:05,638 --> 00:27:09,408 + 이 때 물건과 우리가 모두이 이상 대회를 수행하는 것이 현재의 입력 + +381 +00:27:09,409 --> 00:27:13,830 + 하나 하나, 그리고, 우리는 종류의 우리가 우리가있을 때 호출하지 않습니다 알고있다 + +382 +00:27:13,829 --> 00:27:19,490 + 이전 복장에서 현재 입력하고 정품 인증에서 이러한 활성화 및 + +383 +00:27:19,490 --> 00:27:24,649 + 우리는 그들을 추가하거나 우리가 병합 같은 그 일처럼 재발을 수행하는 것이 같은 + +384 +00:27:24,648 --> 00:27:28,719 + 그 두 생산의 최대이며, 그래서 우리는 현재의 입력의 기능이야 + +385 +00:27:28,720 --> 00:27:34,730 + 뿐만 아니라 이전 활성화의 기능은 너무 감각을 만드는 경우 + +386 +00:27:34,730 --> 00:27:37,200 + 그것은이 두 차원을 사용하여 사실이었다 즉 대해 아주 좋다 + +387 +00:27:37,200 --> 00:27:41,149 + 여기에 대회 이들 모두는 어디 때문에 더 차원 수는 없다 + +388 +00:27:41,148 --> 00:27:44,678 + 이전 야그의 리암의 깊이 권한에 의해 높이로 폭은 매우 함께 + +389 +00:27:44,679 --> 00:27:49,309 + 이전 계층의 깊이와 우리는 이전 시간에서 높은 깊이있는 + +390 +00:27:49,308 --> 00:27:52,408 + 이들 중 일부는 두 가지 차원 대회하지만 우리는 종류와 끝까지 + +391 +00:27:52,409 --> 00:27:57,710 + 재발 여기에 프로세스 등 하나의 방법처럼 재발과이를 볼 수 있습니다 + +392 +00:27:57,710 --> 00:28:00,659 + 우리가 바라 보았다 신경망은이 재발 위치를 가지고있다 + +393 +00:28:00,659 --> 00:28:03,980 + 당신은 상태에서 경쟁하기 위해 노력하고 있으며 이전 상태의 함수이다 + +394 +00:28:03,980 --> 00:28:07,878 + 현재 공격은 그래서 우리는 실제로 여러 가지 방법으로 보았다 + +395 +00:28:07,878 --> 00:28:14,058 + 연구 개의 포 엘 존중가 그래서 그 재발 또는 GRU GRU까지 배선 + +396 +00:28:14,058 --> 00:28:17,950 + LSD의 간단한 버전입니다 당신이 기억하지만 경우는 거의 항상 비슷한 있습니다 + +397 +00:28:17,950 --> 00:28:21,548 + 분석 팀에 성능이 약간 다른 업데이트 수식에 대한 GRU 그래서 + +398 +00:28:21,548 --> 00:28:24,499 + 실제로이 논문은에 무엇을 그 재발을 수행하고 참조 + +399 +00:28:24,499 --> 00:28:27,950 + 이 오스트리아의 간단한 버전이기 때문에 기본적으로 그들은 GRU을 그 + +400 +00:28:27,950 --> 00:28:31,899 + 단지뿐만 아니라 대신 모든 단일 매트릭스 작동하는 것은 일종의처럼 곱 + +401 +00:28:31,898 --> 00:28:36,758 + 진정으로 대체 당신은 당신이 상상할 수있는 수 있다면 그 모든 단일 행렬 + +402 +00:28:36,759 --> 00:28:41,819 + 여기에 곱하면 바로 전화가 그래서 우리는 우리의 입력을 통해 발전 할 수지고가 + +403 +00:28:41,819 --> 00:28:45,798 + 큰 출력을 포함하고는 이전의와 아래, 그리고, 우리는 결합 + +404 +00:28:45,798 --> 00:28:50,329 + 다만 미 GRU의 재발과 그들이 실제로 우리의 활성화를 가져올 수 및 + +405 +00:28:50,329 --> 00:28:57,158 + 이 같은 모습과 지금은 그냥 보이는 전에 그래서 우리는이 없습니다 + +406 +00:28:57,159 --> 00:29:01,179 + 일부 지역의 인터넷과 범위의 일부는 우리가 그냥이이 유한 한 우리의 + +407 +00:29:01,179 --> 00:29:05,679 + 소득은 모든 단일 층 전에하지만 컴퓨팅하지만 반환되는 경우 있음 + +408 +00:29:05,679 --> 00:29:06,410 + 또한 재미 + +409 +00:29:06,410 --> 00:29:11,610 + 이전 노력과 모두의 함수로 그에 따라서이 링크 + +410 +00:29:11,609 --> 00:29:14,990 + 균일 한의 매우 친절 그리고 좀 유전자처럼 그냥 233을 너무 많이 불리는 그 + +411 +00:29:14,990 --> 00:29:19,799 + 멕시코에서 인도 재발하고는 어쩌면 그건 내 간단한 그냥 대답의의 + +412 +00:29:19,799 --> 00:29:27,579 + 일이 이렇게 누군가 당신은 공간 시간 상용 네트워크를 사용하고 싶습니다 그래서 만약 + +413 +00:29:27,579 --> 00:29:30,819 + 당신의 프로젝트와 매우 흥분 때문에 동영상에 제일 먼저에 + +414 +00:29:30,819 --> 00:29:34,359 + 중지하면됩니다 그리고 당신은 당신이 정말로 필요 여부에 대해 생각해야 + +415 +00:29:34,359 --> 00:29:37,740 + 프로세스 운동 또는 전역 움직임이나 감정이 정말 중요합니다 당신의 + +416 +00:29:37,740 --> 00:29:41,839 + 분류 작업 당신이 정말로 운동이 그 다음 생각에 중요하다고 생각하는 경우 + +417 +00:29:41,839 --> 00:29:44,829 + 로컬 움직임이 그가 중요하다 모델링 할 필요가 있는지 여부에 대한 + +418 +00:29:44,829 --> 00:29:46,929 + 모든 전역 움직임을 위해 매우 중요하다 + +419 +00:29:46,930 --> 00:29:50,370 + 당신은 항상에이에 대해 당신이 시도해야 당신이의 힌트를 얻을에 기반 + +420 +00:29:50,369 --> 00:29:54,069 + 내가 말을 기준으로 한 해당 비교 한 다음 사용하여 시도해야 + +421 +00:29:54,069 --> 00:29:57,539 + 광학 플로우는 것 때문에 그 경우 데이터의 당신이 특히 적은 양의 그것 + +422 +00:29:57,539 --> 00:30:02,039 + 실제로는 아주 좋은 신호 세금 선취 특권 코드처럼 매우 중요하다 및 + +423 +00:30:02,039 --> 00:30:06,099 + 명시 적으로 광 흐름이 나와 보는 유용한 기능이라고 지정 + +424 +00:30:06,099 --> 00:30:09,609 + 당신이 지금 막 오후 일을보고있는이 박사를 시도하지만이를 생각할 수 + +425 +00:30:09,609 --> 00:30:12,599 + 실험도 최근 그래서 나는 실제로 내가 충분히 할 수있는 경우에 확실하지 않다 + +426 +00:30:12,599 --> 00:30:16,589 + 보증하거나 작동하는 경우가 아주 좋은 아이디어처럼 보인다하지만되지 않았습니다 + +427 +00:30:16,589 --> 00:30:21,849 + 아직 검증 그래서 그 행복 프로세스의 바위 레이아웃과 같은 종류의의의 + +428 +00:30:21,849 --> 00:30:25,339 + 현장에서 동영상 그래서 나는 저스틴 가고 있기 때문에 질문이 있는지 알고 + +429 +00:30:25,339 --> 00:30:28,339 + 다음에 올 + +430 +00:30:33,980 --> 00:30:43,289 + 이 일이 사용하지 않은보고있는 모든 P는 내가 안 좋은 질문 이잖아 + +431 +00:30:43,289 --> 00:30:46,879 + 내가 LLP 슈퍼 괜찮아요 전문가가 아니에요하지만이 생각하기 전에 보지 못했지만 그렇게 생각 + +432 +00:30:46,880 --> 00:30:52,980 + 그래서 나는 내가 너무 좋아 생각하지 않는다 그녀를 보지 못했다 추측 할 것 + +433 +00:31:18,880 --> 00:31:26,660 + 만 가진 측에 나는 확실히 뭔가 사람들이 할 말을 + +434 +00:31:26,660 --> 00:31:31,810 + 당신은 단지 사람들 때문에 둘 다 할 너무 많은 논문을 볼 수 없습니다 싶은 + +435 +00:31:31,809 --> 00:31:35,639 + 그리고 사람의 수면 문제의 종류와 같은 것은 어쩌면 그들을 해결되지 공동으로하지만, + +436 +00:31:35,640 --> 00:31:38,620 + 확실히 회사는 실제 시스템에 뭔가 작업을 얻으려고 노력하는 당신 + +437 +00:31:38,619 --> 00:31:42,869 + 그런 일을 할 것입니다하지만 난 당신이 할 것입니다 거기에 아무것도 생각하지 않습니다 + +438 +00:31:42,869 --> 00:31:45,449 + 당신은 아마 당신은이 말 융합 접근 방식으로이 작업을 수행 + +439 +00:31:45,450 --> 00:31:49,039 + 오디오에서 가장 잘 작동하고 밖으로 나왔다 무엇이든 동영상에 가장 적합 + +440 +00:31:49,039 --> 00:31:55,029 + 어딘가 나중에 어떻게 든하지만 내가 할 수있는 유일한 뭔가하고와 함께 주장 + +441 +00:31:55,029 --> 00:31:57,639 + 신경망 권리 매우 간단 당신은 그냥 선수가 있기 때문에 + +442 +00:31:57,640 --> 00:32:00,410 + 어떤 점에서 둘의 출력을보고 다음은로 분류하고 + +443 +00:32:00,410 --> 00:32:09,860 + 모두의 기능은 그래서 우리는 그들을 놀라게 할거야 그리고 나는 우리가 얻어야 할 것 같아요 + +444 +00:32:09,859 --> 00:32:11,179 + 이리 + +445 +00:32:11,180 --> 00:32:14,180 + 희망 그것은 작동 + +446 +00:32:29,148 --> 00:32:34,108 + 확인 그래서 우리가 완전히 완전히 거 스위치 기어를하고있어 대한 이야기​​ 같아요 + +447 +00:32:34,108 --> 00:32:38,199 + 자율 학습은 그래서 여기에 대비 약간을하고 싶습니다 + +448 +00:32:38,200 --> 00:32:42,460 + 먼저 우리의 기본 정의 어떤 종류의에 대한 거 얘기 야 + +449 +00:32:42,460 --> 00:32:46,009 + 자율 학습은 우리는 방법에 대한 두 개의 서로 다른 종류의 이야기거야 + +450 +00:32:46,009 --> 00:32:50,858 + 그 자율 학습은 최근에 그래서 사람을 추방에 의해 공격되었다 + +451 +00:32:50,858 --> 00:32:53,408 + 특히 우리는 자동차 인코더와의이 아이디어에 대한 이야기​​를 거 + +452 +00:32:53,409 --> 00:32:58,679 + 적대적 네트워크와 내가 바로 그렇게 꽤 많이 내 리모콘이 필요 같아요 + +453 +00:32:58,679 --> 00:33:03,259 + 우리가 지금까지이 클래스에서 본 적이 모든 기본 그래서지도 학습이다 + +454 +00:33:03,259 --> 00:33:07,128 + 거의 모든지도 학습 문제 뒤에 설치는 우리가 가정이다 + +455 +00:33:07,128 --> 00:33:11,769 + 우리의 데이터 세트는 각 데이터 포인트의 종류는 두 가지 부품의 종류를 가지고있다 우리는이 + +456 +00:33:11,769 --> 00:33:15,858 + 우리의 데이터 액세스 한 다음 우리는 우리가 원하는 것을 왜 어떤 라벨 또는 출력이 + +457 +00:33:15,858 --> 00:33:20,028 + 해당 입력에서 해당로부터 생산 및 감독 학습에서 우리의 전체 목표는 + +458 +00:33:20,028 --> 00:33:24,888 + 우리의 매입 세액에 걸리는 일부 기능을 학습하고이 출력을 생성합니다 + +459 +00:33:24,888 --> 00:33:29,538 + 또는 당신이 정말로 그것을 거의 거의 모든 것에 대해 생각하는 이유와 경우 레이블 + +460 +00:33:29,538 --> 00:33:33,088 + 우리가이 클래스에서 보았던 것은이지도 학습의 일부 예입니다 + +461 +00:33:33,088 --> 00:33:37,358 + 다음 이미지로 이미지를 분류 행위 같은 뭔가를 설정하고 + +462 +00:33:37,358 --> 00:33:41,960 + 물체 검출과 같은의 라벨은 왜 이미지 및 액세스 이유 + +463 +00:33:41,960 --> 00:33:46,119 + 가 될 수 이유를 찾을 수 없습니다 이미지에서 개체의 집합 어쩌면이다 + +464 +00:33:46,118 --> 00:33:50,238 + 이 될 수 왜 우리가 캡처 이름을보고 캡션 후 이제 비디오하고 수 + +465 +00:33:50,239 --> 00:33:55,838 + 레이블 또는 캡션 또는 거의 아무것도 아무것도 중 하나는 그래서 난 그냥 원하는 + +466 +00:33:55,838 --> 00:33:59,450 + 학습 감독 점이 강력한이 매우 매우 매우 강력하게 + +467 +00:33:59,450 --> 00:34:03,819 + 그리고 포함 일반적인 프레임 워크는 우리가에서 수행 한 모든 것을 포함 + +468 +00:34:03,819 --> 00:34:08,960 + 지금까지 클래스와 다른 점은지도 학습은 실제로 시스템을 만드는 것입니다 + +469 +00:34:08,960 --> 00:34:12,639 + 즉, 실제로는 정말 잘 작동 시스템을 작동하고 매우 유용합니다 + +470 +00:34:12,639 --> 00:34:14,628 + 실제 응용 + +471 +00:34:14,628 --> 00:34:17,898 + 내가 생각 자율 학습은 개방 연구의 조금 더 + +472 +00:34:17,898 --> 00:34:22,338 + 정말 멋진, 그래서이 시점에서 질문 나는 정말 생각 + +473 +00:34:22,338 --> 00:34:26,199 + 일반적으로 사람을 해결하기위한 중요하지만이 시점에서 그것은 아마도 약간의 + +474 +00:34:26,199 --> 00:34:30,028 + 영역의 유형에 대한 연구의 초점 이상의 비트는 또한 약간의 작은 + +475 +00:34:30,028 --> 00:34:34,568 + 우리가 일반적으로 우리 가정 자율 학습, 그래서 잘 정의 + +476 +00:34:34,568 --> 00:34:37,579 + 우리는 PACS를 그냥 데이터 우리는 어떤 이유가없는 한 + +477 +00:34:38,349 --> 00:34:44,009 + 및 자율 학습의 목표는 데이터의 역할과 일을하는 것입니다 + +478 +00:34:44,009 --> 00:34:48,199 + 우리가 정말로하려는 일이 너무 일부 그래서 문제에 따라 달라집니다 + +479 +00:34:48,199 --> 00:34:51,939 + 일반적으로 우리는 우리가에 잠재 구조의 몇 가지 유형을 발견 할 수 있기를 바랍니다 + +480 +00:34:51,940 --> 00:34:56,710 + 데이터는 명시 적으로 어떤 레이블에 대해 아무것도 모른 채 역할 + +481 +00:34:56,710 --> 00:34:59,650 + 당신이 이전의 기계 학습에서 볼 수도 고전적인 예 + +482 +00:34:59,650 --> 00:35:04,009 + 클래스는 그래서 수단과 같은 우리가 그냥있어 클러스터링 같은 것들이 될 것이다 + +483 +00:35:04,009 --> 00:35:07,728 + 점의 무리 우리는로를 구분하여 구조를 발견 + +484 +00:35:07,728 --> 00:35:13,268 + 클러스터는 자율 학습의 다른 고전적인 예는 것 + +485 +00:35:13,268 --> 00:35:18,248 + X이 시점에서 그냥 주성분 분석과 같은 + +486 +00:35:18,248 --> 00:35:22,098 + 데이터의 우리는 그 중 일부 저 차원 표현을 발견 할 + +487 +00:35:22,099 --> 00:35:27,170 + 입력 데이터 그래서 자율 학습이 정말 종류의 멋진 지역 만입니다 + +488 +00:35:27,170 --> 00:35:30,519 + 조금 더 문제가 구체적이고 약간은 덜 잘 정의 + +489 +00:35:30,518 --> 00:35:37,228 + 아키텍처로 특정되어 있으므로 두 가지를 학습 감독 + +490 +00:35:37,228 --> 00:35:42,358 + 깊은 학습 사람들은이 아이디어로 자율 학습에 대해 수행 한 + +491 +00:35:42,358 --> 00:35:46,048 + 오디오 인코더의이 아이디어는 전통적인 오스만의 종류에 대해 이야기합니다 + +492 +00:35:46,048 --> 00:35:49,318 + 또한 변분에 대해 이야기하는 아주 아주 오랜 역사를 가지고 분기 + +493 +00:35:49,318 --> 00:35:54,308 + 뉴스의이 종류이다 자동 인코더는 것입니다 그들에 아시아 트위스트를 냉각 + +494 +00:35:54,309 --> 00:35:57,729 + 실제로 일부 생식 적대적 네트워크에 대해이 정말 좋은 이야기 + +495 +00:35:57,728 --> 00:36:06,718 + 생각하지만 당신은 너무 자연스러운 이미지의 이미지와 모델 샘플을 생성 할 수 + +496 +00:36:06,719 --> 00:36:09,548 + 인 오디오 인코더와 아이디어는 매우 간단하다 + +497 +00:36:09,548 --> 00:36:14,088 + 우리는 일부 데이터이며, 우리는이 입력 거 패스 야하는 우리의 입력 자루가 + +498 +00:36:14,088 --> 00:36:19,710 + 인코딩 네트워크의 어떤 종류를 통해 데이터에서 일부 기능을 생산하는 일부 + +499 +00:36:19,710 --> 00:36:24,440 + 이 단계를 생각할 수이 있도록 잠재 기능을 사용하면 약간을 생각할 수 + +500 +00:36:24,440 --> 00:36:28,219 + 우리는 우리의 입력을거야 학습 가능 주요 구성 요소 분석과 같은 비트 + +501 +00:36:28,219 --> 00:36:33,298 + 다음 데이터 그래서 그 많은 다른 기능 표현으로 변환 + +502 +00:36:33,298 --> 00:36:38,940 + 이 10 이미지 때문에이 여기에 표시됩니다 같은 시간은 이러한 액세스는 이미지가 될 것입니다 + +503 +00:36:38,940 --> 00:36:42,989 + 이 인코더 네트워크는 같은 뭔가를 이렇게 아주 복잡한 일을 할 수 + +504 +00:36:42,989 --> 00:36:47,228 + PCA는 그냥 간단한 선형 변환이야 그러나 일반적으로이 완벽하게 될 수 있습니다 + +505 +00:36:47,228 --> 00:36:51,799 + 연결된 네트워크 원래 종류의 아마 다섯 10 년 전 + +506 +00:36:51,800 --> 00:36:56,130 + 종종 하나의 그들은 그것의 현재 시그 모이 단위로 네트워크에 완벽하게 연결되어 + +507 +00:36:56,130 --> 00:37:00,410 + 트레일러 단위 종종 깊은 깊은 네트워크와이 또한 뭔가 될 수 있습니다 + +508 +00:37:00,409 --> 00:37:09,230 + 길쌈 바로 그렇게 작동하지처럼 우리는이 생각이있는 Z + +509 +00:37:09,230 --> 00:37:13,820 + 그래서 역할을보다 우리가 배울 수있는 기능의 크기는 일반적으로 작은 + +510 +00:37:13,820 --> 00:37:18,789 + 데이터 그래서 우리는 우리의 역할에 대해 우리는 유용한 기능의 일종 할 필요가 없습니다 + +511 +00:37:18,789 --> 00:37:22,610 + 그냥 몇 가지로 인터넷 전송에게 데이터를 변환하기 위해 네트워크를 원하지 않는다 + +512 +00:37:22,610 --> 00:37:26,370 + 쓸모없는 표현은 우리가 실제로 데이터를 분쇄 강제로 원하는 + +513 +00:37:26,369 --> 00:37:29,900 + 통계 및 희망 도움이 될 수있는 몇 가지 유용한 방법을 요약 + +514 +00:37:29,900 --> 00:37:34,720 + 사람 다운 스트림 처리하지만 문제는 우리가 정말 어떤을하지 않아도됩니다 + +515 +00:37:34,719 --> 00:37:39,219 + 명시적인 레이블 그래서 대신에 우리가 필요로하는이 다운 스트림 처리를 위해 사용하는 + +516 +00:37:39,219 --> 00:37:43,159 + 대리의 어떤 종류를 발명 우리가 단지 데이터를 사용하여 사용할 수있는 요청 + +517 +00:37:43,159 --> 00:37:50,159 + 자체 회로는 우리가 자주 자동 인코더에 사용하는 요구 있도록이 좋습니다 + +518 +00:37:50,159 --> 00:37:55,719 + 재건의 우리는 매핑을 대신 배울 수있는 지혜가없는 사람, 그래서 + +519 +00:37:55,719 --> 00:38:00,119 + 우리는 이러한 기능의 Z에서 데이터의 행위를 재현 단지 거 시도하고 있고 + +520 +00:38:00,119 --> 00:38:05,119 + 이러한 기능은보다 크기가 작은 특히 희망 그것은 강제합니다 + +521 +00:38:05,119 --> 00:38:07,139 + 네트워크 요약하는 역할을합니다 + +522 +00:38:07,139 --> 00:38:11,420 + 입력 데이터의 유용한 통계 요약 희망 발견 할 + +523 +00:38:11,420 --> 00:38:16,289 + 재건하지만 더 유용 하나가 될 수있는 몇 가지 유용한 기능 + +524 +00:38:16,289 --> 00:38:19,920 + 일반적으로 이러한 기능은 다른 작업에 유용 할 수 있습니다 수 있습니다 우리의 경우 + +525 +00:38:19,920 --> 00:38:26,340 + 나중에 어떤 감독 데이터를 얻을 그래서 다시이 디코더 네트워크는 꽤 될 수있다 + +526 +00:38:26,340 --> 00:38:30,050 + 숙소에서 자동 그래서 처음에 대한 왔을 때 복잡 + +527 +00:38:30,050 --> 00:38:33,720 + 종종 이들은 단지 간단한 선형 네트워크 또는 작은 하나 있었다 + +528 +00:38:33,719 --> 00:38:37,459 + 네트워크 신호하지만 지금은 깊이 네트워크와 종종이 될 수 있습니다 + +529 +00:38:37,460 --> 00:38:43,220 + 길쌈까지 될 것입니다 것은 너무 메이슨 작은 풍선 슬라이드, 그래서 좋은 시간입니다 + +530 +00:38:43,219 --> 00:38:46,869 + 자주이 디코더는 현재이 최대 길쌈 네트워크 중 하나가 될 것입니다 + +531 +00:38:46,869 --> 00:38:50,529 + 즉, 다시 수 있습니다 귀하의 입력 데이터보다 크기가 작은 당신의 기능을한다 + +532 +00:38:50,530 --> 00:38:56,880 + 및 종류의 원본 데이터를 재생 내가 좋겠하는 크기까지 다시 불면 + +533 +00:38:56,880 --> 00:39:00,579 + 이러한 일들이 실제로 그렇게 훈련을 아주 쉽게 있다는 점을 확인하려면 + +534 +00:39:00,579 --> 00:39:04,610 + 바로 여기가 그래서 난 그냥 토치에서 요리하는 간단한 예제입니다 + +535 +00:39:04,610 --> 00:39:05,050 + 래리 + +536 +00:39:05,050 --> 00:39:09,210 + 최대 그들의 디코더에 대한 모든 작업을 수행되는 코드 + +537 +00:39:09,210 --> 00:39:12,420 + 컨볼 루션 네트워크 당신은 실제로 재구성 배운다있어 것을 알 수 있습니다 + +538 +00:39:12,420 --> 00:39:19,159 + 가끔 볼 꽤 잘 다른 것은 데이터가 이러한 인코더이다 + +539 +00:39:19,159 --> 00:39:23,799 + 및 디코더 네트워크는 때로는 종류의 같은과 가중치를 공유합니다 + +540 +00:39:23,800 --> 00:39:27,740 + 정규화 전략과 이러한 반대 있음이 직감으로 + +541 +00:39:27,739 --> 00:39:32,329 + 작업은 그래서 어쩌면 난 둘 정도 같은 대기를 사용하려고하는 것은 의미가 있습니다 + +542 +00:39:32,329 --> 00:39:36,659 + 당신이 완전히 연결에 대해 생각하면 그냥 구체적인 예를 들어 당신은에 있다면 + +543 +00:39:36,659 --> 00:39:39,980 + 네트워크는 아마도 사용자의 입력 데이터의 일부 치수 D를 갖는 + +544 +00:39:39,980 --> 00:39:44,070 + 그리고 당신은 늦게와 데이터는 약간 작은 치수 H를 가지고있는 경우 것 + +545 +00:39:44,070 --> 00:39:47,769 + 이 인코더는 가중치 그냥 될 것 그냥 완전히 연결 네트워크했다 + +546 +00:39:47,769 --> 00:39:51,630 + 이 두바이 시대의 매트릭스와 지금 우리가 디코딩을하고하려고 할 때 + +547 +00:39:51,630 --> 00:39:54,470 + 보다 원래의 데이터를 재구성 + +548 +00:39:54,469 --> 00:39:59,129 + 다시 D에 각각의 뒷면에서 매핑 그래서 우리는 단지이 동일한 가중치를 재사용 할 수 있습니다 + +549 +00:39:59,130 --> 00:40:06,420 + 우리가이 일을 훈련 할 때 두 가지 영역 우리는 너무 행렬의 전치을 + +550 +00:40:06,420 --> 00:40:10,300 + 우리는 비교하는 데 사용할 수있는 손실 함수의 어떤 필요 + +551 +00:40:10,300 --> 00:40:15,400 + 재구성 된 우리의 원래 데이터와 데이터를 다음 번 자주 것 다 + +552 +00:40:15,400 --> 00:40:20,220 + 우리가하면, 그래서 유클리드 손실 지옥 같은 간단한에 L이 일을 훈련합니다 + +553 +00:40:20,219 --> 00:40:24,659 + 우리의 인터넷 작업을 선택하고 우리는 번째 분기 네트워크와 기능을 선택한 후 + +554 +00:40:24,659 --> 00:40:28,329 + 우리는 다른 보통의 신경망처럼이 일을 훈련 할 수있는 우리 + +555 +00:40:28,329 --> 00:40:32,420 + 디코딩을 통해 우리를 통해 전달할 일부 데이터를 인코딩에 도착 우리는 통과 + +556 +00:40:32,420 --> 00:40:37,900 + 컴퓨터 법률 sweetback의 전파 모든 것이 우리가이 훈련 그래서 일단 좋은 + +557 +00:40:37,900 --> 00:40:41,880 + 물건은 자주 우리가 너무 많은 지출이 디코더 네트워크를 취할 것 + +558 +00:40:41,880 --> 00:40:46,700 + 시간 학습과 난 그냥 좀 이상한 보인다 그것을 멀리 던질거야하지만, + +559 +00:40:46,699 --> 00:40:52,129 + 이유는 자체적으로 재구성하므로 대신 같은 유용한 작업 없다는 것이다 우리 + +560 +00:40:52,130 --> 00:40:56,349 + 입니다 실제로 유용한 작업의 일종으로 이러한 네트워크를 적용 할 + +561 +00:40:56,349 --> 00:41:01,099 + 아마 설정하기 때문에 여기에 감독 학습 과제는 우리가 배운 것입니다 + +562 +00:41:01,099 --> 00:41:05,179 + 희망이 모든 자율 데이터로부터이이 엔코더 네트워크 + +563 +00:41:05,179 --> 00:41:08,799 + 데이터를 압축하고 몇 가지 유용한 기능을 추출하기 위해 배운 등장 + +564 +00:41:08,800 --> 00:41:13,190 + 그리고, 우리는 더 큰 부분을 초기화하기 위해 인코더 네트워크를 사용하는거야 + +565 +00:41:13,190 --> 00:41:17,650 + 우리가 실제로 어쩌면 몇 가지 작은에 액세스 할 경우 지금 감독 업무와 + +566 +00:41:17,650 --> 00:41:18,280 + 데이터 세트 + +567 +00:41:18,280 --> 00:41:22,590 + 다음 희망이 작업 대부분이 여기에있는 수있는 몇 가지 레이블이 + +568 +00:41:22,590 --> 00:41:26,309 + 처음에이 자율 훈련을 수행 한 후 우리는 할 수있는 한 + +569 +00:41:26,309 --> 00:41:29,699 + 전체이 더 큰 네트워크를 한 후 미세 조정이를 초기화하는 것을 사용 + +570 +00:41:29,699 --> 00:41:35,509 + 관리 대상 데이터의 희망을 아주 소량 것은 그래서 이것은의 종류 + +571 +00:41:35,510 --> 00:41:39,380 + 자율 기능 학습의 꿈 중 하나의 꿈 당신을 + +572 +00:41:39,380 --> 00:41:43,410 + 그냥 구글에 갈 수없는 레이블이 정말 큰 데이터 세트를 + +573 +00:41:43,409 --> 00:41:46,409 + 영원히 이미지를 다운로드하고 이미지를 많이 얻을 정말 쉽습니다 + +574 +00:41:46,969 --> 00:41:51,399 + 문제는 레이블 그래서 당신은 어떤 시스템을 싶어 수집하는 비용이있다 + +575 +00:41:51,400 --> 00:41:54,960 + 즉 자율 많은 데이터 엄청난 양 모두를 이용할 수도 + +576 +00:41:54,960 --> 00:41:59,570 + 또한 자동차 제조에서 감독 데이터의 단지 작은 양 그래서 + +577 +00:41:59,570 --> 00:42:03,940 + 이 밤 속성 만에이 제안되고있다 적어도 한 가지 + +578 +00:42:03,940 --> 00:42:07,670 + 내가 조금 인 너무 잘 작동하지 않는 경향이 생각하는 연습 + +579 +00:42:07,670 --> 00:42:12,010 + 이 아름다운 그런 생각이 다른 것 때문에 불행한 그 I + +580 +00:42:12,010 --> 00:42:15,890 + 거의 다시 가서 읽으면하는 보조 노트로 지적해야 + +581 +00:42:15,889 --> 00:42:21,179 + 지난 10 년 이상에서 수천 중반부터이 일에 문학 + +582 +00:42:21,179 --> 00:42:25,129 + 사람들은 자신의 아내가 미리 훈련 증가이라는 재미있는 것은이 그 + +583 +00:42:25,130 --> 00:42:30,010 + 그들은 자동 인코더 훈련에 사용하고 생각했다 공유 그 시간에 + +584 +00:42:30,010 --> 00:42:35,410 + 매우 깊은 네트워크가 있었다 2,006 훈련은 도전이고 당신이 경우 당신은 찾을 수 있습니다 + +585 +00:42:35,409 --> 00:42:39,429 + 당신이 가지고있는 경우에도 아마 45 숨겨진 말과 같이 인용 및 논문 + +586 +00:42:39,429 --> 00:42:44,359 + 층이 있도록 네트워크를 훈련 당시 학생 당 극단적으로 도전했다 + +587 +00:42:44,360 --> 00:42:48,760 + 대신 곳 패러다임을 가진과 그 문제를 해결 얻을들이 + +588 +00:42:48,760 --> 00:42:53,560 + 한 번에 하나의 편지를 양성하려고 그들은이이 일을 사용하지만 난 것 + +589 +00:42:53,559 --> 00:42:57,139 + 싶어이있는 제한된 볼츠만 기계 호출에 너무 많이 얻을 해달라고 + +590 +00:42:57,139 --> 00:43:01,279 + 인쇄상의 모델 그리고 그들은 이러한 제한 볼츠만 기계를 사용하는 것 + +591 +00:43:01,280 --> 00:43:05,880 + 한 번에 하나씩 거기에이 작은에 연수생의 종류 그래서 우리는 먼저해야합니다 우리의 + +592 +00:43:05,880 --> 00:43:12,070 + 입력 이미지 크기 W 하나의 최대 크기가 될 수 있으며, 이것은 아마 뭔가 될 것 + +593 +00:43:12,070 --> 00:43:16,630 + PCA 또는 사진의 다른 종류의 같은 변환 한 후 우리는 희망 것 + +594 +00:43:16,630 --> 00:43:19,990 + 제한된 볼츠만 기계에게 관계의 어떤 종류를 사용하여 배울 수 + +595 +00:43:19,989 --> 00:43:25,359 + 그 첫 번째 자신의 기능과 몇 가지 높은 수준의 기능 사이에 때 한 번 + +596 +00:43:25,360 --> 00:43:27,940 + 우리는 이유에서이 층을 알게되면 + +597 +00:43:27,940 --> 00:43:30,840 + 그 기능의 상단에 다른 제한 볼츠만 기계 학습 + +598 +00:43:30,840 --> 00:43:36,000 + 이러한 유형의 접근법을 사용하여 다음 레벨의 기능으로되도록 접속하면하자 + +599 +00:43:36,000 --> 00:43:40,050 + 그 욕심 방법 및 그하자 이런 종류의에서 한 번에 하나의 층을 훈련 + +600 +00:43:40,050 --> 00:43:43,980 + 그들에게 희망이 더 큰 네트워크에 대한 정말 좋은 초기화를 찾을 수 + +601 +00:43:43,980 --> 00:43:48,369 + 그래서이 욕심 사전 교육 단계 이후 그들은 전체를 스틱 것 + +602 +00:43:48,369 --> 00:43:52,099 + 함께이 거대한 오디오 인코더 다음 미세 조정 오디오 인코더로 + +603 +00:43:52,099 --> 00:44:00,469 + 공동 요즘, 그래서 우리가 정말 선 리우와 같은 것들로이 작업을 수행 할 필요가 없습니다 + +604 +00:44:00,469 --> 00:44:04,139 + 적절한 초기화 및 bash는 정상화 약간 애호가 + +605 +00:44:04,139 --> 00:44:08,730 + 일의이 유형은 그래서으로 더 이상 정말 필요하지 않습니다 애호가 최적화 + +606 +00:44:08,730 --> 00:44:12,659 + 이전 슬라이드의 예를 우리는 래리 길쌈이를 보았다 + +607 +00:44:12,659 --> 00:44:16,409 + 내가 휴전에 훈련이 그냥 디컨 볼 루션 오디오 인코더 + +608 +00:44:16,409 --> 00:44:17,429 + 일을하려고 + +609 +00:44:17,429 --> 00:44:20,149 + 모든 현대적인 신경망 기술을 사용하면 주위에 엉망이 없습니다 + +610 +00:44:20,150 --> 00:44:25,039 + 미국 항공 훈련 그래서 이것은 정말 더 이상 수행되는 것이 아닙니다 + +611 +00:44:25,039 --> 00:44:27,800 + 하지만 난 당신이 아마에 있기 때문에 우리는 적어도 언급해야한다고 생각 + +612 +00:44:27,800 --> 00:44:35,990 + 당신이 그래서 이런 것들에 대한 문헌에서 다시 읽으면이 아이디어를 발생 + +613 +00:44:35,989 --> 00:44:39,949 + 기본적인 아이디어 또는 분기 자동차는 나는이 아름답다 아주 간단 생각한다 + +614 +00:44:39,949 --> 00:44:44,009 + 우리가 희망을 배울 자율 많은 양의 데이터를 사용할 수있는 아이디어 + +615 +00:44:44,010 --> 00:44:49,710 + 몇 가지 좋은 기능은 불행하게도 그 작동하지 않습니다하지만 괜찮아요하지만 거기에 + +616 +00:44:49,710 --> 00:44:53,639 + 아마 작업의 다른 좋은 유형 우리는 자율 데이터로 할 것 + +617 +00:44:53,639 --> 00:44:56,639 + 질문 첫 번째 + +618 +00:44:59,068 --> 00:45:10,308 + 어제 질문은 여기에서 일어나고 것은 바로 그래서 이것은 이것이 무엇이다 + +619 +00:45:10,309 --> 00:45:14,880 + 이것은 어쩌면 당신이 우리의 입력 때문에 세 계층 신경 네트워크에 대해 생각할 수있다 + +620 +00:45:14,880 --> 00:45:18,410 + 거 것은 그래서 우리는 단지이 신경 것을 바라고 출력과 동일 + +621 +00:45:18,409 --> 00:45:22,788 + 네트워크 식별 기능을 배울하지만 정말하고에있어 것 + +622 +00:45:22,789 --> 00:45:26,099 + 우리 끝에 일부 손실 함수를 갖는 항등 함수 학습하기 위해서 + +623 +00:45:26,099 --> 00:45:29,989 + 그 손실 성인과 같은 우리의 입력과 출력에 우리를 격려한다 + +624 +00:45:29,989 --> 00:45:35,429 + 같은 학습 식별 기능으로 아마 정말 쉬운 일입니다 + +625 +00:45:35,429 --> 00:45:39,379 + 수행하는 대신 우리는 쉽게 경로를하지하기 위해 네트워크를 강제하는거야 + +626 +00:45:39,380 --> 00:45:43,410 + 대신 희망이 아니라 단지 데이터를 토하는 및 학습보다 + +627 +00:45:43,409 --> 00:45:46,909 + 쉬운 방법으로 식별 기능을 대신 우린 병목 현상이야 + +628 +00:45:46,909 --> 00:45:51,268 + 중간에이 숨겨진 레이어를 통해 표현은 그래서 다음거야 배울 수 있어요 + +629 +00:45:51,268 --> 00:45:54,798 + 신원 기능하지만, 네트워크의 중간에 거이다가 집어 넣은해야 + +630 +00:45:54,798 --> 00:45:59,829 + 아래 데이터를 요약하고 압축하고 잘하면 그 그 압축 것 + +631 +00:45:59,829 --> 00:46:04,339 + 그 조금이 될 수 있으므로 다른 작업에 유용한 기능을 야기 할 + +632 +00:46:04,338 --> 00:46:14,719 + 좀 더 배려 확인은 주장 PCA이 단지 해답이었다 의문을 제기 + +633 +00:46:14,719 --> 00:46:19,259 + 문제는 그래서 만 허용하는 경우 PCA 특정 감각에 최적 인 것은 사실이다 + +634 +00:46:19,259 --> 00:46:25,278 + 경이는 소득 및 디코더가 단지 하나의 경우 어디 하나를 수행합니다 + +635 +00:46:25,278 --> 00:46:30,259 + 당신이 있다면 어떤 의미에서 최적의 참 다음 PCA 변환하지만, 선형 + +636 +00:46:30,259 --> 00:46:34,170 + 분기 및 디코더는 잠재적으로 더 큰 더 복잡한 함수이다 그 + +637 +00:46:34,170 --> 00:46:39,059 + 더 어쩌면 다층 신경망은 어쩌면 PCA가 더있다 없다 + +638 +00:46:39,059 --> 00:46:43,209 + 더 이상 다른 점은 수있는 권리 솔루션은 PCA는 단지 최적이다 + +639 +00:46:43,208 --> 00:46:44,308 + 특정 감각 + +640 +00:46:44,309 --> 00:46:48,670 + 특히 LG의 재건에 대해 이야기하지만 실제로 우리는하지 않습니다 + +641 +00:46:48,670 --> 00:46:51,798 + 실제로 우리가이 일을 배울 것으로 기대하고 재건에 관심 + +642 +00:46:51,798 --> 00:46:56,538 + 다른 작업에 유용한 기능 연습 때문에이 조금 이상을 볼 것이다 + +643 +00:46:56,539 --> 00:47:00,259 + 나는 아마되는 것이기 때문에 사람들은 항상 더 이상에게 사용하지 않는 것이 + +644 +00:47:00,259 --> 00:47:04,719 + 사실에 매우 적합한 손실 그래 특징 + +645 +00:47:04,719 --> 00:47:14,348 + 이것은이다 래리의 군대의 데이터의 생성 적 모델의이 종류이다 + +646 +00:47:14,349 --> 00:47:18,250 + 당신이 내기 당신의 종류의 두 시퀀스가​​ 상상 데이터 + +647 +00:47:18,250 --> 00:47:19,108 + 이 작업을 수행 할 수 + +648 +00:47:19,108 --> 00:47:23,579 + 두 가지의 생식 모델링 그래서 당신은 들어갈 필요 + +649 +00:47:23,579 --> 00:47:26,440 + 이 텍스트는 정확히 손실 기능을 파악하는 것이 이유 중 꽤 많은 + +650 +00:47:26,440 --> 00:47:31,260 + 하지만, 이들로 데이터를 어​​떤 우도 추천되는 것을 끝낸다 + +651 +00:47:31,260 --> 00:47:35,470 + 당신이 관찰되지 않고 그것이 우리가 의지하는 것이 실제로 멋진 아이디어 잠복 상태 + +652 +00:47:35,469 --> 00:47:40,868 + 일종의의 하나 하나 있도록 변분 오디오 인코더에 다시 방문 + +653 +00:47:40,869 --> 00:47:45,280 + 전통적인 오디오 인코더 문제는 배우를 바라고 있다는 것입니다 + +654 +00:47:45,280 --> 00:47:49,590 + 즉 그 멋진 일이의 기능을하지만 다른 일이 우리가 것입니다 + +655 +00:47:49,590 --> 00:47:54,670 + 에 같은 단지 기능을 습득뿐만 아니라 멋진 새로운 데이터를 생성 할 수 없습니다 + +656 +00:47:54,670 --> 00:47:59,320 + 우리는 잠재적으로 자율 데이터에서 배운 수있는 작업은 희망입니다 + +657 +00:47:59,320 --> 00:48:03,030 + 우리 후루룩 소리 내며 먹기 수있는 모델과 이미지의 무리 그것은 일종의 그 것을 수행 한 후 + +658 +00:48:03,030 --> 00:48:06,990 + 자연 이미지의 모습과이 메일 내용은 다음 후 무엇을 배운다 + +659 +00:48:06,989 --> 00:48:11,449 + 그것은 희망 원래의 모습 가짜 이미지의 종류를 뱉어 수 + +660 +00:48:11,449 --> 00:48:17,949 + 이미지하지만 가짜 이것은 어쩌면 바로 작업을 처리하지 않습니다 + +661 +00:48:17,949 --> 00:48:22,319 + 분류 같은 것들에 적용 할 수 있지만, 중요한 일처럼 보인다 + +662 +00:48:22,320 --> 00:48:26,588 + 인간이 데이터를 찾고 그것을 요​​약에서 꽤 좋은 사람과 + +663 +00:48:26,588 --> 00:48:31,199 + 그렇게 희망을 갖고 우리의 모델도 할 수 있다면 어떻게 생겼는지의 아이디어를 얻기 + +664 +00:48:31,199 --> 00:48:34,969 + 작업의 이런 종류는 잘하면 그들은 몇 가지 유용한 배운 것 + +665 +00:48:34,969 --> 00:48:41,299 + 요약이나 데이터의 일부 유용한 통계 변동 오디오 그래서 + +666 +00:48:41,300 --> 00:48:45,539 + 인코더는 우리가 할 수있는 원래 순서에 깔끔한 트위스트의이 종류이다 + +667 +00:48:45,539 --> 00:48:50,690 + 희망 실제로 그래서 여기에 우리 배운다 데이터에서 새로운 이미지를 생성 우리는 필요 + +668 +00:48:50,690 --> 00:48:54,849 + 인내의 약간에 뛰어 것을이 세금은 그래서 이것은 뭔가가 우리 + +669 +00:48:54,849 --> 00:48:58,320 + 정말이 시점에 더 이상하지만까지이 클래스에 대해 전혀 이야기하지 않은 + +670 +00:48:58,320 --> 00:49:02,420 + 하지만 근처하지 않는 기계 학습의이 모든 다른 측면이있다 + +671 +00:49:02,420 --> 00:49:05,250 + 확률에 대한 정말 열심히 네트워크와 깊은 학습하지만 일 + +672 +00:49:05,250 --> 00:49:09,260 + 분포 및 부 합성 분포는 서로에 들어갈 수있는 방법 + +673 +00:49:09,260 --> 00:49:13,190 + 데이터 세트 다음 이유를 확률 적 데이터와에 대해 생성 + +674 +00:49:13,190 --> 00:49:16,670 + 그것은 당신에게 국가의 종류 수 있기 때문에 패러다임의이 유형은 정말 좋은 + +675 +00:49:16,670 --> 00:49:17,970 + 명시 적 확률 + +676 +00:49:17,969 --> 00:49:22,000 + 당신이 당신의 데이터를 어​​떻게 생각하는지에 대한 가정이 생성 한 후 그 주어졌다 + +677 +00:49:22,000 --> 00:49:25,858 + 확률 적 가정은 다음과 데이터에 모델을 파악하려고 + +678 +00:49:25,858 --> 00:49:30,199 + 당신의 가정은 그래서 변화 놀라운 분기 우리는이 가정을하는지 + +679 +00:49:30,199 --> 00:49:35,589 + 우리가 가정 있도록 방법이 특정 유형은하는 우리의 데이터가 생성 된 + +680 +00:49:35,590 --> 00:49:39,800 + 우리는 거의 세계가 몇 가지 사전 분포를 존재했습니다 + +681 +00:49:39,800 --> 00:49:44,440 + 이러한 잠재 미국 Z를 생성하고 우리는 우리는 몇 가지 조건 가정했습니다 + +682 +00:49:44,440 --> 00:49:49,789 + 일단 우리는 우리가 샘플을 최고의 상태를 생성 할 수있는 유통 + +683 +00:49:49,789 --> 00:49:54,389 + 다른 분포 변동 오디오 인코더 따라서 데이터를 생성하도록 + +684 +00:49:54,389 --> 00:49:58,170 + 정말 우리의 데이터가이 꽤 간단한 과정에 의해 생성 된 상상 + +685 +00:49:58,170 --> 00:50:03,639 + 먼저 우리는 몇 가지가 RAZ의 B를 얻기 위해 얻기 위해 몇 가지 사전 분포에서 샘플링 있음 + +686 +00:50:03,639 --> 00:50:10,940 + 직관이 그 역할을하므로이 조건에서 샘플은 우리의 행위를 얻을 수 + +687 +00:50:10,940 --> 00:50:15,240 + 이미지와 Z 같은 아마 그것에 대해 몇 가지 유용한 물건을 요약 + +688 +00:50:15,239 --> 00:50:19,649 + 이 훨씬 이미지를 볼 수 있었다, 그래서 만약 이미지 어쩌면 상태에 누워 그 그녀는 수 + +689 +00:50:19,650 --> 00:50:23,800 + 이 개구리 나 사슴 또는 고양이 여부 이미지의 클래스 같은 것을하고 + +690 +00:50:23,800 --> 00:50:27,690 + 또한 고양이가 지향하거나 어떤 색 방법에 대한 변수를 포함 할 수 있습니다 + +691 +00:50:27,690 --> 00:50:29,269 + 또는 그런 일 + +692 +00:50:29,269 --> 00:50:33,719 + 그래서 이것은 매우 간단 꽤 간단한 아이디어를 가진의 좋은 종류의 종류 + +693 +00:50:33,719 --> 00:50:37,279 + 하지만 당신이되고 이미지의 이미지를 상상하는 방법에 대한 많은 이해 + +694 +00:50:37,280 --> 00:50:43,670 + 문제 때문에 발생 지금 우리는 이러한 매개 변수를 충족 물어보고 싶은 것입니다 + +695 +00:50:43,670 --> 00:50:48,470 + 종래 실제로 않고 조건부 모두 데이터 + +696 +00:50:48,469 --> 00:50:52,598 + 그 도전의 이러한 최신 날짜에 대한 액세스를 참조하고는이의의 + +697 +00:50:52,599 --> 00:50:57,588 + 문제는 그래서 우리는거야 간단한 당신이에서 많이 볼 일을 만들려면 + +698 +00:50:57,588 --> 00:51:00,769 + 베이지안 통계 및 난 그냥 전과가있어 샴푸를 가지고 있다고 가정합니다 + +699 +00:51:00,769 --> 00:51:07,088 + 취급이 용이하고, 조건은 또한 표시됩니다 수 있지만 될거야 될 것 + +700 +00:51:07,088 --> 00:51:11,489 + 조금 애호가 그래서 우리는 대각선 평균과 가진 가우스 있다고 가정합니다 + +701 +00:51:11,489 --> 00:51:16,729 + 대신 죄송 대각 공분산 어떤 의미하지만,과 단위 우리는 단지거야 + +702 +00:51:16,730 --> 00:51:19,650 + 넣어하지만 우리는 사람들을 얻기 위하여려고하고있는 방법은 우리가 그들을 계산하는거야입니다 + +703 +00:51:19,650 --> 00:51:24,800 + 신경 네트워크 그래서 우리가 어떤 부분에 대한 최신 의지를 가지고 있다고 가정 + +704 +00:51:24,800 --> 00:51:27,579 + 데이터 우리는 그 말 대신한다고 가정 + +705 +00:51:27,579 --> 00:51:32,160 + 몇 가지 큰 복잡한 신경이 될 수있는 몇 가지 디코더 네트워크로 이동합니다 + +706 +00:51:32,159 --> 00:51:36,078 + 네트워크와 지금 신경 네트워크는 거이 거의 두 가지를 뱉어입니다 + +707 +00:51:36,079 --> 00:51:40,079 + 데이터의 의미를 뱉어는거야 데이터의 의미를 뱉어 것 + +708 +00:51:40,079 --> 00:51:45,068 + 행위 또한 데이터의 상기 분산은 그래서 당신이 생각해야 작용 + +709 +00:51:45,068 --> 00:51:48,958 + 이것은 우리가 보통 오디오 인코더의 위쪽 절반 같은​​ 아주 많이 보인다 + +710 +00:51:48,958 --> 00:51:52,699 + 우리가 어떤 그건 것으로 알려져있는이 링크 상태 최신 팔에서 작동하지만, + +711 +00:51:52,699 --> 00:51:57,588 + 지금 대신 직접 데이터를 침 대신에 그것을 밖으로 뱉어 것 + +712 +00:51:57,588 --> 00:52:01,690 + 데이터의 평균이 보이는 것보다 데이터의 분산되지만 다른 + +713 +00:52:01,690 --> 00:52:07,528 + 매우 일반적인 오디오 인코더의 디코더 등이이 디코더 그래서 + +714 +00:52:07,528 --> 00:52:11,518 + 일반 오디오 인코더 다시 생각의 네트워크 종류는 간단한 수 있습니다 + +715 +00:52:11,518 --> 00:52:14,578 + 완전히 연결된 것은 아니면이 매우 큰 강력한 디컨 볼 루션 수 있습니다 + +716 +00:52:14,579 --> 00:52:22,269 + 네트워크 이들 모두 문제가있다 의한 지금 매우 일반적인 + +717 +00:52:22,268 --> 00:52:26,679 + 사전을 부여하고 조건 바질가 주어진다면 야구는 우리가 알 + +718 +00:52:26,679 --> 00:52:31,578 + 우리가 실제로이 모델을 사용하려면 우리가 할 필요가 있도록 주어진 것을 후부 + +719 +00:52:31,579 --> 00:52:35,209 + 상기 입력 데이터와 상기 방식에서 잠복 상태를 추정 할 수있는 것을 우리 + +720 +00:52:35,208 --> 00:52:38,659 + 입력 데이터에서 최고의 상태를 추정하면이를 작성하는 것입니다 + +721 +00:52:38,659 --> 00:52:42,899 + 쉽게 주어진 최신의 확률 사후 분포 + +722 +00:52:42,900 --> 00:52:47,519 + 관측 데이터 및 급여를 사용하여 우리는 쉽게 주위에이 플립과 그것을 쓸 수 있습니다 + +723 +00:52:47,518 --> 00:52:54,189 + 우리의 전 감독과 우리의 조건 지방의 관점에서 등 조건 우리 + +724 +00:52:54,190 --> 00:52:57,249 + 이스라엘 사용할 수 있습니다 실제로 주위에이 일을두고 측면에서 그것을 쓰기 + +725 +00:52:57,248 --> 00:53:02,409 + 우리가 이러한 역할을보고 난 후에 우리는 이것들을 분해 할 수 있도록 이러한 세 가지 + +726 +00:53:02,409 --> 00:53:06,818 + 세 가지 용어와 조건은 우리가 우리의 디코더를 사용하는 것이 우리는 볼 수 있습니다 + +727 +00:53:06,818 --> 00:53:11,558 + 네트워크와 우리는 쉽게 그에 대한 액세스 권한이 이전에 다시 우리가 접근 + +728 +00:53:11,559 --> 00:53:15,569 + 사전은 당신이 협상 것으로 가정 할에 그래서 다루기 쉽게하지만, + +729 +00:53:15,568 --> 00:53:19,458 + 당신은 당신이 운동하는 경우 경우이 분모의 역할이 확률​​은 밝혀 + +730 +00:53:19,458 --> 00:53:22,828 + 수학이 행에이 거대한 난치성 인 끝을 쓰는 + +731 +00:53:22,829 --> 00:53:26,579 + 그 완전히 다루기 힘든, 그래서 전체 선도적 인 상태 공간을 통해 더 없다 + +732 +00:53:26,579 --> 00:53:29,479 + 방법은 당신이 할 수 이제까지 포르노도 근사 여자와 그것이 될 것이라고 + +733 +00:53:29,478 --> 00:53:33,399 + 거대한 재난 그래서 대신에 우리는 심지어 여자에 그 평가하려고하지 않습니다 + +734 +00:53:33,400 --> 00:53:38,759 + 대신 우리는하려고 몇 가지 인코더 네트워크를 소개하는거야 + +735 +00:53:38,759 --> 00:53:40,179 + 직접 전 + +736 +00:53:40,179 --> 00:53:45,210 + 우리의 인쇄 재료에 때문에이 엔코더 네트워크는 데이터 포인트에 걸릴 것입니다 + +737 +00:53:45,210 --> 00:53:48,599 + 그리고 회의의 상태에 걸쳐 분포를 뱉어 것 + +738 +00:53:48,599 --> 00:53:53,210 + 공간은 그래서 다시는 매우 원래 오디오를 다시 찾고 보인다 + +739 +00:53:53,210 --> 00:53:57,449 + 몇 슬라이드에서 인코더 전이 매우 하단의 종류와 같은 모양 + +740 +00:53:57,449 --> 00:54:01,449 + 우리가 지금 데이터에 복용하는 기존의 오디오 인코더의 절반 + +741 +00:54:01,449 --> 00:54:04,789 + 대신 직접 최신 팔을 침으로 우리는거야 평균을 뱉어하고 + +742 +00:54:04,789 --> 00:54:09,519 + 및 주요 국가의 분산 다시 이번 분기 네트워크가 될 수 있습니다 + +743 +00:54:09,519 --> 00:54:13,639 + 뭔가 다소 논란의 네트워크 또는 어쩌면 약간의 깊은 수 있습니다 + +744 +00:54:13,639 --> 00:54:21,159 + 컨볼 루션 네트워크는 그래​​서 직관의 종류입니다이 만남 네트워크 + +745 +00:54:21,159 --> 00:54:25,259 + 별도의 완전히 다른 파괴하는 기능이있을 것입니다하지만 우리는 거 야 + +746 +00:54:25,260 --> 00:54:29,180 + 그것은 이러한 사후 분포에 근사하는 방식으로 훈련 시도 + +747 +00:54:29,179 --> 00:54:35,799 + 우리는 실제로 그렇게 할 때 우리는 아마 조각을 함께에 액세스하지 않는 것이 + +748 +00:54:35,800 --> 00:54:40,700 + 그 다음 우리는 이것에 상승을 줄이 모두 함께 스티치를 설정하고 얻을 수 있습니다 + +749 +00:54:40,699 --> 00:54:44,808 + 변화는 오디오 인코더 번, 그래서 우리는 우리가이 다음 함께 이러한 것들을 넣어 + +750 +00:54:44,809 --> 00:54:49,559 + 입력 데이터 포인트의 X 우리 것 우리의 인코더 네트워크와 통해거야 패스를 + +751 +00:54:49,559 --> 00:54:52,819 + 인코더 네트워크는 최고의 상태에 대한 분포를 뱉어 + +752 +00:54:52,818 --> 00:54:57,789 + 우리는 최신 날짜 이상이이 메일을 일단 당신이 상상할 수 + +753 +00:54:57,789 --> 00:55:01,650 + 그 분포에서 샘플링을 상상할 수있는 것은 얻기 위해 일부 일부 최고 + +754 +00:55:01,650 --> 00:55:07,700 + 우리가 한 번왔다하면보다 그 입력을 나에게 높은 확률의 상태를 보자 + +755 +00:55:07,699 --> 00:55:11,889 + 우리는 잠재 상태의 몇 가지 구체적인 예를 우리는 그것을 통과 할 수 + +756 +00:55:11,889 --> 00:55:16,409 + 이 디코더 네트워크 확률을 밖으로 확산되는있는 다음해야 + +757 +00:55:16,409 --> 00:55:20,469 + 우리가이 있으면 다시 다음 데이터의 확률을 가속화 + +758 +00:55:20,469 --> 00:55:24,439 + 우리는 그것에서 맛볼 수있는 데이터에 대한 분포는 실제로 뭔가를 얻을 수 + +759 +00:55:24,440 --> 00:55:29,950 + 그 희망이보고 끝이 때문에 원본 데이터 점처럼 보이는 + +760 +00:55:29,949 --> 00:55:34,269 + 우리는 우리가있어 우리의 입력 데이터를 복용하고 일반 오디오 인코더와 같은 매우 + +761 +00:55:34,269 --> 00:55:37,829 + 일부 잠재 상태를 얻기 위해이 엔코더를 통해 실행하거나에 전달 + +762 +00:55:37,829 --> 00:55:42,200 + 디코더는 완전히 원래의 데이터를 재구성하고이 훈련에 대해 갈 때 + +763 +00:55:42,199 --> 00:55:46,149 + 물건은 실제로 일반 오디오 인코더와 같은 매우 유사한 방법에서 훈련이야 + +764 +00:55:46,150 --> 00:55:50,230 + 우리는 과거이 있고이 이전 버전과의 유일한 차이점은 손실에 전달 + +765 +00:55:50,230 --> 00:55:55,490 + 기능 상단에 우리는이 재건 손실이 아닌 것을 있도록 + +766 +00:55:55,489 --> 00:56:01,078 + (SL2)에 의해 표시 대신에 우리는이 분포가 실제에 근접 할 + +767 +00:56:01,079 --> 00:56:07,349 + 입력 데이터와 우리는이를 우리가 원하는 중간에 나오는 용어를 잃었다 + +768 +00:56:07,349 --> 00:56:11,230 + 레이튼 미국을 통해이 발생 분포는 희망 매우 유사 + +769 +00:56:11,230 --> 00:56:16,579 + 우리의 명시된 사전 분포에 우리는 매우 그래서 한 번 시작 부분에 아래로 썼다 + +770 +00:56:16,579 --> 00:56:19,200 + 당신은 그냥 보통처럼이 일을 시도 할 수 있습니다 함께 이러한 조각을 넣어 + +771 +00:56:19,199 --> 00:56:22,969 + 앞으로 전진 패스 정상 앞으로 오디오 인코더 및 후방 패스 + +772 +00:56:22,969 --> 00:56:29,058 + 만약 손실을 넣고 어떻게 그렇게 손실 해석 여기서 유일한 차이점은 + +773 +00:56:29,059 --> 00:56:32,500 + 우리의 종류를 통해 갈 때 어떤 설정에 대한 질문 그것은 종류입니다 + +774 +00:56:32,500 --> 00:56:39,608 + 그것의 사촌 엉덩이 네 질문은 왜 대각 공분산과 답변을 선택합니까된다 + +775 +00:56:39,608 --> 00:56:44,199 + 정말 쉬운 그들의 작업을 할 수 있지만, 실제로 사람들은 내가 생각하는 시도 + +776 +00:56:44,199 --> 00:56:50,210 + 약간 너무 일을 애호가하지만 당신이 너무 좋아 함께 놀러 수있는 일이다 + +777 +00:56:50,210 --> 00:56:53,530 + 우리가 실제로 한 번이 훈련을하고 나면 우리는 실제로 이러한 종류의 훈련을했습니다 + +778 +00:56:53,530 --> 00:56:56,920 + 변분 오디오 인코더 우리는 실제적으로 새로운 데이터를 생성하는 데 사용할 수 + +779 +00:56:56,920 --> 00:57:00,510 + 즉, 그래서 여기 종류의 원본 데이터 셋처럼 보이는 + +780 +00:57:00,510 --> 00:57:04,430 + 아이디어는 기억하고 있음을 우리는이 이전 될 수있는 당신이 협상 적어된다 + +781 +00:57:04,429 --> 00:57:07,960 + 아니면 뭔가 조금 애호가하지만 어떤 속도로이 사전은 뭔가 + +782 +00:57:07,960 --> 00:57:12,039 + 당신이 아주의 협상 그래서 우리는 쉽게에서 맛볼 수있는 유통 + +783 +00:57:12,039 --> 00:57:15,989 + 그 분포에서 무작위 샘플을 그리 쉽게 그래서 새로운 데이터를 생성하는 + +784 +00:57:15,989 --> 00:57:20,459 + 그저이 데이터이 데이터 생성 과정에 따라 시작됩니다 + +785 +00:57:20,460 --> 00:57:24,849 + 먼저 우리는 우리의 사전에서 우리의에서 샘플링됩니다, 그래서 우리는 데이터를 상상했던 것을 + +786 +00:57:24,849 --> 00:57:28,430 + 국가에있는 호수에 분포하고 우리는 우리의 디코더를 통해 전달됩니다 + +787 +00:57:28,429 --> 00:57:32,078 + 우리는 훈련 기간 동안 배운 네트워크와이 디코더 네트워크는 것 + +788 +00:57:32,079 --> 00:57:36,190 + 지금 모두의 측면에서에서 차례로 분배 무시하고 임명을 뱉어 + +789 +00:57:36,190 --> 00:57:40,460 + 내 말과 공분산 우리는 평균과 공분산이되면이 단지입니다 + +790 +00:57:40,460 --> 00:57:44,548 + 대각선 세상에는 우리가 쉽게 몇 가지 데이터를 생성하기 위해 다시이 일에서보실 수 있습니다 + +791 +00:57:44,548 --> 00:57:50,369 + 당신이 할 수있는 다른 일이 종류의 당신이이 일을 훈련 그래서 지금 11 포인트 + +792 +00:57:50,369 --> 00:57:54,440 + 잠재 공간에서 할 수있는 난에 오히려 잠재에서 샘플링보다 해요 + +793 +00:57:54,440 --> 00:57:58,490 + 최신에서 충성의 분포 대신에 단지 밀집 샘플 + +794 +00:57:58,489 --> 00:58:01,979 + 베이스 종류의 네트워크가 가진 구조 구조의 유형의 아이디어를 얻을 수 있습니다 + +795 +00:58:01,980 --> 00:58:09,280 + 우리는 우리가이 훈련 그래서 이것은 그래서 여기에이 데이터 집합에 정확히하고있다 배웠다 + +796 +00:58:09,280 --> 00:58:12,990 + 최신 여덟 곳으로 변화 오디오 인코더 단지입니다 + +797 +00:58:12,989 --> 00:58:17,959 + 일이 차원 이제 우리는 실제로 공간에서이 말을 스캔 할 수 있습니다 우리 + +798 +00:58:17,960 --> 00:58:22,490 + 늦은 공간 및 각 포인트에 대한 밀도가 이러한 2 차원 탐색 할 + +799 +00:58:22,489 --> 00:58:26,519 + 잠상 공간 디코더 통과 일부 화상을 생성하도록 사용하면 + +800 +00:58:26,519 --> 00:58:30,599 + 실제로는 그런 종류의 아름다운 구조를 발견 있다고 볼 수 있습니다 + +801 +00:58:30,599 --> 00:58:34,618 + I가있을 것이다, 그래서의 원활 다른 자리 클래스 사이에 보간 + +802 +00:58:34,619 --> 00:58:38,530 + 여기에서 당신을 아래로 가서 당신이 여섯 제로로 모프의 종류 참조 왼쪽 + +803 +00:58:38,530 --> 00:58:42,690 + 볼 여섯 것으로는 BB의 화려로 칠로 설정되어 남부의 에이즈은 + +804 +00:58:42,690 --> 00:58:46,159 + 이 잠재 그래서 여기에 아래로 사람 어딘가에 중간에 매달려 + +805 +00:58:46,159 --> 00:58:50,049 + 공간은 실제로 아주에있는 데이터의이 아름다운 풀림을 배웠다 + +806 +00:58:50,050 --> 00:58:55,910 + 좋은 자율 방식으로 우리는 또한 우리의 얼굴 데이터 세트에이 일을 설정할 수 있습니다 그것은이다 + +807 +00:58:55,909 --> 00:58:59,199 + 우리가이 2 차원의 변화를 훈련하고 이야기의 같은 종류의 + +808 +00:58:59,199 --> 00:59:02,679 + 오디오 인코더 다음 우리가 훈련을하면 우리는 밀도에서 늦은에서 샘플링 + +809 +00:59:02,679 --> 00:59:05,679 + 공간은 그가 서 배운 것을보고 시도 + +810 +00:59:13,018 --> 00:59:19,458 + 그래 그래서 문제는 사람들이 지금 가장 구체적를 강제로 시도 여부 + +811 +00:59:19,458 --> 00:59:23,139 + 변수는 일부 일부 일부 정확한 의미를하고 그래 몇 가지가있다 + +812 +00:59:23,139 --> 00:59:27,058 + 후속 정확하게 수행 작업을 깊은 역이라는 종이가 있음 + +813 +00:59:27,059 --> 00:59:31,890 + 그들이 시도 정확하게이 설정을 수행하는 것이이 MIT에서 그래픽 네트워크 + +814 +00:59:31,889 --> 00:59:36,199 + 의 신경망으로 그들이 렌더러의 종류를 학습 할 위치를 강제로 그렇게 + +815 +00:59:36,199 --> 00:59:41,568 + 그들은 일부를 강제 할 것들의 3D 이미지를 렌더링 좋아하는 배우고 싶어요 + +816 +00:59:41,568 --> 00:59:44,619 + 잠재 공간에서 변수의 일부 잠재 공간 + +817 +00:59:44,619 --> 00:59:49,289 + 물체의 3 차원 각도와 아마 클래스와에 대응 + +818 +00:59:49,289 --> 00:59:53,009 + 물체와 그 나머지의 휴식은 무엇이든 배울지도 그 + +819 +00:59:53,009 --> 00:59:56,099 + 원하고 그녀가 가지고있는 멋진 실험은 지금은 정확히 할 수 있었다 + +820 +00:59:56,099 --> 01:00:00,809 + 당신이 말한으로 그 그 특정 값을 잠재 변수를 설정하여 + +821 +01:00:00,809 --> 01:00:03,869 + 그 렌더링 실제로 객체를 회전 것들이다 그 꽤 있습니다 할 수 있습니다 + +822 +01:00:03,869 --> 01:00:09,390 + 멋진하지만 그건 그건있어 그 다음에 공백 그러나이보다 애호가의 많은입니다 + +823 +01:00:09,389 --> 01:00:11,908 + 얼굴은 여전히​​ 당신이 종류의 사이에 보간 볼 수 있습니다 꽤 멋진 + +824 +01:00:11,909 --> 01:00:16,689 + 다른이 아주 좋은 방법 단계와 나는 실제로 아주가 있다고 생각 + +825 +01:00:16,688 --> 01:00:21,759 + 좋은 동기 부여 여기 이유 중 하나 우리는 대각선 긴장이 선택 + +826 +01:00:21,759 --> 01:00:26,079 + 그 독립을 갖는의 확률 적 해석을 가지고 있지만, + +827 +01:00:26,079 --> 01:00:29,179 + 우리의 생활 공간에서 매우 다른 변수 + +828 +01:00:29,179 --> 01:00:33,918 + 실제로 독립적 그래서 나는 그 이유가 설명하는 데 도움이 생각해야 + +829 +01:00:33,918 --> 01:00:37,219 + 당신이 끝날 때 실제로 accys 사이의 아주 좋은 분리입니다 + +830 +01:00:37,219 --> 01:00:40,858 + 공간에있는지도에서 샘플링이 확률 적 독립성 때문이다 + +831 +01:00:40,858 --> 01:00:45,630 + 에 포함 된 가정 이전 그래서이 아이디어 이전이 매우 강력와 + +832 +01:00:45,630 --> 01:00:51,139 + 당신이 종류의 큰 일 이러한 유형의 직접 모델 그래서 II에 있습니다 + +833 +01:00:51,139 --> 01:00:54,028 + 수학의 무리를 쓴 나는 우리가 정말 통과 시간이 생각하지 않습니다 + +834 +01:00:54,028 --> 01:00:57,849 + 그것은하지만 아이디어는 훈련있어 고전 때의 일종이다 + +835 +01:00:57,849 --> 01:01:01,130 + 생식 모델 당신이 원하는 최대 우도라는이 일있다 + +836 +01:01:01,130 --> 01:01:04,608 + 모델에 따라 데이터의 우도를 최대화하고 모델을 선택하는 + +837 +01:01:04,608 --> 01:01:09,018 + 즉, 데이터가 가장 가능성이 있습니다하지만 당신은 그냥하려고하면 밝혀 곳 + +838 +01:01:09,018 --> 01:01:13,068 + 최대 우도 생식이 공정을 이용하여 통상의 사용을 실행하는 것이 우리 + +839 +01:01:13,068 --> 01:01:17,708 + 당신은 당신이 결국이 거대한로 실행보다 나이가 바로 그 문제에 대해 상상했던 + +840 +01:01:17,708 --> 01:01:21,009 + 이 거인이되는이 공동 분배를 소외 필요 + +841 +01:01:21,009 --> 01:01:24,289 + 뭔가 아니다 전체 회의 상태 공간을 통해 소녀에서 다루기 힘든 + +842 +01:01:24,289 --> 01:01:25,890 + 우리가 할 수있는 + +843 +01:01:25,889 --> 01:01:29,659 + 그래서 대신에 다양한 오디오 인코더 인코더는이 일이라고 않는 + +844 +01:01:29,659 --> 01:01:34,259 + 변분은 정말 멋진 생각이다 추론하는 그리고 수학은 경우에 여기에있다 + +845 +01:01:34,260 --> 01:01:38,150 + 당신은 그것을 통해 가고 싶어하지만 아이디어는 대신에 따라 극대화하는 것입니다 + +846 +01:01:38,150 --> 01:01:42,619 + A의 데이터 일 가능성이 영리이 추가 컨텐츠를 삽입하고 + +847 +01:01:42,619 --> 01:01:47,429 + 우리는 바로이 정확한 것입니다있어, 그래서이 두 가지 다른 용어로 그것을 깰 + +848 +01:01:47,429 --> 01:01:50,419 + 당신은 아마 자신 만이 로그에이 작업 할 수 등가물 + +849 +01:01:50,420 --> 01:01:54,710 + 가능성 우리는 우리가 팔꿈치를 호출이 용어의 관점이에서 쓸 수 있습니다 + +850 +01:01:54,710 --> 01:01:58,869 + 이 분포하고 우리는 아는 사이 칼의 차이가 다른 용어 + +851 +01:01:58,869 --> 01:02:03,029 + 두 처녀를 죽인 그래서 우리는이 처녀를 죽인 것을 알고 항상 0이다 + +852 +01:02:03,030 --> 01:02:07,120 + 분배 사이에서 우리가이 용어는 비 - 제로이어야한다는 것을 알 비 제로 + +853 +01:02:07,119 --> 01:02:12,420 + 이는이이 팔꿈치 용어는 실제로 로그에 결합 된 낮다는 것을 의미한다 + +854 +01:02:12,420 --> 01:02:16,480 + 우리의 데이터의 가능성이 아래로 작성하는 과정에서 그 통지 + +855 +01:02:16,480 --> 01:02:20,889 + 팔꿈치 우리는 우리가 같이 해석 할 수있는이 추가 매개 변수 피드를 소개합니다 + +856 +01:02:20,889 --> 01:02:25,710 + 일종의 근사되어이이 엔코더 네트워크의 매개 변수 + +857 +01:02:25,710 --> 01:02:30,909 + 그래서 지금 대신 직접 극대화하는 노력이 하드 사후 분포 + +858 +01:02:30,909 --> 01:02:34,319 + 우리의 데이터의 로그 가능성이 대신 바로이 문제를 극대화하기 위해 노력할 것입니다 + +859 +01:02:34,320 --> 01:02:39,539 + 데이터 바인딩 및 팔꿈치 같은 저급 때문에 로그 하한 + +860 +01:02:39,539 --> 01:02:43,769 + 다음 팔꿈치 최대화 가능성도까지 상승시키는 효과를 가질 것이다 + +861 +01:02:43,769 --> 01:02:49,059 + 로그 가능성과 물건과 실제의 팔꿈치이이 두 용어 + +862 +01:02:49,059 --> 01:02:53,360 + 전면의에서이 하나가 있음이 아름다운 해석 + +863 +01:02:53,360 --> 01:02:57,849 + 레이튼 상태에 대한 기대의 잠재 상태 공간 이상이어야 + +864 +01:02:57,849 --> 01:03:01,889 + 당신이이 있다고 생각하므로 X의 확률은 잠재 상태 공간을 제공 + +865 +01:03:01,889 --> 01:03:05,559 + 말하는 실제로 데이터 재구성 기간 우리가 이상 평균 경우 그 + +866 +01:03:05,559 --> 01:03:08,789 + 우리가 무언가와 끝까지해야 가능한 모든 십팔 상태 + +867 +01:03:08,789 --> 01:03:13,639 + 우리의 원래의 데이터와 유사한이이 다른 용어는 실제로는 이것이다 + +868 +01:03:13,639 --> 01:03:17,940 + 정규화 기간이 대략 사이의 칼 발산하다 + +869 +01:03:17,940 --> 01:03:22,059 + 후부 및 이전 사이에 그래서 이것은 강제로 시도의 정규화입니다 + +870 +01:03:22,059 --> 01:03:27,019 + 당신이 대략 수있는 그 두 가지가 함께 그래서이이 첫 번째 임기에 영향을 미칠 + +871 +01:03:27,019 --> 01:03:31,590 + 뭔가 논문에서이 트릭을 사용하여 샘​​플링하여 대략적인 호출 + +872 +01:03:31,590 --> 01:03:35,600 + 모든 것이 예정되어 있기 때문에 나는 다시이 다른 용어로 얻을하지 않습니다 + +873 +01:03:35,599 --> 01:03:38,489 + 여기에 당신이 단지 숙련 출현에 대해 가능한 명시 적으로 + +874 +01:03:38,489 --> 01:03:44,509 + 그래서 나는이 그 그 종류의의 때문에 대부분의 클래스에있는 모든 슬라이드를지도 생각 + +875 +01:03:44,510 --> 01:03:50,020 + 재미 있지만, 실제로는 그래서하지만 실제로 무서운하지만 그것은 단지 사실이다 + +876 +01:03:50,019 --> 01:03:54,150 + 정확히 바로이 사분의 아이디어 우리는 재건이 후, 당신은 + +877 +01:03:54,150 --> 01:03:59,050 + 당신에게 처벌이 페널티는 질문에 뒤로 이전 이동 + +878 +01:03:59,050 --> 01:04:08,840 + 다양한 분기 해당 일반적으로 오디오 인코더의 생각대로 + +879 +01:04:08,840 --> 01:04:12,180 + 우리는 희망을 갖고 우리의 데이터를 재구성하려고하는 네트워크를 강제하려는 + +880 +01:04:12,179 --> 01:04:16,089 + 이 트리샤에 대한 데이터의 종류의 유용한 표현을 배울 것 + +881 +01:04:16,090 --> 01:04:19,470 + 우리는 변화에 이동하면 인코더의 많은이 피터 학습에 사용되지만, + +882 +01:04:19,469 --> 01:04:23,569 + 우리가 실제로 샘플을 생성 할 수 있도록 분기에 우리는이 일 환자를 확인하는 + +883 +01:04:23,570 --> 01:04:29,440 + 그럼 내 데이터에서 샘플을 생성하는이 아이디어 우리의 데이터와 유사 + +884 +01:04:29,440 --> 01:04:32,690 + 정말 멋진이며, 모든 사람이 사진의 이러한 종류의 너무보고 사랑 + +885 +01:04:32,690 --> 01:04:37,119 + 어쩌면 우리 모두가없이 정말 멋진 예제를 생성 할 수있는 또 다른 생각이있다 + +886 +01:04:37,119 --> 01:04:41,100 + 이 무서운 베이지안 수학 그리고 그것은이라는 생각이 있다고 밝혀 + +887 +01:04:41,099 --> 01:04:45,219 + 다른 생각 다른 트위스트의 일종이다 생식 적대적인 네트워크 + +888 +01:04:45,219 --> 01:04:49,799 + 즉, 여전히 데이터 만의 일종처럼 보이는 샘플을 생성 할 수 있습니다 + +889 +01:04:49,800 --> 01:04:52,560 + 좀 더 명시 적으로 이견에 대해 걱정할 필요없이 + +890 +01:04:52,559 --> 01:04:54,340 + 심판과 물건을 이런 종류의 + +891 +01:04:54,340 --> 01:04:58,920 + 아이디어는 우리가거야 발전기가 처음 우리가 있다는 것을 잘 작동하지 않은 것입니다 + +892 +01:04:58,920 --> 01:05:02,780 + 당신이 협상에서거야 아마 받고있다 일부 랜덤 노이즈로 시작하거나 + +893 +01:05:02,780 --> 01:05:07,060 + 그 다음 우리는 발전기 네트워크를 가지고거야,이 같은 + +894 +01:05:07,059 --> 01:05:11,079 + 발전기 네트워크는 실제로 매우 많은 변분의 디코더처럼 보이는 + +895 +01:05:11,079 --> 01:05:15,849 + 오디오 인코더 또는 우리가 야한다는 점에서 일반 오디오 인코더 년 하반기와 같은 + +896 +01:05:15,849 --> 01:05:20,449 + 이 랜덤 노이즈를 복용하고 우리가 될 것입니다 이미지를 보낼거야 + +897 +01:05:20,449 --> 01:05:26,379 + 우리는 단지 다음이 열차 네트워크를 사용하여 발생하는 일부 가짜하지 실제 이미지 + +898 +01:05:26,380 --> 01:05:29,410 + 우리는 또한에가는 판별 네트워크를 연결하는거야 + +899 +01:05:29,409 --> 01:05:32,679 + 이 가짜 이미지를보고이 있는지 여부를 그 여부를 결정하려고 + +900 +01:05:32,679 --> 01:05:34,769 + 생성 된 이미지는 실제 또는 가짜 + +901 +01:05:34,769 --> 01:05:38,679 + 그래서 이것은이 때문에 제 2 네트워크는 단지이 이진 분류를하고있다 + +902 +01:05:38,679 --> 01:05:42,949 + 작업은 입력을 수신하는 경우 그리고 그것은 단지 그것의 여부를 말할 필요 + +903 +01:05:42,949 --> 01:05:46,739 + 그것은 사실이나 그것이 실제 이미지인지 아닌지의 여부 그건 그냥 일종의이야 + +904 +01:05:46,739 --> 01:05:49,739 + 당신이 다른 것처럼 연결할 수 있습니다 분류 작업 + +905 +01:05:50,730 --> 01:05:55,349 + 그래서 우리는 공동으로 완전히 모든 호출이 일을 훈련 할 수있다 + +906 +01:05:55,960 --> 01:06:01,179 + R 발전기 네트워크 랜덤 노이즈의 여러 배치를받을 것이다 어디거야 + +907 +01:06:01,179 --> 01:06:06,629 + 뱉어과 이미지를 할게요 우리 판별 네트워크는 많은을 받게됩니다 + +908 +01:06:06,630 --> 01:06:12,640 + 데이터 집합에서 부분적으로 이러한 이미지의 배치를 부분적으로 실제 이미지와 + +909 +01:06:12,639 --> 01:06:16,039 + 그것은 대답이 분류 작업을 만들기 위해 노력할 것입니다해야 할 것이다 + +910 +01:06:16,039 --> 01:06:21,358 + 진짜와 가짜있는 등이 또 다른 방법은 지금 종류의 우리가 할 수있다 + +911 +01:06:21,358 --> 01:06:25,880 + 실제 데이터없이지도 학습 문제 틱의이 종류를 연결 우리 때문에 + +912 +01:06:25,880 --> 01:06:30,390 + 까지이 일을 희망 그리고 우리는 우리가 어떤 볼 수 공동으로 기대 훈련 + +913 +01:06:30,389 --> 01:06:34,730 + 그래서이 원래 일반적으로 적대적 네트워크 종이에서 예 + +914 +01:06:34,730 --> 01:06:38,840 + 당신이 볼 수있는 발표 네트워크에 의해 생성 된 가짜 이미지입니다 + +915 +01:06:38,840 --> 01:06:41,829 + 그것은 그들이 진짜처럼 실제로 발생 가짜 가슴의 아주 좋은 일을 있어요 + +916 +01:06:41,829 --> 01:06:46,549 + 숫자와 여기를 내가 여기이 가운데 열있어 실제로 보여주고있다 + +917 +01:06:46,550 --> 01:06:50,080 + 그 자리의 트레이닝 세트의 가장 가까운 이웃이 희망을 알려합니다 + +918 +01:06:50,079 --> 01:06:53,599 + 예를 들어이 너무있다, 그래서 그냥 훈련 집합을 기억하지 않습니다 + +919 +01:06:53,599 --> 01:06:57,389 + 그냥 기억하지 그래서 약간의 점과 다음이 사람은 도트가 없습니다 + +920 +01:06:57,389 --> 01:07:01,079 + 데이터를 교육하고 또한 인식 속도의 꽤 좋은 일을 + +921 +01:07:01,079 --> 01:07:05,849 + 생성 그렇게 얼굴 그러나 당신은 기계 학습에서 일한 사람으로 알고 + +922 +01:07:05,849 --> 01:07:10,440 + 알려진 이러한 숫자와 붙여 넣기 데이터 세트를 생성 아주 쉽게하는 경향이 + +923 +01:07:10,440 --> 01:07:16,869 + 에서 샘플 우리는 RJR 샘플을하지 않는 것보다 훨씬 볼이이 작업을 적용 할 때 + +924 +01:07:16,869 --> 01:07:21,840 + 아주 보면 친절하고 깨끗한 그래서 여기 명확하게 CPR에 대한 몇 가지 아이디어를 가지고 + +925 +01:07:21,840 --> 01:07:25,108 + 블루 stock와 녹색 물건을하지만 그들은 정말처럼 보이지 않는 가치 데이터 + +926 +01:07:25,108 --> 01:07:32,429 + 실제로 시도 후속 작업 그래서 그 해당 그래서 실제 객체는 문제입니다 + +927 +01:07:32,429 --> 01:07:35,599 + 생식 적대적 네트워크에 대한 몇 가지 후속 작업을 만들기 위해 노력하고 있습니다 + +928 +01:07:35,599 --> 01:07:38,529 + 더 크고 더 강력한 이러한 아키텍처는 그렇게 희망 할 수있을 + +929 +01:07:38,530 --> 01:07:44,080 + 하나의 생각이 그래서이 더 복잡한 데이터 세트에 더 좋은 샘플을 생성 + +930 +01:07:44,079 --> 01:07:48,949 + 아이디어는 다중 스케일 처리 때문에보다 한꺼번에 영상을 생성 + +931 +01:07:48,949 --> 01:07:53,919 + 우리는 실제로 그렇게 먼저이 방법으로 여러 규모에서 우리의 이미지를 생성거야 + +932 +01:07:53,920 --> 01:07:58,170 + 우리는 침대 후 발생 소음을 수신하고 피드 생성기 일어날거야 + +933 +01:07:58,170 --> 01:08:03,670 + 낮은 해상도와 우리는 위로 그 노라의 SKYY 샘플 및 제를 적용합니다 + +934 +01:08:03,670 --> 01:08:04,200 + 발전기 + +935 +01:08:04,199 --> 01:08:08,230 + 위에 일부 델타 랜덤 잡음의 새로운 배치를 수신하고, 계산 UR + +936 +01:08:08,230 --> 01:08:12,070 + 낮은 고해상도 이미지를 다시 무슨 샘플과 과정을 반복 + +937 +01:08:12,070 --> 01:08:16,810 + 우리가 실제로 마지막으로 생성 할 때까지 여러 번 생성되는 우리의 + +938 +01:08:16,810 --> 01:08:22,219 + 최종 결과는 그래서 이것은 다시로 이전 매우 비슷한 생각입니다 + +939 +01:08:22,219 --> 01:08:25,329 + 원래 성별 다양한 영역 네트워크하거나 여러 규모에서 발생 + +940 +01:08:25,329 --> 01:08:30,199 + 동시에 여기에 훈련은 실제로 당신이 조금 더 복잡하다 + +941 +01:08:30,199 --> 01:08:35,710 + 각 규모의 판별과 그 희망을 희망 그래서 뭔가있다 + +942 +01:08:35,710 --> 01:08:39,039 + 우리는 그래서 여기에 훨씬 더 실제로이 사람에서 기차 샘플을 볼 때 + +943 +01:08:39,039 --> 01:08:43,869 + 실제로 그래서 여기에 C (510)에 클래스마다 별도의 모델을 훈련 그들은했습니다 + +944 +01:08:43,869 --> 01:08:48,599 + CPR까지 한 단지 비행기에이 적대적 네트워크를 훈련하고 볼 수 있습니다 + +945 +01:08:48,600 --> 01:08:51,460 + 그들은 그 그게 점점 그래서 실제 비행기처럼 보이기 시작하고 있다는 + +946 +01:08:51,460 --> 01:08:52,210 + 어딘가에 + +947 +01:08:52,210 --> 01:08:56,689 + 이 거의 실제 분기처럼 보이는 이들은 진짜처럼 좀보고 할 수있다 + +948 +01:08:56,689 --> 01:09:04,278 + 조류에서의 있도록 다음 해 사람들은 실제로 멀리이 다중 스케일 아이디어를 던져 + +949 +01:09:04,279 --> 01:09:09,339 + 단지 간단한 그래서 여기에 아이디어가 더 나은 더 원칙 대륙이다 사용 + +950 +01:09:09,338 --> 01:09:14,318 + 이 멀티 숙련 된 직원에 대해 잊고 그냥 사용하지 않는 사용 배치 표준을 사용한다 + +951 +01:09:14,319 --> 01:09:17,739 + 우리가 가진 한 모든 건축 제약 완전히 연결 레이어 정렬 + +952 +01:09:17,738 --> 01:09:22,759 + 가 연습 연습과 지난 몇 년은 사람들을 사용하고 밝혀 + +953 +01:09:22,759 --> 01:09:27,969 + 그 범위 작업에 대적 정말 잘 여기 그래서 그들은 발생이 인 것 + +954 +01:09:27,969 --> 01:09:33,088 + 아주 아주 간단합니다 아주 간단합니다 아주 작은 길쌈 네트워크와 + +955 +01:09:33,088 --> 01:09:38,539 + 판별 다시 국유화하고 모든 그냥 간단한 네트워크입니다 + +956 +01:09:38,539 --> 01:09:42,180 + 이러한 다른 종과 경적 당신이이 일을를 연결하면 그들은 몇 가지를 얻을 수 + +957 +01:09:42,180 --> 01:09:47,810 + 이러한 네트워크에서 침실을 생성하므로 본 논문에서 놀라운 샘플 + +958 +01:09:47,810 --> 01:09:53,450 + 그래서이 실제로 꽤 인상적 결과 이​​들은 실제 데이터처럼 거의 + +959 +01:09:53,449 --> 01:09:57,529 + 그래서 당신은 캡처 정말 좋은 일을 끝낼 것을 알 수 있습니다 + +960 +01:09:57,529 --> 01:10:00,920 + 나쁜 거기에 같은 침실 정말 상세한 구조는 윈도우있다 + +961 +01:10:00,920 --> 01:10:07,710 + 이러한이 정말 놀라운 샘플하지만되도록 전등 스위치가있다 그것은 + +962 +01:10:07,710 --> 01:10:12,579 + 오히려 방금 생성 된 샘플보다 우리가 같은 재생할 수 있습니다 밝혀 + +963 +01:10:12,579 --> 01:10:16,260 + 실제로 인코더의 매우 문제가 많은 등의 트릭과 재생하려고 악용을 시도 + +964 +01:10:16,260 --> 01:10:16,670 + 약 + +965 +01:10:16,670 --> 01:10:21,739 + 이 이러한 적대적 네트워크가이 받고있는 사촌 때문에 회의 공간 + +966 +01:10:21,738 --> 01:10:25,579 + 노이즈 입력하고 우리는 영리 소음 주위에 이동하려고하고 그것을 넣을 수 있습니다 + +967 +01:10:25,579 --> 01:10:29,920 + 이러한 네트워크 그렇게 일례를 생성하는 것들의 형태를 변경하려고 + +968 +01:10:29,920 --> 01:10:36,050 + 우리는 그래서 여기에 왼쪽 엉덩이에 그래서 여기에 침실 사이에 보간됩니다 시도 할 수 있음 + +969 +01:10:36,050 --> 01:10:40,119 + 아이디어는 왼쪽에 이러한 이미지의 왼쪽에 우리가 그린 한 것입니다 + +970 +01:10:40,119 --> 01:10:43,550 + 생성하기 위해 사용 후 우리 잡음 분포로부터 무작위 포인트 + +971 +01:10:43,550 --> 01:10:47,690 + 이미지는 이제 오른쪽 우리는이를 수행 한 결과 우리가 생성 + +972 +01:10:47,689 --> 01:10:51,259 + 우리 잡음 분포에서 다른 임의의 지점과는를 생성하는 데 사용할 + +973 +01:10:51,260 --> 01:10:57,710 + 양측이이 두 사람이 생성 해주기 때문에 이미지가 일종의 있습니다 + +974 +01:10:57,710 --> 01:11:01,760 + 우리는 공간에서 리드 사이에서 보간하고자하는 라인과 I의 두 점 + +975 +01:11:01,760 --> 01:11:08,210 + 두 리드 배우와 그 라인을 따라 우리가 거​​의 사용을 사용 생성하고 + +976 +01:11:08,210 --> 01:11:11,859 + 발전기는 이미지를 생성하고 희망이 보간됩니다 + +977 +01:11:11,859 --> 01:11:16,439 + 최신 두 사람의 날짜와이 꽤 미친 것을 알 수있다 + +978 +01:11:16,439 --> 01:11:22,169 + 이 객실은 아주 좋은 부드러운 연속 방법으로 더 많은 명성 종류의 것을 + +979 +01:11:22,170 --> 01:11:28,020 + 침실에서 다른 당신이 한 가지 지적 할 경우이 있다는 것입니다 + +980 +01:11:28,020 --> 01:11:32,300 + 당신이 상상할 경우 아침은 실제로 좋은 낭만적 인 방법의 종류에 무슨 일이 일어나고 있는지 + +981 +01:11:32,300 --> 01:11:35,460 + 그냥이 어떤 종류의 것보다이 픽셀 공간과 같을 것이다 + +982 +01:11:35,460 --> 01:11:39,100 + 효과를 페이딩과 전혀 매우 좋아 보이지 않을 것이다 그러나 여기 당신이 볼 수 있습니다 + +983 +01:11:39,100 --> 01:11:42,690 + 실제로 이러한 것들의 모양과 색상의 지속적 종류입니다 + +984 +01:11:42,689 --> 01:11:50,119 + 또 다른 실험 있도록 아주 재미있는 다른 한쪽에서 변형 + +985 +01:11:50,119 --> 01:11:53,939 + 그들은 실제로 주위에 재생 벡터 수학을 사용하고이 문서에있는 + +986 +01:11:53,939 --> 01:11:58,069 + 이러한 네트워크가 생성 가지의 유형은 그래서 여기에 아이디어는 그들이 + +987 +01:11:58,069 --> 01:12:02,189 + 다음 노이즈 분포에서 무작위 샘플의 전체 무리를 생성 + +988 +01:12:02,189 --> 01:12:05,789 + 샘플의 전체 무리를 생성하는 발전기를 통해 그들 모두를 밀어 + +989 +01:12:05,789 --> 01:12:09,698 + 그들은 그가 자신의 인간의 지능을 사용하여 그들이 몇 가지를 만들려고 + +990 +01:12:09,698 --> 01:12:14,500 + 그 랜덤 샘플 그룹 다음과 같이 무엇에 대한 의미 론적 판단 + +991 +01:12:14,500 --> 01:12:18,050 + 이것 때문에 여기에 의미 론적 범주의 몇 가지로 + +992 +01:12:18,050 --> 01:12:21,739 + 세 가지가 될 것이라고 네트워크에서 생성 된 세 가지 이미지 + +993 +01:12:21,738 --> 01:12:25,529 + 모든 종류의 웃는 여자처럼 그 제공 인간 + +994 +01:12:25,529 --> 01:12:26,819 + 라벨 + +995 +01:12:26,819 --> 01:12:30,309 + 여기 중간에 중립 여성의 네트워크에서 3 개의 시료는 그 + +996 +01:12:30,310 --> 01:12:35,010 + 그 미소 요금에 공유 아니에요되는 것은 사람의 300 무료 샘플입니다 + +997 +01:12:35,010 --> 01:12:40,289 + 그래서이 사람들의 각각의 미소되지는 일부 잠재 상태 벡터에서 생성 된 + +998 +01:12:40,289 --> 01:12:45,729 + 그래서 우리는 평균 이런 종류의 계산하는 상태 벡터에서와 평신도 평균 다만 것 + +999 +01:12:45,729 --> 01:12:51,269 + 웃는 여자 중립 여성과 중성 남자의 평균 평가 상태 지금 한 번 + +1000 +01:12:51,270 --> 01:12:55,220 + 우리는 우리가 어떤 벡터 연산을 할 수있는이 잠복 상태 벡터 그래서 우리가 걸릴 수 있습니다 + +1001 +01:12:55,220 --> 01:13:01,050 + 웃는 여자 중립 여자를 빼고 중성 남자 그래서 무엇을 어떻게 것 + +1002 +01:13:01,050 --> 01:13:06,070 + 당신이 당신에게 웃는 사람을 줄 것이라고 희망 있도록 제공하고이게 무슨입니다 + +1003 +01:13:06,069 --> 01:13:12,649 + 그것은 일종의의 웃는 남자 생겼 그래서이 실제로 발생 + +1004 +01:13:12,649 --> 01:13:19,199 + 즉, 우리가 사람을 걸릴 수 있습니다 우리는 또 다른 실험을 할 수있는 아주 놀라운 + +1005 +01:13:19,199 --> 01:13:25,099 + 안경 및 안경없이 남자와 안경 사람과 사람을 빼기 + +1006 +01:13:25,100 --> 01:13:31,140 + 이이 혼란 안경없이 안경 안경 여자를 추가 + +1007 +01:13:31,140 --> 01:13:38,630 + 물건이었다 그래서 어떤이이 작은 방정식은 우리에게 줄 것이다 무엇 + +1008 +01:13:38,630 --> 01:13:47,369 + 즉 그하더라도 그것 때문에 데프 폭행 꽤 미친 그래서 그 봐 + +1009 +01:13:47,369 --> 01:13:51,279 + 우리는 일종의 강제하지 않는 잠자는 공간 공간에 명시 적으로 이전이 + +1010 +01:13:51,279 --> 01:13:54,869 + 적대적 네트워크는 어떻게 든 여전히 유용 정말 좋은 내용을 관리해야 + +1011 +01:13:54,869 --> 01:13:59,960 + 이 표현은 그래서도 매우 빠르게 난 정말 멋진 있다고 생각 + +1012 +01:13:59,960 --> 01:14:04,220 + 이러한 아이디어를 모두두고 그 두 주 전에 나온 그냥 종이 + +1013 +01:14:04,220 --> 01:14:07,820 + 우리는이 강의에서 다른 생각을 많이 덮여 함께 같은과 + +1014 +01:14:07,819 --> 01:14:11,239 + 그냥 그렇게 먼저 우리는거야 함께 모든 스틱에 변화를 보자 + +1015 +01:14:11,239 --> 01:14:15,659 + 이 분기는 통상의 정렬을 갖하고 시작점으로서의 + +1016 +01:14:15,659 --> 01:14:20,130 + 동맹국의 다양한 오디오 인코더 손실 그러나 우리는 이러한 적대적인 것을보고 + +1017 +01:14:20,130 --> 01:14:24,220 + 하지 우리는 적대적인 네트워크를 가지고 왜 네트워크는 그래​​서 정말 놀라운 샘플을 제공 + +1018 +01:14:24,220 --> 01:14:29,630 + 우리가 가진뿐만 아니라 지금 그렇게하도록 변화가 분기에 우리의 + +1019 +01:14:29,630 --> 01:14:33,710 + 변화 발판 분기 우리는 또한의이이 판별 네트워크가 + +1020 +01:14:33,710 --> 01:14:35,949 + 사이의 차이를 말하려고 + +1021 +01:14:35,949 --> 01:14:40,689 + 자료 없음과 샘플 사이의 변분 오디오 인코더하지만 그건 아니에요 + +1022 +01:14:40,689 --> 01:14:47,099 + 멋진 충분한 왜 우리는 또한 알렉스 NAT를 다운로드하지 않는 한 다음이 통과 + +1023 +01:14:47,100 --> 01:14:47,930 + 두 개의 이미지 + +1024 +01:14:47,930 --> 01:14:53,730 + 알렉스 그물 원래의 이미지와 4 개의 모두 알렉스 순 특징을 추출 + +1025 +01:14:53,729 --> 01:14:59,079 + 또한 비슷한 사진 손실과 머리를 가지고 지금 이미지를 생성 + +1026 +01:14:59,079 --> 01:15:02,340 + 우리는 또한이 샘플을 생성하기를 바라고있는 판별을 당겨 + +1027 +01:15:02,340 --> 01:15:06,900 + 너무 모든 측정 및 모든 스틱 한 번와 유사한 알렉스 순 기능 + +1028 +01:15:06,899 --> 01:15:10,859 + 일이 함께 희망 당신은 바로 그래서 정말 아름다운 샘플을 얻을 것이다 + +1029 +01:15:10,859 --> 01:15:17,069 + 여기 그래서이 단지 전체 훈련을 지불하고있는 용지의 예입니다 + +1030 +01:15:17,069 --> 01:15:21,109 + 그래서 우리는 난이 이러한 사실은 꽤 멋지다 생각해야한다는 이미지 일 + +1031 +01:15:21,109 --> 01:15:26,029 + 샘플 및 심폐 소생술에 다중 스케일 샘플이 대조 경우 그 우리 + +1032 +01:15:26,029 --> 01:15:29,609 + 그 샘플이 실제로 별도의 훈련 된 기억을 위해 이전에 본 + +1033 +01:15:29,609 --> 01:15:34,380 + 클래스 당 모델 볼 화재 및이 그 아름다운 침실 샘플 당신 + +1034 +01:15:34,380 --> 01:15:35,760 + 톱을 다시했다 + +1035 +01:15:35,760 --> 01:15:40,270 + 침실에 고유이다 그러나 여기에서 실제로 교육 훈련을 하나의 모델 + +1036 +01:15:40,270 --> 01:15:45,050 + 인터넷의 모든에 아직도 이런 하나의 모델은 실제 이미지하지만 그들은이야 + +1037 +01:15:45,050 --> 01:15:50,489 + 확실히 그 내가이 생각, 그래서 진짜 문제를 찾고 이미지를 향해 점점 + +1038 +01:15:50,489 --> 01:15:54,170 + 나는 또한 생각 꽤 멋진 그냥 모든 일을 재미의 종류 그리고 + +1039 +01:15:54,170 --> 01:16:00,020 + 함께 스틱 잘하면 내가 생각하는 그게 정말 좋은 샘플을 얻을 + +1040 +01:16:00,020 --> 01:16:02,460 + 그것은 거의 우리가 만약 그렇다면 자율 학습에 대해 말을 전부 + +1041 +01:16:02,460 --> 01:16:05,460 + 어떤 질문이있다 + +1042 +01:16:07,100 --> 01:16:17,110 + 여기에 무엇을 무슨 일이야 않습니다 + +1043 +01:16:18,680 --> 01:16:23,500 + 그래 그래서 질문은 당신이 침실 공간을 상승 글을 읽고 선형 될 수 있습니다 + +1044 +01:16:23,500 --> 01:16:28,079 + 그리고 우리는 우리가있어 기억 여기에 그것에 대해 생각하는 한 가지 방법은 어쩌면이다 + +1045 +01:16:28,079 --> 01:16:30,729 + 샘플링 프로그램을 바로 잡음에서 샘플링과를 통해 전달 + +1046 +01:16:30,729 --> 01:16:35,319 + 판별보다는 제너레이터 후 발전기 갖는다 + +1047 +01:16:35,319 --> 01:16:40,630 + 바로 그런 그 좋은 방법으로 서로 다른 소리 채널을 사용하기로 결정 + +1048 +01:16:40,630 --> 01:16:44,510 + 당신이 소음 사이에 상호 작용하는 경우는 이미지 사이에 보간 결국 + +1049 +01:16:44,510 --> 01:16:49,110 + 수 그래서 잘하면 좋은 부드러운 방법의 종류에 당신은 그것의 알고 + +1050 +01:16:49,109 --> 01:16:51,799 + 그저 실제로 싶은 연수 예를 기억하지 + +1051 +01:16:51,800 --> 01:17:00,310 + 바로 그래서 그냥 우리가 이야기 모든 것을 정리해하는 좋은 방법으로 그 일반화 + +1052 +01:17:00,310 --> 01:17:04,430 + 약 오늘 우리는 작업을위한 당신에게 정말 유용한 실용적인 팁을 많이 준 + +1053 +01:17:04,430 --> 01:17:08,470 + 동영상은 내가 당신에게 발생하는 매우 비 실용적인 팁을 많이 제공 + +1054 +01:17:08,470 --> 01:17:16,119 + 아름다운 이미지가 그래서 나는이 물건은 정말 멋진 생각하지만 난 무엇 확실하지 않다 + +1055 +01:17:16,119 --> 01:17:19,840 + 생성 된 이미지 이외의 사용하지만 그 멋진 그것은 재미와 확실히 있도록 + +1056 +01:17:19,840 --> 01:17:24,640 + 우리는 JAP 대에서 게스트 강의를해야하기 때문에 다음에 주위에 붙어 그렇다면 + +1057 +01:17:24,640 --> 01:17:27,310 + 당신은 그것을 위해 클래스에 와서 인터넷에 아마 당신은 싶어 있습니다보고있어 + +1058 +01:17:27,310 --> 01:17:31,500 + 하나는 그래서 나는 그것이 오늘날 우리가 모든 것을 생각하고 나중에 너희들을 볼 + diff --git a/captions/Ko/Lecture15_ko.srt b/captions/Ko/Lecture15_ko.srt new file mode 100644 index 00000000..ffbb057d --- /dev/null +++ b/captions/Ko/Lecture15_ko.srt @@ -0,0 +1,3432 @@ +1 +00:00:00,000 --> 00:00:03,370 + 오늘 발표됩니다 동안 것은 부분적으로 내 일이라고 지적했습니다 + +2 +00:00:03,370 --> 00:00:06,919 + 사람들에 의해 수행 다른 사람과 때때로 내가 제시하고있어 업무와 공동으로 제 + +3 +00:00:06,919 --> 00:00:10,929 + 정말 많은 많은 사람들이 있지만 공동 작업에 참여하지 않은 그룹 + +4 +00:00:10,929 --> 00:00:14,740 + 그렇게 에누리이 걸릴 회담을 통해 이름을 많이 볼 수 있습니다 + +5 +00:00:14,740 --> 00:00:20,920 + 오늘날 어디에 구글이 있었는지의 종류에 대해 그래서 나는거야 무엇을 말해 + +6 +00:00:20,920 --> 00:00:26,310 + 다른 장소 프로젝트의 많이 벗어나지 사용의 관점에서 그 + +7 +00:00:26,309 --> 00:00:30,608 + 지출 일일 A를 입력 할 때 실제로 2011 년에 시작에 관여 + +8 +00:00:30,609 --> 00:00:36,340 + 구글의 주와 나는 마이크로 부엌에서 그를 우연히 일이하고 나는 말했다 + +9 +00:00:36,340 --> 00:00:39,420 + 나도 몰라처럼 오 무슨 일을 하였다는했지만, 난 아직하지만 파악하지 않은 + +10 +00:00:39,420 --> 00:00:44,170 + 바퀴를 무시하거나 흥미와 나는 전화를 받았는데 내가 안 밝혀 + +11 +00:00:44,170 --> 00:00:49,120 + 전 원하지 않는 나이와 같은 흰개미의 병렬 교육에 조각을 이해 + +12 +00:00:49,119 --> 00:00:50,250 + 당신에게 시간을 전합니다 + +13 +00:00:50,250 --> 00:00:56,350 + 다시 최초의 흥미 진진한 기간에서 휴식을 취 항상 종류의 정말 + +14 +00:00:56,350 --> 00:01:00,660 + 계산 모델처럼 그들은 제공하지만 그 시간에이 있었다 + +15 +00:01:00,659 --> 00:01:03,599 + 우리의 수를 충분히 큰 데이터 세트를 가지고 있지 않은 것처럼 너무 일찍 조금 + +16 +00:01:03,600 --> 00:01:08,879 + 계산은 정말 그 노래와 앤드류 슬픈 0의 종류 흥미로운 일이 될 것이다 만들려면 + +17 +00:01:08,879 --> 00:01:13,579 + 훈련하지만 지금은 우리 종류의 공동 전화의 확인과 같은거야하기 + +18 +00:01:13,579 --> 00:01:20,209 + 규범 교육의 크기와 규모를 밀어 뇌 프로젝트를 시작 + +19 +00:01:20,209 --> 00:01:24,059 + 특히 우리는 큰 데이터 세트를 사용하여 정말 관심 + +20 +00:01:24,060 --> 00:01:27,890 + 많은 양의 결혼 생활에 대한 인식의 문제를 해결하기위한 경쟁 + +21 +00:01:27,890 --> 00:01:34,400 + 문제와 나는 종종 코 세라를 발견하고 종류의 단지 거리에서 읽을 + +22 +00:01:34,400 --> 00:01:39,719 + 구글하지만 그 이후로 우리는 두 종류의 재미있는 작업을 많이 해왔습니다 + +23 +00:01:39,719 --> 00:01:43,408 + 다른 도메인의 많은 연구 분야는 좋은 것들 중 하나 알고 + +24 +00:01:43,409 --> 00:01:46,859 + 많은 많은 다른 종류의 약에 상관없이 자신의 믿을 수 없을만큼 적용 할 수 없습니다 + +25 +00:01:46,859 --> 00:01:52,478 + 나는 확신 같은 문제는이 클래스에서 본 우리는 또한 생산을 배치 한 + +26 +00:01:52,478 --> 00:01:56,530 + 다른 제품의 모든 종류의 매우 다양한 우리의 매트를 사용하는 시스템 + +27 +00:01:56,530 --> 00:02:00,049 + 의 당신에게 생산 측면의 일부를 연구 몇 가지의 샘플링을 제공 + +28 +00:02:00,049 --> 00:02:04,579 + 우리의 종류를 포함하여 커버 아래에 내장 한 시스템의 일부 + +29 +00:02:04,578 --> 00:02:08,030 + 우리가하려는 않는 구현 물건 중 일부는 이러한 종류를 확인하기 위해 수행하는 + +30 +00:02:08,030 --> 00:02:12,959 + 모델의 빠른 실행하고 나는 그녀의 입에 초점을 맞출 것입니다하지만 기술이 많이 있습니다 + +31 +00:02:12,959 --> 00:02:13,349 + 더 + +32 +00:02:13,349 --> 00:02:17,699 + 당신이 전에 몇 달은 다른 종류의 많은 훈련을 할 수있는 몇 + +33 +00:02:17,699 --> 00:02:22,159 + 강화 알고리즘 또는 기계 산업의 다른 종류의 다른 종류 + +34 +00:02:22,159 --> 00:02:29,099 + 그것은 시간을 제공하는 경우 확인 케빈 실제로 뒷면의 일부를 내 말 듣고 그게 전부 내가 하나 + +35 +00:02:29,099 --> 00:02:32,560 + 정말 우리가 함께 넣어 한 팀에 대해 좋아하는 것들을 우리가 가지고있다 + +36 +00:02:32,560 --> 00:02:36,479 + 사람들이 정말 그래서 우리가 가지고있는 전문 지식을 다른 종류의 정말 다양한 믹스 + +37 +00:02:36,479 --> 00:02:40,709 + 기계 학습 연구에서 전문가들은 제프리 힌튼 다른에게 같은 사람을 알고 + +38 +00:02:40,710 --> 00:02:45,820 + 우리는 대규모는 시스템 빌더를 배포 한 사람들은 모두 내가 가지 + +39 +00:02:45,819 --> 00:02:50,169 + 그 금형이 더 자신을 생각하고 우리가 함께 할 수있는 사람이 + +40 +00:02:50,169 --> 00:02:54,989 + 우리가 집단적으로 당신 작업 프로젝트의 일부 종종 그 기술과의 혼합 + +41 +00:02:54,990 --> 00:03:00,870 + 집합이 다른 전문 지식의 종류 당신과 함께 사람을 넣어 + +42 +00:03:00,870 --> 00:03:03,580 + 자주 모두 필요하기 때문에 아무도 개별적으로 할 수 없었다 뭔가를 + +43 +00:03:03,580 --> 00:03:09,670 + 즉 항상 그래서 대규모 시스템 사고의 종류, 기계 아이디어를 학습 + +44 +00:03:09,669 --> 00:03:13,539 + 재미와 당신은 종종 종류의 픽업과 다른 사람들로부터 새로운 것을 배우고 + +45 +00:03:13,539 --> 00:03:22,280 + 당신이 할 수있는 종류의 알 수 있도록 스크립트 개요 사실이 다시 보류에서입니다 + +46 +00:03:22,280 --> 00:03:26,080 + 어떻게 구글의 많은 걸쳐 깊은 학습을 적용되었습니다의 진행 상황을 볼 + +47 +00:03:26,080 --> 00:03:28,540 + 우리는 프로젝트를 시작하고 때와 다른 지역 일종의이다 우리 + +48 +00:03:28,539 --> 00:03:32,209 + 음성 팀 비트와 함께 공동 작업을 시작하고 시작 일부 그 일을 + +49 +00:03:32,210 --> 00:03:37,830 + 문제의 초기 컴퓨터 비전 종류의와 같은 종류의 우리는 몇 가지에 성공했다 + +50 +00:03:37,830 --> 00:03:42,770 + 다른 팀의 구글은 헤이 나도 그들과 같은 문제가 말하는 것 + +51 +00:03:42,770 --> 00:03:46,550 + 우리에게 올 것 또는 우리가 도울 수 있다고 생각 헤이 우리는 그들에게 가서 말을 + +52 +00:03:46,550 --> 00:03:50,610 + 특정 문제와 시간에 우리했습니다 종류의 점진적 그리 + +53 +00:03:50,610 --> 00:03:54,670 + 점차적으로 우리가이 적용되어 한 분야에 팀의 세트를 확장 + +54 +00:03:54,669 --> 00:03:58,539 + 문제의 종류 당신은 폭을 참조 + +55 +00:03:58,539 --> 00:04:03,689 + 그것을 같지 지역의 다른 종류 그렇게 만 컴퓨터 비전 문제입니다 + +56 +00:04:03,689 --> 00:04:08,150 + 즉 그것이 우리가 선한 성장을 계속하고 좀 좋네요 그리고 + +57 +00:04:08,150 --> 00:04:12,920 + 사물의 광범위한 스펙트럼에 대한 이유의 일부는 당신이 할 수있는 것입니다 + +58 +00:04:12,919 --> 00:04:18,229 + 정말 당신이 넣을 수 있습니다이 좋은 정말 보편적 인 시스템으로 그 생각 + +59 +00:04:18,230 --> 00:04:21,359 + 만약에 많은 입력의 다른 종류의 많은 다른 종류를 많이 얻을 + +60 +00:04:21,358 --> 00:04:22,129 + 출력 + +61 +00:04:22,129 --> 00:04:27,300 + 그들 중에서 당신은 당신이 시도 모델의하지만 약간의 차이를 알고 함께 + +62 +00:04:27,300 --> 00:04:32,270 + 일반적으로는 같은 기본적인 기술은 모든 걸쳐 꽤 잘 작동 + +63 +00:04:32,269 --> 00:04:36,990 + 다른 도메인과 난 당신에 대해 들었어요 진정한 우리의 결과를 얻을 것 + +64 +00:04:36,990 --> 00:04:40,400 + 다른 지역의 제비에서이 클래스 이제 거의 모든 컴퓨터 비전 + +65 +00:04:40,399 --> 00:04:46,219 + 문제는 이러한 일이 시작하는 음성 문제는 많은 더 많은 경우 일 수 있습니다 + +66 +00:04:46,220 --> 00:04:51,880 + 약물과 같은 과학의 다른 영역의 종류의 언어 이해 영역의 많은 + +67 +00:04:51,879 --> 00:04:54,519 + 발견은 더 나은 흥미로운 역할 모델을 가지고 시작 + +68 +00:04:54,519 --> 00:05:05,930 + 다른 것보다 그래 나는 그들이 우리가 가지 내장 한 길을 따라 좋은있어 그들처럼 + +69 +00:05:05,930 --> 00:05:10,040 + 교육에 대한 우리의 기본 시스템 소프트웨어의 두 개의 서로 다른 세대 + +70 +00:05:10,040 --> 00:05:14,640 + 자신의 입술을 배포하는 것은 처음이라고했다 불신 대해 논문을 게시 + +71 +00:05:14,639 --> 00:05:20,479 + 하여 닙 2012 그들의 장점은 제 등 실제로 확장이었다했다 + +72 +00:05:20,480 --> 00:05:23,759 + 우리가에 넣어 최초의 용도 중 하나는 내가거야 일부 자율 훈련을하고 있었다 + +73 +00:05:23,759 --> 00:05:27,319 + 교육에 16,000 과정을 사용 분에 대해 말해 그들은하지 않습니다 + +74 +00:05:27,319 --> 00:05:31,209 + 이 매개 변수가 많이 생산 사용에 적합하지만 슈퍼 아니었다 + +75 +00:05:31,209 --> 00:05:35,819 + 이 종류의 이상한 이상의 표현하기 좀 어려운처럼 연구를위한 유연한 + +76 +00:05:35,819 --> 00:05:38,949 + 표현하는 모델 강화 학습 알고리즘의 비의 종류는 어렵다 + +77 +00:05:38,949 --> 00:05:43,349 + 그리고 많은 이런 종류 이상 상하로 구동되는 방식을 가지고 + +78 +00:05:43,350 --> 00:05:48,770 + 메시지와 그것이 무슨 짓을했는지 잘 작동하지만 우리는 종류의 다시 발을 내딛었 + +79 +00:05:48,769 --> 00:05:52,639 + 조금 전에 년에 대한 우리의 두 번째 세대를 구축 시작했다 + +80 +00:05:52,639 --> 00:05:57,339 + 시스템이 우리가 1 세대 배운 것을 기반으로하는 흐르는 경향과 + +81 +00:05:57,339 --> 00:06:02,289 + 우리가 작업하고있는 오픈 소스 패키지의 다른 종류에서 배운 무엇을 + +82 +00:06:02,290 --> 00:06:06,620 + 재고의 불신에 좋은 기능을 많이 유지뿐만 아니라 그것을 만든 + +83 +00:06:06,620 --> 00:06:13,329 + 연구의 다양한 꽤 유연 내가 가진 그것은 오픈 소스입니다 + +84 +00:06:13,329 --> 00:06:19,120 + 정말 좋은 속성 중 하나에 대해 들어 그래서 내가 잡고 것으로 알려져있다 + +85 +00:06:19,120 --> 00:06:23,459 + 이 모두 그래프의 측면을 확장했다 사촌 특정 논문에서이 + +86 +00:06:23,459 --> 00:06:27,819 + 트레이닝 데이터 및 방법의 정확성이 증가하고 또한 신경의 크기를 스케일링 + +87 +00:06:27,819 --> 00:06:30,279 + 그물 및 방법의 정확성 증가 + +88 +00:06:30,279 --> 00:06:33,109 + 정확한 세부 사항은 중요하지 않습니다 당신은 트렌드와 수백 이러한 종류의를 찾을 수 있습니다 + +89 +00:06:33,110 --> 00:06:37,509 + 당신이 더 많은 데이터를 가지고있는 경우 그러나 논문의 정말 좋은 호텔 중 한 곳입니다 + +90 +00:06:37,509 --> 00:06:42,180 + 당신은 당신의 모델이 일반적으로 이러한 것들을 모두 사망하고 더 큰 만들 수 있습니다 + +91 +00:06:42,180 --> 00:06:47,019 + 단지 그들 중 하나를 확장보다 더 나은 당신은 순서대로 정말 큰 모델이 필요 + +92 +00:06:47,019 --> 00:06:49,810 + 더 큰에서 나타나는 미묘한 동향 캡처 종류 + +93 +00:06:49,810 --> 00:06:54,180 + 데이터 세트가 어떤 종류의 명백한 트렌드를 포착 것이다 알려진 알거나 + +94 +00:06:54,180 --> 00:06:57,370 + 당신이 필요로하는 곳에 분명 좀 패턴하지만 더 미묘한 것들 것들 + +95 +00:06:57,370 --> 00:07:04,189 + 그녀는 그를 너무 짠 것을보고 그 여분의 경우 더 큰 모델을 캡처하기 + +96 +00:07:04,189 --> 00:07:09,579 + 우리가 계산을 확장에 많은 초점을 맞출 수 있도록 더 많은 경쟁이 필요합니다 + +97 +00:07:09,579 --> 00:07:17,689 + 우리가해야 할 첫 번째 중 하나에 큰 데이터 세트에 큰 모델을 훈련 할 수 + +98 +00:07:17,689 --> 00:07:22,699 + 우리는이 프로젝트에서했던 것을 우리가 아 내가 놀랄 배우고거야 수 말했다 + +99 +00:07:22,699 --> 00:07:28,879 + 정말 중요하고 우리는 초기에 신속하게에 큰 초점을했고 다른 사람 + +100 +00:07:28,879 --> 00:07:34,870 + 우리가 임의의 당신을 자율 학습을했다면 무슨 일이 일어날 지했다 + +101 +00:07:34,870 --> 00:07:38,519 + 아이디어가 레나가 천만 임의 유튜브 프레임 하나의 수행되도록 인쇄 + +102 +00:07:38,519 --> 00:07:42,990 + 임의의 동영상의 무리에서 프레임 우리는 본질적으로 데이터를 훈련하는거야 + +103 +00:07:42,990 --> 00:07:47,418 + 레코더 모두가 그 가족 다단계 같은데 무슨 색깔입니다 알고 + +104 +00:07:47,418 --> 00:07:51,788 + 자동차 당신이 알고 인코더와 우리가 지금 이미지를 재구성하려는이 하나 + +105 +00:07:51,788 --> 00:07:54,459 + 우리는 반복 여기서 표현을 재구성하려는에 + +106 +00:07:54,459 --> 00:08:01,629 + 이 등 우리가 만육천 자동차를 사용하여 우리는에서의 GPU가 없었다 + +107 +00:08:01,629 --> 00:08:07,459 + 데이터 센터는 시간 그래서 우리는 빛이 더 많은 CPU를 던지고로 보상 우리 + +108 +00:08:07,459 --> 00:08:11,870 + 실제로 최적화를위한 분에 대해 얘기하는 싱크 귀 염 둥이를 사용 + +109 +00:08:11,870 --> 00:08:17,189 + 이것은 우리를 와서 이전했다 길쌈되지 않은 그 사촌 매개 변수를 많이했다 + +110 +00:08:17,189 --> 00:08:20,199 + 모든 분노를해야합니다 그는 또한 우리가 로컬 수용 필드를 가지고 있지만거야 말했다 있도록 + +111 +00:08:20,199 --> 00:08:24,168 + 그들은 망상가되지 않으며 별도의 표현처럼 배울 것 + +112 +00:08:24,168 --> 00:08:28,269 + 의 종류 이미지의이 부분에있는 이미지의이 부분 + +113 +00:08:28,269 --> 00:08:31,038 + 나는 그것이 실제로 흥미로운 실험이 될 거라고 생각 흥미로운 트위스트 + +114 +00:08:31,038 --> 00:08:37,330 + 이 작업을 다시 실행하지만, 길쌈 오페라 공유와 나는 서늘함의 종류에있을거야 + +115 +00:08:37,330 --> 00:08:40,590 + 어떤 경우는 표현은 그와 같은 후 아홉 층 상단을 배웠다 + +116 +00:08:40,590 --> 00:08:45,580 + 이러한 비 길쌈 로컬 수용 필드의 최상위에 $ 60,000 + +117 +00:08:45,580 --> 00:08:50,750 + 우리가 일어날 거라고 생각 것들 중 하나는 가지 배울 것입니다 + +118 +00:08:50,750 --> 00:08:54,799 + 높은 수준의 기능 픽셀 때문에 특히 인쇄에 감지기하지만, + +119 +00:08:54,799 --> 00:08:58,929 + 높은 수준의 개념을 배울 수있는 우리는 반 얼굴이었다 데이터 집합을 가지고 있었고, + +120 +00:08:58,929 --> 00:09:04,349 + 하지면을 가지고 우리는 좋은 뉴런을 위해 주위를 둘러 보았다 발견 + +121 +00:09:04,350 --> 00:09:08,120 + 하는지 여부, 이미지의 추정치 선택기 얼굴을 포함 우리 + +122 +00:09:08,120 --> 00:09:13,850 + 몇 가지 예 뉴런을 수있는 최선의 일을 볼 수있는 샘플의 일부입니다 + +123 +00:09:13,850 --> 00:09:19,610 + 당신이 보면 신경이 후 가장 흥분을 얻을 수 있다는 인한 이미지 + +124 +00:09:19,610 --> 00:09:24,240 + 주위의 원인이됩니다 어떤 자극에 대한 신경은 가장 흥분 거기에 도착 + +125 +00:09:24,240 --> 00:09:32,669 + 소름 얼굴 남자와 재미의 종류는 우리가에는 라벨이 없었다처럼 + +126 +00:09:32,669 --> 00:09:38,399 + 모든 데이터 셋의 이미지를 우리가 훈련하고 있다는이의 신경 세포 + +127 +00:09:38,399 --> 00:09:43,029 + 모델은 얼굴이 일이 내가 흥분거야 있다는 사실을 포착했다 + +128 +00:09:43,029 --> 00:09:48,399 + 나는 그 YouTube에서 머리에서 백인 얼굴의 종류를 볼 때 우리는 또한이 + +129 +00:09:48,399 --> 00:09:55,179 + 선장이 유지되지 않은 한과 데이터 집합에 지금 고양이는 평균 얼룩 무늬 I입니다 + +130 +00:09:55,179 --> 00:10:03,019 + 그들에게 전화 한 다음 해당 자율 모델을 수 및 시작 + +131 +00:10:03,019 --> 00:10:07,659 + 이 때 특히 감독 훈련 작업은 우리 온 내가 훈련을했다 + +132 +00:10:07,659 --> 00:10:11,669 + 가장 손상 하나없는 이미지 스물 다음 천 클래스 작업 + +133 +00:10:11,669 --> 00:10:14,939 + 결과는 그 천 클래스에보고하려고하고있는 + +134 +00:10:14,940 --> 00:10:21,490 + 훨씬 더 열심히 작업의 모든 20 20,000 클래스 중 하나에서 만든 구별 + +135 +00:10:21,490 --> 00:10:26,340 + 우리는 다른 원인이 이미지의 종류에 주위를 둘러 보았다 다음 훈련 + +136 +00:10:26,340 --> 00:10:29,300 + 인기있는 노선은 그들은 매우 높은 수준에 따기있어보고 흥분하는 + +137 +00:10:29,299 --> 00:10:33,819 + 개념은 당신이 노란 꽃 전용 또는 물새 알 + +138 +00:10:34,620 --> 00:10:41,080 + 내가 좋아하는이 재교육 실제로 하드 정확도로 상태를 증가 + +139 +00:10:41,080 --> 00:10:44,080 + 시 양 특정 태스크에 + +140 +00:10:45,129 --> 00:10:50,500 + 우리는 종류의 자율 학습 때문에 대한 우리의 흥분을 잃었다 + +141 +00:10:50,500 --> 00:10:54,860 + 그래서 이놈 잘 요리를 배우고 그래서 우리는 말과 ​​함께 작업을 시작 감독 + +142 +00:10:54,860 --> 00:11:00,100 + 시 아닌 계 탄성 매트 가석방이었다 팀 + +143 +00:11:00,100 --> 00:11:06,570 + 기본적으로 백처럼 오디오 데이터의 작은 세그먼트에서 이동하려고 + +144 +00:11:06,570 --> 00:11:09,420 + 오십 밀리 초 시간은 당신이 소리에 선포되고 무엇을 예측하려고 + +145 +00:11:09,419 --> 00:11:17,809 + 중간 10 밀리 초는 그리고 우리는 완전히 층을 바꾸기로했습니다 + +146 +00:11:17,809 --> 00:11:21,879 + 유목민 접속하고 상단 만사천 시도 된 전화 중 하나를 예측 + +147 +00:11:22,549 --> 00:11:27,939 + 나는 기본적으로 매우 신속하게 훈련 할 수있는 동안 작업 가족에있어 그리고 그것은했다 + +148 +00:11:27,940 --> 00:11:31,530 + 거대한 감소 같은 온건 한 음성 팀에있는 사람들 중 하나가 말했다입니다 + +149 +00:11:31,529 --> 00:11:34,339 + 가장 큰 하나의 개선과 같은 그들의 20 년에 본 적이 있는지 + +150 +00:11:34,340 --> 00:11:47,970 + 연구와 그 안드로이드 기반 검색 시스템 2012 정도의 일환으로 시작 + +151 +00:11:47,970 --> 00:11:51,990 + 우리가 흔히하는 일 중 하나는 우리가 어떤 데이터를 많이 가지고 찾을 수있다 + +152 +00:11:51,990 --> 00:11:57,149 + 그에 대한 작업 등의 작업을하지만, 아주 많지 않은 매우 많은 데이터를 우리는 자주 + +153 +00:11:57,149 --> 00:12:02,949 + 당신에게 슬픈 멀티 태스크를 확인하고 학습 전송 시스템을 구축 + +154 +00:12:02,950 --> 00:12:09,030 + 여러 가지 방법이 그래서는 우리가 분명히 연설에서 이것을 사용하는 예를 살펴 보자 + +155 +00:12:09,029 --> 00:12:13,110 + 영어로 우리는 많은 데이터를 가지고 있고 우리는 그것을 정말 좋은 느린 단어를 가지고 또는 + +156 +00:12:13,110 --> 00:12:17,350 + 반면에, 포르투갈어있다 저하는 시간에 대해 우리는하지 않았다 + +157 +00:12:17,350 --> 00:12:21,310 + 단어 오류율이를 때까지 그 정도 훈련 오늘 우리는 $ (100) 구매를했다 + +158 +00:12:21,309 --> 00:12:27,129 + 그래서 당신이 할 수있는 첫 번째이자 가장 간단한 것들 중 하나 나쁜 더 많은 + +159 +00:12:27,129 --> 00:12:30,620 + 이는 당신이 모델을 사전에 훈련 된 찍을 때 당신이하는 일의 종류 + +160 +00:12:30,620 --> 00:12:33,509 + 및 다른 문제에 적용 촬상 우리는 많은 데이터가없는 + +161 +00:12:33,509 --> 00:12:37,610 + 당신은 그들을 완전히 무작위 밤에 의해 그 무게 훈련을 시작 + +162 +00:12:37,610 --> 00:12:41,700 + 나는 실제로 않는 경우 포르투갈어 워드 에러율을 향상하고 있지 않다 + +163 +00:12:41,700 --> 00:12:45,210 + 당신이 연설에 대해 원하는 기능의 종류에 충분한 유사성이있다 + +164 +00:12:45,210 --> 00:12:50,570 + 일반적으로는 언어에 상관없이 당신이 할 수있는 더 복잡한 것은 사실이다 + +165 +00:12:50,570 --> 00:12:55,390 + 공동 모든 언어에서 또는에서 모델에게 기업의 점유율을 훈련 + +166 +00:12:55,389 --> 00:12:56,360 + 이 경우 모든 + +167 +00:12:56,360 --> 00:13:04,680 + 모든 유럽 언어 나는 우리가 사용한 무엇을 생각하고 그래서 그들은 우리가있어 볼 수 있습니다 + +168 +00:13:04,679 --> 00:13:07,939 + 공동으로이 데이터를 훈련하고 우리가 실제로 꽤 상당한있어 + +169 +00:13:07,940 --> 00:13:13,310 + 심지어 포르투갈 모델의 단지 복사하는 일을 통해 개선하지만, + +170 +00:13:13,309 --> 00:13:17,739 + 놀랍게도 우리는 실제로 총 때문에 작은 개선이 영어를 얻었다 + +171 +00:13:17,740 --> 00:13:20,889 + 다른 모든 언어를 통해 우리는 실제로는 거의 양을 두 배로 + +172 +00:13:20,889 --> 00:13:25,399 + 교육 자료는 우리는 당신이 단지 영어에 비해 모델을 그리워 사용할 수 있었다 + +173 +00:13:25,399 --> 00:13:30,379 + 많은 일없이 그렇게 기본적으로 언어와 같은 알람이 모두 많이 향상 + +174 +00:13:30,379 --> 00:13:35,850 + 데이터의 많은 언어를 조금이라도 개선하고 우리는 있었다 + +175 +00:13:35,850 --> 00:13:39,350 + 알아낼 하구의 언어 별 최상층 조금 조금 + +176 +00:13:39,350 --> 00:13:44,620 + 그것은 언어 별 최고 선수 일부가 힘든 만들 않는 한 나는 믿을거야 + +177 +00:13:44,620 --> 00:13:47,620 + 이들은 당신이 만들고 인간의 가이드 선택의 종류가 + +178 +00:13:48,269 --> 00:13:53,149 + 즉, 생산 음성 모델은 그 정말 간단에서 많이 참여의 + +179 +00:13:53,149 --> 00:13:57,778 + 지금 사용하는 피드 포워드 모델은 내가 마지막으로 언급 할 시간을 처리했다 + +180 +00:13:57,778 --> 00:14:02,490 + 암시의 컴파일 그래서이 매우 다른 주파수로 그들을 만들려면 + +181 +00:14:02,490 --> 00:14:06,769 + 종이 여기에 출판되었다​​ 당신은 반드시 모든 이해 할 필요가 없습니다 알고 + +182 +00:14:06,769 --> 00:14:11,459 + 상세하지만이 모델의 종류에 더 많은 복잡성은이고 그것의 + +183 +00:14:11,458 --> 00:14:15,088 + 그것은 그녀의 현재 모델과 계산 모델 훨씬 더 정교한을 사용하고 + +184 +00:14:15,089 --> 00:14:22,100 + 최근의 추세는 앨리스가 완전히 그래서 오히려 사용할 수 있습니다 충족 된 + +185 +00:14:22,100 --> 00:14:26,730 + 이러한 종류의 소요 음향 모델 후 언어 모델을 갖는보다 + +186 +00:14:26,730 --> 00:14:30,550 + 소외의 음향 모델의 출력은 다소 별도로 갈 수 + +187 +00:14:30,549 --> 00:14:34,879 + 직접 오디오 파형에서에서 문자로 성적 증명서를 생산하는 + +188 +00:14:34,879 --> 00:14:38,120 + 시간과 그게 정말 큰 트렌드가 될 것 같아 + +189 +00:14:38,809 --> 00:14:44,169 + 모두 연설에서 더 일반적으로 난방 시스템을 많이에서 당신은 종종이 + +190 +00:14:44,169 --> 00:14:49,338 + 오늘은 많은 시스템이 가지 서브 시스템 각각의 무리로 구성되어 있습니다 + +191 +00:14:49,339 --> 00:14:54,350 + 아마도 일부 그녀는 조각과 손 코드 조각의 일부 종류를 배웠다 + +192 +00:14:54,350 --> 00:14:58,000 + 나는 보통 끈적 거리는 디코드의 큰 더미가 모두 함께 접착제 및 + +193 +00:14:58,509 --> 00:15:04,600 + 별도로 개발 한 조각 장애 최적화가 종종 있지만, + +194 +00:15:04,600 --> 00:15:08,800 + 당신은에 의해 대칭의 맥락에서 서브 시스템을 최적화 오른쪽처럼 + +195 +00:15:08,799 --> 00:15:12,699 + 통계는 어떤 관심있는 마지막 작업을 위해 옳은 일을하지 않을 수 있습니다 + +196 +00:15:12,700 --> 00:15:22,370 + 올바르게 복사 할 수 있으므로 같은 훨씬 더 큰 하나의 시스템을 가지고 + +197 +00:15:22,370 --> 00:15:25,649 + 하나의 신경 애플은 끝까지 오디오 파형에서 직접 모든 길을 간다 + +198 +00:15:25,649 --> 00:15:29,929 + 목표는 당신이 처방에 관심 당신은 엔드 - 투 - 엔드를 최적화 할 수 없음 + +199 +00:15:29,929 --> 00:15:34,579 + 통해가는 중간에 손으로 쓴 많은 코드가 아니다 + +200 +00:15:34,580 --> 00:15:37,440 + 난 당신이 내가 부족 것을 볼 수 있습니다 여기에 볼 수있을 거라 생각 큰 트렌드가 될 수 있습니다 + +201 +00:15:37,440 --> 00:15:46,250 + 번역 요구의 다른 종류의 많은 그래서 사람들은 모든 대회의 우리 + +202 +00:15:46,250 --> 00:15:48,919 + 우리는 다양한 종류를 사용했던 시력 문제의 톤을 가지고 + +203 +00:15:48,919 --> 00:15:54,849 + 당신을위한 계산 모델은 길쌈 주위에 큰 흥분을 알고 + +204 +00:15:54,850 --> 00:15:59,220 + 신경망 잘 먼저 젊은 시작과 경쟁을 읽고 확인하는 것이 + +205 +00:15:59,220 --> 00:16:05,110 + 가지 잠시 가라 앉았하고 좋아 다음 알렉스 Kozinski 요 요 한모금의 은혜와 + +206 +00:16:05,110 --> 00:16:10,200 + 에서 블루에게 다른 경쟁에 불을하는 2012 년 종이에 그를 확인 + +207 +00:16:10,200 --> 00:16:16,470 + 내가 넣어 생각 비를 사용하여 이미지 순이익 2,012 도전에 물 + +208 +00:16:16,470 --> 00:16:20,500 + 모든 사람의지도에 그 일이 다시 우리는 우리가 사용되어야한다 잘 말 + +209 +00:16:20,500 --> 00:16:24,399 + 그들은 정말 잘 작동 사촌 비전이 일 내년 + +210 +00:16:24,399 --> 00:16:28,100 + 항목의 스물 스물이나 뭐 같은 당신은 알지 + +211 +00:16:28,100 --> 00:16:34,550 + 스레드는 이전에 그냥 알렉스 우리는 구글에서 사람들의 무리 했어했다 + +212 +00:16:34,549 --> 00:16:38,529 + 더 나은 일을 위해 아키텍처의 다양한 종류의보고 + +213 +00:16:38,529 --> 00:16:41,829 + 검사 아키텍처에 대한 협의는 다음과 같이 가지고 더 나은 이미지 + +214 +00:16:41,830 --> 00:16:45,889 + 모든 종류의 수 있습니다 같은 다른 크기 경쟁의 복잡한 모델 + +215 +00:16:45,889 --> 00:16:50,419 + 함께 연결된 후, 당신은 그 모델에게 시간의 무리를 복제 할 수 없습니다 + +216 +00:16:50,419 --> 00:16:51,319 + 과 + +217 +00:16:51,320 --> 00:16:55,810 + 그게 꽤 좋은 밝혀졌다에서 당신은 매우 깊은 알려진 끝낼 + +218 +00:16:56,789 --> 00:17:01,870 + 조건은 어떤 것과 약간의 추가 및에 약간의 변화가있었습니다 + +219 +00:17:01,870 --> 00:17:07,740 + 그것은 훨씬 더 정확한 당신처럼 내가 당신이 그런 식으로 약간을 보았다 알고 있어야합니다 + +220 +00:17:07,740 --> 00:17:17,120 + 좋아 그래서 II가 게을 렀다는 수잔 만 내가 이야기 폴더 일에서 내 슬라이드를했다 + +221 +00:17:17,119 --> 00:17:19,549 + 그 라벨에 앉아 앙드레 대한 이야기 + +222 +00:17:19,549 --> 00:17:26,559 + 확인 안드레이는 자신이 이미지를 그 대회를 관리하는 데 도움이되었다 결정에 서명 그는 + +223 +00:17:26,559 --> 00:17:31,269 + 앉아서 자신을 훈련 교육 훈련 200 시간을 가하지 것 + +224 +00:17:31,269 --> 00:17:38,099 + 나도 몰라 오스트레일리아 양치기 개에서이 같은 힘든 분할 및 및 + +225 +00:17:38,099 --> 00:17:41,449 + 예 나는 그것을 할 수있는 실험실 동료 중 하나를 설득 할 수 있지만 지능 없었다 + +226 +00:17:41,450 --> 00:17:45,309 + 이미지에 대한 교육의 백 이십시간에 대해 가열 + +227 +00:17:45,980 --> 00:17:52,380 + 그의 연구실은 그가 5.1 %의 오류가있어 있도록 12 시간 일 후 피곤받을 수 있습니다 + +228 +00:17:52,380 --> 00:17:55,380 + 제작 내가 12 %를 생각있어 + +229 +00:17:56,269 --> 00:18:12,918 + 인간의 오류하지만 비없이 심하게 모든 주말 + +230 +00:18:12,919 --> 00:18:19,690 + 뒤쪽에 백십이시간 나중에 어쨌든 여기에 좋은 무엇이든 + +231 +00:18:19,690 --> 00:18:23,220 + 그것에 대해 블로그 게시물 나는 그가 매개 변수를 많이 가지고 확인해 보시기 바랍니다 + +232 +00:18:23,220 --> 00:18:34,279 + 당신이 많은 201 일이 80000000000000 연결을 알고 같은 전형적인 인간은 + +233 +00:18:34,279 --> 00:18:37,918 + 파라미터 소수의 모델에 잘 맞는 이들 모델에 대한 지적 + +234 +00:18:37,919 --> 00:18:43,440 + 내 모바일 장치에서 난 휴대 전화 있지만 일반에 다소 맞지 않도록 + +235 +00:18:43,440 --> 00:18:47,029 + 앙드레 이외의 추세는 알렉스에 비해 매개 변수의 작은 숫자처럼 + +236 +00:18:47,029 --> 00:18:52,509 + 대부분 알렉스 그물 상단이이 두 거대한 완전히 연결 층 같이했다 + +237 +00:18:52,509 --> 00:18:57,000 + 그것이 함께 거대한하지만 매개 변수를 많이 이상은 가지 얻을 멀리했다 + +238 +00:18:57,000 --> 00:19:02,220 + 대부분의과가 그래서 그들은 당신이 매개 변수의 작은 번호를 알고 사용했지만 + +239 +00:19:02,220 --> 00:19:07,829 + 이다 사용 조성 매개 변수 표시 당 부동 소수점 연산 + +240 +00:19:07,829 --> 00:19:12,379 + 좋은 펀드에 이르렀 우리는 텐서 흐름의 일부로 출시 + +241 +00:19:12,380 --> 00:19:18,549 + 당신이 사설에 대한 거기에 사용할 수있는 재시도 채택 모델을 업데이트 + +242 +00:19:18,548 --> 00:19:24,089 + 우리가하지 않은 군사 유니폼을 생각하지만, 그것은 크리스 하퍼가 + +243 +00:19:24,089 --> 00:19:29,859 + 그들이있어 이러한 모델에 대한 좋은 것들을 몹시 부정확 한 + +244 +00:19:29,859 --> 00:19:32,589 + 아주 세밀한 상담을하고 정말 좋은 나는 것들 중 하나라고 생각 + +245 +00:19:32,589 --> 00:19:35,959 + 이 컴퓨터 모델은 훨씬 실제로 있다는 안드레스 블로그입니다 + +246 +00:19:35,960 --> 00:19:40,880 + 개 그러나 인간이다의 구별 정확한 품종에서 사람보다 더 나은 + +247 +00:19:40,880 --> 00:19:42,179 + 더 나은 + +248 +00:19:42,179 --> 00:19:49,150 + 자주 레이블이 탁구 공이며 경우가 있다면 당신이 알고있는 작은을 따기 + +249 +00:19:49,150 --> 00:19:52,190 + 탁구 인간을 재생하는 거대한 고위 사람들처럼 그에서 더 낫다 + +250 +00:19:52,829 --> 00:20:00,250 + 모델은 당신이있는 모델을 훈련하면 더 많은 픽셀 일에 집중하는 경향이 + +251 +00:20:00,250 --> 00:20:01,109 + 데이터의 오른쪽 종류 + +252 +00:20:01,109 --> 00:20:05,019 + 이러한 장면은 모두 아무것도 보지 동안 일반화 알고 있지만 실제로 당신 + +253 +00:20:05,019 --> 00:20:08,690 + 우리는 모두 당신의 훈련 데이터를 사랑 나를로 표시거야 알고 잘 표현된다 + +254 +00:20:08,690 --> 00:20:14,710 + 그들은 그것이 뱀이 아니다 좀 더 구십 허용 실수를하지만, + +255 +00:20:14,710 --> 00:20:19,230 + 난 그냥 그런 말을하고 왜 이해하고 난 개가 아니에요 알고 있지만 실제로 I + +256 +00:20:19,230 --> 00:20:25,190 + 전면 동물이이 당나귀가되고 난 경우 신중하게 생각했다 + +257 +00:20:25,190 --> 00:20:27,490 + 완전히 확실하지 아직도 + +258 +00:20:27,490 --> 00:20:37,900 + 모든 투표는 이렇게 생산 중 하나는 우리가 모델 kiryas의 이러한 종류를 넣었습니다 사용합니다 + +259 +00:20:37,900 --> 00:20:42,850 + 구글 사진 검색 그래서 우리는 Google 사진 제품을 출시하면 검색 할 수 있습니다 + +260 +00:20:42,849 --> 00:20:46,539 + 방금 바다 입력 한 모든 얘기하지 않고 업로드 한 사진 + +261 +00:20:46,539 --> 00:20:51,639 + 갑자기 올리버 바다 포토샵의 모든 그래서 예를 들어이 사용자에 대한 + +262 +00:20:51,640 --> 00:20:56,870 + 공개적으로 게시 헤이 나는 스크린 샷 헤이 나는이 걸릴하지 않았다 게시 + +263 +00:20:56,869 --> 00:21:04,879 + 부처님의 동상이 때문에 힘든 것을 알고 도시 운전을위한 나타났다 + +264 +00:21:04,880 --> 00:21:09,520 + 대부분의 Utahns에 비해 질감이 많이 그래서 우리는 꽤 기쁘게 생각있어 + +265 +00:21:09,519 --> 00:21:18,339 + 우리가보다 구체적인 다른 종류의 종류의 많은이 대 식세포 등을 검색 + +266 +00:21:18,339 --> 00:21:21,730 + 우리는 우리의 스트리트 뷰에서하고 싶은 것들을 본질적으로 같은 비주얼 작업 + +267 +00:21:21,730 --> 00:21:25,819 + 세계에서이 차 운전자의 이미지와 모든 도로의 사진을 촬영 + +268 +00:21:25,819 --> 00:21:29,609 + 거리 장면과 그 다음 우리는 우리가 찾을 수있는 모든 텍스트를 읽을 수 있어야합니다 + +269 +00:21:29,609 --> 00:21:34,909 + 그래서 먼저 당신이 원하는 우선의 텍스트와 잘 하나를 찾을 수있다 + +270 +00:21:34,910 --> 00:21:39,720 + 이렇게하면 읽기처럼 싶은 것을 전 모든 주소와지도와 달을 찾을 수 있습니다 + +271 +00:21:39,720 --> 00:21:43,829 + 당신이 우리가을하는 모델이없는 것을 볼 수 있도록 다른 모든 텍스트 + +272 +00:21:43,829 --> 00:21:47,799 + 예측 꽤 좋은 직장 그 화소가 포함되는 픽셀 수준 + +273 +00:21:47,799 --> 00:21:53,819 + 텍스트 여부와 꽤 잘하지 + +274 +00:21:53,819 --> 00:21:58,289 + 잘 우선 훈련 데이터에 세금을 많이 발견은 여러 종류가 있었다 + +275 +00:21:58,289 --> 00:22:03,019 + 이렇게 표시되는 문자는 더 문제가 한자를 인식하지 않았다 + +276 +00:22:03,019 --> 00:22:08,569 + 영어 문자는 꽤 잘처럼 수행 로마 라틴 문자이다 + +277 +00:22:08,569 --> 00:22:12,889 + 세금의 두 개의 서로 다른 글꼴과 크기와 그들 중 몇 가지의 서로 다른 색상 + +278 +00:22:12,890 --> 00:22:17,200 + 이다 매우 멀리 떨어져 카메라에 매우 근접하고 난 그냥이었고,이 데이터입니다 + +279 +00:22:17,970 --> 00:22:24,809 + 텍스트의 조각 주위에 단지 인간의 표지 그린 다각형에서 그리고 그들은 + +280 +00:22:24,809 --> 00:22:27,809 + 다음을 전사하고 우리가 OCR 모델을 우리 또한 인쇄 + +281 +00:22:30,880 --> 00:22:34,500 + 우리는 가지 점차적으로 우리가 출시 제품의 다른 종류를 출시했습니다 + +282 +00:22:34,500 --> 00:22:39,799 + ATI의의 클라우드 비전은 당신이 의미 라벨 이미지 같은 것들을 많이 할 수 있습니다 + +283 +00:22:39,799 --> 00:22:44,859 + 반드시 싶어하지 않는 사람 또는 방법을 기계 학습 전문 I에 대한 + +284 +00:22:44,859 --> 00:22:48,349 + 단지 종류의 당신이 말을 알고에 가고 싶어 이미지와 함께 멋진 물건을하고 싶지 + +285 +00:22:48,349 --> 00:22:54,990 + 그들은 OCR을 수행하고 모든 이미지에서 과세 찾을 것 같았다 실행하는 경우에만 있음 + +286 +00:22:54,990 --> 00:22:58,650 + 당신은 기본적으로 자전거 토론토 CRM 라벨 비상 사태를 부여 업로드 + +287 +00:22:58,650 --> 00:23:03,820 + 이 사람에게가는 경우이 이미지의 생성과 꽤 행복했다 + +288 +00:23:03,819 --> 00:23:06,689 + 그 + +289 +00:23:06,690 --> 00:23:10,220 + 내부적으로 사람들이 사용하는 방법의 창조적 사용을 생각하고있다 + +290 +00:23:10,220 --> 00:23:13,600 + 컴퓨터 비전 정말 실제로 본질적으로 지금 컴퓨터 비전 정렬 + +291 +00:23:13,599 --> 00:23:19,819 + 오년에 비해 작품 전에이 일이 우리의 우리의 우리 지역 팀입니다 + +292 +00:23:19,819 --> 00:23:23,250 + 기본적으로 어떤 프로세스와 위성 이미지를 함께 넣어 출시 + +293 +00:23:23,250 --> 00:23:28,740 + 그 여러 위성 뷰에서 지붕의 기울기를 예측하는 방법 + +294 +00:23:28,740 --> 00:23:32,769 + 국가 당신은 당신이 여기에 몇 개월마다 새로운 위성 이미지를 알고있다 싶습니다 + +295 +00:23:32,769 --> 00:23:36,099 + 우리가 같은 위치의 여러 전망이 때까지 우리는 무엇을 예측할 수있다 + +296 +00:23:36,099 --> 00:23:40,109 + 지붕의 기울기가 같은 위치의 모든 다른 견해를 제공하고, + +297 +00:23:40,109 --> 00:23:43,589 + 나가 다음 당신이 인 경우에 당신이 알고 예측하는 방법 많은 태양 노출 + +298 +00:23:43,589 --> 00:23:48,490 + 당신이 점점 생성 할 수 얼마나 많은 에너지 어쨌든 태양 전지 패널을 설치 + +299 +00:23:48,490 --> 00:23:53,930 + 좀 당신이 당신이 할 수있는 작은 임의의 물건이 아닌 비전처럼 알고 냉각 + +300 +00:23:53,930 --> 00:24:03,160 + 작동 확인이 클래스는 내가 거 얘기 해요 그래서 대부분은 대부분의 비전에 대해되었습니다 + +301 +00:24:03,160 --> 00:24:08,029 + 이제 언어 이해 같은 문제의 다른 종류에 대한 가장 중 하나 + +302 +00:24:08,029 --> 00:24:16,779 + 중요한 문제는 분명히 검색 그래서 우리는 수술에 대한과에 많은 관심입니다 + +303 +00:24:16,779 --> 00:24:20,700 + 내가 판매에 대한 쿼리 자동차 부품을한다면 내가 확인하고 싶은 특정의 어떤 + +304 +00:24:20,700 --> 00:24:25,400 + 이 두 문서는 관련성 그리고 당신은 단지의 서비스 형태를 보면 + +305 +00:24:25,400 --> 00:24:28,019 + 첫 번째 문서는 무척 관련 보이는 것을 말씀 + +306 +00:24:28,019 --> 00:24:34,609 + 단어를 많이처럼 발생은 autorad 실제로 두 번째 문서는 많이 있습니다 + +307 +00:24:34,609 --> 00:24:41,189 + 관련성이 주어 우리는 어떻게 그래서 이해할 수 있도록하고 싶습니다 + +308 +00:24:41,190 --> 00:24:47,269 + 당신이에 대해 알 수 있도록 많은 당신이 멋진 모델을 포함 이야기가 + +309 +00:24:47,269 --> 00:24:47,879 + 의료진 + +310 +00:24:47,880 --> 00:24:54,680 + 피고인을 포함하는 것은 그래서 나는 빨리 갈 것입니다하지만 기본적으로 당신이 원하는합니다 + +311 +00:24:54,680 --> 00:24:58,200 + 드문 드문 높은 차원 일에 단어 나 사물을 표현 + +312 +00:24:58,200 --> 00:25:03,559 + 조밀 한 경우 몇 백 차원 11,000 차원 공간 때문에로 매핑 + +313 +00:25:03,559 --> 00:25:11,440 + 이제 서로 가까이와 유사한이 일을 가질 수 + +314 +00:25:11,440 --> 00:25:15,029 + 의미에 대한 너무 높은 차원 공간에서 서로 가까이 끝날 것 + +315 +00:25:15,029 --> 00:25:17,769 + 예를 들어 당신은 돌고래와 돌고래는 매우 가까이 서로 될 수 있습니다 + +316 +00:25:17,769 --> 00:25:20,099 + 그들은 매우 유사 단어이고 어떤을 가지고 있기 때문에 높은 차원 공간 + +317 +00:25:20,099 --> 00:25:23,099 + 회의 그들은 목적 동시에 공유 + +318 +00:25:24,909 --> 00:25:27,420 + 그래 + +319 +00:25:27,420 --> 00:25:32,620 + 와 씨월드 당신은 가지 근처의 수와 카메론의 부모는 꽤 멀리로 + +320 +00:25:32,619 --> 00:25:39,069 + 당신이 하나를 현대화 포함 훈련 할 수는 최초 인 것입니다 + +321 +00:25:39,069 --> 00:25:42,519 + 당신이로 스팀에서, 심지어 간단한 GET을 공급 할 때 당신이 할 일 + +322 +00:25:42,519 --> 00:25:47,859 + 일은 나의 전 동료 너무 많은 니켈 오프가 될 해낸 기술이다 + +323 +00:25:47,859 --> 00:25:51,969 + 이 모델을 만들기 위해 단어라고 어디 본질적으로 대한 출판 종이 + +324 +00:25:51,970 --> 00:25:55,870 + 아마 스무 단어가 왜 당신이를 선택 않았다 기본적으로 당신은 단어의 창을 선택 + +325 +00:25:55,869 --> 00:26:00,119 + 워드 센터 및 다음 삽입을 사용하려고 할 경우 다른 랜덤을 선택 + +326 +00:26:00,119 --> 00:26:06,419 + 그 중심 단어의 표현은 그 희망을 훈련 할 수있는 사람을 예측하는 + +327 +00:26:06,420 --> 00:26:11,230 + 배경으로는 기본적으로 당신은 그 무게의 근육 플렉스 분류를 조정하고 + +328 +00:26:11,230 --> 00:26:17,190 + 결과적으로 당신은 역 전파을 통해 당신은 거의 조정 + +329 +00:26:17,190 --> 00:26:20,830 + 그 중심 단어의 표현을 포함하는 다음 번에 ​​할 수 있습니다 있도록 + +330 +00:26:20,829 --> 00:26:25,919 + 더 나은 자동차에서 단어 부분을 예측하고 실제로 오른쪽처럼 작동 + +331 +00:26:25,920 --> 00:26:29,930 + 을 선동에 대한 정말 좋은 것 중 하나는 충분한 교육을 제공됩니다 + +332 +00:26:29,930 --> 00:26:34,070 + 이들은 그래서 당신이 단어의 정말 놀라운 무기 비전을 가지고 않았다 + +333 +00:26:34,069 --> 00:26:39,759 + 어휘 이러한 세 가지 다른 단어 나 문구에 대한 가장 가까운 이웃 + +334 +00:26:39,759 --> 00:26:44,319 + 호랑이 상어에이 특정 항목 당신은 자신의 11 삽입 생각할 수 + +335 +00:26:44,319 --> 00:26:48,480 + 이들 벡터는 가장 가까운 이웃은 그 중심을 가지고 말 + +336 +00:26:48,480 --> 00:26:55,529 + 이 검색에 유용 이유를 볼 같은 선명도 자동차 흥미 권리가있다 + +337 +00:26:55,529 --> 00:27:01,000 + 당신이 일이 있기 때문에 사람들은 종종 코딩 정보 검색을 건네 것을 + +338 +00:27:01,000 --> 00:27:07,079 + 복수형 및 형태소 분석 등의 간단한 동의어 어떤 종류의 같은 시스템 만 + +339 +00:27:07,079 --> 00:27:10,750 + 여기에 그는 단지 오 내가 자동차 자동차 픽업 트럭 경주 용 자동차를 알고 같은 듯 + +340 +00:27:10,750 --> 00:27:15,470 + 여객 자동차 대리점의 종류 당신은 그냥이이이 참조 관련 + +341 +00:27:15,470 --> 00:27:19,200 + 바로 자동차의 칼 케네스 부드러운 표현의 개념이 아닌 + +342 +00:27:19,200 --> 00:27:26,509 + 명시 적으로 만 후자는 우리의 경기를 관람하며 밝혀 당신 경우 + +343 +00:27:26,509 --> 00:27:29,980 + 방향으로 판명 단어 AVEC 접근 방식을 사용하여 훈련 + +344 +00:27:29,980 --> 00:27:35,730 + 의미 있고 정신적 인 공간이 너무뿐만 아니라 근접 재미 있지만, 방향이다 + +345 +00:27:35,730 --> 00:27:38,730 + 흥미로운 그래서 당신이 보면 그것은 밝혀 + +346 +00:27:39,720 --> 00:27:43,860 + 자본과 국가 쌍 가고 + +347 +00:27:43,859 --> 00:27:47,288 + 거의 같은 방향과 거리가 대응하는 국가에서 얻을 수 + +348 +00:27:47,288 --> 00:27:56,029 + 파리, 당신은 또한 당신이 볼 수있는 국가 자본에 대한 자본 또는 그 반대의 경우도 마찬가지 + +349 +00:27:56,029 --> 00:27:59,298 + 다른 구조의 일부 외관이 묻어 두 아래지도입니다 + +350 +00:27:59,298 --> 00:28:05,889 + 그래서 주성분 분석을 치수 당신은 종류의 참조 + +351 +00:28:05,890 --> 00:28:12,788 + 동사의 주위에 흥미 구조는 회사에 관계없이 시제하는 + +352 +00:28:12,788 --> 00:28:18,210 + 당신은 여왕처럼 유추을 수행하여뿐만 아니라 아저씨 사람이 부패되어 해결할 수 있음을 의미 + +353 +00:28:18,210 --> 00:28:21,279 + 몇 가지 간단한 사실 연산은 말 그대로 그냥 삽입보고있는 말 + +354 +00:28:21,279 --> 00:28:26,029 + 차이를 추가 벡터 한 다음 그 지점에 거의 도착 + +355 +00:28:26,029 --> 00:28:35,269 + 그래서 우리는 우리가 가지 시작 검색 팀과 공동으로 봤는데 포인트 + +356 +00:28:35,269 --> 00:28:40,668 + 우리라는 지난 몇 년 동안에서 가장 큰 검색 순위 중 하나가 변경 + +357 +00:28:40,669 --> 00:28:44,640 + 그것은 단지 깊은 알고 본질적으로 데려 울렸다하지만 묻어 및 사용 + +358 +00:28:44,640 --> 00:28:50,059 + 선수의 무리가이 문서는이 것이 얼마나 중요한에 대한 점수를주고 + +359 +00:28:50,058 --> 00:28:51,730 + 특별한 + +360 +00:28:51,730 --> 00:28:58,308 + 그것은 수백 중 기차 여행 마일 세 번째 가장 중요한 + +361 +00:28:58,308 --> 00:29:07,259 + 그 소위 스마트 응답은 Gmail 팀과 함께 약간의 협력이 있었다이었다 + +362 +00:29:07,259 --> 00:29:11,259 + 기본적으로 휴대 전화에 메일을 회신 종류의 사촌 입력이 어렵다 짜증 + +363 +00:29:11,259 --> 00:29:16,429 + 그래서 우리는 당신이 될 것이라고 예측할 수 종종 시스템을 가지고 싶어 + +364 +00:29:16,429 --> 00:29:21,900 + 좋은 응답이 바로 메시지를 찾고 그래서 우리는 소규모 네트워크가이 예측이 + +365 +00:29:21,900 --> 00:29:26,970 + 수있는 가능성이 내가 볼 수있는 짧은 간결한 응답을 가질 수 뭔가를 할 수 있음 + +366 +00:29:26,970 --> 00:29:30,380 + 당신이 그들을 묻는다면 나는 훨씬 더 큰 활성화 + +367 +00:29:30,380 --> 00:29:35,409 + 모델이 제 동료 중 하나가 프로젝트를 수신하는 메시지입니다에서 + +368 +00:29:35,409 --> 00:29:37,720 + 그의 동생은 그는 우리가 이른 추수 감사절 우리와 함께 당신을 초대하고자했다 + +369 +00:29:37,720 --> 00:29:43,220 + 가능성이 밥 우리는​​ 그래서 다음 모델 좋아하는 요리 RCP 다음 주에 봤는데 + +370 +00:29:43,220 --> 00:29:48,100 + 백작을 예측하고있을 것 또는 죄송는 그것을 만들 수 없습니다 + +371 +00:29:49,660 --> 00:29:54,810 + 이메일을 많이받을 경우 응답이 될 것입니다하지만 좋은 그것은 환상적이다 + +372 +00:29:54,809 --> 00:29:58,169 + 좋은 그들 중 다소 저주 + +373 +00:30:02,250 --> 00:30:07,329 + 당신이 실제로 모바일 앱처럼 우리가 흥미있는 일을 할 수있어 + +374 +00:30:07,329 --> 00:30:11,779 + 비행기 모드에서 실행 그래서 실제로 전화 모델을 실행중인 그것은이다 + +375 +00:30:11,779 --> 00:30:19,430 + 실제로 당신이 본질적으로있어, 그래서 완전히 실현 흥미로운 것들을 많이 가지고 + +376 +00:30:19,430 --> 00:30:25,670 + 단어가 무엇인지 발견하여 텍스트를 검출하는 카메라 영상을 이용하여 + +377 +00:30:25,670 --> 00:30:28,830 + 다음 여기에 OCR을 수행 할 수있는 번역 모델을 통해 그것을 실행 + +378 +00:30:28,829 --> 00:30:31,980 + 이것에 대해, 특히 그것을 그림은 서로 다른 언어를 순환한다 + +379 +00:30:31,980 --> 00:30:38,779 + 그러나 일반적으로 당신은 스페인어 돈에 설정 거라고하지만 스페인어하지만 일이 보여 + +380 +00:30:38,779 --> 00:30:43,460 + 점에서 흥미로운 재미 선택 문제 등이 실제로있다 실현 + +381 +00:30:43,460 --> 00:30:49,210 + 내가 그렇게 종류의 당신이 있다면 좋은 호출하면 출력을 보여주고 싶은 것을 선택 + +382 +00:30:49,210 --> 00:30:50,410 + 여행 + +383 +00:30:50,410 --> 00:30:55,590 + 흥미로운 장소는 사실은 그래서 내가 찾고 있어요 오전 한국 불굴 갈거야 + +384 +00:30:55,589 --> 00:31:04,549 + 그들은 우리가하는 일이 너무 일하지 않는 나의 번역기를 사용하여 전달 + +385 +00:31:04,549 --> 00:31:09,000 + 보험을 줄일 수 있습니다에 약간의 작업 비용 아무것도처럼보다 더있다 + +386 +00:31:09,000 --> 00:31:15,789 + 와우 내 모델이 큰 너무 굉장이 느낌은 그냥 슬픈 꿈이야 + +387 +00:31:15,789 --> 00:31:18,309 + 독일 내 휴대 전화의 배터리 + +388 +00:31:18,309 --> 00:31:22,769 + 또는 당신은 내가 당신은 내가 계속 알고 그것을 실행하는 유혹을 감당할 수있어 + +389 +00:31:22,769 --> 00:31:27,039 + 많이있다, 그래서 내 데이터 센터의 당신은 내가 기계를받은 경우에도 + +390 +00:31:27,039 --> 00:31:31,720 + 당신은 예를 들어 특히 간단한 싶어 뉴스에서 사용할 수있는 트릭 + +391 +00:31:31,720 --> 00:31:39,430 + 심지어 훨씬 낮은 정밀 계산 단의 일반적으로 훨씬 더 관대 + +392 +00:31:39,430 --> 00:31:44,120 + 훈련은 지금까지 프랑스에서 우리는 일반적으로 우리가 얻을 수있는 모든 방법을 양자화 할 수 있습니다 발견 + +393 +00:31:44,119 --> 00:31:48,319 + 더 적은 조금 너무 그녀 야 좋은 품질하지만 저렴한 통해 당신은 거래를하고 싶습니다 + +394 +00:31:48,319 --> 00:31:52,139 + 정말 당신은 prolly의 육을 할 수 있지만 그 많은 도움이되지 않습니다 + +395 +00:31:52,140 --> 00:31:57,930 + 그 매개 변수를 저장하기에 좋은 외환 메모리 감소처럼를 제공하고 + +396 +00:31:57,930 --> 00:32:01,850 + 또한 CPU 벡터를 사용할 수 있습니다 사촌 경쟁 효율을 제공 + +397 +00:32:01,849 --> 00:32:08,809 + (24) 대신 1:30 곱하지만 왜 설명은 갑자기 당신에게있어 + +398 +00:32:08,809 --> 00:32:13,879 + 모바일에서 더 많은 효율을 얻기의 귀엽 더 이국적인 방법의 종류에 대한 + +399 +00:32:13,880 --> 00:32:14,310 + 전화 + +400 +00:32:14,309 --> 00:32:19,169 + 제프리 힌튼의 세포 소기관 내가 근무하는 기술이라고 증류 + +401 +00:32:19,170 --> 00:32:24,910 + 그래서 생각에 당신은 정말 거대한 모델에게 난 그냥 설명하는 문제가 + +402 +00:32:24,910 --> 00:32:30,660 + 어쩌면 이들의 앙상블과 함께 정말 기쁘게이 환상적인 모델 당신에게 + +403 +00:32:30,660 --> 00:32:36,430 + 지금 당신은 그래서 여기에 작은 싼 모델에서 거의 같은 배우를 원한다 + +404 +00:32:36,430 --> 00:32:41,480 + 당신의 거대한 비싼 모델 동일한 의제는 환상적인 준다 공급 + +405 +00:32:41,480 --> 00:32:47,630 + 예측은 좋아한다. 95 재규어 나는 확신하고 나는 그것이 아니다 확실히 확신 + +406 +00:32:47,630 --> 00:32:48,530 + 자동차 + +407 +00:32:48,529 --> 00:32:57,769 + 당신을 위해 10-4 차 창은 내가 잘 그래서 그건 사자가 될 수 침대로 향하고 있어요 + +408 +00:32:57,769 --> 00:33:02,900 + 나중에 불행하게도 우리를 메인 아이디어를 말해 정말 정확한 모델을 무엇을 + +409 +00:33:02,900 --> 00:33:07,380 + 2006 년 부자 카루 아나는 논문에서 비슷한 생각을 게시 발견했다 + +410 +00:33:07,380 --> 00:33:13,310 + 당신의 거대한 정확한 모델 구현을위한 앙상블 소위 모델 압축 + +411 +00:33:13,309 --> 00:33:18,669 + 입력 - 출력에서​​이 흥미로운 기능은 사실을 잊어 버린 경우 이렇게 + +412 +00:33:18,670 --> 00:33:22,720 + 거기에 몇 가지 구조 그리고 당신은 정보를 사용하려고하는 것이 + +413 +00:33:22,720 --> 00:33:27,500 + 그는 우리가있는 지식을 전달하는 방법을 그 함수에 포함 된 것 + +414 +00:33:27,500 --> 00:33:30,730 + 작은에 정말 정확한 기능 + +415 +00:33:30,730 --> 00:33:36,339 + 함수의 의도는 그래서 당신은 당신이 무엇 일반적으로 모델을 훈련 할 때 + +416 +00:33:36,339 --> 00:33:40,740 + 당신 위업이 같은 이미지입니다 다음은 주요하려고하는 대상으로 제공 + +417 +00:33:40,740 --> 00:33:47,109 + 당신은 내가거야 다른 대상에게 그것을 하나의 재규어 랜드 로버 모든 것을 제공 + +418 +00:33:47,109 --> 00:33:52,819 + 그래서 하드 타겟이 모델에 노력하고 이상적인의 종류의 것을 호출 + +419 +00:33:52,819 --> 00:33:56,298 + 달성하고 당신은 수천 또는 수백만의 수백을 알고 그것을 제공 + +420 +00:33:56,298 --> 00:34:00,918 + 드라이브의 이미지를 훈련하는 것은 모든 요인에 근접하는 + +421 +00:34:00,919 --> 00:34:05,160 + 실제 사실에 차이가 꽤하지 않습니다 그는 당신이 좋은를 제공 사촌 + +422 +00:34:05,160 --> 00:34:09,990 + 다른 클래스를 통해 다른 이미지를 통해 공개 확률 분포 + +423 +00:34:09,989 --> 00:34:17,579 + 같은 결혼은 그래서 우리의 거대한 비싼 모델과 중 하나를 수행 할 수 있도록하기위한 + +424 +00:34:17,579 --> 00:34:22,079 + 우리가 할 수있는 일이 우리가 실제로이의 분포를 부드럽게 할 수있다 + +425 +00:34:22,079 --> 00:34:30,940 + 제프리 힌튼 어두운 지식을 부르는하지만 당신은이 작업을 부드럽게 경우 + +426 +00:34:30,940 --> 00:34:34,500 + 당신이있을 수 있습니다에 기본적으로 온도에 의해 모든 물류 단위를 분할 + +427 +00:34:34,500 --> 00:34:38,820 + 5 ~ 10 개 무엇인가 당신은이의 부드러운 표현을 얻을 + +428 +00:34:38,820 --> 00:34:44,159 + 당신은 재규어에 괜찮 말할뿐만 아니라 좀 회피 확률 분포 + +429 +00:34:44,159 --> 00:34:48,950 + 에 대한 작은 여전히​​ 전화 소 어쩌면 작은 사자의 비트를 호출 + +430 +00:34:48,949 --> 00:34:56,878 + 그것은 확실히 자동차와 그 당신이 다음 년 수 뭔가이 + +431 +00:34:56,878 --> 00:35:00,139 + 가을 분포에 대한 이미지에 대해 더 많은 정보를 만들어 + +432 +00:35:00,139 --> 00:35:04,429 + 이 큰 앙상블 앙상블에 의해 구현되는 기능에 노력하고있다 + +433 +00:35:04,429 --> 00:35:08,169 + 침대 머리는 곧 당신에게 확률 확률을주는 정말 좋은 일을 + +434 +00:35:08,170 --> 00:35:15,559 + 해당 이미지를 통해 배포 그래서 당신은 보통의 작은 모델을 학습 할 수 있습니다 + +435 +00:35:15,559 --> 00:35:19,070 + 당신은 하드 목표를 교육 훈련 대신에 당신은에 훈련 할 때 + +436 +00:35:19,070 --> 00:35:25,640 + 하드 타겟 플러스 소프트 목표와 교육의 조합 + +437 +00:35:25,639 --> 00:35:32,089 + 목표는 거 매트 매트 시도해야하는 두 가지 중 어떤 기능 때문에 + +438 +00:35:32,090 --> 00:35:37,579 + 이것은 우리가 큰 연설에서했던 실험을 그래서 여기 놀라 울 정도로 잘 작동이다 + +439 +00:35:37,579 --> 00:35:42,039 + 모델은 그래서 우리는 모델에 따라 그의 친구의 분류 58.9 %로 시작 + +440 +00:35:42,039 --> 00:35:46,190 + 제대로 그것은 우리의 큰 정확한 모델 그리고 지금 우리는 그 끔찍한를 사용하는거야 + +441 +00:35:46,190 --> 00:35:50,829 + 작은 모델의 부드러운 목표를 제공하기 위해 그들은 또한 하드를 보게 + +442 +00:35:50,829 --> 00:35:57,690 + 대상 및 우린 기차있어 그 데이터의 3 % 그래서 함께 새로운 모델 + +443 +00:35:57,690 --> 00:36:04,599 + 부드러운 목표 정확도 57 %가 그냥 하드 목표입니다 거의 것을 유지 + +444 +00:36:05,210 --> 00:36:12,800 + 크게 이상 44.5 % 정확한 적합하고 매우 부드러운 목표는 정말 남쪽으로 이동 + +445 +00:36:12,800 --> 00:36:17,700 + 정말 좋은 정례화하고, 다른 것은입니다 주식 목표 때문에 + +446 +00:36:17,699 --> 00:36:21,739 + 너무 많은 정보들을 비교 한 단 하나의 하나의 상상 당신을 발생 + +447 +00:36:21,739 --> 00:36:27,889 + 훈련을 훨씬 더 빨리 당신은에 대한 짧은 일주일 등에 그 정확성에 도착 + +448 +00:36:27,889 --> 00:36:33,358 + 시간은 꽤 좋은 있다고 당신은 빛 건조에이 방법을 수행 할 수 있습니다 + +449 +00:36:33,358 --> 00:36:37,889 + 앙상블에 대한 하나의 크기 모델에 낮잠 앙상블 당신은 큰에서 수행 할 수 있습니다 + +450 +00:36:37,889 --> 00:36:45,269 + 작은 일에 병 다소 과소 평가 기술은 확인 보자 + +451 +00:36:45,269 --> 00:36:51,980 + 그래서 우리가했던 것들 중 하나는 우리는 밀가루의 톤을 구축에 대해 생각했다 때 + +452 +00:36:51,980 --> 00:36:56,309 + 우리는 가지 더 알고 다시 단계를했고 우리가 정말 당신이 무엇을 말했다 + +453 +00:36:56,309 --> 00:36:59,259 + 당신이 다른 많은 것들을 할 수 있도록 시스템을 조회 할 그것을 가지의 + +454 +00:36:59,260 --> 00:37:04,740 + 당신이 정말로 관심있는 것들을 정말 한 일의 모든 I 그러나 하드 균형 + +455 +00:37:04,739 --> 00:37:08,489 + 몇 연구원이 중 하나에 대해 내가 싶어 식은를 취할 수 + +456 +00:37:08,489 --> 00:37:12,589 + 이전 연구 아이디어와 그것을 밖으로 시도 + +457 +00:37:15,119 --> 00:37:37,219 + 대신 천 폭 완전히 연결 계층 같은 상당히 작았 + +458 +00:37:37,219 --> 00:37:43,409 + 그것은 600 실제로 큰 차이가 500 y를하지만 확인 같았다 + +459 +00:37:43,409 --> 00:37:51,399 + 자세한 것은 그 종이 아마 misremembered 권리를 해요 그리고 당신은 원하는 + +460 +00:37:51,400 --> 00:37:55,490 + 많은 당신이 할 수 있도록하려면 빨리 실행을 연구 아이디어를 취할 수 + +461 +00:37:55,489 --> 00:38:00,689 + 두 데이터 센터와 좋은 아이폰에 아마 그것을 실행하는 것은 재현 할 수 있도록 + +462 +00:38:00,690 --> 00:38:04,269 + 일이 당신은 생산 시스템에 좋은 연구 아이디어에서 가고 싶어 + +463 +00:38:04,269 --> 00:38:10,730 + 다시 및 다른 시스템 필요없이 그 방법을 우리 종류 주의의 + +464 +00:38:10,730 --> 00:38:15,659 + 일이 우리는 당신이있어 등으로 역류 오​​픈 소스에게 그것을 웬들 링 고려했다 + +465 +00:38:15,659 --> 00:38:25,519 + 첫 감정가요 유의 부드러운 유동의 코어 비트 우리 그래서 + +466 +00:38:25,519 --> 00:38:30,769 + 그것은 많은 다른 실행 휴대용 다른 장치의 개념이 + +467 +00:38:30,769 --> 00:38:34,340 + 운영 체제는 우리가 상단에 다음이 핵심 그래픽 솔루션 엔진과이 + +468 +00:38:34,340 --> 00:38:37,700 + 그 중 우리는 다른 친구는 대회의 종류를 표현했다가 + +469 +00:38:37,699 --> 00:38:41,819 + 수행하려는 우리는 C ++ 친구가 대부분의 사람들은에있는 사용하지 않는 내 + +470 +00:38:41,820 --> 00:38:45,700 + 마음 우리는 당신의 확인 대부분은 더 그렇게 아마있는 밧줄 친구 야했다 그들은 + +471 +00:38:45,699 --> 00:38:49,339 + 대부분의 남자를 착용해야하지만 퍼팅에서 사람들을 방지 아무것도 없다하지 않습니다 + +472 +00:38:49,340 --> 00:38:55,750 + 몇 가지 작업이 때문에 다른 언어 나는 중립 상당히 언어가되고 싶어 + +473 +00:38:55,750 --> 00:38:58,269 + 거기에 전 친구를 넣어 진행 + +474 +00:38:58,269 --> 00:39:03,980 + 다른 언어의 종류 당신은 싶어 그 모델을 취할 수 및 실행해야 + +475 +00:39:03,980 --> 00:39:09,440 + 다른 플랫폼의 매우 다양한 기본 계산 모델 + +476 +00:39:09,440 --> 00:39:12,710 + 나는 당신의 개요이 얘기 얼마나 모르는 땅이다 + +477 +00:39:12,710 --> 00:39:17,179 + 열 좀 확인 가장자리 또는 입찰에 대한을 따라 흐름이 그래프 것들 때문에 + +478 +00:39:17,179 --> 00:39:25,469 + 달리으로 프록터와 같은 기본 유형으로 임의과 차원 배열 + +479 +00:39:25,469 --> 00:39:29,269 + 실제로이에 머물 것이 순수한 데이터 흐름 모델은 crassly 당신은 일이 + +480 +00:39:29,269 --> 00:39:33,219 + 교구와 같은 변수 후 다시 작업을 업데이트 한하는 + +481 +00:39:33,219 --> 00:39:37,019 + 전체 그래프는 몇 가지 계산을 통해 시스템 상태를 일 일이 갈 수 있습니다 + +482 +00:39:37,019 --> 00:39:45,329 + 구배 후 바이어스 조정은 기울기 그래프에 기초가 통과 + +483 +00:39:45,329 --> 00:39:50,809 + 단계의 시리즈는 한 가지 중요한 단계는의 전체 무리를 주어 결정하는 + +484 +00:39:50,809 --> 00:39:55,670 + 실행에 서로 다른 각각의 우리를있는 연산 장치와 맥그래스 + +485 +00:39:55,670 --> 00:40:01,369 + 예를 들어 계산의 그래프 측면에서 노드가 여기에 우리가 CPU가있을 수 있습니다 및 + +486 +00:40:01,369 --> 00:40:06,650 + 파란색과 I의 GPU 카드와 녹색과 우리는 같은에서 그래프를 실행할 수 있습니다 + +487 +00:40:06,650 --> 00:40:13,160 + 방법이 그 경쟁이 정도로 실제로 GPU에서 발생 비록 + +488 +00:40:13,159 --> 00:40:17,259 + 옆이 배치 결정은 종류 우리는 사용자가 그에게 제공 할 수의 까다로운 있습니다 + +489 +00:40:17,260 --> 00:40:22,760 + 가이드는이 비트와 반드시 하드없는 힌트가 주어 + +490 +00:40:22,760 --> 00:40:26,750 + 새로운 검은 장치에 있지만 제약은해야 같은 수 있습니다 + +491 +00:40:26,750 --> 00:40:33,300 + 정말 GPU에서이 작업을 실행하거나 작업 일곱에 배치하려고 난 상관 없어 무엇을 + +492 +00:40:33,300 --> 00:40:40,200 + 다음 장치 및 우리는 기본적 그래프 피사체 시간을 최소화 할 + +493 +00:40:40,199 --> 00:40:44,159 + 우리가 서로에 사용할 수있는 메모리 같은 다른 제약 조건의 모든 종류의 유지 + +494 +00:40:44,159 --> 00:40:51,199 + 당신은 CPU에서 카터 나는 실제로와 집에서 사용하는 재미있을 거라고 생각 + +495 +00:40:51,199 --> 00:40:54,639 + 당신이 실제로 여기에 목표를 측정 할 수 있기 때문에 학습 일부 강화 + +496 +00:40:54,639 --> 00:40:58,759 + 나는이 쪽지를 배치하면 당신을 알고이 이런 식으로 방법이 노트에 알려진 + +497 +00:40:58,760 --> 00:41:02,500 + 빨리 내 그래프이고, 나는 그것이 꽤 흥미 보강있을 거라고 생각 + +498 +00:41:02,500 --> 00:41:02,929 + 배우기 + +499 +00:41:02,929 --> 00:41:09,139 + 문제 13 우리는 전송을 삽입 한 후 물건을 배치하는 의사 결정을 통해 만든 + +500 +00:41:09,139 --> 00:41:12,500 + 기본적으로 모든 통신 시스템을 캡슐화 노드를받을 + +501 +00:41:12,500 --> 00:41:16,800 + 그래서 기본적으로는 노드 전송이 한 장소에서 다른 답변을 이동하려면 + +502 +00:41:16,800 --> 00:41:21,200 + 그들은 더 검사를받지 그들이 때까지 종류의 단지 텐서에 개최 + +503 +00:41:21,199 --> 00:41:26,669 + 정말에 대한 데이터를 사랑하고 당신은 십자가의 모든 가장자리에 대해이 작업을 수행 + +504 +00:41:26,670 --> 00:41:32,150 + 장치 경계 및 수신 전송 파리의 다른 의미를 가지고 + +505 +00:41:32,150 --> 00:41:36,220 + 장치에 따라서 예를 들어, GPU는 동일한 방법에 있으면 볼 + +506 +00:41:36,219 --> 00:41:39,779 + 기계 종종있을 하나의 GPU 메모리에서 직접 우리의 DNA를 수행 할 수 있습니다 + +507 +00:41:39,780 --> 00:41:44,410 + 그들은 시스템에서 다른 시스템에있어, 당신이 경우 RBC 네트워크는 수도 + +508 +00:41:44,409 --> 00:41:50,868 + 그냥 직접 도달 할 때 사용하는 네트워크 및 I 케이스에서 지원 RDMA + +509 +00:41:50,869 --> 00:41:56,920 + 남부 기계와 당신이 할 수있는 신용의 남부 GPU 메모리에 + +510 +00:41:56,920 --> 00:42:00,210 + 아주 쉽게 새로운 운영 및 대령을 정의 + +511 +00:42:00,210 --> 00:42:06,920 + 당신은 일반적으로 실행할 수있는 그래프를 실행하는 방법을 이러한 인터페이스는 그 본질적 + +512 +00:42:06,920 --> 00:42:10,940 + 한 번 그래프를 설정하고 우리가 가지 가질 수 있도록 당신은 많은 실행 + +513 +00:42:10,940 --> 00:42:17,068 + 시스템은 원하는 본질적 방법에 대해 최적화 많은 의사 결정을 할 + +514 +00:42:17,068 --> 00:42:22,199 + 그것을 만들 않는 등 몇 가지 실험을 할 아마도 더 후 경쟁을 배치 없습니다 + +515 +00:42:22,199 --> 00:42:26,068 + 더 감각으로부터 중복을 광고 할 수 있습니다 여기 여기이기 때문에 그것을 넣어 + +516 +00:42:26,068 --> 00:42:30,969 + 저자 브라이언은 단일 프로세스 구성 모든 실행 하나를 호출 + +517 +00:42:30,969 --> 00:42:35,509 + 과정을 그리고 그것은 단지 종류의 간단한 절차는 분산 환경에서 호출이야 + +518 +00:42:35,510 --> 00:42:38,440 + 노동자의 무리는이 클라이언트 프로세스는 마스터 프로세스이고 그 + +519 +00:42:38,440 --> 00:42:43,608 + 나는 서브 그래프를 실행하려면 같은 장치와 마스터 톤의 클라이언트가 + +520 +00:42:43,608 --> 00:42:47,568 + 마스터는 내가 처리 할 얘기가에 그들에게 싶었 의미 괜찮 말한다 + +521 +00:42:47,568 --> 00:42:54,808 + 당신이 실제로 데이터에 공급할 수 물건을하고 그것이 내가 종류의를 가질 수 있음을 의미합니다 + +522 +00:42:54,809 --> 00:42:59,619 + 더 복잡한 그래프 그러나 나는 단지 내가 만하면 원인 그것의 작은 비트를 실행해야 + +523 +00:42:59,619 --> 00:43:05,440 + 계산에 대한 부분을 실행하는 내내 출력 우리의 + +524 +00:43:05,940 --> 00:43:14,940 + 필요에 우리가 이것을 확장 할 수있는에 많은 초점 이야기를 기반으로 + +525 +00:43:14,940 --> 00:43:19,099 + 분산 환경 실제로 우리의 가장 큰 것 중 하나 때 우리가 처음 열려 + +526 +00:43:19,099 --> 00:43:23,210 + 일주일 동안 소스 센터는 꽤 오픈 소스 모바일 떨어져 조각하지 않았다 + +527 +00:43:23,210 --> 00:43:28,269 + 이 전화 번호 (23)를 가지고하는 방법 있도록 분산 구현은 좋았다 + +528 +00:43:28,269 --> 00:43:33,259 + 헤이 분산 버전의 우리의 릴리스의 일처럼 이내에 제출 + +529 +00:43:33,260 --> 00:43:39,839 + 그게거야 더 좋은, 그래서 우리는 처음 출시 된 지난 목요일했다 + +530 +00:43:39,838 --> 00:43:43,619 + 포장하지만 순간에 당신의 종류 및 여러 프로세스를 구성 할 수 있습니다 + +531 +00:43:43,619 --> 00:43:48,710 + 다른 프로세스의 이름 그는 우리가있어 관련 IP 주소의 중요성입니다 + +532 +00:43:48,710 --> 00:43:55,150 + 내가 주 더 앞으로 몇이야하지만 그건 좋은 및 거 패키지 + +533 +00:43:55,150 --> 00:43:59,250 + 그것을 가지고 온 이유는 훨씬 더 나은 처리 시간을 할 것입니다 + +534 +00:43:59,250 --> 00:44:05,889 + 실험은 모드 훈련 및 실험에 있다면 그렇게 + +535 +00:44:05,889 --> 00:44:09,769 + 반복 당신이 있다면 정말 정말 좋은 분 또는 몇 시간의 종류 + +536 +00:44:09,769 --> 00:44:15,159 + 한 달보다 종류의 희망처럼 더 같은 여러 주 모드 + +537 +00:44:15,159 --> 00:44:19,279 + 당신이 당신은 일반적으로 작업을 수행 할 또는 당신이 할 경우, 당신은 할 나의 여행 오 ​​같은거야 + +538 +00:44:19,280 --> 00:44:26,130 + 왜 우리가 정말 우리 그룹에서 많이 강조 다시 그래서 그냥되는 것이 했는가 + +539 +00:44:26,130 --> 00:44:31,269 + 합리적으로 빨리 실험을 할 수있는 사람을 만들 수 + +540 +00:44:33,920 --> 00:44:39,250 + 그래서 두 가지 일이 우리는 내가 대해 얘기하자 문제 속에서 우리의 모델 평행선을 + +541 +00:44:39,250 --> 00:44:46,588 + 모두 당신이 조금 또는 확인이 이야기 한 당신이 할 수있는 가장 좋은 방법은 이렇게 + +542 +00:44:46,588 --> 00:44:52,279 + (9) 교육 시간을 단축하는 것은 그래서 정말 좋은 중 하나를 시간을 중지 감소 + +543 +00:44:52,280 --> 00:44:56,329 + 속성 대부분의 노트북을 많이하고 고유의 병렬 권리 등 많이있다 + +544 +00:44:56,329 --> 00:44:59,329 + 당신이 계산 모델에 대해 생각하면 병렬 많이있다 + +545 +00:45:00,539 --> 00:45:04,119 + 각 층의 모든 공간 위치는 거의 무관하므로 + +546 +00:45:04,119 --> 00:45:06,280 + 당신은 단지 그들 주위에 실행할 수 있습니다 + +547 +00:45:06,280 --> 00:45:10,680 + 다른 장치에 병렬로 문제가 통신하는 방법을 알아낼 수있다 + +548 +00:45:10,679 --> 00:45:17,889 + 같은 방법으로 그 계산을 배포하는 것은 당신을 경우 죽이지 않는 방법 + +549 +00:45:17,889 --> 00:45:21,389 + 도움이 사람이 길쌈 신경 같은 지역의 전도성 당신 생각 + +550 +00:45:21,389 --> 00:45:25,299 + 매트는 일반적으로 오에 의해처럼 찾고이 좋은 특성을 가지고 + +551 +00:45:25,300 --> 00:45:31,070 + 오 그 아래의 데이터를 패치하고 다른 아무것도 신경 세포가 필요하지 않습니다 + +552 +00:45:31,070 --> 00:45:35,289 + 그것은 그것을 위해에 필요한 데이터와 중복 훨씬로서 그 옆에 + +553 +00:45:35,289 --> 00:45:41,099 + 타워 사이 거의 또는 전혀 연결을 통해 먼저 신경 UCAV 타워 그래서 + +554 +00:45:41,099 --> 00:45:46,179 + 마다 몇 층 당신은 약간의 의사 소통 수도 있지만 대부분은 당신이 동의하지 않습니다 + +555 +00:45:46,179 --> 00:45:50,399 + 종이 그래서 기본적으로 대부분 우연히 두 개의 별도의 시간을 한 것으로 한 + +556 +00:45:50,400 --> 00:45:55,880 + 다른 CPU에 GPU가와에 대한 처벌은 때때로 몇 가지 정보를 교환 + +557 +00:45:55,880 --> 00:45:59,220 + 당신은 몇 가지 예를 들어 모델 매력적인 여자의 전문 부품을 얻을 + +558 +00:45:59,219 --> 00:46:06,759 + 당신은 그냥 순진하게있을 때 그래서 병렬을 악용 할 수있는 방법이 많이있다 + +559 +00:46:06,760 --> 00:46:10,630 + 아마 이미 GCC 또는 무언가 그 많은 행렬 곱셈 코드를 컴파일 + +560 +00:46:10,630 --> 00:46:16,880 + 인텔의 CPU 점수에 명령 병렬 선물을 활용 + +561 +00:46:16,880 --> 00:46:23,420 + 당신은 통신 기기에서 스레드 영웅주의와 가지 방법을 사용할 수 있습니다 + +562 +00:46:23,420 --> 00:46:27,760 + 종종 꽤 당신에게 한정 학대자 사이에 30 ~ 40 배처럼이 + +563 +00:46:27,760 --> 00:46:31,950 + 로컬 팀 구성원에 더 밴드 여행 당신은 다른 좋아 할 수있는 + +564 +00:46:31,949 --> 00:46:36,750 + 동일한 시스템에서 GPU 카드 메모리와 시스템에서 일반적으로 아래도 + +565 +00:46:36,750 --> 00:46:41,519 + 더 나쁜 종류의 당신이 할 수있는 지역의 많은 데이터를 유지하기 때문에 매우 중요하고 + +566 +00:46:41,519 --> 00:46:48,159 + 당신이있어입니다 기본 개념에 너무 많이하지만 모델 평행선을 먹고 피 + +567 +00:46:48,159 --> 00:46:51,929 + 그냥 어떻게 든 아마 계산 모델을 분할하는 것 + +568 +00:46:51,929 --> 00:47:01,710 + 특히 층으로하고, 예를 들어이 경우이 어쩌면 층 등 + +569 +00:47:01,710 --> 00:47:05,730 + 내가해야 할 유일한 통신이 경계 당신이 몇몇을 알고있다 + +570 +00:47:05,730 --> 00:47:09,039 + 청원의 데이터가 해당 파티션의 입력에 필요한 가지고 일하지만, + +571 +00:47:09,039 --> 00:47:16,949 + 대부분 모든 당신이 속도에 사용할 수있는 다른 기술 지역입니다 + +572 +00:47:16,949 --> 00:47:21,419 + 컨버전스는 일부 데이터 병렬 처리가 다른 많은 사용하려고하는 경우입니다 + +573 +00:47:21,420 --> 00:47:24,608 + 동일 모델 구조의 복제본 그들은 모든 협력거야 + +574 +00:47:24,608 --> 00:47:30,949 + 매개 변수를 잡고 일부는 서버의 공유 세트에 있도록 업데이트 매개 변수 + +575 +00:47:30,949 --> 00:47:36,629 + 상태 속도 향상은 속도 모델의 종류에 많은 10-40 X이 될 수 의존 + +576 +00:47:36,630 --> 00:47:42,720 + 모든에 대해 같은 정말 큰 묻어 450 복제본 스파 스 모델을 + +577 +00:47:42,719 --> 00:47:44,769 + 인간에게 알려진 어휘 단어 + +578 +00:47:44,769 --> 00:47:48,469 + 대부분의 업데이트는 업데이트 사촌 일반적으로 더 많은 병렬 처리를보고 할 수 없습니다 + +579 +00:47:48,469 --> 00:47:53,129 + 매립 항목의 소수 문장은 10와 같은 독특한 단어를 가지고 있습니다 + +580 +00:47:53,130 --> 00:47:57,630 + 만 밖으로 당신은 수백만을 가질 수 있고, 수백만의 수천 + +581 +00:47:57,630 --> 00:48:03,088 + 기본 개념 및 데이터 병렬 처리는 당신이 그래서 일을 많이하고 복제본 + +582 +00:48:03,088 --> 00:48:07,019 + 서로 다른 모델 복제본 유지하는 중앙 집중식 시스템을 거 가지고있다 + +583 +00:48:07,019 --> 00:48:10,519 + 단지 하나의 기계와 아마 많이하지 않을 수 있습니다 매개 변수 추적 + +584 +00:48:10,519 --> 00:48:16,338 + 기계의 당신은 때로는 모든 유지하기 위해 네트워크 대역폭을 많이 필요로하기 때문에 + +585 +00:48:16,338 --> 00:48:19,900 + 그래서이 모델 복제 표준 매개 변수 당신은 우리의 큰 설정에서 알 수 있습니다 + +586 +00:48:19,900 --> 00:48:24,950 + 내 뒤에 (27) 기계 후 중지 된 것을 당신은 당신이있을 수 있습니다 알고 + +587 +00:48:24,949 --> 00:48:29,259 + 다섯 거기 모델의 복제본과 모든 모델 복제하기 전에 + +588 +00:48:29,260 --> 00:48:34,430 + 그것은 당신에게 백을 좋아 말한다, 그래서 그거야 매개 변수를 잡아 일치하지 않습니다 + +589 +00:48:34,429 --> 00:48:39,179 + 및 스물일곱 기계는 나에게 매개 변수를 제공 한 후는 않습니다 + +590 +00:48:39,179 --> 00:48:44,289 + 미니 배지 주변의 조합 동의 것이기 때문에 그것을해야한다 + +591 +00:48:44,289 --> 00:48:47,869 + 매개 변수 서버 라우터로 다시 분해 시간의 속도에 적용되지 않습니다 + +592 +00:48:47,869 --> 00:48:52,829 + 서버는 그 후 이전에 현재의 매개 변수 값을 업데이트 + +593 +00:48:52,829 --> 00:48:58,039 + 다음 단계 우리는 같은 일이 정말로 집중에 따라 네트워크했다 당신의 + +594 +00:48:58,039 --> 00:49:01,690 + (가) 매우 많은 매개 변수가없는 여기에 도움이 모델로있는 모델 일 + +595 +00:49:01,690 --> 00:49:06,068 + 대회는 그 점 엘라는 그런 점에서 표준화 정말 멋지다 + +596 +00:49:06,068 --> 00:49:11,250 + 당신은 재사용보다 본질적이기 때문에 모든 매개 변수는 너무 시간을 잠글 + +597 +00:49:11,250 --> 00:49:16,929 + 당신은 이미 당신은 그러나 더 큰 배치 크기를 알고 사용하는 것은의 모델에 + +598 +00:49:16,929 --> 00:49:20,088 + 당신이 백을 사용할 수 있습니다 통해 자녀의 228는 거 압력을 가지고있어와 + +599 +00:49:20,088 --> 00:49:23,900 + 경기의 모든 열에 대한 스물여덟 시간 만은 길쌈을 + +600 +00:49:23,900 --> 00:49:28,970 + 모델은 지금 넌 아마 같은 $ 10 재사용의 추가 팩터를 얻을 수있어 + +601 +00:49:28,969 --> 00:49:30,019 + 다른 위치 + +602 +00:49:30,019 --> 00:49:34,769 + 층에 당신은 당신이 백을 풀다 경우를 분석 오후를 사용하는 거라고 + +603 +00:49:34,769 --> 00:49:41,460 + 시간은 그냥 그 종류를 줄이기 위해 그것을 백 번을 다시 사용할 수 있습니다 단계 + +604 +00:49:41,460 --> 00:49:47,220 + 모델이 물건의 정렬 계산을 많이 적은 매개 변수가 + +605 +00:49:47,219 --> 00:49:50,109 + 박사의 경쟁은 일반적으로 잘 작동과 평행 한 것 + +606 +00:49:50,110 --> 00:49:57,340 + 환경은 지금 당신이 그렇게 그 작업을 수행하는 방법에 따라 명백한 문제가있다 + +607 +00:49:57,340 --> 00:50:00,720 + 당신이 할 수있는 한 가지 방법은 완전하게 비동기 적으로 모든 모델 복제본입니다 + +608 +00:50:00,719 --> 00:50:05,459 + 미니 배지 난방을하는 루프에 앉아 매개 변수를 설정 + +609 +00:50:05,460 --> 00:50:09,210 + 복사가 그것을 보내고 당신은 비동기 적으로 다음 그라데이션을 그렇게 할 경우 + +610 +00:50:09,210 --> 00:50:13,710 + 계산하여이 매개 변수는 곳에 대해 완전히 부패 할 수있다 + +611 +00:50:13,710 --> 00:50:17,030 + 지금 현재이 매개 변수 값에 자신의 뒤쪽에 그것을 계산하지만, + +612 +00:50:17,030 --> 00:50:20,810 + 한편 10 다른 지원자가 만든은을 통해 사행하는 매개 변수를 호출 + +613 +00:50:20,809 --> 00:50:27,529 + 여기 그리고 지금 당신은 당신이이 만드는 여기에 대한라고 생각 그라데이션을 적용 + +614 +00:50:27,530 --> 00:50:31,080 + 그것의 사촌 이미 불편이 추가 매우 불편 + +615 +00:50:31,079 --> 00:50:38,619 + 완전히 비 윤리 문제 그러나 좋은 소식은 그것이 어떤까지 일 + +616 +00:50:38,619 --> 00:50:43,670 + 수준 당신이 알고있는 조건을 이해 정말 좋은 것 + +617 +00:50:43,670 --> 00:50:48,059 + 작품과 이론적 기초하지만, 실제로는 꽤 작동하도록 보인다 + +618 +00:50:48,059 --> 00:50:51,710 + 잘 당신이 할 수있는 다른 일이 완전히 동 기적으로 당신이 할 수있는 그래서 이렇게이다 + +619 +00:50:51,710 --> 00:50:55,800 + 확인 모든 사람들이 그들 모두가 매개 변수를 가서 소리 하나 운전 루프가 + +620 +00:50:55,800 --> 00:50:58,610 + 그들은 모두 계산 그라디언트 다음은 그라데이션을 기다리는는 표시와해야 할 일 + +621 +00:50:58,610 --> 00:51:03,820 + 그녀의 주위에 그들에게 큰 노력 뭔가하고 효과적으로 단지 + +622 +00:51:03,820 --> 00:51:09,269 + 거대한 일괄처럼 보인다는 당신이 우리의 회를 알고처럼 보이는 그 복제본 + +623 +00:51:09,269 --> 00:51:14,300 + 때때로 당신이 가지 얻을 작동 개개의 일괄 처리 크기 + +624 +00:51:14,300 --> 00:51:18,950 + 더 큰 배치 크기 만 더 훈련에서 수익을 감소 + +625 +00:51:18,949 --> 00:51:21,169 + 예 당신이 + +626 +00:51:21,170 --> 00:51:26,159 + 더 관대 한 당신은 더 큰 바이트는 일반적으로 크기 조 훈련을하다 + +627 +00:51:26,159 --> 00:51:30,420 + 당신이 천 확인의 크기에 대해 알고 예는 백만 훈련을 + +628 +00:51:30,420 --> 00:51:36,068 + 천 그리 좋은하지 못 외부의 예 + +629 +00:51:36,639 --> 00:51:41,289 + 내가 훨씬 더 복잡한 선택은 당신이 할 수 있습니다 거기에 루이스했다 생각 + +630 +00:51:41,289 --> 00:51:52,650 + 유럽​​의 권리를 설명 끝처럼 현재의 모델은 좋은했다 + +631 +00:51:52,650 --> 00:51:57,829 + 데이터 병렬 정말 정말 실제로 그래서 그들은 매개 변수를 많이 재사용 + +632 +00:51:57,829 --> 00:52:02,740 + 우리의 모델의 거의 모든 중요한 그것은 우리가 지점에 도착 방법 + +633 +00:52:02,739 --> 00:52:10,669 + 같은 반에서 교육 모델은 하루 일반적으로 하루 그래서 당신은 당신이 어떤 참조 알고 + +634 +00:52:10,670 --> 00:52:19,180 + 사용 설정의 거친 종류의 이곳은의 예 훈련 그래프이다 + +635 +00:52:19,179 --> 00:52:25,489 + 이미지 네트 모델 하나의 GPU 10기가바이트 52 뷰를 사용하고 최대 속도의 종류가있다 + +636 +00:52:25,489 --> 00:52:26,239 + 아직 + +637 +00:52:26,239 --> 00:52:29,759 + 같은 때때로 이러한 그래프는 10의 차이처럼 받고있다 + +638 +00:52:29,760 --> 00:52:34,220 + 50 년 라인과 같은 큰 각 다른 종류의 가까운 것을하지 않는 것 + +639 +00:52:34,219 --> 00:52:39,489 + 군인하지만 실제 사실 10과 50의 차이는 같다 + +640 +00:52:39,489 --> 00:52:43,798 + 그 요인 4.1처럼 보이지 않도록 네 점의 요인은 무엇인가를 원하는 + +641 +00:52:43,798 --> 00:52:51,920 + 차이를 수행하지만, 그래 당신이없이 원하는만큼 당신이 그것을 할 방법입니다 + +642 +00:52:51,920 --> 00:52:59,150 + 하나의 위기 지점 여섯 칠천 위기 지점 확인 + +643 +00:52:59,150 --> 00:53:04,490 + 그래서 내가 당신에게 당신이에 모델에 입찰 할 수 있도록 약간의 개조하면 되겠 어의 일부를 보여 드리겠습니다 + +644 +00:53:04,489 --> 00:53:08,149 + 병렬 처리의 서로 다른 종류의 악용 우리가 원하는 것들 중 하나 + +645 +00:53:08,150 --> 00:53:13,280 + 병렬 개념의 이러한 종류이었다 그렇게 표현 꽤 쉽게하기 + +646 +00:53:13,280 --> 00:53:17,500 + 것들 중 하나는 난의 종류에 꽤 잘 매핑 약 20 분을 좋아한다 + +647 +00:53:17,500 --> 00:53:22,949 + 이 얘기 아니에요 있도록 연구 논문에서 볼 수있는 일이 읽을 수있는 모든 것을하지만, + +648 +00:53:22,949 --> 00:53:30,189 + 당신은 당신이 좀 좋은되어서는 안 볼 것입니다 무엇보다 너무 다른 아니에요 + +649 +00:53:30,190 --> 00:53:37,940 + 간단한 줄기 세포처럼이 시퀀스 모델에 순서입니다 만 + +650 +00:53:37,940 --> 00:53:43,079 + 보조 기관이 신속하게 2014 년에 출판 된 모든 우리는 본질적으로있어 + +651 +00:53:43,079 --> 00:53:47,849 + 입력 시퀀스를 가지고 매핑하는 시도는 이것이하다 서열을 밝혀 + +652 +00:53:47,849 --> 00:53:51,679 + 연구의 정말 큰 영역은 모델 이러한 종류의에 적용 할 수 있습니다 밝혀 + +653 +00:53:51,679 --> 00:53:56,849 + 문제의 종류를 많이하고 많이 다른 그룹 많이하고있다 + +654 +00:53:56,849 --> 00:54:07,369 + 그래서 여기에이 지역에서 흥미로운 비활성 작업은 최근의 단지 몇 가지 예입니다 + +655 +00:54:07,369 --> 00:54:13,269 + 어떤 다른 실험실 주변에서이 지역의 마지막 년 반에서 일 + +656 +00:54:13,269 --> 00:54:17,630 + 당신이 이미 그것에 대해 얘기했습니다 세계 + +657 +00:54:17,630 --> 00:54:26,320 + 당신은 픽셀 단위로 넣을 수 있습니다 그냥 대신 시퀀스의 자막 호출은 당신입니다 + +658 +00:54:26,320 --> 00:54:31,890 + 당신은 당신의 초기 상태의 당신이 CNN을 통해 갔다 픽셀에 넣고 + +659 +00:54:31,889 --> 00:54:34,889 + 꽤 놀라운 캡션을 생성 할 수 있습니다 + +660 +00:54:36,030 --> 00:54:42,019 + 삼십오년 전 나는 잠시 동안 해리 R에 대해 그렇게하지를 생각하지 않는에 기여했다 + +661 +00:54:42,019 --> 00:54:46,730 + 당신은 실제로 할 수있는 다음 당신이 생성 할 수 있도록이 생식 모델 말할 + +662 +00:54:46,730 --> 00:54:51,320 + 분포를 탐구하여 다른 문장은 내가 우리 모두를 생각하는지 + +663 +00:54:51,320 --> 00:54:56,870 + 선장은 인간 하나의 매우 정교한가 안하지 않은 것은하지 않습니다 + +664 +00:54:56,869 --> 00:55:01,230 + 종종 사물의 하나입니다 참조 + +665 +00:55:01,230 --> 00:55:07,639 + 당신이 모델 조금 훈련하면 경우는 그녀의 트레이너 정말 중요합니다 + +666 +00:55:07,639 --> 00:55:13,210 + 그렇게 나쁘지 않아 빛이 있기 때문에 모델이 수렴하지만 당신은 것을 훈련하는 경우 + +667 +00:55:13,210 --> 00:55:17,070 + 모델 이상 같은 모델은 훨씬 더있어 + +668 +00:55:21,079 --> 00:55:25,139 + 트랙에 앉​​아있다 여기에 같은 일을 바로 훈련은 예 그건 사실이야 + +669 +00:55:25,139 --> 00:55:30,909 + 하지만 사람이 더 나은하지만 그녀는 여전히 볼 수있는 사람은 훨씬 더 세련가 + +670 +00:55:30,909 --> 00:55:35,480 + 그들은의 저장소 근처에 트랙을 교차하고 있음을 알 권리처럼 + +671 +00:55:35,480 --> 00:55:42,199 + 모델이 귀여운 다른 종류의에 데리러 것을 더 미묘한 물건의 종류 + +672 +00:55:42,199 --> 00:55:48,750 + 을 사용하여 실제로 매우 시원 그래프 모든 종류의 문제를 해결하는 데 사용할 수 있거나 + +673 +00:55:48,750 --> 00:55:56,440 + 도 마라 포르투나 및 FTP 당신의 톤으로 시작이 일을 yalls + +674 +00:55:56,440 --> 00:56:03,059 + 포인트는 그게 잘 작동에 대한 외판원을 예측하려고 + +675 +00:56:03,059 --> 00:56:11,559 + 잔디의 볼록 선체 또는 Delonte 삼각 측량을위한거야 전화 + +676 +00:56:11,559 --> 00:56:14,199 + 그것은 단지의 순서에 당신 위업에 대한 시퀀스 문제 비밀 알고 + +677 +00:56:14,199 --> 00:56:18,129 + 점하고 출력은 어떤 문제에 대한 포인트의 설정 오른쪽은 당신 + +678 +00:56:18,130 --> 00:56:21,130 + 에 대한 관심 + +679 +00:56:21,780 --> 00:56:28,519 + 내가 사기를거야, 그래서 당신이 한 번, 그래서 확인 응답이 내가 당신을 보여 앨리스 오후 cellco + +680 +00:56:28,519 --> 00:56:35,530 + 거기에 당신의 당신이 네 가지를 원 가정 해 봅시다 시간 스무 시간 단계에 등록 할 수 있습니다 + +681 +00:56:35,530 --> 00:56:37,680 + 시간 단계 당 층 대신 하나 + +682 +00:56:37,679 --> 00:56:42,389 + 잘 당신은 당신의 코드를 변경의 약간을 만들 것입니다 그리고 당신은 지금 그렇게 + +683 +00:56:42,389 --> 00:56:47,690 + 계산의 4 층 당신이 할 수있는 일의 2011 실행이 + +684 +00:56:47,690 --> 00:56:51,840 + 그래서 다른 GPU에 그 층의 각각의 변화가 톤을 만들 것입니다 + +685 +00:56:51,840 --> 00:56:56,869 + 의 작업을 수행 발생하고 당신이 그래서 이런 모델을 할 수 있습니다 + +686 +00:56:56,869 --> 00:57:01,289 + 내 장식 조각이 나는 시간 단계 당이 난 층이야 다른 깊은 질투하다 + +687 +00:57:01,289 --> 00:57:08,190 + 첫 번째 조금 후 나는 점점 더 많은 GPU를을 가지지고 시작할 수 있습니다 + +688 +00:57:08,190 --> 00:57:10,349 + 과정에 관여 + +689 +00:57:10,349 --> 00:57:15,579 + 당신은 기본적으로 파이프 라인 전체 것은 상기 거대한 소프트 팩있다 + +690 +00:57:15,579 --> 00:57:19,710 + 당신의 상단은 아주 쉽게 모델에 그렇게 유지에 걸쳐 분할 할 수 있습니다 + +691 +00:57:19,710 --> 00:57:25,500 + 병렬 바로 우리가 지금이 사진을 우리가 실제로 분할을 사용하여 여섯 GPU가있어 + +692 +00:57:25,500 --> 00:57:30,909 + 그 부드러운 최대 국경을 남용하고 남자는 그렇게 모든 복제본은 GPU 것 + +693 +00:57:30,909 --> 00:57:36,109 + 동일한 시스템에서 카드를 따라 흥얼의 모든 종류 그리고 당신은 사용할 수 있습니다 + +694 +00:57:36,110 --> 00:57:37,849 + 그 외에도, 데이터 병렬성 + +695 +00:57:37,849 --> 00:57:45,989 + 빨리 훈련하는 AGP 카드 복제의 무리를 양성하는 우리는 QS의이 개념이 + +696 +00:57:45,989 --> 00:57:50,509 + 그는 종류의 그녀가 잔뜩 할 사진이 한 다음 고통을 수 있습니다 + +697 +00:57:50,510 --> 00:57:55,860 + 다음 EQ와 나중에는 D로 시작 사진과 시간의 또 다른 비트가 + +698 +00:57:55,860 --> 00:58:00,789 + 청문회 물건 후 십여 가지 하나 하나의 예는 그래서 + +699 +00:58:00,789 --> 00:58:04,650 + 변환하는 JPEG 디코딩을 할 이유를 다음 입력을 프리 페치와 할 수 있습니다 + +700 +00:58:04,650 --> 00:58:09,240 + 배열의 종류에 어쩌면 약간의 미백을하고 임의 자르기 + +701 +00:58:09,239 --> 00:58:16,149 + 당신 같은 사람의 물건을 선택하고 당신은 다른 GPU에 DQ 수 있습니다 + +702 +00:58:16,150 --> 00:58:22,769 + 카드 또는 뭔가 우리 또한 할 수있는 번역 작업의 우리에 대한 그룹 유사한 예 + +703 +00:58:22,769 --> 00:58:27,869 + 당신의 배치 예에 무리가되도록 실제로 문장의 길이에 의해 버킷 + +704 +00:58:27,869 --> 00:58:32,449 + 그 모두 거의 같은 문장 길이 모두 13 216 단어 문장 + +705 +00:58:32,449 --> 00:58:37,539 + 단지 우리가 심지어해야 만 정확하게 많은 펼쳐진 실행 의미 일 + +706 +00:58:37,539 --> 00:58:42,210 + 당신이 임의의 다음 문장 길이 잘 알고보다는 단계 + +707 +00:58:42,210 --> 00:58:46,099 + 임의 회원 큐는 단지 전체 무리입니다 셔플 도전 + +708 +00:58:46,099 --> 00:58:49,099 + 예를 들면 다음 밖으로 임의의 사람을 얻을 + +709 +00:58:55,130 --> 00:59:02,269 + 데이터 병렬 바로 그래서 다시 우리는이 많은 복제본을 가질 수 있도록하려면 + +710 +00:59:02,269 --> 00:59:09,309 + 것은 그래서 당신은 우리있어 꽤 행복하지 않은 변경의 적당한 양을 + +711 +00:59:09,309 --> 00:59:13,769 + 하지만 변화의 양이 감독자가 무엇 당신이 할의 종류 + +712 +00:59:13,769 --> 00:59:19,429 + 그것은 당신이 지금 압력 장치가 말할와 준비 사물의 무리가 + +713 +00:59:19,429 --> 00:59:25,509 + 세션 후 다음 라운드의 각 로컬 루프 당신은 유지하지 + +714 +00:59:25,510 --> 00:59:28,000 + 얼마나 많은 단계의 트랙 모두에 걸쳐 전 세계적으로 적용되었습니다 + +715 +00:59:28,000 --> 00:59:32,500 + 곧 다른 복제본과 모든 사람들의 누적 합계가 큰입니다 + +716 +00:59:32,500 --> 00:59:38,829 + 동기 훈련을 위해 충분히 그 세 가지 별도의 클라이언트처럼 ​​좀 보인다 + +717 +00:59:38,829 --> 00:59:43,929 + 그래서 모든 매개 변수와 함께 큰 중 하나를 세 가지 별도의 복제본을 구동 두려워 + +718 +00:59:43,929 --> 00:59:47,119 + 우리는 분리가없는 경우 불신에서 의미가 흐르는 경향하기 + +719 +00:59:47,119 --> 00:59:54,359 + 매개 변수 서버 개념 우리가 포함 된 답변 변수 변수를 + +720 +00:59:54,360 --> 00:59:59,590 + 답변 그들은 그래프의 단지 다른 부분이고 일반적으로 당신이 그들을지도 + +721 +00:59:59,590 --> 01:00:04,250 + 장치의 작은 세트에 그들은 당신에게 매개 변수를 거 보유하고 있지만 전부 + +722 +01:00:04,250 --> 01:00:07,269 + 나는 그 대답을 보낸다 여부 종류의 같은 프레임 워크에 통합 + +723 +01:00:07,269 --> 01:00:12,829 + 매개 변수 또는 정품 인증 또는이 문제가되지 않습니다 어떤이의 종류 + +724 +01:00:12,829 --> 01:00:16,750 + 동기는 하나의 클라이언트를 가지고 난 그냥 세에서 내 배치를 분할 할 + +725 +01:00:16,750 --> 01:00:22,989 + 복제 기울기를 가지고 있었고, 꽤 것으로 판명 수 있습니다 알고 적용하고 + +726 +01:00:22,989 --> 01:00:31,239 + 감소 정밀도의 허용 그렇게 FB (16)가 실제로 있고 난 트리폴리 변환 + +727 +01:00:31,239 --> 01:00:36,869 + 16 ~ 14 점은 현재 지점을두고 표준 내가 할 지금 대부분의 CPU 사용하지 꽤 + +728 +01:00:36,869 --> 01:00:42,719 + 아직 우리가 우리 자신의 여섯 비트 형식을 구현 지원 + +729 +01:00:42,719 --> 01:00:45,719 + 기본적으로 우리는 32 비트 부동는 나에게 구입 잘려 수있다 + +730 +01:00:47,429 --> 01:00:55,889 + 당신은 종류의 확률 공공 새로운하지만 우리가 종류의 확인은 안해야 + +731 +01:00:55,889 --> 01:01:01,389 + 어떤에서 작성하여 다른 측면에서 32 비트 변환 동의하면 바로 알 + +732 +01:01:01,389 --> 01:01:15,098 + 그것을 위해 여전히 모델링 및 데이터하면서 매우 졸린 지붕 친화적 인 종이입니다 + +733 +01:01:15,099 --> 01:01:19,500 + 함께 바인딩에서 병렬 정말 빠르게 모델을 훈련 좋아 + +734 +01:01:19,500 --> 01:01:24,639 + 즉,이 모든 정말로에 대한 시도 연구 아이디어를 가지고 할 수있는입니다 무엇 + +735 +01:01:24,639 --> 01:01:28,250 + 큰 데이터 세트에 그것을 밖으로는 상관 문제의 대표 + +736 +01:01:28,250 --> 01:01:29,000 + 약 + +737 +01:01:29,000 --> 01:01:34,199 + 꽤 쉽게로 실험의 다음 세트 밖으로 그 일의 숫자를 파악 + +738 +01:01:34,199 --> 01:01:38,039 + 어딘가에 집중 하중을위한 너무 행복하지 않은 데이터 프로파일을 표현하는 + +739 +01:01:38,039 --> 01:01:44,889 + 동기 병렬 처리는 일반적으로 우리가 오픈 소스가 너무 나쁜 아니지만 + +740 +01:01:44,889 --> 01:01:49,480 + 센터 흐름을 우리가보다 쉽게​​ 연구 기록을 공유 할 수있을 거라 생각하기 때문에 + +741 +01:01:49,480 --> 01:01:56,338 + 우리는 당신이 외부 시스템을 사용하는 많은 사람들을 가지는 알고 생각입니다 + +742 +01:01:56,338 --> 01:01:59,849 + 구글을 개선하고 우리가하지 않는 아이디어를 가져 오는 좋은 일이 있었다 + +743 +01:01:59,849 --> 01:02:05,200 + 반드시이에 기계 학습 시스템을 구축하는 것이 매우 쉽게하는 방법 + +744 +01:02:05,199 --> 01:02:09,298 + 실제 제품은 당신이 뭔가를 실행에 우리의 연구 아이디어에서 갈 수 있기 때문에 + +745 +01:02:09,298 --> 01:02:13,059 + 상대적으로 쉽게 전화 외부 수십 사용자의 커뮤니티 + +746 +01:02:13,059 --> 01:02:16,609 + 구글은 멋진 사물의 모든 종류의 일을하는 방법 좋은 인 성장 I + +747 +01:02:16,608 --> 01:02:21,130 + 고른 게시 얻을 사람들이 수행 한 일의 몇 가지 임의의 예 + +748 +01:02:21,130 --> 01:02:28,769 + 이 안드레처럼 하나의 방법 데일 스 포드가에서 실행에서이 불만을 가지고 + +749 +01:02:28,769 --> 01:02:32,920 + 브라우저를 사용하여 자바 스크립트와 그가 약간 게임의 한 것들 중 하나 + +750 +01:02:32,920 --> 01:02:38,798 + 노란색 점을 학습 보강 배운다 진짜 먹고 얻을 수 배운다 + +751 +01:02:38,798 --> 01:02:42,769 + 긴급 녹색 점은 누군가에 그것을 다시 구현하도록 빨간색 점을 피하기 위해 + +752 +01:02:42,769 --> 01:02:47,059 + 흐름의 관점 실제로 추가 오렌지 도트 정말 나쁜 + +753 +01:02:50,650 --> 01:02:54,550 + 누군가가에 틸 부르 흐 대학에서이 정말 좋은 종이를 구현 + +754 +01:02:54,550 --> 01:02:59,590 + 막스 플랑크 연구소는 당신이 사진 이미지를 촬영하는이 작품을 볼 수과 + +755 +01:02:59,590 --> 01:03:05,269 + 일반적으로 다음 그림과 해당 용지의 스타일에서 해당 사진을 렌더링 + +756 +01:03:05,269 --> 01:03:14,820 + 당신은 당신이 문자가 알고 나쁜처럼 멋진 물건으로 끝날 + +757 +01:03:14,820 --> 01:03:19,550 + 높은 수준의 라이브러리의 인기 정렬 외부 여기 모델을 만드는 + +758 +01:03:19,550 --> 01:03:25,640 + 쉽게 메일 매트를 표현하는 사람이 신경 자막 모델을 구현 + +759 +01:03:25,639 --> 01:03:31,099 + 중국어로 번역에 낮은 측면에서 우리의 노력이 진행되고있다 + +760 +01:03:31,099 --> 01:03:39,349 + 멋진 위대한 마지막 것은 우리가했습니다 뇌 레지던시 프로그램에 대해 이야기합니다 + +761 +01:03:39,349 --> 01:03:44,349 + 실험의 비트 올해이 프로그램을 시작하고 그래서 이것은 더 + +762 +01:03:44,349 --> 01:03:47,769 + 참고로 내년 원인 또는 응용 프로그램에 대한 폐쇄 사제관으로 + +763 +01:03:47,769 --> 01:03:53,420 + 사람들은 것 이번 주에 우리의 최종 후보를 선택한 다음 생각은 + +764 +01:03:53,420 --> 01:03:57,789 + 깊은 학습 연구를하고 우리 그룹의 올해 투자 및 희망이다 + +765 +01:03:57,789 --> 01:04:02,750 + 그들은 나올 것입니다 및 제출 아카이버 논문의 몇 가지를 발표했다 + +766 +01:04:02,750 --> 01:04:08,039 + 회사에하고 흥미로운 기계의 종류를하는 것에 대해 많은 것을 배울 + +767 +01:04:08,039 --> 01:04:16,170 + 연구를 배우고 지금 우리에 대해 분명히 내년 사람을 찾고있어 + +768 +01:04:16,170 --> 01:04:24,670 + 당신은 애플리케이션을 다시 할 수업을 아는 사람에 우리의 강한 + +769 +01:04:24,670 --> 01:04:25,990 + 가을 + +770 +01:04:25,989 --> 01:04:34,439 + 내년에 기회처럼 졸업 거기 당신은 무리가 더있어 이동 + +771 +01:04:34,440 --> 01:04:36,909 + 이 읽기 + +772 +01:04:36,909 --> 01:04:42,949 + 당신의 사촌을 시작 난의 전체 세트를 만들기 위해 흰 종이에 많은 일을했다 + +773 +01:04:42,949 --> 01:04:52,169 + 참조를 클릭 한 다음 확인을 그래서 250 다른 인물을 통해 귀하의 방법을 클릭합니다 I + +774 +01:04:52,170 --> 01:04:53,820 + 초기에 수행 된 + +775 +01:04:53,820 --> 01:04:56,820 + 백 육십 오 + +776 +01:05:02,730 --> 01:05:31,599 + 예 그래서 사물의 그 종류는 실제로 까다로운 그리고 우리는 실제로 꽤있다 + +777 +01:05:31,599 --> 01:05:37,329 + 당신이 당신에 대해 얘기를 알고있는 것 것들에 대한 광범위한 세부 과정 + +778 +01:05:37,329 --> 01:05:43,119 + 일 똑똑 응답 이러한 종류의 사용자의 개인 정보를 사용 + +779 +01:05:43,119 --> 01:05:47,559 + 이제까지 생성됩니다 기본적으로 모든 응답이 단어는 것들 + +780 +01:05:47,559 --> 01:05:52,710 + 수천 명의 사용자로 말했다되었습니다 있도록하기위한 모델에 입력 + +781 +01:05:52,710 --> 01:05:57,380 + 교육 방법에 대해 사람들하지만 단지에 대해 일반적으로하지 않은 이메일입니다 + +782 +01:05:57,380 --> 01:06:02,480 + 지금까지 제안합니다 일이 당신이 알고에 의해 응답으로 생성되는 것들 + +783 +01:06:02,480 --> 01:06:07,670 + 고유 한 사용자의 의심 번호를 넣어 사용자의 개인 정보를 보호하기 위해 + +784 +01:06:07,670 --> 01:06:10,710 + 면 같은 제품을 설계 할 때 약 당신이 생각하는 물건의 종류 + +785 +01:06:10,710 --> 01:06:16,400 + 실제로 카렌의 많은 당신에 갈 생각된다 우리가이가 될 것이라고 생각을 알고 + +786 +01:06:16,400 --> 01:06:22,119 + 훌륭한 기능 그러나 우리는 사람들의 프라이버시를 보장하는 방식으로이 작업을 수행 할 수있는 방법 + +787 +01:06:22,119 --> 01:06:25,119 + 보호 + +788 +01:06:52,670 --> 01:07:30,108 + 우리는 아마 그것을 보장해야하는만큼 그냥 가지 중 하나가되었습니다 + +789 +01:07:30,108 --> 01:07:32,548 + 우리가했던 모든 다른 것들에 비해 다시 버너에 일 + +790 +01:07:32,548 --> 01:07:37,679 + 나는 전문가의 개념은 내가 그것에 대해 얘기하지 않았다고 생각 할 작업 + +791 +01:07:37,679 --> 01:07:42,489 + 기본적으로 모든하지만 우리는 종류의 임의의 이미지를했다 모델을 가지고 그 + +792 +01:07:42,489 --> 01:07:46,868 + JFT 같은 분류 모델은 만칠천 손실 또는 같은 인 + +793 +01:07:46,869 --> 01:07:51,220 + 뭔가는 우리가 할 수있는 좋은 일반 모델을 내부 데이터 훈련있어 그 + +794 +01:07:51,219 --> 01:07:57,539 + 모든 클래스에 대처하고 우리는 흥미로운 혼동이 계산 가능한 발견 + +795 +01:07:57,539 --> 01:08:01,719 + 세계에서 버섯의 모든 종류의 같은 알고리즘이다 클래스 + +796 +01:08:01,719 --> 01:08:06,539 + 데이터가 풍부한이 유일한 골이 한 세트에 우리는 전문가를 훈련했다 + +797 +01:08:06,539 --> 01:08:11,909 + 버섯 주로 데이터와 가끔 임의의 이미지와 우리는 할 수 + +798 +01:08:11,909 --> 01:08:16,179 + 물건의 종류에 좋은 도달 쉰 같은 모델을 훈련받을 + +799 +01:08:16,179 --> 01:08:24,440 + 꽤 상당한 정확도는 우리가 그것을 증류 할 수 있었다 우리시에 증가 + +800 +01:08:24,439 --> 01:08:27,588 + 단일 모델로 꽤 잘 우리가 정말 너무 많은 것을 추구하지 않은 + +801 +01:08:27,588 --> 01:08:31,899 + 밝혀졌다 그냥 역학 쉰 별도의 모델을 훈련하고있다 + +802 +01:08:31,899 --> 01:08:34,899 + 조금 다루기로 증류 + +803 +01:08:38,170 --> 01:09:20,630 + 이 명확하게 보여줍니다 말한대로 14 탐사 및 추가 연구가있다 + +804 +01:09:20,630 --> 01:09:25,920 + 우리가 걸 내가 그것을 다른 목적이해야 할 모델을 이야기 한 뜻 + +805 +01:09:25,920 --> 01:09:31,048 + 바로 우리는이 어려운 라벨을 사용하거나이 어려운 라벨을 사용하고 그것을 말하는 거 + +806 +01:09:31,048 --> 01:09:36,189 + 여기처럼 말한다이 믿을 수 없을만큼 풍부한 그라데이션의 백 다른 신호를 얻을 수 + +807 +01:09:36,189 --> 01:09:41,379 + 정보 어떤 의미에서 불공정 한 비교 바로 당신이 그것을 많이 이야기하고 그래서 + +808 +01:09:41,380 --> 01:09:46,829 + 내 경우는 그래서 때로는 그렇지 않은 모든 예에 대한 더 많은 물건 너무 많은 + +809 +01:09:46,829 --> 01:09:49,119 + 작업은 어쩌면 우리도해야거야 느낌 + +810 +01:09:49,119 --> 01:09:53,960 + 단지 하나의 이진 레이블보다 설교자 신호를 공급하는 방법을 알아내는 + +811 +01:09:53,960 --> 01:09:59,569 + 우리의 모델 그게 우리가 생각 나는을 추구하는 아마 흥미있는 영역이라고 생각 + +812 +01:09:59,569 --> 01:10:05,349 + 모든 훈련 집합 모델의 큰 앙상블을 갖는 아이디어에 대한 + +813 +01:10:05,350 --> 01:10:08,449 + 그 예측의 형태로 정보를 교환의 일종이다 오히려 + +814 +01:10:08,448 --> 01:10:12,779 + 해당 매개 변수보다 나는 훨씬 저렴 이상의 네트워크 친화적 인 방법이 될 수 있습니다로 + +815 +01:10:12,779 --> 01:10:19,099 + 의의 공동 당신이 훈련의 1 % 않았다 정말 큰에서 훈련 + +816 +01:10:19,100 --> 01:10:22,100 + 하루라도 스왑 예측 + +817 +01:10:39,729 --> 01:10:49,779 + 그래 내가 라디오의 모든 종류의 캡션을 추구하는 가치가있다 생각 의미 + +818 +01:10:49,779 --> 01:10:55,039 + 흥미로운 근로자하지만 당신 경향이 많은 적은 라벨을 갖는 경향이 + +819 +01:10:55,039 --> 01:11:02,550 + 캡션 우리는 지터 재규어 같은 하드 라벨의 종류에 이미지를 가지고 + +820 +01:11:02,550 --> 01:11:06,810 + 깨끗한 방법으로 제조되는 적어도 나는 많은 거기에 내가 알고 있어요 실제로 생각 + +821 +01:11:06,810 --> 01:11:11,539 + 트릭에 대해 쓴 문장과 이미지로 식별되는 + +822 +01:11:11,539 --> 01:11:26,430 + 문장있는 이미지 문제는 당신이 필요하지 않습니다 알고있는 몇 가지 문제에 대한 + +823 +01:11:26,430 --> 01:11:29,510 + 정말 음성 인식 등의 광산에 훈련은 그렇지 않은 좋은 예입니다 + +824 +01:11:29,510 --> 01:11:35,670 + 인간의 성대가 종종 단어를 변경처럼 그렇게 조금 변경 말한다 + +825 +01:11:35,670 --> 01:11:38,670 + 우리 재배포은 매우 고정하지 경향이있다 + +826 +01:11:39,640 --> 01:11:45,460 + 단어 모두가 공동으로 말한다처럼 내일 것과 매우 유사하다 + +827 +01:11:45,460 --> 01:11:50,640 + 그들은 오늘하지만 롱 아일랜드 초콜릿 축제 같은 미묘한 차이가 수도 말 + +828 +01:11:50,640 --> 01:11:55,220 + 갑자기 더 눈에 띄는 다음 2 주 그 종류 이상이 될 + +829 +01:11:55,220 --> 01:11:58,930 + 일의 당신은 당신이 원하는 사실을 인식 할 필요가 알고 + +830 +01:11:58,930 --> 01:12:03,079 + 이 양성하는 것입니다 할 그 효과의 종류와 가지 방법 중 하나를 캡처하여 + +831 +01:12:03,079 --> 01:12:07,380 + 모델과 작은 언젠가 그는 이렇게 온라인으로 할 필요하지는 않지만 좋아 + +832 +01:12:07,380 --> 01:12:10,770 + 예를 받고 즉시 당신의 모델을 업데이트 할 수 있지만 당신은 알고 + +833 +01:12:10,770 --> 01:12:16,180 + 펜티엄 문제 5 분마다, 10 분 시간 또는 하루에 충분하다 + +834 +01:12:16,180 --> 01:12:23,940 + 대부분의 문제이지만 것은 아닌 고정을 위해 그렇게 매우 중요 + +835 +01:12:23,939 --> 01:12:28,949 + 그런 시간이 지남에 따라 변경 광고 나 검색 쿼리 나 물건 같은 문제 + +836 +01:12:28,949 --> 01:12:33,738 + 권리 + +837 +01:12:33,738 --> 01:12:42,428 + 내가 네 말을 할 수없는 세 번째 가장 중요한 + +838 +01:12:45,819 --> 01:12:57,170 + 그래 나는 훈련 데이터 세트에서 잡음이 실제로 모든 시간 유명한 선수를 어떻게 의미 + +839 +01:12:57,170 --> 01:13:01,340 + 예 때때로 당신이 건너거야 같은 당신은 이미지를 볼 경우에도 + +840 +01:13:01,340 --> 01:13:02,328 + 당신의 인생에서 하나 + +841 +01:13:02,328 --> 01:13:06,670 + 에서 일하고있는 사실 난 그냥 어떤 사람과의 만남에 앉아 있었다 + +842 +01:13:06,670 --> 01:13:10,929 + 시각화 기술과 지금까지 볼 수 있었다 시각화 된 것들 중 하나 + +843 +01:13:10,929 --> 01:13:14,779 + 입력 데이터들은 모두 C 네 코어 프레젠테이션 이런 있었다 + +844 +01:13:14,779 --> 01:13:18,920 + 예는 모두 4 × 4 화소 각각 월에 같은에 매핑 + +845 +01:13:18,920 --> 01:13:22,819 + 자신의 육만 이미지 화면과 마이크가 가지 일을 선택할 수 + +846 +01:13:22,819 --> 01:13:28,219 + 출력 및 방향을 선택하고 여기에 예측 모델을 좋아 하나 + +847 +01:13:28,219 --> 01:13:33,948 + 높은 신뢰성하지만 잘못했고, 그것은 말했다 모델로 그녀의 비행기가 + +848 +01:13:33,948 --> 01:13:40,518 + 비행기 당신은 이미지를보고는 비행기의 레이블은 아니다 + +849 +01:13:40,519 --> 01:13:49,690 + 주로 내가 실행 야지 왜 당신은 당신이 원하는 알고는 그래서 이해 좋아 + +850 +01:13:49,689 --> 01:13:53,288 + 있는지 확인 데이터 집합 교육 사촌 가능한 한 깨끗하고 잡음이 데이터가 + +851 +01:13:53,288 --> 01:13:56,488 + 로 일반적으로 좋지 않다 + +852 +01:13:56,488 --> 01:14:00,819 + 그것을 세정하지만 한편 것을 청소 너무 많은 노력을 늘리지 + +853 +01:14:00,819 --> 01:14:06,969 + 종종 더 많은 종류의 몇 가지 필터링 가지 작업을 수행하는 그 가치보다 더 많은 노력 + +854 +01:14:06,969 --> 01:14:12,788 + 일의 당신은 일반적으로 더 명백한 나쁜 물건을 던져하지 않습니다 + +855 +01:14:12,788 --> 01:14:15,788 + 시끄러운 데이터는 최대 그것은 덜 깨끗한보다 종종 더 낫다 + +856 +01:14:18,739 --> 01:14:28,649 + 문제에 따라 달라집니다하지만 당신이 있다면 약 한 것은 다음 시도하고 + +857 +01:14:28,649 --> 01:14:34,159 + 결과에 만족하지 왜 질문 조사 + +858 +01:14:34,159 --> 01:14:39,210 + 좋아 감사합니다 + diff --git a/captions/Ko/Lecture1_ko.srt b/captions/Ko/Lecture1_ko.srt new file mode 100644 index 00000000..72eb704c --- /dev/null +++ b/captions/Ko/Lecture1_ko.srt @@ -0,0 +1,2909 @@ +1 +00:00:00,000 --> 00:00:03,899 + 늦게 들어오시는 분들은 측면에 더 많은 좌석이 있습니다. + +2 +00:00:03,899 --> 00:00:19,868 + 이 수업은 CS231n + +3 +00:00:19,868 --> 00:00:23,969 + Deep Learning Neural Network Class for Visual Recognition 입니다. + +4 +00:00:23,969 --> 00:00:33,549 + 수업 잘못 들어오신 분 있나요? 좋아요. 환영합니다, 행복한 새해, 그리고 행복한 겨울 방학의 첫날 입니다. + +5 +00:00:33,549 --> 00:00:41,069 + 이 수업은 CS231n 의 두번째 개강입니다. + +6 +00:00:41,070 --> 00:00:48,738 + 우리는 지난 번에 180명의 수강생에서 이번엔 거의 350명으로 + +7 +00:00:48,738 --> 00:00:55,939 + 말 그대로 수강 인원을 두 배로 늘렸습니다. + +8 +00:00:55,939 --> 00:01:02,570 + 알아두셔야 할 것이 우리는 지금 동영상 촬영 중입니다. + +9 +00:01:02,570 --> 00:01:10,680 + 그러니 먄약 오늘 촬영이 불편하시면 카메라 뒤로 이동하거나 + +10 +00:01:10,680 --> 00:01:18,280 + 카메라가 비추지 않는 쪽으로 이동하세요. + +11 +00:01:18,280 --> 00:01:25,228 + 후에 동영상 촬영에 동의를 구하는 서류양식을 보내드릴겁니다. + +12 +00:01:25,228 --> 00:01:32,200 + 좋아요. 저는 컴퓨터 과학부의 교수이며 이름은 Fei-Fei Lee 입니다. + +13 +00:01:32,200 --> 00:01:37,960 + 이 수업동안 제가 두 명의 대학원생과 함께 가르치는데, + +14 +00:01:37,961 --> 00:01:45,839 + 그 중에 한명은 지금 이 자리에 있습니다. 안드레, 모두에게 인사하세요. + +15 +00:01:45,840 --> 00:01:48,659 + 안드레에 대해서는 많은 소개가 필요 없을 듯 합니다. 많은 분들이 아마 그를 알고 있을 겁니다. + +16 +00:01:48,659 --> 00:01:53,960 + 그의 블로그나 트위터를 팔로우하면서 말이죠. + +17 +00:01:53,961 --> 00:02:02,509 + 안드레가 저 보다 팔로워수가 훨씬 많아요. 엄청 유명하죠. + +18 +00:02:02,510 --> 00:02:08,200 + 그리고 다른 한명 저스틴 존슨은 아직 여행중인데 며칠 내로 돌아올 겁니다. + +19 +00:02:08,201 --> 00:02:14,509 + 안드레와 저스틴이 상당한 양의 강의를 진행하게 됩니다. + +20 +00:02:14,509 --> 00:02:20,039 + 오늘은 제가 첫 강의를 진행 하겠지만 보시다시피 제가 곧, 몇 주안에, 아기를 낳을 예정입니다. + +21 +00:02:20,039 --> 00:02:28,239 + 그래서 아마 여러분들은 안드레와 저스틴을 강의 시간에 더 많이 보게 될 것입니다. + +22 +00:02:28,239 --> 00:02:34,189 + 또한 강의를 마치기 전에 전체 조교들을 소개해 드릴 것입니다. + +23 +00:02:34,189 --> 00:02:38,959 + 아직 좌석을 찾고있는 분은 밖으로 돌아서 들어오시면 + +24 +00:02:38,959 --> 00:02:47,039 + 저쪽에 자리가 많이 있습니다. + 이 수업에서 우리는 + +25 +00:02:47,039 --> 00:02:53,519 + 이 강좌에 대한 소개, 우리가 풀고있는 문제들과 도구들을 다룰 것입니다. + +26 +00:02:53,519 --> 00:03:03,530 + 다시 한번, Vision 수업 CS231n 강의에 오신 것을 환영 합니다. + +27 +00:03:03,530 --> 00:03:09,140 + 이 수업은 매우 구체적으로 Neural Network 이라는 모델링 아키텍처를 다루게 되며 + +28 +00:03:09,141 --> 00:03:16,000 + 더 자세히는 Convolutional Neual Network 을 다룹니다. + +29 +00:03:16,000 --> 00:03:23,799 + Deep Learning Network 이라고도 부르는 이 용어를 아마 언론매체 기사에서 접하게 될 텐데. + +30 +00:03:23,799 --> 00:03:34,239 + Vision은 인공지능 분야에서도 제일 빠르게 성장하고 있는 분야중 하나입니다. + +31 +00:03:34,239 --> 00:03:40,920 + 실제로, CISCO사의 자료에 의하면, + +32 +00:03:40,921 --> 00:03:50,018 + 지금 이 시간, 2016의 4번째 날, + 인터넷 사이버공간 데이터의 85% 이상이 + +33 +00:03:50,019 --> 00:03:56,230 + pixel 형태로 존재합니다. + +34 +00:03:56,231 --> 00:04:05,329 + 멀티미디어라고 하죠. 우리는 다시말해 이미지들과 영상 - Vision의 시대에 들어 온 것입니다. + +35 +00:04:05,330 --> 00:04:12,530 + 비젼이 이렇게 큰 부분을 차지하는 이유는 + +36 +00:04:12,530 --> 00:04:20,858 + 데이터 캐리어로써의 인터넷과 센서들 덕분입니다. + +37 +00:04:20,858 --> 00:04:25,930 + 우리는 이미 지구상의 인구수보다 많은 센서들을 가지고 있지요. + +38 +00:04:25,930 --> 00:04:32,000 + 모든 사람이 스마트 폰, 디지털 카메라를 가지고 있고 + +39 +00:04:32,000 --> 00:04:37,879 + 길을 달리는 차들도 카메라가 있죠. + +40 +00:04:37,879 --> 00:04:46,500 + 정말 센서들은 폭발적인 양의 시각 데이터를 인터넷으로 불러왔죠. + +41 +00:04:46,500 --> 00:04:55,209 + 하지만 시각 데이터 또는 픽셀 데이터는 또한 가장 다루기 힘든 데이터입니다. + +42 +00:04:55,209 --> 00:05:07,810 + 저와 다른 Computer Vision 교수들은 시각 데이터를 인터넷의 암흑 물질이라고 부릅니다. + +43 +00:05:07,810 --> 00:05:13,879 + 왜 암흑 물질일까요? 이유는 바로 우주의 85%가 암흑 물질로 이루어져 있으며, + +44 +00:05:13,879 --> 00:05:19,409 + 이 암흑 에너지또한 아주 관측하기 어렵기 때문입니다. + +45 +00:05:19,410 --> 00:05:25,919 + 우리는 수학적 모델로써 암흑 에너지를 추론해 볼 수 있죠. + +46 +00:05:25,920 --> 00:05:30,649 + 인터넷에서는 픽셀 데이터가 바로 우리가 잘 모르는, 우리가 내용을 알아내기 힘든 암흑물질인 것이죠. + +47 +00:05:30,649 --> 00:05:36,239 + 여기에 여러분이 고려해야할 아주 간단한 측면이 있습니다. + +48 +00:05:36,240 --> 00:05:39,090 + 오늘 + +49 +00:05:39,091 --> 00:05:49,560 + 매 60초마다 유튜브 서버들로 150시간 이상되는 분량의 동영상이 업로드됩니다. + +50 +00:05:49,560 --> 00:05:54,089 + 메 60초마다. + +51 +00:05:54,089 --> 00:06:02,739 + 데이터의 양에 대해 생각해보면 인간의 눈으로는 이 방대한 데이터를 + +52 +00:06:02,740 --> 00:06:07,829 + 가려내고 + +53 +00:06:07,829 --> 00:06:14,009 + 분류하여 내용을 묘사할 방법이 없습니다. + +54 +00:06:14,009 --> 00:06:20,980 + YouTube 팀이나 또는 Google 회사의 관점에서 생각해보면 + +55 +00:06:20,980 --> 00:06:25,640 + 그들이 이 데이터들을 검색하고 분류하고 또는 그들을 위한 광고를 넣고 + +56 +00:06:25,641 --> 00:06:31,529 + 무엇을 하려고 하던지간에 이건 답이 없어요. + +57 +00:06:31,529 --> 00:06:38,919 + 아무도 직접 손으로 분류를 할 수가 없기 때문이죠. + 우리가 가진 유일한 희망은 Vision 기술입니다. + +58 +00:06:38,920 --> 00:06:44,640 + 사물이나 풍경들을 알아내고 + +59 +00:06:44,641 --> 00:06:50,349 + 코비 브라이언트가 끝내주는 슛을 날리는 농구 비디오를 찾아내는 거죠. + +60 +00:06:50,350 --> 00:06:57,320 + 이런 것들이 지금 우리가 직면하고있는 문제들입니다. + +61 +00:06:57,321 --> 00:07:02,860 + 엄청난 양의 데이터, 즉 인터넷의 암흑 물질에 대한 도전이죠. + +62 +00:07:02,860 --> 00:07:07,379 + 컴퓨터 비젼분야는 다른 많은 분야와 맞닿아 있습니다. + +63 +00:07:07,379 --> 00:07:12,740 + 마치 여기 앉아계신 여러분들 중에서도 + +64 +00:07:12,740 --> 00:07:18,050 + 어떤 분들은 컴퓨터 과학에서, 어떤 분들은 생물학, 심리학, + +65 +00:07:18,050 --> 00:07:24,389 + 자연어 처리, 그래픽스, 로보틱스, + +66 +00:07:24,389 --> 00:07:30,680 + 또는 의료 영상등 여러분야에서 온것 처럼요. + 컴퓨터 비전은 실제로 + +67 +00:07:30,680 --> 00:07:37,329 + 여러 학문이 관련된 분야있니다. + 우리가 풀고있는 문제들, 사용하는 모델들은 + +68 +00:07:37,329 --> 00:07:43,849 + 물리학, 생물학, 심리학, 컴퓨터 과학과 수학까지 관련되어 있죠. + +69 +00:07:43,850 --> 00:07:51,030 + 좀 더 개인적인 이야기를 하자면, 저는 스탠포드에서 컴퓨터 비젼 연구실을 맡고 있는데요. + +70 +00:07:51,031 --> 00:07:58,589 + 대학원생, 박사후 연구원과 + +71 +00:07:58,589 --> 00:08:04,669 + 학부생들까지 함께 우리의 연구를 위해 아주 많은 주제를 다룹니다. + +72 +00:08:04,670 --> 00:08:10,540 + 그들 중 일부는 음.. 안드레와 저스틴도 우리 연구실에 있고요, + +73 +00:08:10,540 --> 00:08:17,780 + 많은 조교들도 우리 연구실 출신입니다. + 우리는 머신러닝, + +74 +00:08:17,781 --> 00:08:26,109 + 즉 딥러닝을 포함하는 큰 분야를 연구하며, + 자연어 처리와 연설의 교차점으로서 + +75 +00:08:26,110 --> 00:08:31,270 + 인지과학과 신경과학에 대해서도 많은 연구를 합니다. + +76 +00:08:31,269 --> 00:08:40,399 + 제 연구실에 대한 간략한 소개였습니다. + +77 +00:08:40,399 --> 00:08:45,600 + 자, 그럼 조금 다른 관점에서 생각해볼 수 있도록 어떠한 Vision 수업들이 + +78 +00:08:45,600 --> 00:08:51,050 + 스탠포드 컴퓨터 과학부에서 열리는지 알아보도록 하죠. + +79 +00:08:51,049 --> 00:08:59,629 + 여러분들은 지금 CS231n 수업을 듣고 계시고요. + 이 중에서 컴퓨터 비전수업을 한번도 들어 본 적이 없고 + +80 +00:08:59,629 --> 00:09:06,220 + 컴퓨터 비전이란 말을 처음 듣는 분들이 있다면 + +81 +00:09:06,220 --> 00:09:14,730 + 전 분기에 열었던 CS131 수업을 들으셨어야해요. + +82 +00:09:14,730 --> 00:09:19,779 + 그리고 원래는 이번 분기에서 올해만 다음 분기로 미뤄진 + +83 +00:09:19,779 --> 00:09:25,069 + 매우 중요한 대학원 수준의 컴퓨터 비젼 클래스 + +84 +00:09:25,070 --> 00:09:31,840 + CS231a 를 로보틱스 3D 비젼의 Silvio Savarese 교수가 가르칩니다. + +85 +00:09:31,840 --> 00:09:47,230 + 많은 분들이 CS231n 과 CS231a 수업이 서로 같은지 물어보시는데 + +86 +00:09:47,230 --> 00:09:56,639 + 같지 않습니다. 만약 좀 더 넓은 범위의 + +87 +00:09:56,639 --> 00:10:03,220 + 컴퓨터 비젼분야의 주제와 도구사용 및 + +88 +00:10:03,220 --> 00:10:11,009 + 3D 비젼, 로보틱스 비젼과 시각인지에 관한 기본적인 주제들에 관심이 있다면 + +89 +00:10:11,009 --> 00:10:17,269 + 더 포괄적인 231a 수업 수강을 고려해 보세요. + +90 +00:10:17,269 --> 00:10:26,039 + 오늘부터 시작하는 231n은 좀 더 세부적인 + +91 +00:10:26,039 --> 00:10:33,329 + 문제와 모델을 다룹니다. 대부분 신경망 모델을 이용한 + +92 +00:10:33,330 --> 00:10:38,580 + 시각적 인식이죠. 당연히 두 수업에서 + +93 +00:10:38,580 --> 00:10:47,990 + 중복되는 부분도 있겠죠. + 그리고 아마 다음 분기에 + +94 +00:10:47,990 --> 00:10:55,590 + 심화된 수준의 세미나 수업이 몇몇 열릴 것 같아요. 하지만 + +95 +00:10:55,590 --> 00:11:01,649 + 아직 미정이니 후에 강의목록을 확인하셔야 할겁니다. + +96 +00:11:01,649 --> 00:11:11,409 + 스탠포드 컴퓨터 비전 수업들을 대략적으로 소개했습니다. 질문있나요? 네. + +97 +00:11:11,409 --> 00:11:20,879 + 131이 이 수업의 선수과목은 아닙니다만, + +98 +00:11:20,879 --> 00:11:25,570 + 컴퓨터 비전 수업을 처음 듣는다면 + +99 +00:11:25,570 --> 00:11:33,830 + 이 수업은 컴퓨터 비전에대한 기본적인 이해를 요구하기 때문에 + +100 +00:11:33,830 --> 00:11:42,560 + 강의노트 등을 통한 예습을 하길 권해드립니다. + +101 +00:11:42,559 --> 00:11:49,619 + 좋아요. 오늘은 컴퓨터 비전의 역사를 간단히 다루고 + +102 +00:11:49,620 --> 00:11:55,519 + 231n 수업이 어떻게 구성되어 있는지 이야기 해볼 것입니다. + +103 +00:11:55,519 --> 00:12:01,409 + 저는 사실 이 컴퓨터 비전의 역사를 다루는 것을 중요하게 생각하는데 + +104 +00:12:01,409 --> 00:12:07,480 + 여러분 대무분은 이 딥러닝이라는 매우 흥미있는 도구에 관심이 있어 이 자리에 있을텐데요, + +105 +00:12:07,480 --> 00:12:11,990 + 그것이 이 수업의 목적이기도 하고요. + +106 +00:12:11,990 --> 00:12:16,370 + 이 수업은 딥러닝 모델이 무엇인지 + +107 +00:12:16,370 --> 00:12:22,470 + 심도있게 다룰 것입니다. + +108 +00:12:22,470 --> 00:12:28,050 + 하지만 문제가 속해있는 영역에 대한 이해, 문제에 대한 깊은 고찰없이는 + +109 +00:12:28,051 --> 00:12:37,849 + 비젼분야의 새로운 문제를 해결하는 + +110 +00:12:37,850 --> 00:12:43,320 + 새로운 모델을 만들거나 + +111 +00:12:43,320 --> 00:12:52,379 + 어려운 문제들을 푸는 일에 주요한 일을 하기 매우 어려울 것입니다. + +112 +00:12:52,379 --> 00:12:58,860 + 또한 일반적인 문제 영역과 모델링 도구 자체는 + +113 +00:12:58,860 --> 00:13:00,129 + 결코 완전히 서로 분리될 수 없어요. + +114 +00:13:00,129 --> 00:13:05,360 + 그들은 서로에게 정보를 제공합니다. 딥 러닝의 역사를 통해 알게 되겠지만 + +115 +00:13:05,360 --> 00:13:13,000 + 네트워크 아키텍처의 연합은 비젼 문제들을 해결하고자 하는 + +116 +00:13:13,000 --> 00:13:15,289 + 필요에 의해 생겨납니다. + +117 +00:13:15,289 --> 00:13:23,449 + 그리고 다시 비전 문제는 딥 러닝 알고리즘을 발전시키는데 도움을 주게되죠. + +118 +00:13:23,450 --> 00:13:29,350 + 그래서 매우 중요합니다. 여러분들이 이 수업을 마치고 + +119 +00:13:29,350 --> 00:13:34,300 + 컴퓨터 비젼과 딥러닝 수업의 학생임을 자랑스러워 하길 바래요. + +120 +00:13:34,301 --> 00:13:39,528 + 여러분들은 문제해결을 위한 도구들과 그 도구들을 사용하는 방법에 대한 깊은 이해를 갖게 될 거에요. + +121 +00:13:39,528 --> 00:13:46,750 + 비전의 역사는 간략하지만, + +122 +00:13:46,750 --> 00:13:54,149 + 짧은 역사는 아닙니다. 이야기는 5억 4천만년 전으로 거슬러 올라갑니다. + +123 +00:13:54,149 --> 00:14:00,110 + 왜 5억 4천만년 전 일까요? + +124 +00:14:00,110 --> 00:14:09,240 + 제가 매우 구체적인 시간대를 이야기했는데요. + +125 +00:14:09,240 --> 00:14:14,049 + 여러분이 이미 알고 있는지는 모르지만 + 이시간대는 지구의 역사 중에서도 매우 흥미로운 시간대 입니다. + +126 +00:14:14,049 --> 00:14:23,539 + 생물 학자들은 이 시기를 진화의 큰 가방이라고 부릅니다. + 매우 간단한 생명체들이 있었죠. + +127 +00:14:23,539 --> 00:14:27,679 + 5억4천만년 전 그 이전의 지구는 + +128 +00:14:27,679 --> 00:14:37,989 + 아주 평화로운 물이 담긴 큰 냄비였어요. + 매우 간단한 유기체들이 있었는데, + +129 +00:14:37,990 --> 00:14:46,049 + 이들은 마치 물에 떠 다니는 동물들처럼 + 일상적인 먹고 살아가는 방법은 + +130 +00:14:46,049 --> 00:14:53,838 + 단지 둥둥 떠다니면서 먹을 것이 주변에 다가오면 + +131 +00:14:53,839 --> 00:15:01,160 + 그들은 입에 넣는 거죠. + +132 +00:15:01,160 --> 00:15:09,969 + 이 당시에는 그렇게 많은 종류에 동물들이 있지 않았어요. + 그런데 아주 이상한 일이 일어났어요. + +133 +00:15:09,970 --> 00:15:18,430 + 5억 4천만년 전 이후의 화석에서는 엄청난 양의 종들이 발견된 것이죠. + +134 +00:15:18,430 --> 00:15:27,729 + 생물학자들은 이 것을 종의 분화라고 하죠. + 아주 갑자기 무슨 이유인지 지구에서 + +135 +00:15:27,730 --> 00:15:35,230 + 생물들이 다양화하기 시작하고 매우 복잡한 + +136 +00:15:35,230 --> 00:15:41,039 + 포식자와 먹이감의 관계 그리고 살아남기 위한 수단들을 가지기 시작했어요. + +137 +00:15:41,039 --> 00:15:46,698 + 그럼 과연 무엇이 이 변화들의 시발점이었는가 하는 질문이 남아있었죠. + +138 +00:15:46,698 --> 00:15:53,269 + 사람들은 유성이 지구 떨어졌다던가 환경이 바뀌었다던가하는 주장들을 했어요. + +139 +00:15:53,269 --> 00:16:00,198 + 그중에서도 제일 설득력있는 이론은 Andrew Parker의 이론입니다. + +140 +00:16:00,198 --> 00:16:03,159 + 그는 + +141 +00:16:03,159 --> 00:16:09,490 + 호주의 현대 지질학자로서 아주 많은 화석을 연구했죠. + +142 +00:16:09,490 --> 00:16:19,278 + 빙하기의 도래가 그의 이론이었죠. + +143 +00:16:19,278 --> 00:16:25,688 + 최초로 삼엽충이 "눈"을 갖게되었어요. 매우 간단한 눈인데 + +144 +00:16:25,688 --> 00:16:30,779 + 핀홀 카메라처럼 단지 빛을 포착하고 투영해서 + +145 +00:16:30,779 --> 00:16:34,750 + 주변환경의 정보를 받아들이죠. + +146 +00:16:34,750 --> 00:16:41,080 + 이제부터는 삶이 달라집니다. + +147 +00:16:41,080 --> 00:16:44,889 + 제일 먼저 먹이를 찾아갑니다, 먹이가 어디있는지 보이거든요. + +148 +00:16:44,889 --> 00:16:51,809 + 더이상 장님처럼 둥둥 떠 다니기만 하지 않아도 되죠. + 자 이제 당신이 먹이들을 찾아갈 수 있게 되었어요. + +149 +00:16:51,809 --> 00:16:57,399 + 이제 먹이들도 당신에게 도망가기 위해 눈이 필요해졌죠. + +150 +00:16:57,399 --> 00:17:02,590 + 그렇지 않으면.. 먹이들은 인생 끝난거죠. + 처음으로 눈을 가진 녀석은 그야말로 + +151 +00:17:02,590 --> 00:17:11,380 + 구글에서 제공하는 무제한 뷔페 식당에 앉아 끝내주는 시간을 보내는거죠. + +152 +00:17:11,380 --> 00:17:18,170 + 찾을 수 있는 모든 먹이를 먹고다닙니다. + 우리는 이 빙하기가 도래하면서 + +153 +00:17:18,170 --> 00:17:28,400 + 생물학적으로 군비 경쟁이 시작된 것을 알 수 있죠. 모든 동물들은 + +154 +00:17:28,400 --> 00:17:34,170 + 생존을 위해 무엇인가를 개발하는 법을 배워야 했어요. + +155 +00:17:34,170 --> 00:17:40,190 + 갑자기 천적과 먹이의 관계, 그리고 종의 분화가 시작됬어요. + +156 +00:17:40,190 --> 00:17:47,870 + 5억 4천만년 전, 그때가 바로 비전이 시작된 시기입니다. + 그 뿐아니라 비전은 + +157 +00:17:47,870 --> 00:17:53,189 + 종의 분화와 진화를 불러온 주요 원동력중의 하나 입니다. + +158 +00:17:53,190 --> 00:17:58,980 + 좋아요. 우리는 진화에 대해 자세히 다루진 않을거예요. + +159 +00:17:58,980 --> 00:18:08,710 + 비전 엔지니어링에 대한 또 다른 중요한 일이 + +160 +00:18:08,710 --> 00:18:19,220 + 르네상스시기에 일어났어요. 바로 레오나르도 다 빈치에의한 것이었죠. + +161 +00:18:19,220 --> 00:18:23,740 + 르네상스 이전에도 인간 문명화의 과정에서 아시아, 유럽, + +162 +00:18:23,740 --> 00:18:30,400 + 인도에서 아랍세계에 걸쳐 카메라의 모델들이 있어왔어요. + 아리스토텔레스는 + +163 +00:18:30,400 --> 00:18:36,360 + 나뭇잎 잎사귀들을 통해서 보는 카메라를 제안했구요. 중국의 철학자 Mozi는 + +164 +00:18:36,359 --> 00:18:40,939 + 구멍이 있는 상자를 통해 바라보는 카메라를 제안했어요. + +165 +00:18:40,940 --> 00:18:47,750 + 하지만 처음으로 현대의 카메라와 닮은 카메라의 기록을 보면 + +166 +00:18:47,750 --> 00:18:49,180 + Camera Obscura 라는 카메라가 있어요. + +167 +00:18:49,180 --> 00:18:56,610 + 레오나르도 다빈치에 의해 기록되었는데 자세히 다루진 않겠어요. + +168 +00:18:56,609 --> 00:19:07,240 + 그러나 이것은 카메라에 현실세계에서 반사된 빛을 포착하는 + +169 +00:19:07,240 --> 00:19:12,240 + 렌즈 혹은 적어도 구멍이 있다는 것을 보여줍니다. 또한, + +170 +00:19:12,240 --> 00:19:20,319 + 현실세계의 상에서 정보를 얻어 투영하는 과정이 이루어짐을 알 수 있습니다. + +171 +00:19:20,319 --> 00:19:27,779 + 이것이 현대 Vision 공학의 시작이라고 할 수 있습니다. + +172 +00:19:27,779 --> 00:19:36,170 + 비전은 세상을 복사하고 싶어서, 시각적인 세상을 복사하고 싶어서 시작되었죠. + +173 +00:19:36,170 --> 00:19:42,350 + 이 시기까지는 시각적인 세상을 공학적으로 이해할 단계는 아니며 + +174 +00:19:42,349 --> 00:19:46,879 + 단지 세상을 복제하고 있죠. + +175 +00:19:46,880 --> 00:19:53,760 + 하지만 여전히 기억해야할 중요한 업적이지요. + +176 +00:19:53,759 --> 00:20:01,299 + 물론 Obscura 카메라 이후에 많은 발전이 있어 왔어요. + +177 +00:20:01,299 --> 00:20:07,539 + 여러분도 알다시피 필름이 개발되고 Kodak에서 + +178 +00:20:07,539 --> 00:20:12,329 + 상업 카메라를 개발하고 우리는 캠코더들까지 가지게 되었죠. + +179 +00:20:12,329 --> 00:20:21,889 + Vision 을 공부하는 학생으로서 + +180 +00:20:21,890 --> 00:20:28,050 + 여러분들이 알고 있어야할 또 다들 중요한 점은 사실 공학적인 요소가 아닌 + +181 +00:20:28,049 --> 00:20:32,710 + 과학적인 요소로써 이런 질문을 던집니다. + +182 +00:20:32,710 --> 00:20:38,130 + Vision 은 우리의 생물학적 뇌안에서 어떻게 작동할까요? + +183 +00:20:38,130 --> 00:20:45,760 + 우리는 이제 5억 4천만년에 걸친 진화를 통해서 + +184 +00:20:45,759 --> 00:20:54,579 + 포유류와 인간이 가진 끝내주는 시각 시스탬이 생겼다는 것을 배웠어요. + 하지만 이 기간동안의 진화과정에서 무슨 일이 일어난 걸까요? + +185 +00:20:54,579 --> 00:21:01,759 + 삼엽충의 간단한 눈에서 오늘날의 여러분과 저의 눈까지 어떠한 구조를 발달시켜온 걸까요? + +186 +00:21:01,759 --> 00:21:07,950 + 매우 중요한 연구가 하버드에서 + +187 +00:21:07,950 --> 00:21:12,690 + 박사후 과정에 있었던 당시 매우 젊고 열정있는 Hubel과 Wiesel 두 사람에 의해 이루어졌어요. + +188 +00:21:12,690 --> 00:21:21,500 + 그들은 고양이를 깨어있는 상태로 마취를 시키고 + +189 +00:21:21,500 --> 00:21:28,529 + Electrode라는 작은 바늘을 + +190 +00:21:28,529 --> 00:21:35,129 + 두개골이 열린 상태의 고양이의 뇌로, + +191 +00:21:35,130 --> 00:21:42,180 + 일차 시각 피질이라고 알려진 부위에 넣습니다. + +192 +00:21:42,180 --> 00:21:49,490 + 일차 시각 피질 영역에서는 많은 뉴런들이 시각 정보를 처리합니다. + +193 +00:21:49,490 --> 00:21:54,779 + 하지만 Hubel과 Wiesel 이전엔 일차 시각 피질이 정확이 무슨 일을 하는지 몰랐죠. + +194 +00:21:54,779 --> 00:22:02,369 + 단지 시각처리과정의 초기 단계라는 것과 + +195 +00:22:02,369 --> 00:22:07,299 + 엄청난 양의 뉴런이 있다는 것만 알고 있었어요. + +196 +00:22:07,299 --> 00:22:12,419 + 이 일차 시각 피질은 뇌에서 시각 처리의 시작지점이기 때문에 꼭 알아야만 합니다. + +197 +00:22:12,420 --> 00:22:20,300 + 그렇게 그들은 일차 시각 피질에 전극을 넣습니다. + +198 +00:22:20,300 --> 00:22:25,930 + 여기에서 또 다른 흥미로운 사실​​이 있습니다. + +199 +00:22:25,930 --> 00:22:34,880 + 이것 좀 내려놓고 설명할게요. 일차 시각 피질, + 시작이 어디냐에 따라 첫번째 혹은 두번째의 시각 처리 과정이 이루어 지는 곳이죠. + +200 +00:22:34,880 --> 00:22:40,910 + 간략히 말해 이 시각 피질들의 첫번째 시각 처리 과정은 + +201 +00:22:40,910 --> 00:22:47,180 + 눈 근처가 아닌 뇌 뒷편에서 이루어 집니다. 매우 흥미로운 점은 + +202 +00:22:47,180 --> 00:22:51,788 + 후각 대뇌 피질은 코 바로 뒤에 있어요. + +203 +00:22:51,788 --> 00:22:58,519 + 청각 피질은 귀 바로 뒤에 있지요. + +204 +00:22:58,519 --> 00:23:05,798 + 그런데 시각 피질은 눈에서 가장 먼 곳에서 이루어지죠. + +205 +00:23:05,798 --> 00:23:11,099 + 사실, 일차 시각 피질뿐만 아니라 많은 다른 부분들이 시각처리에 관여합니다. + +206 +00:23:11,099 --> 00:23:17,888 + 거의 50%의 뇌가 Vision과 관련되어있어요. + Vision은 뇌에서 가장 어렵고 중요한 감각 지각체계입니다. + +207 +00:23:17,888 --> 00:23:22,608 + 제가 다른 체계들이 중요하지 않다는 것은 아니지만, + +208 +00:23:22,608 --> 00:23:29,839 + 자연이 이 감각 체계를 발달 시키는데에 오랜 시간이 걸렸고, + +209 +00:23:29,839 --> 00:23:37,579 + 이렇게 큰 공간을 차지하고 있어요. + +210 +00:23:37,579 --> 00:23:43,148 + 왜그럴까요? 너무 중요하고 엄청나게 어렵기 때문입니다. + +211 +00:23:43,148 --> 00:23:50,959 + 그래서 이렇게 큰 공간을 차지하고 있는거죠. + 자 Hubel과 Wiesel로 돌아가보면, 그들은 매우 야심찼어요. + +212 +00:23:50,960 --> 00:23:56,028 + 그들은 일차 시각 피질이 무엇을 하는지 알고 싶었어요. + 이것이 바로 Deep Learning Neural Network 연구의 시작이기 때문이죠. + +213 +00:23:56,028 --> 00:24:02,878 + 그들은 한 방에 고양이를 두고 + +214 +00:24:02,878 --> 00:24:07,709 + 신경 활동을 기록했어요. 이 신경 활동을 기록한다는 것은 다시 말해서 + +215 +00:24:07,710 --> 00:24:11,659 + 제가 전극을 여기에 넣고 + +216 +00:24:11,659 --> 00:24:18,059 + 무언가를 보았을 때 뉴런이 활발하게 활동하는지를 보려고 하는 거죠. + +217 +00:24:18,059 --> 00:24:25,308 + 예를 들어 그들이 고양이에게.. + +218 +00:24:25,308 --> 00:24:30,519 + 만약 제가 고양이에게 생선을 보여주면, 그 당시에는 분명히 고양이들은 콩사료보다는 물고기를 먹었죠. + +219 +00:24:30,519 --> 00:24:42,019 + 고양이의 뉴런이 기뻐 펄쩍뛰는 모습을 기대하는 것이죠. + +220 +00:24:42,019 --> 00:24:48,128 + 과학적 발견의 재미있는 점은 과학적 발견은 행운이 따라주면서 + +221 +00:24:48,128 --> 00:24:52,449 + 관심과 깊은 고민의 과정이 있을 때 나타납니다. + +222 +00:24:52,450 --> 00:24:58,740 + 그들은 고양이에게 생선, 쥐, 꽃등을 보여주었습니다만, 그 어떤 것도 효과가 없었어요. + +223 +00:24:58,740 --> 00:25:02,839 + 고양이의 일차 시각 피질은 조용했습니다. + +224 +00:25:02,839 --> 00:25:09,079 + 아주 약간의 활동만을 보여 그들은 매우 불만스러웠죠. 그러나 좋은 소식은 + +225 +00:25:09,079 --> 00:25:14,509 + 그 당시에 컴퓨터가 존재하지 않았다는 점 입니다. 그들은 이 고양이에게 + +226 +00:25:14,509 --> 00:25:21,740 + 자극적인 것들을 보여주기 위해서 슬라이드 프로젝터를 사용해야했어요. + +227 +00:25:21,740 --> 00:25:26,799 + 그들은 생선이 그려진 슬라이드를 넣고 뉴런이 활동하는지 지켜봅니다. + +228 +00:25:26,799 --> 00:25:29,960 + 활동이 없다면 다른 슬라이드로 교체를 했죠. + +229 +00:25:29,960 --> 00:25:38,630 + 그런데 슬라이드를 바꿀 때 마다 뉴런이 활동하는것을 발견했어요. + +230 +00:25:38,630 --> 00:25:46,890 + 그 사각형의 필름있죠? 유리로 되있는지 필름인지 기억은 안나지만, 아무튼 뉴런이 반을을 했죠. + +231 +00:25:46,890 --> 00:25:51,940 + 실제 쥐, 생선이나 꽃에 뉴런이 반응하진 않았어요. + +232 +00:25:51,940 --> 00:25:59,759 + 슬라이드를 꺼내거나 집어넣는 움직임에 뉴런이 반응을 했죠. + +233 +00:25:59,759 --> 00:26:03,140 + 고양이가 단지 "아 드디어 새로운 물체를 보여주려는 구나~" 하고 생각하는 것일 수도 있겠죠. + +234 +00:26:03,140 --> 00:26:13,410 + 알고보니 그들이 슬라이드를 교체하면서 투영된 선이 있었어요. + +235 +00:26:13,410 --> 00:26:18,240 + 그 정사각형인가 네모난 판 말이에요. + +236 +00:26:18,240 --> 00:26:28,120 + 그 움직이는 선이 뉴런을 활동하게 만들었고 + 그들은 그 점에 대해 조사하기 시작했죠. + +237 +00:26:28,120 --> 00:26:34,859 + 그들이 좌절을 했다거나 부주의 했다면 아마 놓쳐버릴 수도 있었죠. + +238 +00:26:34,859 --> 00:26:41,359 + 하지만 그들은 계속 단서를 쫒았고 일차 시각피질의 뉴런들이 + +239 +00:26:41,359 --> 00:26:48,279 + 기둥모양으로 구성되어 있으며, 각각의 뉴런기둥은 + +240 +00:26:48,279 --> 00:27:01,309 + 특정한 방향성을 가지고 자극을 본다는 것을 알게됩니다. + +241 +00:27:01,309 --> 00:27:02,980 + 물고기 혹은 쥐 보다 간단한 하나의 방향을 가진 선을 보는 것이죠. + +242 +00:27:02,980 --> 00:27:07,519 + 제가 이 이야기를 단순화시켜서 말하고는 있지만 + +243 +00:27:07,519 --> 00:27:10,940 + 일차시각피질에는 우리가 무엇에 반응하는지 알아내지 못한 뉴런들이 아직 있어요. + +244 +00:27:10,940 --> 00:27:17,570 + 그 뉴런들은 단순한 방향성을 가진 선에 반응하지는 않죠. + 하지만 Hubel과 Wiesel은 시각처리의 시작이 + +245 +00:27:17,570 --> 00:27:23,779 + 시각 처리는 전체적인 생선이나 쥐의 모습이 아니라는 것을 발견했어요. + 시각 처리의 시작은 + +246 +00:27:23,779 --> 00:27:29,178 + 우리 세상의 간단한 구조들 입니다. + +247 +00:27:29,179 --> 00:27:40,890 + 선, 방향을 가진 선들이죠. 이것은 신경생리학, 신경과학 + +248 +00:27:40,890 --> 00:27:47,870 + 또한 공학 모델링에 있어서도 매우 중요한 발견입니다. + 우리가 이후에 Deep Neural Network feature들을 시각화 할 때에도 + +249 +00:27:47,870 --> 00:27:57,069 + 간단한 선과같은 구조가 우리의 모델에 사용되는것을 볼 수 있어요. + +250 +00:27:57,069 --> 00:28:03,298 + 이 발견은 50년대 후반과 60년대 초반에 이루어 졌지만, + +251 +00:28:03,298 --> 00:28:12,039 + 그들이 노벨의학상을 받은 건 1981년 이었습니다. + +252 +00:28:12,039 --> 00:28:25,928 + 이 발견은 비젼과 시각처리와 관련해 매우 중요한 업적이었던 것입니다. + 그럼 과연 컴퓨터 비전은 언제 시작되었을까요? + +253 +00:28:25,929 --> 00:28:35,620 + 이것도 재미있는 역사의 한 부분입니다. + 현대학문의 분야로써 컴퓨터 비전은 시발점은 + +254 +00:28:35,620 --> 00:28:42,779 + 1963년 Larry Roberts의 "조각 세계" 라는 특별한 논문이었어요. + +255 +00:28:42,779 --> 00:28:49,889 + Hubel과 Weisel이 우리 뇌에서 세상을 시각적으로 + +256 +00:28:49,890 --> 00:29:00,380 + 선과 같은 구조로 받아들이는 것을 연구한 것처럼 + +257 +00:29:00,380 --> 00:29:06,350 + Larry Roberts는 컴퓨터 과학 박사과정 학생으로서 + +258 +00:29:06,349 --> 00:29:08,980 + 이러한 선과 같은 구조와 이미지를 공학적으로 추출하려는 노력을 했어요. + +259 +00:29:08,980 --> 00:29:16,210 + 이 특별한 연구의 목표는.. + +260 +00:29:16,210 --> 00:29:22,210 + 음 여러분과 저처럼 사람들은 상자가 어떤식으로 변한다해도 + 우리는 그 상자를 인지할 수가 있죠? + +261 +00:29:22,210 --> 00:29:28,009 + 조명이 변하고 방향이 달라져도 + +262 +00:29:28,009 --> 00:29:33,019 + 이 두 상자는 같아요. 그의 요지는 + +263 +00:29:33,019 --> 00:29:40,720 + Hubel과 Wiesel이 말한 것처럼 구조를 정의하는 것은 + +264 +00:29:40,720 --> 00:29:46,419 + 가장자리의 선들이라는 거죠. + 이 선들이 모영을 정하고, 내부의 모든 것들과 다르게 이 선들은 변하지 않습니다. + +265 +00:29:46,419 --> 00:29:53,290 + 그래서, Larry Roberts는 이 선들을 추출하는 것을 주제로 박사 학위 논문을 썼어요. + +266 +00:29:53,289 --> 00:29:59,250 + 아시다시피 지금 컴퓨터 비전의 박사과정 학생에게는 + +267 +00:29:59,250 --> 00:30:03,990 + 이 작업은 학사과정의 작업이며 박사논문이 될 수 없었겠지요. + +268 +00:30:03,990 --> 00:30:10,210 + 하지만 이 논문이 최초의 선구자적인 컴퓨터 비전 논문이었습니다. + +269 +00:30:10,210 --> 00:30:18,819 + Larry Roberts는 이후 컴퓨터 비전에 관련된 연구를 포기했습니다. + +270 +00:30:18,819 --> 00:30:27,189 + 그리고는 DARPA에 들어갔죠. 인터넷의 창시자중의 한 명이었습니다. + +271 +00:30:27,190 --> 00:30:34,490 + 컴퓨터 비전을 포기했지만 나쁘지 않은 업적이네요. + 현대 학문으로써의 컴퓨터비전의 생일은 + +272 +00:30:34,490 --> 00:30:43,960 + 1966년 여름이라고 합니다. 1966년 여름, MIT의 + +273 +00:30:43,960 --> 00:30:49,548 + 인공 지능 연구소가 설립되었어요. 그전에 여러분이 + +274 +00:30:49,548 --> 00:30:55,819 + Stanford 학생으로써 자부심을 느낄만한 이야기가 있어요. + +275 +00:30:55,819 --> 00:31:02,579 + 세계에서 선구적인 인공 지능 연구실이 두곳이 있습니다. + +276 +00:31:02,579 --> 00:31:10,329 + 1960년대 초에 한 곳은 Marvin Minsky에 의해 MIT에서, + 또 하나는 Mavin Minsky에 의해 Stanford에서 세워집니다. + +277 +00:31:10,329 --> 00:31:15,369 + Stanford에서 인공 지능 연구실은 컴퓨터 과학대학 전에 설립되었어요. + +278 +00:31:15,369 --> 00:31:21,479 + 그리고 인공지능 연구실을 세운 John McCarthy 교수가 + +279 +00:31:21,480 --> 00:31:22,490 + 인공지능이라는 말을 만든 것이지요. + +280 +00:31:22,490 --> 00:31:26,450 + 여러분들이 자랑스러워할 Stanford 역사를 조금 이야기 해드렸어요. + +281 +00:31:26,450 --> 00:31:31,720 + 하지만 컴퓨터 비전을 시작한 업적은 MIT에게 돌아갑니다. + +282 +00:31:31,720 --> 00:31:41,380 + 1966년 여름 MIT의 인공지능 연구실 교수는 비전 연구를 시작하기로 결정합니다. + +283 +00:31:41,380 --> 00:31:46,630 + 그로인해 인공지능 연구실이 설립되고 우리는 이런 저런 로직들을 이해할 수 있게 됩니다. + +284 +00:31:46,630 --> 00:31:55,010 + 이것이 아마 그 시간에 발명되었다는 것을 보여준다고 생각해요. + +285 +00:31:55,009 --> 00:32:01,109 + 아무튼 비전은 너무 쉬워요. 눈을 뜨고 세상을 바라보면 됩니다. + 이게 어려우면 얼마나 어렵겠어요? + +286 +00:32:01,109 --> 00:32:04,109 + 그러니 이 문제를 여름안에 풀어봅시다! MIT학생들은 똘똘하잖아요? + +287 +00:32:04,109 --> 00:32:18,729 + 그래서 여름 비전 프로젝트는 여름동안의 인력을 효율적으로 사용해서 + 우리의 시각 시스템의 상당한 부분을 만들어내려는 시도였어요. + +288 +00:32:18,730 --> 00:32:24,329 + 그리고 이 시도가 그 해 여름동안의 계획이었어요. + 하지만 아마 그들이 충분히 효율적이지 못했었기 때문일까요? + +289 +00:32:24,329 --> 00:32:30,490 + 컴퓨터 비전의 문제는 그 해 여름안에 해결되지 않았죠. + +290 +00:32:30,490 --> 00:32:35,740 + 하지만 그 이후로 컴퓨터 비전과 인공지능은 가장 빠르게 성장하는 분야가 되었습니다. + +291 +00:32:35,740 --> 00:32:43,679 + 오늘날 CPVR이나 ICCV와 같은 유명한 컴퓨터 비전 컨퍼런스에는 + +292 +00:32:43,679 --> 00:32:52,160 + 전 세계적으로 2천에서 2천 5백명이 넘는 연구자들이 참가하고 있어요. + +293 +00:32:52,160 --> 00:33:00,620 + 학생들을 위한 매우 현실적인 이야기를 해보자면 + +294 +00:33:00,619 --> 00:33:05,369 + 여러분이 훌륭한 머신러닝 혹은 비전을 공부한 학생들이라면 + 실리콘 밸리 혹은 어떤 곳에서도 취업걱정할 일은 없을거예요. + +295 +00:33:05,369 --> 00:33:11,569 + 비전은 실제로 가장 흥미로운 분야 중 하나이며 그때에 바로 비전의 생일입니다. + +296 +00:33:11,569 --> 00:33:19,210 + 말인즉슨 올해가 컴퓨터 비전의 탄생의 50주년입니다. + +297 +00:33:19,210 --> 00:33:25,829 + 아주 신나는 연도이며 + +298 +00:33:25,829 --> 00:33:28,529 + 지금까지 비전은 참으로 먼 길을 걸어왔어요. + +299 +00:33:28,529 --> 00:33:31,660 + 자 다시 컴퓨터 비전의 역사로 돌아가 봅시다. + +300 +00:33:31,660 --> 00:33:38,169 + 여러분이 기억해야할 사람이 하나 있습니다. + David Mark, 그 또한 그 당시 MIT에서 + +301 +00:33:38,169 --> 00:33:50,240 + Shimon Ullman, Tommy Poggio와 같은 많은 영향력있는 컴퓨터 비전 연구자들과 작업했습니다. + +302 +00:33:50,240 --> 00:33:58,808 + 그리고 그 자신은 70년대 초에 일찍 세상을 떠낫지만 "Vision" 이라는 매우 영향력있는 책을 펴냈어요. + 매우 얇은 책이죠. + +303 +00:33:58,808 --> 00:34:08,148 + David Mark가 가진 비전에 대한 생각들에 있어서 그는 신경과학에서 많은 영감을 받았어요. + +304 +00:34:08,148 --> 00:34:14,868 + 우리는 Hubel과 Wiesel이 발견한 단순한 구조에 대한 컨셉을 이미 다루었죠. + +305 +00:34:14,869 --> 00:34:16,539 + 비전은 단순한 구조에서 시작합니다. + +306 +00:34:16,539 --> 00:34:23,259 + 비전은 신성한 물고기나 쥐에서 시작하지 않았어요. + +307 +00:34:23,260 --> 00:34:28,679 + David Mark는 그 다음 가장 중요한 컴셉에 대한 통찰을 합니다. + 이 두 가지 동찰이 바로 + +308 +00:34:28,679 --> 00:34:35,740 + 딥러닝의 구조의 시작인데 그것은 바로 비전은 계층적이라는 것이죠. + +309 +00:34:35,740 --> 00:34:44,029 + Hubel과 Wiesel은 간단한 구조로부터 시작한다고 했지만, 간단한 구조로 끝난다고 말하진 않았죠. + 시각적인 세계는 매우 복잡합니다. + +310 +00:34:44,030 --> 00:34:49,540 + 제가 사진을 찍습니다. 아이폰으로 보통 사진을 오늘 찍어요. + +311 +00:34:49,540 --> 00:34:58,309 + 음.. 정확히 내 아이폰의 해상도는 모르지만 10 메가픽셀이라고 가정합시다. + +312 +00:34:58,309 --> 00:35:05,059 + 그 해상도 안에서 하나의 그림을 구성할 수 있는 픽셀들의 조합개수는 + +313 +00:35:05,059 --> 00:35:11,429 + 우주에 존재하는 원자들의 수보다도 많아요. + 그만큼 비전은 복잡해질 수가 있어요. + +314 +00:35:11,429 --> 00:35:18,539 + 비전은 정말 정말 복잡합니다. + Hubel과 Wiesel은 간단한 구조로 시작하라고 했으며, + David Mark는 + +315 +00:35:18,539 --> 00:35:25,130 + 계층적인 모델을 만들라고 했어요. + 물론 David Mark가 Convolution Neural Network를 만들라고 하지는 않았죠. + +316 +00:35:25,130 --> 00:35:29,400 + 우리는 나머지 분기동안 그에대해 다룰 것입니다. + +317 +00:35:29,400 --> 00:35:36,990 + 그의 아이디어는 이렇습니다. 하나의 이미지를 표현하거나 생각을 할 때, + +318 +00:35:36,989 --> 00:35:42,129잰 + 우리는 여러 계층으로 나누어 생각을 합니다. 그가 생각하는 계층들 중의 첫번째는 선으로 이루어진 이미지입니다. + +319 +00:35:42,130 --> 00:35:49,110 + 분명히 Hubel과 Wiesel에게서 영감을 받았죠. + +320 +00:35:49,110 --> 00:35:52,579 + 그는 개인적으로 이것을 Primal Sketch라고 부릅니다. + +321 +00:35:52,579 --> 00:35:55,730 + 이름이 그 계층에 대해 설명해줍니다. + +322 +00:35:55,730 --> 00:36:02,400 + 그 다음에 우리는 2.5 차원으로 생각을 합니다. 이 계층이 바로 + +323 +00:36:02,400 --> 00:36:08,829 + 당신이 2D 이미지를 3D 세상으로 인식하기 시작하는 계증입니다. + 당신은 + +324 +00:36:08,829 --> 00:36:15,679 + 층들이 존재한다는 것을 인지합니다. 지금 제가 여러분을 바라볼 때, + 제가 여러분들이 머리와 목만 있다고 + +325 +00:36:15,679 --> 00:36:17,239 + 생각하지는 않아요. + +326 +00:36:17,239 --> 00:36:22,799 + 그게 내가이 표시되는 모든 비록 당신이 모든 행에 체결 거 알아 + +327 +00:36:22,800 --> 00:36:29,680 + 당신이 문제의 전면 해결하기 위해 문제를 게시 할 예정입니다 + +328 +00:36:29,679 --> 00:36:38,118 + 자연은 광범위한 차원 이미지의 2D 때문에 해결하기 위해 확률값 반대하는 것으로했다 + +329 +00:36:38,119 --> 00:36:45,210 + 자연은 내 첫 번째 하드 작업 트릭은 우리가 그들이 하나를 사용했던 아이스하는 것을보고 + +330 +00:36:45,210 --> 00:36:49,389 + I하지만 거 야를 돌출하는 괭이 소프트웨어 트릭의 전체 무리가있을 수있어 + +331 +00:36:49,389 --> 00:36:53,868 + 컴퓨터 비전과 같은 일 때문에 두 눈의 형성과 더스 우리 + +332 +00:36:53,869 --> 00:36:59,280 + 그것도를 해결하고 차에 문제가있는 그리고 그들은 결국 우리에게있다 + +333 +00:36:59,280 --> 00:37:03,180 + 우리가 실제로 좋은 3D 모델을 함께 있도록 모든 것을 넣어 + +334 +00:37:03,179 --> 00:37:08,629 + 세계는 왜 우리가 살아남을 가지고 우리가 세계의 3D 모델을해야합니까 + +335 +00:37:08,630 --> 00:37:15,309 + 나는 손을 흔들 때 내가 정말 알아야 할 세계를 조작 이동 + +336 +00:37:15,309 --> 00:37:16,509 + 당신은 알고 어떻게 + +337 +00:37:16,510 --> 00:37:22,320 + 내 손을 외부와가의 3 차원 모델링입니다 올바른 방법을 향하고 잡아 + +338 +00:37:22,320 --> 00:37:26,000 + 세계 그렇지 않으면 나는 때 올바른 방법으로 당신의 머리를 잡아 할 수 없습니다 + +339 +00:37:26,000 --> 00:37:34,219 + 그건 그래서, 그래서 그 데이비드 마르의의 찻잔에게 같은 일을 데리러 + +340 +00:37:34,219 --> 00:37:39,899 + 높은 수준의 추상적 인 아키텍처의 비전 아키텍처 그것을 + +341 +00:37:39,900 --> 00:37:45,490 + 정말 수학적 모델링 정확히 어떤 종류의 정보를 통보하지 않습니다 우리는해야 + +342 +00:37:45,489 --> 00:37:51,439 + 그것은 학습 과정의 정보를 통보하지 않으며, 그들은 정말 않습니다 + +343 +00:37:51,440 --> 00:37:55,599 + 우리는 깊은 학습을 통해에 도착합니다 추론 절차 + +344 +00:37:55,599 --> 00:38:02,759 + 그 단어 아키텍처하지만이 아닌 그의 중요의 높은 수준의보기이다 + +345 +00:38:02,760 --> 00:38:06,250 + 그것은 배울 수있는 중요한 개념이다 + +346 +00:38:06,250 --> 00:38:08,619 + 구상 우리는 이것을 호출 + +347 +00:38:08,619 --> 00:38:16,859 + 표현 정말 중요한 작업 및이에 약간의 물건 첫번째 여행이다 + +348 +00:38:16,860 --> 00:38:25,180 + 다만 즉시이에 대해 생각이 중요한 방법을지도로 보여 + +349 +00:38:25,179 --> 00:38:31,879 + 영상 인식 알고리즘의 첫 번째 물결은 3D 모델 이후 갔다 + +350 +00:38:31,880 --> 00:38:38,280 + 그 오른쪽에 상관없이 같은 목표이기 때문에 어떻게 단계을 나타냅니다 + +351 +00:38:38,280 --> 00:38:45,519 + 여기에 목표는 인식 개체를 복원하는 것입니다이 정말 합리적이다 + +352 +00:38:45,519 --> 00:38:52,380 + 우리는 당신의 일에 그렇게 이들 모두를 세계로 이동 할 때이 있기 때문에 + +353 +00:38:52,380 --> 00:38:58,829 + 팔로 알토 (Palo Alto)에서 유래 합계 41까지로 투자 수익 (ROI) 상투 메에서 그 중 하나는 예전 + +354 +00:38:58,829 --> 00:39:00,440 + 스탠포드 교수 + +355 +00:39:00,440 --> 00:39:05,760 + 나는 그와 그의이 직접 브룩스가 처음으로 11을 제안 사랑 + +356 +00:39:05,760 --> 00:39:10,430 + 살루 모델까지 일반화 된 소위 아니에요거야 세부 사항에 들어가 있지만, + +357 +00:39:10,429 --> 00:39:17,129 + 아이디어는 세상이 같은 간단한 형태로 구성되어 있다는 것입니다 + +358 +00:39:17,130 --> 00:39:23,150 + 블록을 궁금해하고 실제 세계의 객체는이 단지 조합 + +359 +00:39:23,150 --> 00:39:28,340 + 간단한 형태는 특정 느낌을 주어 이동이 매우이었다 + +360 +00:39:28,340 --> 00:39:37,970 + 70 년대 영향력있는 시각적 인식 모델이되기 위해 계속 + +361 +00:39:37,969 --> 00:39:47,239 + MIT 연구소의 이사 그는 또한 아이 로봇 회사 룸바의 창립 멤버였다 + +362 +00:39:47,239 --> 00:39:51,379 + 이 모든 그래서 그래서 그는 매우 영향력을 계속했다 + +363 +00:39:51,380 --> 00:39:56,930 + 나는 일을하고 아무도 흥미로운 모델은 지역에서 오는 + +364 +00:39:56,929 --> 00:40:05,009 + 연구소는 나는 엘 카미노이 인에서 나는 길 건너 본 것 같아요 + +365 +00:40:05,010 --> 00:40:15,260 + 화보 구조 모델은 확률의 차원 맛이 덜하지만 더있다 + +366 +00:40:15,260 --> 00:40:21,570 + 맛은 개체가 여전히 간단한 부분 만들어진 것입니다 + +367 +00:40:21,570 --> 00:40:28,059 + 같은 사람의 머리는 눈, 코 또는 입 만들어진 부품은 CuMn되어 있었다 + +368 +00:40:28,059 --> 00:40:34,679 + 확인 우리의 감각을 받고 일부 변형을 허용 스프링에 의해 행동 + +369 +00:40:34,679 --> 00:40:40,069 + 세계를 인식하지 당신의 모든 하나는 정확히 같은 눈을 가지고 + +370 +00:40:40,070 --> 00:40:45,150 + 눈 사이의 거리 때문에이 드문 변화의 어떤 종류의 수 + +371 +00:40:45,150 --> 00:40:50,450 + 변화의 시작의 개념이 같은 모델에 도입하려면 및 + +372 +00:40:50,449 --> 00:40:56,309 + 이가 너무 나는 당신을 보여주고 싶은 이유를 알고이 같은 모델을 사용하여 + +373 +00:40:56,309 --> 00:41:02,710 + 표시 방법 애타게이 있었던 최악의 단순 가장 영향력있는 중 하나였다 + +374 +00:41:02,710 --> 00:41:09,670 + 실제 개체와 전체 용지를 인식하는 80 년대 모델 + +375 +00:41:09,670 --> 00:41:18,900 + 실세계의이 겉보기 사용자이지만 모서리를 사용하여 간단한 + +376 +00:41:18,900 --> 00:41:26,010 + 따뜻한 모양이지만 서로 다른 재료 또는하여이를 인식하는 에지 + +377 +00:41:26,010 --> 00:41:33,980 + 졸업 그건 그래서 그 컴퓨터 비전의 입사 세계의 종류의의 + +378 +00:41:33,980 --> 00:41:39,699 + 바람 것은 흑백 또는 합성 이미지가 시작 본적이되고 + +379 +00:41:39,699 --> 00:41:46,529 + 구십 우리는 마침내 컬러 현실 세계의 이미지를 좋아하는 이동하기 시작하고 + +380 +00:41:46,530 --> 00:41:55,210 + 다시 큰 변화 여기 매우 매우 영향력있는 작업이되지이었다 + +381 +00:41:55,210 --> 00:42:01,150 + 객체를 인식하는 것은 대해 특히에 대해 어떻게은을 개척 좋아합니까 + +382 +00:42:01,150 --> 00:42:08,990 + 당신이이 방을 입력하면 합리적인 부분으로 이미지를 바로 그래서 방법 당신이 없습니다 + +383 +00:42:08,989 --> 00:42:15,559 + 시각 시스템은 단지 그룹이되었다 내가 이렇게 많은 사진을 볼 세상에 당신을 말할 것입니다 + +384 +00:42:15,559 --> 00:42:22,259 + 일이 당신은 헤드 헤드 영토 의자 무대 플랫폼 조각이 참조 + +385 +00:42:22,260 --> 00:42:26,640 + 가구이 가장 오래된에 지각 그룹화 지각이라고 + +386 +00:42:26,639 --> 00:42:28,309 + 나 중 하나로 그룹화 + +387 +00:42:28,309 --> 00:42:34,779 + 우리가하지 않으면 가장 중요한 문제는 생물학적 또는 인공 구상 + +388 +00:42:34,780 --> 00:42:39,420 + 그들은이 지각 그룹핑 문제를 해결하는 방법을 알고 + +389 +00:42:39,420 --> 00:42:46,690 + 정말 하드 시간은 깊이 시각적 세계를 이해하고 단어 수 없습니다 + +390 +00:42:46,690 --> 00:42:53,450 + 정지하지으로 기본이 수업이 과정에 문제의 끝 + +391 +00:42:53,449 --> 00:42:57,859 + 우리는 많은 진전 이전을 한 경우에도 컴퓨터 비전에 해결 + +392 +00:42:57,860 --> 00:43:04,390 + 우리는 여전히 문제로 최종 솔루션을 파악하고 deplaning 이후에 출발하는 + +393 +00:43:04,389 --> 00:43:10,650 + 나는이 소개에서 당신을주고 싶어 왜이 같은 그래서 이것은 다시 I입니다 + +394 +00:43:10,650 --> 00:43:16,950 + 당신은 또한 깊은 문제를 회피하고 당시 알고 있어야하는 그들은 + +395 +00:43:16,949 --> 00:43:22,730 + 에 상기 도전 우리에도 모든 문제를 해결할 수없는 구상 + +396 +00:43:22,730 --> 00:43:29,079 + 올가미는 우리가 개발 터미네이터에서 멀리있는 것처럼 당신이 알고있는이야 어떤 사람 수 + +397 +00:43:29,079 --> 00:43:34,860 + 모든 것을 할 일이 조각이 정상화 컷이라고 있도록 중 하나입니다 무엇인가 + +398 +00:43:34,860 --> 00:43:42,390 + 에 실제 이미지와 시도 걸리는 최초의 컴퓨터 비전 작업 + +399 +00:43:42,389 --> 00:43:52,420 + 고위 컴퓨터 비전 연구원이 교수에 지금 문제를 해결 + +400 +00:43:52,420 --> 00:43:56,000 + 버클리 또한 스탠포드 졸업 + +401 +00:43:56,000 --> 00:44:01,989 + 결과는 내가이 클래스의 모든 침전을 포함하지 않습니다 좋은하지 않습니다 + +402 +00:44:01,989 --> 00:44:08,459 + 당신이 보는 곳에서 우리는 진전이 있지만, 이것은의 시작입니다 + +403 +00:44:08,460 --> 00:44:15,510 + 불러 및 지불 내가하고 싶은 또 다른 매우 캐주얼 작업이 원하는 + +404 +00:44:15,510 --> 00:44:22,410 + 비록이 작품에 대한 찬사 우리는의 나머지 부분을 커버하지 않는 + +405 +00:44:22,409 --> 00:44:26,679 + 물론하지만 난 당신이 될 꽤 중요한 비전 학생이 생각 + +406 +00:44:26,679 --> 00:44:31,199 + 이 알고 있기 때문에뿐만 아니라 우리가 원하는 중요한 문제를 소개합니다 + +407 +00:44:31,199 --> 00:44:36,730 + 그것을 해결하는 것은 또한 당신에게 필드하자의 발전의 관점을 제공합니다 + +408 +00:44:36,730 --> 00:44:40,480 + 작업이 호출 빌라 존스 얼굴 검출기 + +409 +00:44:40,480 --> 00:44:46,030 + 이 때문에 대학원생 신선한​​ 대학원 학생으로 내 마음을 매우 사랑이다 + +410 +00:44:46,030 --> 00:44:51,650 + 칼 테크에서 그것은 내가 대학원생 때과 같이 첫 번째 논문의 하나 + +411 +00:44:51,650 --> 00:44:56,150 + 나는이 내 고문 그것에 대해 아무것도 모르는 내가 실험실까지 + +412 +00:44:56,150 --> 00:45:02,090 + 당신은 우리 모두가 그들을 이해하려는 알고 작품의 놀라운 조각 + +413 +00:45:02,090 --> 00:45:08,690 + 내가 셀틱 졸업 시간이 매우 작품은 처음에 전달된다 + +414 +00:45:08,690 --> 00:45:16,510 + 이 최초의 디지털 카메라와 같은 2006 년에 후지 필름에 의해 스마트 디지털 카메라 + +415 +00:45:16,510 --> 00:45:22,390 + 보기의 얼굴 검출기 지금까지 내 이송 펌프 기술 이전 시점이되었다 + +416 +00:45:22,389 --> 00:45:28,789 + 매우 빠르고 시각적 첫 번째 성공적인 높은 수준의 일이 있었다 + +417 +00:45:28,789 --> 00:45:35,849 + 소비자 제품에 사용되고 인식 알고리즘 그래서 그냥 작업 할 수 + +418 +00:45:35,849 --> 00:45:41,059 + 얼굴을 감지 배운다 더 이상 빨리 당신이 알고 함께 야생에서 직면하지 + +419 +00:45:41,059 --> 00:45:47,920 + 시뮬레이션 그들은 매우 이들은 비록 모든 사진 및 있습니다 고안되어 + +420 +00:45:47,920 --> 00:45:53,329 + 그는 깊은 학습 풍미를 많이 갖는 깊은 학습 네트워크를 사용하지 않은 + +421 +00:45:53,329 --> 00:46:01,179 + 기능은 기능을 간단한 기능을 찾을 수있는 알고리즘을 배운다을 알게되었다 + +422 +00:46:01,179 --> 00:46:06,919 + 당신이 우리에게 가장 좋은 줄 수있는이 흑백 필터 기능 등 + +423 +00:46:06,920 --> 00:46:14,639 + 얼굴의 현지화 그래​​서 이것은 하나의 작품의 매우 영향력있는 작품이다 + +424 +00:46:14,639 --> 00:46:24,679 + 컴퓨터를 배포하고 실제 로밍 할 수있는 첫 번째 컴퓨터 영상 작품의 + +425 +00:46:24,679 --> 00:46:31,019 + 그 비교 알고리즘 전에 시간은 용지가 실제로 매우 느린했다 + +426 +00:46:31,019 --> 00:46:36,699 + 이 부여 된 실시간 얼굴 인식이라고 나는 알고하지 않는 팁에 그를 보내 + +427 +00:46:36,699 --> 00:46:41,409 + 사람이 칩의 종류를 기억하지만 느린 채팅 아니었지만 그럼에도 불구하고 + +428 +00:46:41,409 --> 00:46:48,569 + 그것은 또 다른 매우 중요한 예술 작품도 한 번 더 있었다 실시간으로 실행 + +429 +00:46:48,570 --> 00:46:53,380 + 일이 유일한 일없는이시기에 지적하는 + +430 +00:46:53,380 --> 00:46:59,170 + 그러나 이것은 금주 모임 정말 좋은 표현 모랄레스 시간의 초점이다 + +431 +00:46:59,170 --> 00:47:06,250 + 컴퓨터 비전은 미스터을했습니다 기억 이동하고있다 + +432 +00:47:06,250 --> 00:47:14,699 + 작업 초기에 지금 우리가하고있는 세에게 물체의 형상을 모델링하기 위해 노력했다 + +433 +00:47:14,699 --> 00:47:23,439 + 우리가 정말 개체에 대한 약간이 무엇인지 인식을 할 수 있습니다 이동 + +434 +00:47:23,440 --> 00:47:27,400 + 이러한 단계를 재구성 여부를 컴퓨터 비전의 전체 분기있다 + +435 +00:47:27,400 --> 00:47:34,200 + 그래픽은 그 작업을 계속 단계하지만 컴퓨터 비전의 큰 부분은 아니다 + +436 +00:47:34,199 --> 00:47:38,730 + 세기의 전환기 주위에이 시간에 인식에 초점을 맞추고있다 + +437 +00:47:38,730 --> 00:47:47,539 + 즉, 컴퓨터 비전과 오늘에게 가장 중요한 부분을 가져이다 + +438 +00:47:47,539 --> 00:47:55,480 + 컴퓨터 비전 작업이 인식 등이인지 질문을 집중하고, + +439 +00:47:55,480 --> 00:47:57,369 + 내가 질문 + +440 +00:47:57,369 --> 00:48:06,150 + 작품의 또 다른 매우 중요한 부분 때문에 주위의 기능에 초점을 시작 + +441 +00:48:06,150 --> 00:48:12,950 + 사람들이 그것을 실현하기 시작 얼굴 인식의 시간은 정말 열심히 정말 + +442 +00:48:12,949 --> 00:48:19,829 + 난 그냥 말했듯이 모든 일을 설명하여 객체를 인식하는 당신은 내가 알고 + +443 +00:48:19,829 --> 00:48:25,960 + 너희가 많이 있었다 나는 당신의 몸통 I의 나머지 부분을 볼 수 없습니다 결론을 내렸다 참조 + +444 +00:48:25,960 --> 00:48:31,690 + 정말 첫 번째 행에에 다리 중 하나를 볼 수 있지만 나는 당신을 인식하지 않고 + +445 +00:48:31,690 --> 00:48:39,230 + 나는 그래서 어떤 사람들은 그녀가 이것이 재미 실현하기 위해 시작 개체로 전나무 당신을 애 + +446 +00:48:39,230 --> 00:48:44,240 + 정말 글로벌 형상은 지금 우리가 물체를 인식하기 위해 후 가야 + +447 +00:48:44,239 --> 00:48:50,319 + 우리는 중요한 기능을 우리가 할 수있는 객체를 인식하는 경우 아마이 기능의 + +448 +00:48:50,320 --> 00:48:53,090 + 먼 길을 가서 많은 이해 + +449 +00:48:53,090 --> 00:48:57,930 + 당신이 밖으로 것을 인식 할 필요가 없습니다 당신을 사냥하는 경우 진화에 대해 생각 + +450 +00:48:57,929 --> 00:49:03,909 + 모양 호랑이 몸 전체는 몇이 알고 도망 할 필요가 결정하는 + +451 +00:49:03,909 --> 00:49:06,588 + 관통 호랑이의 첫 번째 패치 + +452 +00:49:06,588 --> 00:49:12,679 + 우리가 빨리 듣고 필요가 너무 있도록 충분히 아마 멋진 팔을 잎 + +453 +00:49:12,679 --> 00:49:16,429 + 의사 결정 야구의 버전은 정말 빠르다 + +454 +00:49:16,429 --> 00:49:22,308 + 이에 의해 이동 비용을 부담해야하므로이 많은 온라인 중요한 기능을 발생 + +455 +00:49:22,309 --> 00:49:28,539 + 데이비드 낮은 다시 다시 그 이름을보고 중요한 중요한 학습에 관한 것입니다 + +456 +00:49:28,539 --> 00:49:34,009 + 객체에 기능과 당신은 단지 몇 이러한 중요한 기능을 배우면 + +457 +00:49:34,009 --> 00:49:38,400 + 당신이 할 수있는 개체에 대한 그들 실제로 완전히에이 객체를 권장합니다 + +458 +00:49:38,400 --> 00:49:45,548 + 다른과 교훈을 유지하도록하도록 징수 복잡 장면에 이동 + +459 +00:49:45,548 --> 00:49:54,880 + 약 10 년 전 필드의 2010 년 또는 2012 년 연구 선거 + +460 +00:49:54,880 --> 00:50:00,229 + 컴퓨터 비전에 모델을 구축하기 위해 이러한 기능을 사용하는 방법에 초점을 맞추고 있었다 + +461 +00:50:00,228 --> 00:50:05,538 + 객체 및 장면을 인식하고 우리는 우리가 먼 길을 갔어요 훌륭한 일을 했어 + +462 +00:50:05,539 --> 00:50:12,609 + 깊은 그 단어를 배우는 이유 중 하나는 더 이상 설득력이되었다 + +463 +00:50:12,608 --> 00:50:17,690 + 많은 사람들이 우리가 볼 수있는 기능은 그 깊은 학습이 그 + +464 +00:50:17,690 --> 00:50:22,880 + 학습자는 화려한하여 이러한 설계 기능과 매우 유사 + +465 +00:50:22,880 --> 00:50:30,229 + 엔지니어는 필요한 경우 우리가 그들을 필요 알지 심지어 종류의 확인 있도록 + +466 +00:50:30,228 --> 00:50:34,929 + 아래 먼저이 일을 갖추고 있으며, 우리는 더 나은 개발을 시작하는 우리에게 얘기를 + +467 +00:50:34,929 --> 00:50:38,978 + 수학적 모델은 그 자체로 이러한 기능을 배울 수 있지만 확인 + +468 +00:50:38,978 --> 00:50:46,210 + 서로 너무 너무 역사적 당신은 안이 작품의 중요성을 알고 + +469 +00:50:46,210 --> 00:50:52,028 + 감소 된이 작품은 우리의 지적 기반의 하나입니다 + +470 +00:50:52,028 --> 00:50:57,858 + 우리의 지적 기반은 실현하기 위해 얼마나 중요한지 또는 얼마나 유용한 지 그 + +471 +00:50:57,858 --> 00:51:07,018 + 이러한 깊은 학습 기능은 우리가 그들을 배울 경우 그냥 간단히 때문에 말할 수 있습니다 + +472 +00:51:07,018 --> 00:51:12,379 + 이 기능의 저와 다른 많은 연구자들은 우리가 사용할 수없는 우리에게 + +473 +00:51:12,380 --> 00:51:18,239 + 그 장면 인식과 그 시간 기계 학습 주위를 배울합니다 + +474 +00:51:18,239 --> 00:51:24,719 + 도구는 우리가 주로 사용하거나 그래픽 모델 또는 지원 벡터 기계와 + +475 +00:51:24,719 --> 00:51:29,479 + 이 하나의 영향력 작업 지원 벡터 기계와 대령을 사용하여에 + +476 +00:51:29,478 --> 00:51:43,358 + 모델 2222은 일을 인식하지만 난 여기에 간단한과 마지막 깊은 학습 모델이 될 수 있습니다 + +477 +00:51:43,358 --> 00:51:50,578 + 이 기능 또는 기능 야구라는 변형 부분은 왈도입니다입니다 우리 + +478 +00:51:50,579 --> 00:51:57,420 + 사람의 일부처럼 개체의 일부를 배우고 우리는 그들이 그림을 오는 방법 + +479 +00:51:57,420 --> 00:52:08,519 + 공간에서 서로 소득 그림에 모델을 지원 벡터 머신의 종류를 사용 + +480 +00:52:08,518 --> 00:52:16,179 + 2009 년의이시기에 인간과 병 같은 객체를 인식 + +481 +00:52:16,179 --> 00:52:21,419 + 2010 년 컴퓨터 비전 분야는 우리가 최선을 다하고 충분히 성숙 + +482 +00:52:21,420 --> 00:52:25,659 + 중요한 심장이 아마 보행자 인식과 + +483 +00:52:25,659 --> 00:52:30,828 + 더 이상 인위적인 문제가 뭔가있어 차를 인식하지 것은 다른 사람이었다 + +484 +00:52:30,829 --> 00:52:37,219 + 때문에 우리가하지 않으면 지금 진행 필드로 부분적으로 자신의 벤치가 필요 + +485 +00:52:37,219 --> 00:52:44,039 + 좋은 벤치 마크는 다음 모두가 이미지 집합 느낌과 정말 열심히 정말 + +486 +00:52:44,039 --> 00:52:50,369 + 가장 중요한 기준 중 하나가 통과 목표라고 있도록 글로벌 표준을 설정 + +487 +00:52:50,369 --> 00:52:57,608 + V OC는 물체 인식 벤치 일부는 유럽의 노력 유럽 생물의 그 + +488 +00:52:57,608 --> 00:53:04,190 + 연구진은 20 종류의 이미지 수만에 의해 함께 넣어 + +489 +00:53:04,190 --> 00:53:13,019 + 광학 및이 고양이처럼 하나의 예 개체 당 기준 요금은 소 영화를 숭배하지 + +490 +00:53:13,018 --> 00:53:17,808 + 고양이는 소 비행기 병 개 + +491 +00:53:17,809 --> 00:53:20,048 + 말 훈련 + +492 +00:53:20,048 --> 00:53:27,268 + 다음 더스와 우리는 매년 우리의 컴퓨터 비전 연구자를 사용 + +493 +00:53:27,268 --> 00:53:34,948 + 그리고 바퀴는 최고의 여자 객체에 대한 모든 물체 인식 작업을 경쟁 올 + +494 +00:53:34,949 --> 00:53:41,188 + 당신이 년을 통해처럼 알고 과거를 통해 인식 문제와 + +495 +00:53:41,188 --> 00:53:47,949 + 성능은 계속 증가하고 우리가 느끼기 시작할 때 + +496 +00:53:47,949 --> 00:53:52,929 + 그 때의 필드의 진행 흥분 + +497 +00:53:52,929 --> 00:53:59,729 + 여기에 가까운 우리에게 더 가까이 이야기를 통해 조금이다 그건 그 내 사랑 내 + +498 +00:53:59,728 --> 00:54:05,718 + 학생들은 현실 세계가 진짜 약 20 개체를 알고 생각했다 + +499 +00:54:05,719 --> 00:54:12,489 + 세상은 그렇게 파스코 시각의 작품 다음 작은 20 개 이상의 광학입니다 + +500 +00:54:12,489 --> 00:54:18,239 + 물체 인식 문제는 우리가 함께이 방대한 대규모 프로젝트를 넣어 + +501 +00:54:18,239 --> 00:54:23,889 + 여러분 중 일부는이 클래스에서 당신이 될 것 이미지 들었을 수 있다는 이미지 + +502 +00:54:23,889 --> 00:54:30,098 + 이미지의 작은 부분을 사용하여 해당 과제 그 이미지의 일부에 + +503 +00:54:30,099 --> 00:54:36,759 + 그 모두가 내 손을 청소하고 5 천만 이미지의 데이터 세트입니다 + +504 +00:54:36,759 --> 00:54:47,000 + 그것을 청소 학생들에게 20,000 객체 클래스를 주석 + +505 +00:54:47,000 --> 00:54:54,469 + 내 삶의 다양한 영역의 습관의 크라우드 소싱 플랫폼을 제거 + +506 +00:54:54,469 --> 00:54:59,969 + 당신이 함께이 플랫폼이 퍼팅 알고에서 글래디스는 고통을 몰라 그 + +507 +00:54:59,969 --> 00:55:08,599 + 그러나 그것은 매우 흥미로운 일 우리가 함께 넣어하기 시작 시작되지 않습니다이다 + +508 +00:55:08,599 --> 00:55:15,900 + 대회는 매년 이미지라고 그 물체 인식을위한 경쟁 + +509 +00:55:15,900 --> 00:55:22,440 + 예를 들어 이모 겐에 의한 영상 분류의 표준 경쟁은은이다 + +510 +00:55:22,440 --> 00:55:28,710 + 거의 150 만 이미지와 알고리즘을 통해 천 개체 클래스에 경쟁 + +511 +00:55:28,710 --> 00:55:34,220 + 성능은 그래서 사실 난 그냥 소셜 미디어에 있던 사람이 들었어요 + +512 +00:55:34,219 --> 00:55:38,589 + 나는 매우했다 컴퓨터 비전의 올림픽 도전 이미지 참조 + +513 +00:55:38,590 --> 00:55:40,240 + 유망한 + +514 +00:55:40,239 --> 00:55:55,649 + 그 도전 2010 그래서 그렇게 사람들을 역사에 우리가 가까이 가져 + +515 +00:55:55,650 --> 00:56:00,369 + 그 시간 패스 주위에 실제로 것은 어디 동료를 알아가는 사람들 + +516 +00:56:00,369 --> 00:56:05,309 + 그들은 우리가 직면 그래서 20 개체의 자신의 문제를 단계적으로 폐지하는 거 처음이야 우리에게 말했다 + +517 +00:56:05,309 --> 00:56:12,039 + 천 개체의 이미지에 도전하는 이유는 에러율에 액세스 + +518 +00:56:12,039 --> 00:56:18,199 + 우리는 매우 중요한 오류와 함​​께 시작에 우리는 시작 물론 당신은 알고있다 + +519 +00:56:18,199 --> 00:56:28,029 + 매년 감소하지만 특히 세 정말 감소가 그 + +520 +00:56:28,030 --> 00:56:38,960 + 올해는 뜨거운 거의 IS 2012 2012입니다 절단 한 것 승리 아키텍처 + +521 +00:56:38,960 --> 00:56:45,769 + 이미지 그 문제는 내가 말할 것 네트워크의 회선이었다 + +522 +00:56:45,769 --> 00:56:53,250 + 그것은 어떻게 모든 새로운 스피커의 느낌에도 불구하고 2012 년에 발명되지 않았습니다 대해 + +523 +00:56:53,250 --> 00:56:58,190 + 이 블록 주위에 새로운 일이처럼 그것은 다시 발명되었다 아니에요 + +524 +00:56:58,190 --> 00:56:59,349 + 칠십 년대와 80 년대 + +525 +00:56:59,349 --> 00:57:05,279 + 그는 그러나 사물의 융합에 회선에 대해 이야기합니다 보내고 당신의 + +526 +00:57:05,280 --> 00:57:10,519 + 네트워크는 대용량으로 그 거대한 힘을 보여 주었다 훈련을 종료 + +527 +00:57:10,519 --> 00:57:18,219 + 큰 차이로 제치고 그였다 이미지 아키텍처와 왕 + +528 +00:57:18,219 --> 00:57:24,829 + 보기의 AA 수학적 관점에서 매우 역사적인 순간이 아니었다 그것은 + +529 +00:57:24,829 --> 00:57:30,079 + 볼이 내 엔지니어링 전에 새로운 및 해결 실제 포인트가 + +530 +00:57:30,079 --> 00:57:35,090 + 당신이 많은 알고에 의해 작품의 조각이 덮여있는 역사적 순간이었다 + +531 +00:57:35,090 --> 00:57:42,400 + 시간이 모든 문제는이 발병입니다 학습의 시작입니다 + +532 +00:57:42,400 --> 00:57:48,869 + 혁명 당신은 그것을 호출이 이것 때문에이 클래스의 전제 인 경우 + +533 +00:57:48,869 --> 00:57:54,609 + 우리가 컴퓨터의 간단한 역사를 통해 갔다, 그래서 나는 거 스위치 해요 가리 + +534 +00:57:54,610 --> 00:57:59,539 + 540,000,000년 비전 + +535 +00:57:59,539 --> 00:58:05,869 + 이 클래스의 개요 다른 질문이 있습니다 + +536 +00:58:05,869 --> 00:58:13,969 + 우리가 많이 얘기 종류의 압도적 이었지만 확실히 그래서 우리는 심지어 얘기 + +537 +00:58:13,969 --> 00:58:20,559 + 컴퓨터 비전에서 다른 작업을 찾는 것에 대해 31에 보인다 초점을 맞출 것입니다 + +538 +00:58:20,559 --> 00:58:27,849 + 시각적 인식 문제에도 특히 대부분의를 통해 확대 + +539 +00:58:27,849 --> 00:58:29,509 + 기초 강좌 + +540 +00:58:29,510 --> 00:58:35,750 + 우리가 얘기 분류하지만 지금은 당신이 알고있는 모든 것을 할 거입니다 + +541 +00:58:35,750 --> 00:58:41,480 + 우리는 우리가 다른 얻고 있었다됩니다 설정 분류 그 이미지를 기반으로 + +542 +00:58:41,480 --> 00:58:47,900 + 시인성 시나리오이지만, 화상 분류 문제이다 메인 + +543 +00:58:47,900 --> 00:58:52,780 + 명심하십시오 의미 우리는 엠마의 클래스에 초점을 맞출 것이다 문제 + +544 +00:58:52,780 --> 00:58:56,600 + 시각적 인식은 바로 3 차원 거기에 그냥 이미지 분류되지 않습니다 + +545 +00:58:56,599 --> 00:59:01,339 + 모델링이 분할의 그룹이었다하고 있지만,이 모든입니다 + +546 +00:59:01,340 --> 00:59:06,250 + 그건 우리가에 초점을 맞출 것이다 나는 미스 당신이 그냥 전화도 할 필요가 없습니다 것 + +547 +00:59:06,250 --> 00:59:11,000 + 애플리케이션 현명한 이미지 분류는 매우 유용 문제 + +548 +00:59:11,000 --> 00:59:17,929 + 당신은 큰 큰 상업적인 인터넷 기업들에게 관점을 알고부터 + +549 +00:59:17,929 --> 00:59:22,449 + 시작 아이디어 당신은 당신이 인식 할 객체를 인식 할 알고 + +550 +00:59:22,449 --> 00:59:29,119 + 음식은 이동할 수 있도록 당신이 우리에게 고문 앨범을 원하는 온라인 상점 모바일 쇼핑을 + +551 +00:59:29,119 --> 00:59:35,710 + 분류 소식은 많은 많은에 대한 생계 작업이 될 수있다 + +552 +00:59:35,710 --> 00:59:44,650 + 중요한 문제 두 분류와 관련이있어 문제가있다 + +553 +00:59:44,650 --> 00:59:49,329 + 오늘은 당신이 차이를 이해하는 기대하지 않는다 그러나 나는 듣고 싶어 + +554 +00:59:49,329 --> 00:59:55,659 + 이 클래스가 있는지 확인 반면 것을 당신은의 미묘한 차이를 이해하는 법을 배워야 + +555 +00:59:55,659 --> 01:00:01,879 + 시각적 인식의 다른 맛의 세부 내용 이미지 + +556 +01:00:01,880 --> 01:00:07,700 + 분류는 영상 자막이 그리고 이것들이 가지고있는 물체 감지 무엇 + +557 +01:00:07,699 --> 01:00:14,529 + 그는이 분류를 만든 예를 들어 다른 맛을 알고 내 + +558 +01:00:14,530 --> 01:00:19,740 + 로 전체의 큰 이미지 객체 검출에 초점 곳 가지를 알려 + +559 +01:00:19,739 --> 01:00:23,579 + 정확히 차가 보행자입니다 같다 + +560 +01:00:23,579 --> 01:00:30,159 + 망치와 단어가 등등 객체와의 관계 + +561 +01:00:30,159 --> 01:00:35,529 + 이 클래스에 대해 학습한다 그들의 뉘앙스 및 세부 사항을 사회 + +562 +01:00:35,530 --> 01:00:43,840 + 나는 이미 CNN 말했다 또는 네트워크의 연합은 깊이의 한 종류입니다 + +563 +01:00:43,840 --> 01:00:50,910 + 아키텍처하지만 계획 아키텍처 압도적으로 성공이고 + +564 +01:00:50,909 --> 01:00:54,909 + 이것은 우리가 집중되고 단지로 다시 이동합니다 아키텍처 + +565 +01:00:54,909 --> 01:01:02,849 + 이미지 9 도전은 그래서 역사적 년이는 연도 2012 인 + +566 +01:01:02,849 --> 01:01:14,349 + 나는 그것이 칠 생각 제프 힌튼이이 길쌈을 제안 소풍 있습니다 + +567 +01:01:14,349 --> 01:01:20,500 + 이전 모델에 도전 이미지를 승리 네트워크 길쌈 층 + +568 +01:01:20,500 --> 01:01:22,318 + 올해 + +569 +01:01:22,318 --> 01:01:30,548 + 기능을 SIFT 플러스 벡터 머신 아키텍처 그것은 여전히​​ 계층 구조를 지원 + +570 +01:01:30,548 --> 01:01:38,449 + 하지만 두의 맛을 가지고 2015 년에 앞으로 빠른 학습하지 않습니다 + +571 +01:01:38,449 --> 01:01:43,798 + 승리 아키텍처는 아직도 당신이 그것의 걱정하지 않은 결론이다 + +572 +01:01:43,798 --> 01:01:56,599 + 사냥꾼 (51) 층은 마이크로 소프트 아시아 연구소 연구원을 구입 구입하고 그것은 분명 + +573 +01:01:56,599 --> 01:02:03,048 + 이유가 잔류 잔류 그래서 커버에 대한 확신 아니에요 + +574 +01:02:03,048 --> 01:02:09,369 + 그 확실히 실제로 무엇을 하나 하나 층 알고 기대하지 않습니다 + +575 +01:02:09,369 --> 01:02:17,269 + 그들은 2012 우승 구조 때문에 마음 만 매년 자체를 반복 + +576 +01:02:17,268 --> 01:02:23,548 + 이미지의 그 문제는 내가 같은 깊은 학습 기반의 아키텍처입니다 + +577 +01:02:23,548 --> 01:02:32,369 + 나는 또한 당신이 역사는 발명되지 존중하고 싶은 것을하는 것은 하룻밤 많이있다 + +578 +01:02:32,369 --> 01:02:37,979 + 오늘하지만 당신이 알고있는 영향력있는 선수의 구축 많은 사람들이있다 + +579 +01:02:37,978 --> 01:02:41,879 + 기초 사실 나는 슬라이드를 기억해야 할 한 가지 중요한 일이 없습니다 + +580 +01:02:41,880 --> 01:02:50,910 + 쿠니히코 후쿠시마 contigo 솔루션은 구축 일본의 과학자했다입니다 + +581 +01:02:50,909 --> 01:02:58,798 + 모델 corneil 홍콩 트럭 및 그 새로운 네트워크의 시작 + +582 +01:02:58,798 --> 01:03:04,318 + 건축과 노란색 색상도 매우 영향력있는 사람이며 그는 정말 + +583 +01:03:04,318 --> 01:03:10,248 + 젊은 쿠데타의 내 의견에 혁신적인 작업에 출판되었다 + +584 +01:03:10,248 --> 01:03:16,348 + 그래서 19 구십 한 수학자의 한 제프 힌튼 + +585 +01:03:16,349 --> 01:03:22,479 + 포함 된 모든 항목을 포함 고문은 다시 전파 학습을했다 + +586 +01:03:22,478 --> 01:03:28,088 + 아래에 아무것도 삭제이 있다면 전략은 몇 당신을 말할 것이다 + +587 +01:03:28,088 --> 01:03:34,528 + 주하지만,하지만, 수학적 만도는 80 년대 거칠게하고, + +588 +01:03:34,528 --> 01:03:34,920 + 그만큼 + +589 +01:03:34,920 --> 01:03:40,869 + 속옷이 있었다 그것이 AT & T 벨 연구소에서 근무하고 해당 지역의 + +590 +01:03:40,869 --> 01:03:47,160 + 그 때 놀랄만한 장소들이 있다고 더 이상 오늘날에는 보석 UPS가 없습니다 + +591 +01:03:47,159 --> 01:03:50,949 + 정말 야심 찬 프로젝트를 진행하고 그는 숫자를 인식하는 데 필요한 + +592 +01:03:50,949 --> 01:03:57,019 + 심지어 미국의 게시물에 우리의 가방에 제공된 해당 제품을 떠나 있기 때문에 + +593 +01:03:57,019 --> 01:04:03,380 + 사무실은 어렵고 검사 및 변태 건설하는 연합을 인식하는 + +594 +01:04:03,380 --> 01:04:08,068 + 그는 그가 HUBEL과 위젤 그는 영감을 어디 네트워크에서이입니다 + +595 +01:04:08,068 --> 01:04:14,500 + 일부 수영장에서 찾고 의해 시작은 구조와 같은 가장자리와 이미지는이 마음에 들지있어 + +596 +01:04:14,500 --> 01:04:20,099 + 전체 편지 정말 가장자리에 필요가있어 팔과 계층 별 층 + +597 +01:04:20,099 --> 01:04:25,539 + 이 가장자리를 끌어 필터들 함께 풀을 필터링 한 다음 필드이 + +598 +01:04:25,539 --> 01:04:36,230 + 아키텍처 20121 알렉스 kruschev 스키와 제프 힌튼하면 거의 정확하게 일 + +599 +01:04:36,230 --> 01:04:40,900 + 아키텍처는 차에 참여 + +600 +01:04:40,900 --> 01:04:47,900 + 몇 가지 변경이 도전을 상상하지만 승리가 될 + +601 +01:04:47,900 --> 01:04:54,920 + 이 아키텍처는 그래서 우리는 lib 디렉토리의 세부 사항 변경에 대한 자세한 말씀 드리죠 + +602 +01:04:54,920 --> 01:05:02,380 + 거기에 무어의 법칙이 우리를 도왔 때문에 용량 모델은 조금 증가했다 + +603 +01:05:02,380 --> 01:05:08,220 + 대한 모양의 약간의 변화도 매우 매우 상세한 기능 + +604 +01:05:08,219 --> 01:05:14,828 + 대부분의 Signori (224)는 그 형태하지만 무엇에 파일뿐만 몇있다 + +605 +01:05:14,829 --> 01:05:19,130 + 큰 아무것도에 의해 정말 작은 변화 만 변경했다 + +606 +01:05:19,130 --> 01:05:26,490 + 수학적하지만 중요한 것은 변화 않았고 그 깊은 학습을 성장 + +607 +01:05:26,489 --> 01:05:35,379 + 그 르네상스 하나에 Architektur 검정 잉크 승무원은 음식물의 한 입처럼이며, + +608 +01:05:35,380 --> 01:05:41,180 + 이들은 매우 높은 높은 있기 때문에 하드웨어 하드웨어는 큰 차이를 만들어 + +609 +01:05:41,179 --> 01:05:44,669 + 용량 모델 일 델라 크루즈 + +610 +01:05:44,670 --> 01:05:50,720 + 때문에 계산의 병목이 고통스럽게 느린 그는 할 수 없었다 + +611 +01:05:50,719 --> 01:05:55,209 + 그래서 당신은 큰 수없는 완벽에 추가 할 수는 없지만이 모델에게 너무 큰 구축 + +612 +01:05:55,210 --> 01:06:00,670 + 15 이상 거기에 기계 학습의 관점에 대한 잠재력을 실현하고 + +613 +01:06:00,670 --> 01:06:07,780 + 이러한 모든 문제는 당신도하지만 지금 우리는 훨씬 더 빨리 더 큰 트랜지스터를 가질 수있다 + +614 +01:06:07,780 --> 01:06:16,410 + 엔비디아에서 트랜지스터 마이크로 칩 및 GPU는 깊은에 큰 차이를 만들어 + +615 +01:06:16,409 --> 01:06:22,358 + 우리가 지금 적당한 양의 모델을 연수생 수있는 학습의 역사 + +616 +01:06:22,358 --> 01:06:27,358 + 그들은 거대한이고 다른 사람들이 우리가 밖으로 데리고해야합니까 생각하는 경우에도 시간 + +617 +01:06:27,358 --> 01:06:37,159 + 작품 자체가 그냥있는 빅 데이터했던이었다 데이터의 데이터 가용성이다 + +618 +01:06:37,159 --> 01:06:41,078 + 그것은 아무것도 의미하지 않는다 알고는 있지만 그것을 사용하는 방법을 모르는 경우 + +619 +01:06:41,079 --> 01:06:45,869 + 깊은 학습 Architektur 데이터 고용량 구동력 될 + +620 +01:06:45,869 --> 01:06:52,390 + 모델은 뭐하는 교육을 활성화하는 진정한 진정한 진정한 도움 피하기 overfitting 때 + +621 +01:06:52,389 --> 01:06:57,608 + 당신은 픽셀 수를 보면 당신은 그래서 당신을 알 수 있도록 당신은 충분한 데이터를 가지고 그 + +622 +01:06:57,608 --> 01:07:05,639 + 기계 학습 사람들은 나선형 그것이 거대 1998를 가진 대 2012 년에 있었다 + +623 +01:07:05,639 --> 01:07:06,469 + 차이 + +624 +01:07:06,469 --> 01:07:14,469 + 크기의 주문은 그래서 그래서 그래서이 (231)의 초점이었다 + +625 +01:07:14,469 --> 01:07:21,098 + 뿐만 아니라 갈 것입니다 오, 내가이 생각을 침을 흘리고있어 중요한 마지막이기도 + +626 +01:07:21,099 --> 01:07:27,048 + 나는 어떤을 원하지 않는 시각적 지능 물체 인식을 넘어 않습니다 + +627 +01:07:27,048 --> 01:07:31,039 + 이 과정에서 나오는 것은 우리는 당신이 우리가했습니다 알고있는 모든 일을했습니다 딩키 + +628 +01:07:31,039 --> 01:07:38,889 + 시각적 인식의 전체 공간을 비행 할 도전은이 사실이 아니에요 + +629 +01:07:38,889 --> 01:07:44,460 + 여전히 멋진 많은 문제가 당신이 알고 예를 들어 해결하는 것은 라벨링을한다 + +630 +01:07:44,460 --> 01:07:51,650 + 모든 단일 픽셀이 속한 곳 지각 그룹과 전체 장면은 그래서 나는 알고있다 + +631 +01:07:51,650 --> 01:07:52,329 + 에 + +632 +01:07:52,329 --> 01:07:56,900 + 그 여전히 함께 지속적인 문제입니다 + +633 +01:07:56,900 --> 01:08:02,740 + 3 차원으로 인식이 정말 흥분을 많이가 거기 무슨 일이 일어나고입니다 + +634 +01:08:02,739 --> 01:08:09,349 + 비전과는이이 로봇의 교차로는 그 확실히 하나의 영역입니다 + +635 +01:08:09,349 --> 01:08:15,039 + 다음 아무것도 국경의 움직임과와와 함께 할 수있는이 또 다른입니다 + +636 +01:08:15,039 --> 01:08:33,289 + 당신은 그냥 거 이상으로 알고 연구 작업의 큰 개방 영역은 당신이 실제로 원하는 노래 + +637 +01:08:33,289 --> 01:08:35,689 + 깊이 승자를 이해 + +638 +01:08:35,689 --> 01:08:39,489 + 어떤 사람들이하고있는 것은 서양의 개체 사이의 관계 무엇인가 + +639 +01:08:39,489 --> 01:08:45,029 + 객체 사이의 상기 관계에 RD와이 진행중인 프로젝트 + +640 +01:08:45,029 --> 01:08:49,759 + 학생들의 수는 단지 내 무릎에 시각적 게놈이라고 + +641 +01:08:49,760 --> 01:08:55,739 + 관련이 지금까지 우리가에 대한 이야기​​ 잡초의 이미지 분류 넘어 + +642 +01:08:55,739 --> 01:09:03,639 + 우리의 거룩한 Grails에의 한 것입니다 사회 이것의 성배의 일 동안 + +643 +01:09:03,640 --> 01:09:09,260 + 바로 그래서 인간으로 당신에 대해 생각하는 장면의 이야기를 할 수 있어야합니다 + +644 +01:09:09,260 --> 01:09:11,180 + 당신은 당신의 눈을 열어 + +645 +01:09:11,180 --> 01:09:17,840 + 당신은 당신이 할 수있어 눈을 뜨고 순간 당신이 실제로 무엇을보고 설명하기 + +646 +01:09:17,840 --> 01:09:24,940 + 심리학 실험은 우리는 당신이 사람들에게 단에 사진을 표시하는 경우에도 발견 + +647 +01:09:24,939 --> 01:09:30,659 + 말 그대로 두 번째 사람의 절반이다 오백 밀리 자 + +648 +01:09:30,659 --> 01:09:36,769 + 그들은하지 않았다, 그래서 그것에 대해 에세이를 작성 우리는 그들에게 $ 시간당 10을 지불 + +649 +01:09:36,770 --> 01:09:42,410 + 그것은 그 길지 않았다하지만 우리는 더 많은 돈을 이야기하면 당신은 내가 그림을 알고 그들이 + +650 +01:09:42,409 --> 01:09:47,970 + 아마 더 이상 윤리를 작성하지만 요점은 우리의 시각 시스템이 있다는 것입니다 수 + +651 +01:09:47,970 --> 01:09:54,390 + 매우 강력한 내 셀입니다 우리는 이야기를하고 나는이 꿈을 꿀 것 + +652 +01:09:54,390 --> 01:10:02,560 + 논문을 옷을 위해 우리는 당신이 당 컴퓨터 한 그림을 제공주고 있고 + +653 +01:10:02,560 --> 01:10:03,960 + 결과 + +654 +01:10:03,960 --> 01:10:09,159 + 이 같은 설명 당신은 당신이이 줄이 표시됩니다 내가 거기에 도착하고 있었다 알고 + +655 +01:10:09,159 --> 01:10:15,149 + 크메르어 올림픽 TUR 당신에게 한 문장을 제공하거나 하나의 선택이 켜져 수가 줄 + +656 +01:10:15,149 --> 01:10:20,319 + 하지만 짧은 문장으로 우리는 아직 여기 아니에요하지만 홀더 중 하나 + +657 +01:10:20,319 --> 01:10:26,250 + 블루와 다른 들고 성장 내가 생각이이 계속이 작업을 계속하고있다 + +658 +01:10:26,250 --> 01:10:33,659 + 오드리의 블로그가 정말 잘 요약하면 바로 다음과 같이 알고있다 + +659 +01:10:33,659 --> 01:10:42,300 + 당신이뿐만 아니라 즐길 얻을이 그림이 너무 많은 뉘앙스를 정제 + +660 +01:10:42,300 --> 01:10:47,890 + 전역은 매우 지루한 오래된 컴퓨터가 당신이 말할 수있을 것입니다 그것을 추구 인식 + +661 +01:10:47,890 --> 01:10:53,650 + 객실에 객실 규모 + +662 +01:10:53,649 --> 01:10:58,238 + 그것의 사물함에서 어떤 유형의 당신은 당신이 인식 여기 알고있는 그들은 + +663 +01:10:58,238 --> 01:11:00,569 + 트릭을 인식하고 있습니다 + +664 +01:11:00,569 --> 01:11:06,009 + 오바마 대통령은 당신이 유머를 인식 상호 작용의 종류를 인식 할 것입니다 + +665 +01:11:06,010 --> 01:11:11,250 + 너무 많이 알고,이 세상의 하나입니다 우리에 관한 것입니다 거기에 인식 + +666 +01:11:11,250 --> 01:11:18,719 + 때뿐만 아니라 탐색 생존하는 경향이 시각 간호사에게 우리의 능력을 사용 + +667 +01:11:18,719 --> 01:11:26,000 + 재생 그러나 우리가 세계를 이해하기 위해 즐겁게 교제하는 데 사용 + +668 +01:11:26,000 --> 01:11:32,929 + 모든 책의 비전은 비전의 목표를 읽을 곳은 그래서 나는 점이다 + +669 +01:11:32,929 --> 01:11:39,630 + 우리의 세계 a를 만들 것이다 당신에게 해당 컴퓨터 시각 기술을 설득 할 필요가 없습니다 + +670 +01:11:39,630 --> 01:11:46,550 + 당신이 집 심지어하지만 알고 거기 어떤 무서운 이야기에도 불구하고 더 나은 곳으로 + +671 +01:11:46,550 --> 01:11:51,029 + 업계에서 오늘뿐만 아니라 연구의 세계 우리가 컴퓨터를 사용하는 + +672 +01:11:51,029 --> 01:11:58,349 + 더 나은 로봇을 구축하는 비전은 이제 분석을 탐험 깊이 갈 생명을 저장합니다 + +673 +01:11:58,350 --> 01:12:02,860 + 확인 그래서 나는 35 분 왼쪽 무엇 이분처럼이 + +674 +01:12:02,859 --> 01:12:10,839 + 좋은 시간은 컬러 강사가 나를 팀과 정의를 소개하자 + +675 +01:12:10,840 --> 01:12:16,989 + 내가 그에게 인사 일어 서서하시기 바랍니다 그래야 될지도 + +676 +01:12:16,989 --> 01:12:22,639 + 당신은 빨리이 안전하게 이름을 좋아하고 당신은 그냥 포기하지 않는 것 같아 수 있습니다 + +677 +01:12:22,640 --> 01:12:49,180 + 연설 그러나 예 + +678 +01:12:49,180 --> 01:13:42,240 + 사람들의 집단 소송이 우리를 도와 때문에 처리하기 때문에 사람이 존중 + +679 +01:13:42,239 --> 01:14:04,739 + 기밀 개인적인 문제가 다시 나는 우리의 조건에 예정하고 떠날거야하지만, + +680 +01:14:04,739 --> 01:14:09,939 + 월 말부터 몇 주 사회 당신은 당신을 결정하십시오 + +681 +01:14:09,939 --> 01:14:15,379 + 당신 같은 사람이 그들이 취할 것입니다하지 않는 한 나에게 이메일을 보내려면 + +682 +01:14:15,380 --> 01:14:20,770 + 그것에 대해 내가 답장 가능성이있어 당신을 즉시 죄송합니다 + +683 +01:14:20,770 --> 01:14:25,420 + 우선 순위 + +684 +01:14:25,420 --> 01:14:34,739 + 우리의 철학에 대한 그리고 우리는 우리가 진정으로 원하는 세부에 도착하지 않는 + +685 +01:14:34,738 --> 01:14:39,448 + 이것은이 정말 내가 신용을 많이하는 줄 매우 실제적인 프로젝트를 할 수 + +686 +01:14:39,448 --> 01:14:46,419 + 저스틴과 앙드레는 이러한 실습을 통해 걷기에 매우 좋은 + +687 +01:14:46,420 --> 01:14:51,840 + 이 클래스 나올 때와 세부 있도록뿐만 아니라 I에게 있습니다 + +688 +01:14:51,840 --> 01:14:57,719 + 이해를 사랑하지만 당신이 가지고 당신은 구축 할 수있는 정말 좋은 능력을 가지고 + +689 +01:14:57,719 --> 01:15:02,010 + 자신의 깊은 학습 코드 우리는 당신이 예술의 상태로 노출 할 + +690 +01:15:02,010 --> 01:15:08,730 + 당신이거야 일을 학습 할 수있는 재료 정말 2015로 신선한 그리고 그것은거야 + +691 +01:15:08,729 --> 01:15:11,859 + 이 같은 일을하는 재미를 얻을 수 + +692 +01:15:11,859 --> 01:15:18,960 + 아니 모든 시간이 있지만, 하나의 목표 나 또는이 이상으로 시간과 같은 사진 + +693 +01:15:18,960 --> 01:15:27,489 + 일이 모든 중요한 작업에 추가하여 재미 클래스를 알 수있을 것입니다 당신이 당신 + +694 +01:15:27,488 --> 01:15:33,589 + 당신은 우리가 이러한 다른 웹 사이트에 있습니다 등급을 매기는 정책을 수행 배우 + +695 +01:15:33,590 --> 01:15:44,929 + 다시 사람들을 식당 하나 당신이 좋아하는 성장 업을 성장 명확 + +696 +01:15:44,929 --> 01:15:51,989 + 우리가 과정의 끝에서 아무것도하지 않는 어른들 내 교수 원하는이다 + +697 +01:15:51,988 --> 01:15:56,359 + 날이 회의에 가서 내가 세처럼이 있어야 더 늦게 그들은 아무 말도 + +698 +01:15:56,359 --> 01:16:03,630 + 당신은 당신이 7 늦게 사용할 수있는 사용에 대한 책임 총 팔일 있습니다 + +699 +01:16:03,630 --> 01:16:11,079 + 어떤 방법으로 당신은 모든 10 처벌은 벌금을해야 모든 일 + +700 +01:16:11,079 --> 01:16:18,069 + 정말 정말 뛰어난 의료 가족 비상 같다 + +701 +01:16:18,069 --> 01:16:21,799 + 개별 기준으로하지만, 무엇에 우리 이야기 + +702 +01:16:21,800 --> 01:16:29,539 + 회의는 왜 다른 사람은 마침내 당신이 누락 고양이 또는 무엇처럼 알고 + +703 +01:16:29,539 --> 01:16:37,850 + 우리는 우리가 우리가 칠일에이 하나의 또 다른 자신의 명예 감기 그 예산을 책정 우리 + +704 +01:16:37,850 --> 01:16:43,190 + 내가 가진 것은 당신이 그런 권한이 있습니다 정말 진지한 얼굴로 대답 + +705 +01:16:43,189 --> 01:16:50,710 + 기관 당신은 당신이 당신이 명예에 대한 책임하려는 어른들되어 있습니다 + +706 +01:16:50,710 --> 01:16:55,239 + 코드이 수업을 하나 하나 Stampfer 학생은 다른 알아야 + +707 +01:16:55,239 --> 01:16:58,619 + 공동 당신이 변명이 없다하지 않으면 당신은 돌아 가야한다 + +708 +01:16:58,619 --> 01:17:04,840 + 매우 심각하게 나는 거의 통계적으로 그런 말을 싫어 협력을 기다립니다 + +709 +01:17:04,840 --> 01:17:10,380 + 주어진 계급 큰 단어 알라는 몇 가지 경우가 있지만 나는 또한 당신이되고 싶어요 + +710 +01:17:10,380 --> 01:17:16,210 + 심지어 크기와 뛰어난 클래스이 큰 우리는 무엇을보고 싶어하지 않는 + +711 +01:17:16,210 --> 01:17:22,399 + 대학 명예 코드를 침해하므로 협력 정책과 위험을 읽을 수는 있지만 + +712 +01:17:22,399 --> 01:17:31,960 + 이것은 정말 당신이 할 수있는 모든 습득 조건으로 생각하는 자신을 존중하는 것을 + +713 +01:17:31,960 --> 01:17:38,149 + 당신은 어떤 굽기가 내가 말하고 싶은 무슨 상관 읽을 수 + +714 +01:17:38,149 --> 01:17:47,569 + 당신이 예 물어 가치가 느끼는 질문 + +715 +01:17:47,569 --> 01:18:06,689 + 그래 + diff --git a/captions/Ko/Lecture2_ko.srt b/captions/Ko/Lecture2_ko.srt new file mode 100644 index 00000000..ba230f61 --- /dev/null +++ b/captions/Ko/Lecture2_ko.srt @@ -0,0 +1,2948 @@ +1 +00:00:00,000 --> 00:00:03,750 + 우리는 좋은처럼 기록 단지 당신을 다시 생각 나게하는 + +2 +00:00:03,750 --> 00:00:08,160 + 당신이 카메라에 불편 말하기를하는 경우, 그래서 안녕하세요 가장 가까운 기록 + +3 +00:00:08,160 --> 00:00:15,929 + 아니 그림에 있지만 음성은 당신이 수 확인 위대한 기록에있을 수 있습니다 + +4 +00:00:15,929 --> 00:00:19,589 + 참조 또한 화면은해야보다 넓은 내가 그것을 해결하는 방법을 잘 모르겠어요 + +5 +00:00:19,589 --> 00:00:21,300 + 함께 사는 열심히 + +6 +00:00:21,300 --> 00:00:25,269 + 가능성이 시각 피질 그래서 스트레칭에 아주 좋은 아주 불변이다 + +7 +00:00:25,268 --> 00:00:26,118 + 이것은 문제가되지 않는다 + +8 +00:00:26,118 --> 00:00:32,259 + 우리는 클래스로 다이빙을하기 전에 확인 그래서 일부 관리 것들로까지 무엇이야 + +9 +00:00:32,259 --> 00:00:36,100 + 첫 임무는 월을하다 오늘 밤 또는 이른 내일 나올 것입니다 + +10 +00:00:36,100 --> 00:00:41,289 + 20 정확히 2 주 당신이 분류 이전 분류를 작성하는 것이 + +11 +00:00:41,289 --> 00:00:44,159 + 작은 두 계층 신경망 당신의 전체를 작성 할 수 있습니다 + +12 +00:00:44,159 --> 00:00:47,979 + 22 층 신경 네트워크의 역 전파 알고리즘을 모두 충당 할 수 + +13 +00:00:47,979 --> 00:00:54,459 + 2 주 아침에 재료에 의해 일부는 마지막에서가 + +14 +00:00:54,460 --> 00:00:57,350 + 그들이하지 기쁘게 있도록 올해도 우리는 할당을 변경하고 + +15 +00:00:57,350 --> 00:01:02,890 + 과에 대한주의해야 할 뭔가 2,015 할당에 완료하여 + +16 +00:01:02,890 --> 00:01:07,109 + 경쟁은하지만 파이썬과 파이를 사용하는 것 또한 제공됩니다 + +17 +00:01:07,109 --> 00:01:11,030 + 에서 기본적으로 가상 머신 인 것입니다 터미널 닷컴 + +18 +00:01:11,030 --> 00:01:13,939 + 당신이 그렇게에 아주 좋은 노트북을 가지고하지 않을 경우 사용할 수있는 클럽 + +19 +00:01:13,938 --> 00:01:17,250 + 그것에 대해 세부 사항으로 이동하지만 난 그냥 제에 대한 것을 지적하고자 + +20 +00:01:17,250 --> 00:01:21,090 + 할당 우리는 당신이거야 파이썬 비교적 잘 알고있을거야 가정 + +21 +00:01:21,090 --> 00:01:24,859 + 어디 조작에이 최적화 된 NumPy와 식을 작성한다 + +22 +00:01:24,859 --> 00:01:28,438 + 이 행렬과 벡터 예를 들어 당신이 있다면, 그래서 매우 효율적인 형태 + +23 +00:01:28,438 --> 00:01:31,908 + 이 코드를보고 그 다음 당신에게 아무것도 의미하지 않는 것은 봐주세요 + +24 +00:01:31,909 --> 00:01:35,880 + 이 저스틴에 의해 작성된 것뿐만 아니라 웹 사이트에 가입 우리의 파이썬 튜토리얼에서 + +25 +00:01:35,879 --> 00:01:39,489 + 매우 좋은이며, 그래서 통과하고 익숙해 + +26 +00:01:39,489 --> 00:01:42,328 + 표기법 당신과 같은 코드를 많이 작성을 볼 수있을 것이기 때문에 + +27 +00:01:42,328 --> 00:01:47,048 + 그들은 충분히 빨리 실행하는 것, 그래서 우리가하고있는 곳이이 작업을 최적화하는 모든 + +28 +00:01:47,049 --> 00:01:51,610 + CPU에서 지금은 전체의 관점에서이 있다는 것입니다 금액 기본적으로 무엇을 + +29 +00:01:51,609 --> 00:01:54,599 + 당신에게 할당에 대한 링크를 제공합니다 당신은 웹 페이지로 이동합니다 당신은 볼 수 있습니다 + +30 +00:01:54,599 --> 00:01:58,309 + 이 같은 일이 설정되어 클라우드에서 가상 머신이다 + +31 +00:01:58,310 --> 00:02:01,420 + 과제의 모든 종속성까지 그들은 모두 이미 설치되어있는 + +32 +00:02:01,420 --> 00:02:05,618 + 데이터가 이미 존재하고에 그래서 당신은 점심 시스템에서 클릭하고이는거야 + +33 +00:02:05,618 --> 00:02:09,580 + 기본적으로이 이런 일에 당신을 데려 동생을 실행하고 + +34 +00:02:09,580 --> 00:02:13,060 + 이것은 기본적 AWS 위에 얇은 층 + +35 +00:02:13,060 --> 00:02:17,209 + 기계 여기 그래서 UI 층에는 노트북과 조금 아이팟이 + +36 +00:02:17,209 --> 00:02:20,739 + 터미널 당신은 주위에 갈 수있는이 클라우드에 그냥 기계처럼이며, + +37 +00:02:20,739 --> 00:02:24,310 + 그래서 그들은 어떤 CPU 제품을 가지고 그들은 또한 당신이 할 수있는 일부 GPU 시스템을 + +38 +00:02:24,310 --> 00:02:25,539 + 그래서 사용 + +39 +00:02:25,539 --> 00:02:29,090 + 일반적으로 단말기 비용을 지불해야하지만, 그래서 당신에게 크레딧을 분배한다 + +40 +00:02:29,090 --> 00:02:33,709 + 당신은 단지 당신이 TA에 이메일을 보내 비트에 결정할 것이다 따 특정 손실과 + +41 +00:02:33,709 --> 00:02:36,950 + 돈을 요구하는 것은 당신에게 돈을 보내 우리는 우리가로 전송 얼마나 많은 돈을 추적합니다 + +42 +00:02:36,949 --> 00:02:40,799 + 모든 사람들은 그래서 당신은 그래서 이것이 자금과 책임을 가지고 + +43 +00:02:40,800 --> 00:02:55,689 + 또한 옵션은 모든 세부 사항 당신이 읽을 수 수 있습니다 확인처럼 사용하기 위해 + +44 +00:02:55,689 --> 00:02:57,680 + 당신은 당신의 의견이 필요하지있어 좋아하는 경우 + +45 +00:02:57,680 --> 00:03:03,879 + 하지만 당신은 아마 주위에서 얻을 수 그래 좋아 샘이 일어날 것을 말한다 + +46 +00:03:03,879 --> 00:03:07,870 + 강의는 이제 오늘 우리가 이야기 할 것입니다있는 분류 및 특수 것 + +47 +00:03:07,870 --> 00:03:13,219 + 우리는 분류의 기본을 이야기하도록 선형 분류에 시작 + +48 +00:03:13,219 --> 00:03:17,560 + 작업은 우리가 범주의 일부 수 있도록 개 고양이 트럭 평면 또는 말을해야한다는 것입니다 + +49 +00:03:17,560 --> 00:03:20,799 + 우리는 그 다음이 무엇인지를 결정하는 얻을에 이미지를 촬영하기 이전 요청 + +50 +00:03:20,799 --> 00:03:24,950 + 어떤 숫자의 거대한 품종이며, 우리는이 중 하나에 변환해야 + +51 +00:03:24,949 --> 00:03:29,169 + 라벨 우리는이 문제가 지출 범주 중 하나에 그것을 구축해야 + +52 +00:03:29,169 --> 00:03:32,548 + 우리의 대부분의 시간은 구체적으로이 일에 대해 이야기하지만 당신은 하나를 수행하려는 경우 + +53 +00:03:32,549 --> 00:03:36,349 + 이러한 검출 이미지 캡처 어떤 분할 등의 컴퓨터 비전에서 다른 작업 + +54 +00:03:36,349 --> 00:03:40,108 + 또는 어떤 다른 당신은 찾을 그가 분류 방법에 대해 알고 나면 그 + +55 +00:03:40,109 --> 00:03:43,569 + 다른이 이루어집니다 모든 그래서 당신이 알 수있을 것입니다 그것의 상단에 내장 단지 작은입니다 + +56 +00:03:43,568 --> 00:03:47,060 + 이 개념에 대한 정말 좋은, 그래서 좋은 위치는 다른 작업을 수행 할 수 + +57 +00:03:47,060 --> 00:03:50,840 + 이해하고 우리는 간단하게 구체적인 예로서 그 통해 작동합니다 + +58 +00:03:50,840 --> 00:03:54,819 + 처음에 일이 지금 왜이 문제는 하드 그냥 생각를 제공한다 + +59 +00:03:54,818 --> 00:03:58,518 + 문제는 우리가 거​​대한 여기에 의미 론적 차이로이 이미지를 참조 할 것입니다 + +60 +00:03:58,519 --> 00:04:01,739 + 숫자의 격자 이미지가 컴퓨터에 표시되는 방식이 있다는 + +61 +00:04:01,739 --> 00:04:06,299 + 세 오 기본적으로 가속화 세에 의해 약 300 백으로 말 + +62 +00:04:06,299 --> 00:04:09,620 + 적색, 녹색, 청색 세 가지 색상 채널에서 차원 배열과 열로 + +63 +00:04:09,620 --> 00:04:13,590 + 그래서 그 이미지의 일부를 확대 할 때 기본적으로 거대한 중대하다 + +64 +00:04:13,590 --> 00:04:18,728 + 0에서 255 사이의 숫자 것은 그래서 우리는이 숫자와 함께 일해야 무엇 + +65 +00:04:18,728 --> 00:04:21,370 + 밝기의 양마다 모든 세 개의 컬러 채널을 나타낸다 + +66 +00:04:21,370 --> 00:04:25,569 + 단일 이미지의 위치 및 임의의 사양이되도록 이유 + +67 +00:04:25,569 --> 00:04:26,269 + 어려운 + +68 +00:04:26,269 --> 00:04:29,519 + 당신은 우리가 수백만처럼 괜찮은 작업해야 것에 대해 생각할 때 + +69 +00:04:29,519 --> 00:04:33,899 + 그 형태의 번호와 가지고는 신속하게 고양이 등을 분류하기 + +70 +00:04:33,899 --> 00:04:38,339 + 태스크의 복잡성은 명백 해졌다 그래서 예를 들면 카메라가 될 수있다 + +71 +00:04:38,339 --> 00:04:42,689 + 이 고양이를 중심으로 회전하며 확대 할 수 있고, 아무것도를 이동하지 않았다 + +72 +00:04:42,689 --> 00:04:46,769 + 초점 속성과 카메라가 다른 수행과에 대해 생각할 수있는 트랜스 액슬 + +73 +00:04:46,769 --> 00:04:49,769 + 무슨 일이 밝기 값으로 발생하고 실제로 모든 할만큼 큰 + +74 +00:04:49,769 --> 00:04:52,779 + 카메라와 함께 이러한 변환은 완전히 모든 패턴이 출시 될 예정이다 + +75 +00:04:52,779 --> 00:04:56,559 + 변경하고 우리는이 모두에 강력한 될 수 있습니다 많은 다른있다 + +76 +00:04:56,560 --> 00:05:00,709 + 예를 들어, 요금에 대한 문제는 여기에 조명까지 우리는 긴 고양이가 + +77 +00:05:00,709 --> 00:05:07,728 + 흰 고양이는 우리가 실제로 그 두 가지를 가지고 있지만 한 고양이가 넘어 당신은 볼 수 있습니다 + +78 +00:05:07,728 --> 00:05:11,098 + 명확하게 그것을 꽤 만든, 다른 하나는 아니지만 여전히 인식 할 수 + +79 +00:05:11,098 --> 00:05:14,750 + 두 고양이 등의 수준에 대해 다시 밝기 계곡을 생각한다 + +80 +00:05:14,750 --> 00:05:18,329 + 그는 모든 다른 모든 것들을 변화와 같이 그리드 무엇을 그들에게 발생 + +81 +00:05:18,329 --> 00:05:21,279 + 우리가 세상에서 가질 수있는 가능한 조명 방식은 견고하기 + +82 +00:05:21,279 --> 00:05:28,179 + 모두에게 많은 클래스를 형성 떨어져 이상한 많은 문제가 있음 + +83 +00:05:28,180 --> 00:05:33,668 + 이러한 개체의 배열은 매우 오는 캐스트 그렇게 인식하고 싶습니다 + +84 +00:05:33,668 --> 00:05:37,468 + 슬라이드와 다른 포즈 난 그들이 거기에 아주 건조이야을 만들 때 + +85 +00:05:37,468 --> 00:05:41,449 + 즉, 그래서 수학이 과학의 많은 내가 재미를 얻을 수있는 유일한 시간이다 내가 + +86 +00:05:41,449 --> 00:05:45,939 + 이러한 긍정의 모든 강력한으로 발생 그냥 어떻게 든 모든 + +87 +00:05:45,939 --> 00:05:50,189 + 당신은 여전히​​ 자신의 문제에도 불구하고 고양이이 모든 이미지를 인식 할 수 있습니다 + +88 +00:05:50,189 --> 00:05:54,240 + 그래서 가끔 우리는 원양가 표시되지 않을 수 있습니다하지만 당신은 여전히​​ 그건 인식 + +89 +00:05:54,240 --> 00:06:00,340 + 고양이 물 한 병 뒤에 고양이 택시는 소파 내부가 또한있다 + +90 +00:06:00,339 --> 00:06:06,068 + 도 기본적으로 거기에이 클래스의 10 개 조각을보고있는 것처럼 + +91 +00:06:06,069 --> 00:06:10,500 + 배경 혼란에 문제가 일들이 우리가 가지고있는 환경에 혼합 할 수 있습니다 + +92 +00:06:10,500 --> 00:06:15,300 + 그에게 상기시켰다 그래서 고양이 실제로도 내 수준의 변화 거기에있다 + +93 +00:06:15,300 --> 00:06:19,728 + 이 고양이 단지 종의 엄청난 양이다 그래서 그들은 다르게 보일 수 있습니다 + +94 +00:06:19,728 --> 00:06:23,240 + 나는 그래서 모두에게 당신의 상사와 방법은 감사하는 것처럼 + +95 +00:06:23,240 --> 00:06:26,718 + 우리는 이러한 독립적 중 하나 고려 작업의 복잡성은 어렵다 + +96 +00:06:26,718 --> 00:06:31,908 + 당신이 모든 다른 것들의 크로스 제품을 고려하고 그러나이 + +97 +00:06:31,908 --> 00:06:35,769 + 아무것도 전혀 작동하는지 실제로 아주 놀라운 것을 모두에서 작동합니다 + +98 +00:06:35,769 --> 00:06:39,539 + 사실 그것은 작동하지만 거의 여기 정말 잘 작동 않습니다뿐만 아니라, + +99 +00:06:39,540 --> 00:06:43,740 + 이 같은 카테고리의 정확성과 우리는 수십 밀리 초 단위로이 작업을 수행 할 수 있습니다 + +100 +00:06:43,740 --> 00:06:49,040 + 그래서 현재의 기술과 함께 그는이 클래스에 대한 자세한 내용은 무엇입니까 + +101 +00:06:49,040 --> 00:06:54,390 + 기본적으로 우리는 우리가하고 싶은 영역을 통해이 복용하고 같은 분류보기 + +102 +00:06:54,389 --> 00:06:57,539 + 클래스 레이블을 생성하고 내가 원하는 때 그는 더 있다는 것을 눈치없는거야 + +103 +00:06:57,540 --> 00:07:01,569 + 확실한 방법까지 실제로 인코딩하고 정액이 분류의이 권리 + +104 +00:07:01,569 --> 00:07:04,790 + 일찍 클래스에 모두 좋은 복용하는 말처럼 간단한 알고리즘은 없다 + +105 +00:07:04,790 --> 00:07:08,379 + 컴퓨터 과학 교육 과정 당신의 쓰기 거품 정렬 또는 당신이 뭔가를 작성하는 + +106 +00:07:08,379 --> 00:07:11,939 + 당신은 모든 가능한 단계에와 당신을 직관적으로 할 수 있습니다 특정 작업을 수행 할 다른 + +107 +00:07:11,939 --> 00:07:15,300 + 그것들을 열거하고이를 수 있습니다 여기에 함께 연주하고 그것을 분석 할 수 있지만, + +108 +00:07:15,300 --> 00:07:18,530 + 어떤 알고리즘이 모든 변화에서 고양이를 검출 없다 그것의있다 + +109 +00:07:18,529 --> 00:07:21,509 + 는 IS 당신이 실제로을 작성하는 방법에 대해 생각하기가 매우 어렵습니다 + +110 +00:07:21,509 --> 00:07:26,039 + 작업의 순서는의 고양이를 감지하는 임의의 이미지를 할 것 + +111 +00:07:26,040 --> 00:07:28,629 + 사람들이 시도하지 않은 말을하지 특히​​ 초기이 컴퓨터 만 + +112 +00:07:28,629 --> 00:07:32,719 + 나는 당신이 생각하는 경우 그들에게 전화하고 싶습니다 이러한 명시 적 접근이 있었다 + +113 +00:07:32,720 --> 00:07:37,240 + 내가 말할 수 없습니다에 대한 괜찮 그는 당신이 작은 귀를 찾아 만나고 싶을 것입니다 + +114 +00:07:37,240 --> 00:07:40,910 + 우리가 무엇을 할 거 야 그래서 개는 우리가 울트라 iso는 뜻을 가장자리 모든 가장자리를 감지 할 수 있습니다입니다 + +115 +00:07:40,910 --> 00:07:45,380 + 가장자리의 서로 다른 특성을 분류하고 그들의 접합 당신이 알고 만듭니다 + +116 +00:07:45,379 --> 00:07:48,350 + 우리가 보면 시즌 라이브러리는 자신의 준비를 찾으려고 할 것이다 + +117 +00:07:48,350 --> 00:07:52,150 + 우리는 몇 가지의 특정 질감을 볼 고양이를 검출 할 것 같은 것을 + +118 +00:07:52,149 --> 00:07:55,899 + 당신이 어떤 규칙을 가지고 올 수있는 특정 주파수는 고양이를 공격합니다 + +119 +00:07:55,899 --> 00:07:59,870 + 하지만 문제는 내가 좋아 말해 일단 내가 사실을 인식하고 싶습니다이다 + +120 +00:07:59,870 --> 00:08:03,569 + 보트 지금 또는 당신이 좋아처럼 다시 드로잉 보드에 아직 갈 사람 + +121 +00:08:03,569 --> 00:08:06,719 + 어떤 원본 페이지를 잘 완전히의 정확히 보트를 만든다 + +122 +00:08:06,720 --> 00:08:11,590 + 이 클래스를 삭제 압력으로 기소로 확장 접근과 + +123 +00:08:11,589 --> 00:08:16,699 + 우리가에 원하는 데이터 중심의 접근 방식으로 매우 잘 작동 방식 + +124 +00:08:16,699 --> 00:08:20,170 + 기계 학습의 프레임 워크는 단지 지적 것과 실제로 이러한 일 + +125 +00:08:20,170 --> 00:08:23,840 + 초기에 그들은이에 데이터 때문에 사용의 사치를하지 않았다 + +126 +00:08:23,839 --> 00:08:27,060 + 포인트는 시간에 당신은 매우 낮은 해상도의 당신의 그레이 스케일 이미지를 가지고있어 + +127 +00:08:27,060 --> 00:08:30,250 + 당신이 분명히 작동하지 않을거야 것을 인식하려고의 이미지 + +128 +00:08:30,250 --> 00:08:33,769 + 하지만 데이터의 인터넷 엄청난 양의 가용성과 나는를 검색 할 수 있습니다 + +129 +00:08:33,769 --> 00:08:38,460 + Google에서 고양이 예를 들어 나는 모든 곳에서 고양이를 많이 얻고 우리는 알고 + +130 +00:08:38,460 --> 00:08:42,840 + 거기에 많은 그래서 이러한 웹 페이지에서 주변 텍스트를 기반으로 고양이입니다 + +131 +00:08:42,840 --> 00:08:46,060 + 데이터의 방식 때문에 지금처럼 보이는이 우리가 훈련 얼굴을 가지고 있다는 것을 + +132 +00:08:46,059 --> 00:08:49,079 + 당신은 내게 캐스팅 훈련 샘플을 많이주는 곳 + +133 +00:08:49,080 --> 00:08:52,900 + 당신은 자신의 고양이에 대해 말해 당신은 내게 모든 유형의 예를 많이 제공 + +134 +00:08:52,899 --> 00:08:54,230 + 관심있는 다른 카테고리 + +135 +00:08:54,230 --> 00:08:59,920 + 나는 멀리 가야합니까 나는 모델은 클래스 모델 훈련 나는 그 것을 사용할 수 있습니다 + +136 +00:08:59,919 --> 00:09:04,250 + 모델은 실제로 내가 볼 수있는 새로운 이미지를 부여하고있어 그래서 데이터를 분류합니다 + +137 +00:09:04,250 --> 00:09:07,500 + 내 훈련 데이터와 난 그냥 패턴에 따라이 함께 뭔가를 할 수 + +138 +00:09:07,500 --> 00:09:13,759 + 매칭 및 통계 또는 간단한 예를 들어이 내에서 작동 할 수 있도록 사람 + +139 +00:09:13,759 --> 00:09:17,279 + 프레임 워크는 가장 가까운 이웃 분류 당신이 하나있어 방법을 고려 + +140 +00:09:17,279 --> 00:09:20,939 + 분류는 효과적으로 파괴 주어진 무역 센터는 것입니다 작동 + +141 +00:09:20,940 --> 00:09:23,970 + 뿐만 아니라 교육 시간을 그냥 모든 훈련 데이터 그래서 모두가 기억 + +142 +00:09:23,970 --> 00:09:27,820 + 훈련 데이터는 여기에 도착하고 당신이 나에게 테스트를 줄 때 나는 지금 그것을 기억 + +143 +00:09:27,820 --> 00:09:32,060 + 우리가 무엇을 할 거 야 이미지는 우리의 모든 하나 하나에 테스트 이미지를 비교하는 것입니다 + +144 +00:09:32,059 --> 00:09:36,729 + 이미지는 우리가 기차 데이터를보고 우리는 단지 내가 거를 통해 라벨을 전송합니다 + +145 +00:09:36,730 --> 00:09:41,149 + 그냥 통과로 특정 경우에 작동합니다 모든 이미지를 봐 + +146 +00:09:41,149 --> 00:09:43,740 + 나는 가능한 한 완전 좋아이 그래서 우리는 특정 도와 드리겠습니다 + +147 +00:09:43,740 --> 00:09:47,740 + 이 10가로 페르 인도라는 무언가의 경우는 오늘 장면을 설정 + +148 +00:09:47,740 --> 00:09:53,129 + 라벨은 다음에 액세스 할 수 50,000 훈련 이미지가 레이블 + +149 +00:09:53,129 --> 00:09:57,159 + 우리가 잘하는 방법을 평가하는거야 10 만 이미지의 테스트 세트가있다 + +150 +00:09:57,159 --> 00:10:00,669 + 분류기는 작업과 이러한 이미지는 그들이에 그냥 좀있어 아주 작은 있습니다 + +151 +00:10:00,669 --> 00:10:05,009 + 32 작은 썸네일 이미지로 (32)의 데이터 세트 대기 가까운 이웃 그래서 + +152 +00:10:05,009 --> 00:10:07,809 + 우리는 다른 사람들이 우리에게 주어진 한 모든 교육을로 분류가 작동합니다 + +153 +00:10:07,809 --> 00:10:12,589 + 오만은 그냥 나는 우리가이 10 개의 다른 사례가 있다고 가정 아니에요 + +154 +00:10:12,590 --> 00:10:15,920 + 여기에 우리가 볼 것입니다 무엇을 할 거 야 여기에서 첫 번째 호출을 따라 우리의 테스트 이미지입니다 + +155 +00:10:15,919 --> 00:10:19,909 + 가장 유사한 사물의 트레이닝 세트의 가장 가까운 이웃까지 + +156 +00:10:19,909 --> 00:10:24,139 + 다만 독립적 그래서 거기에 그 모든 일이 당신은 이미지의 순위 목록을 보려면 + +157 +00:10:24,139 --> 00:10:30,220 + 모든 사람에게 그 (10)의 어느 하나에 트레이닝 데이터에 가장 유사하다고 + +158 +00:10:30,220 --> 00:10:32,700 + 저기 그 테스트 이미지의 첫 번째 행 있도록이 있다고 볼 수 있습니다 + +159 +00:10:32,700 --> 00:10:36,230 + 내가 생각하는 트럭은 테스트 이미지와 보면 대부분의 이미지가있다 + +160 +00:10:36,230 --> 00:10:40,490 + 여기서 비슷한 조그마한 비트를 찾을 당신이 할 수있는 방법을 정확하게 볼 수 그것과 유사 + +161 +00:10:40,490 --> 00:10:44,269 + 첫 번째 후퇴의 결과가 실제로 말을하지 트럭이고, 그건 볼 + +162 +00:10:44,269 --> 00:10:48,289 + 때문에 당신이 할 수 있도록 던져진 된 푸른 하늘의 단지 배치 + +163 +00:10:48,289 --> 00:10:52,480 + 이것은 아마 우리가이 정의 어떻게 잘 작동하지 않습니다 참조 + +164 +00:10:52,480 --> 00:10:55,470 + 우리가 실제로 비교를 어떻게 측정 여러 가지 방법 중 하나가 있습니다 + +165 +00:10:55,470 --> 00:10:59,940 + 및 이해 또는 맨하탄 있도록 간단한 방법은 맨해튼 거리를 수 있습니다 + +166 +00:10:59,940 --> 00:11:01,180 + 연구소의 거리 + +167 +00:11:01,179 --> 00:11:04,429 + 용어는 상호 교환 단순히 무엇을 당신이있어 테스트 이미지를 가지고있다 + +168 +00:11:04,429 --> 00:11:07,639 + 분류에 관심 고려 우리가 원하는 하나의 교육 이미지 + +169 +00:11:07,639 --> 00:11:11,919 + 우리가 무엇을 할 거 야 볼이 이미지를 비교하는 것은 우리가 요소 가격은 모든 비교 것입니다 + +170 +00:11:11,919 --> 00:11:15,959 + 찍어의 막대 사탕은 그러므로 절대 값의 차이를 형성하는 것이다 우리 + +171 +00:11:15,960 --> 00:11:20,040 + 그냥 우리가 모든 단일 위치 또는 감산을보고있는 모든 것을 추가 할 + +172 +00:11:20,039 --> 00:11:24,139 + 그것은 오프 차이가 추가 점점 더 특별한 위치에 어떤 참조 + +173 +00:11:24,139 --> 00:11:30,169 + 모든 걸 포기하고 우리의 유사성이다 그래서이 두 이미지는 그렇게 56 다른위한 + +174 +00:11:30,169 --> 00:11:33,809 + 우리가 코드를 보여주기 위해 여기에 동일한 이미지가 있다면 우리는 0을 얻을 + +175 +00:11:33,809 --> 00:11:36,959 + 구체적으로는이 같을 것이다 방법은 전체 구현은 + +176 +00:11:36,960 --> 00:11:42,930 + 가장 가까운 이웃 분류와 나는 두 사람의 실제 몸에 충전 곳 + +177 +00:11:42,929 --> 00:11:46,799 + 에 대해 이야기하고 우리가주는 것 같이 우리는 훈련 시간에 여기에 무엇을 + +178 +00:11:46,799 --> 00:11:52,709 + 레이블 그래서 용서 일반적으로 노트 모든 레이블 집합 X와 Y + +179 +00:11:52,710 --> 00:11:56,530 + 우리는 그냥 그냥 기억 클래스의 인스턴스 메소드에 할당 할 + +180 +00:11:56,529 --> 00:12:01,439 + 데이터 아무것도 우리가 여기서 무슨 일을하는지는 비록 내가 시간을 예측 수행되고 있지 + +181 +00:12:01,440 --> 00:12:06,080 + 우리는 이미지의 X의 영원 테스트 세트를 받고있어 나는 전체를 통과하지 않을거야 + +182 +00:12:06,080 --> 00:12:09,320 + 자세한하지만 당신은 모든 단일 테스트 이미지를 통해 for 루프 거기에 볼 수 있습니다 + +183 +00:12:09,320 --> 00:12:13,020 + 독립적으로 우리는 매일 훈련 이미지의 거리를 받고있어 + +184 +00:12:13,019 --> 00:12:18,360 + 그리고 그 하나의 벡터 라인 내가 그렇게 파이썬 코드를 사용하여 만의 통지 + +185 +00:12:18,360 --> 00:12:21,750 + 단 한 줄의 코드는 매 훈련이 테스트 이미지를 비교 하​​였다 + +186 +00:12:21,750 --> 00:12:26,370 + 내가 생각하고 이전 슬라이드에서이 거리를 계산 데이터베이스의 이미지 + +187 +00:12:26,370 --> 00:12:30,720 + 그 위기 코드는 그래서 우리가 그 4 루프를 소비하지 않았다 모두 그 + +188 +00:12:30,720 --> 00:12:35,860 + 처리 시스템에 참여하고 우리는 인스턴스를 계산하는 + +189 +00:12:35,860 --> 00:12:40,659 + 가장 가까운 그래서 우리는 색인에 가지고있는 교육의 인덱스를을 받고있어 + +190 +00:12:40,659 --> 00:12:45,719 + 가장 낮은 거리와 그 다음 우리는 단지이 이미지 레이블에 대한 예측됩니다 + +191 +00:12:45,720 --> 00:12:51,210 + 그래서 여기에 어떤 것은 가장 가까운 이웃 분류의 관점에서 당신을위한 질문 + +192 +00:12:51,210 --> 00:12:56,639 + 어떻게 그 속도는 무슨 일하는 것은 인 훈련 데이터의 크기에 따라 달라 않습니다 + +193 +00:12:56,639 --> 00:13:02,779 + 느린 훈련 장비를 확장 + +194 +00:13:02,779 --> 00:13:07,789 + 내가 만약 내가 그냥 가지고 있기 때문에 예는 실제로는 실제로 아주 천천히 맞아입니다 + +195 +00:13:07,789 --> 00:13:12,129 + 천천히 아래로 조금 그래서 독립적으로 하나 하나 훈련 샘플을 비교 + +196 +00:13:12,129 --> 00:13:16,370 + 우리는 클래스를 진행하면서이 거꾸로 실제로는 것을 실제로 이동 + +197 +00:13:16,370 --> 00:13:19,590 + 우리가 관심을 우리가 정말 가장 실용적인 응용 프로그램에 대한 관심이 무엇 때문에 + +198 +00:13:19,590 --> 00:13:23,330 + 이 분류의 시험 시간 성능에 대해 그것은 우리가 원하는 것을 의미합니다 + +199 +00:13:23,330 --> 00:13:27,240 + 클래스는이 시점에서 매우 효율적 그래서 정말 사이의 트레이드 오프있을 수 있습니다 + +200 +00:13:27,240 --> 00:13:30,419 + 우리는 기차 방식에 넣고 우리는 좋은에서 얼마나 넣을까요 얼마나 많은 컴퓨터 + +201 +00:13:30,419 --> 00:13:35,240 + 가장 가까운 이웃 기차 인스턴트하지만 그것은 우리가 거​​ 같은 시험 비싼 및 + +202 +00:13:35,240 --> 00:13:38,570 + 곧 볼이 실제로 주변이 완전히 다른 방법으로 플립있어 올 + +203 +00:13:38,570 --> 00:13:41,510 + 우리가 컴퓨팅의 엄청난 양의 훈련을 할 것이다 기차 시간을 볼 것이다 + +204 +00:13:41,509 --> 00:13:45,409 + 상업용 네트워크의 시스템 성능은 것 실제로 매우 효율적일 것이다 + +205 +00:13:45,409 --> 00:13:49,589 + 상수와 하나 하나 테스트 이미지 컴퓨팅의 일정 금액을 수 + +206 +00:13:49,590 --> 00:13:53,149 + 당신 만 수십억 또는 수조가있는 경우에 상관없이 계산의 양 + +207 +00:13:53,149 --> 00:13:57,669 + 난 그냥 해요 훈련은 내가 조 조 조 조를 그냥 가지고 싶습니다 + +208 +00:13:57,669 --> 00:14:01,579 + 아무리 크거나 무역 적자가 전체 사용자의 컴퓨터 작업을 수행하는 방법 + +209 +00:14:01,580 --> 00:14:05,250 + 즉 실질적으로 말하기 아주 좋은, 그래서 단일 테스트 샘플을 분류 + +210 +00:14:05,250 --> 00:14:10,370 + 지금은 그냥 세이버 여기 가속화하는 방법이 있다는 것을 지적하고자합니다 + +211 +00:14:10,370 --> 00:14:13,669 + 이 대략 가까운 이웃 방법 거기 분류는 같은 계획 + +212 +00:14:13,669 --> 00:14:16,879 + 사람들이 그 연습을 위해 사용 예제 라이브러리는 속도를 할 수 있습니다 + +213 +00:14:16,879 --> 00:14:22,909 + 가까운 이웃이 과정은 일치하지만 확인 그냥 보조 노트의 + +214 +00:14:22,909 --> 00:14:27,490 + 그래서 우리는 우리가 정의한 것을보고 분류의 디자인으로 돌아 가자 + +215 +00:14:27,490 --> 00:14:32,200 + 내가 임의로 선택이 거리는 당신에게 맨해튼 거리를 표시 할 + +216 +00:14:32,200 --> 00:14:35,720 + 할 수있는 많은 방법이 사실상 존재 절대 값의 차이를 비교 + +217 +00:14:35,720 --> 00:14:38,879 + 측정 거리를 공식화 등의 다양한 선택이있다 + +218 +00:14:38,879 --> 00:14:42,700 + 우리는이 비교를 정확히 어떻게 다른 사람들에게 다른 선택의 여지가 심 + +219 +00:14:42,700 --> 00:14:46,000 + 실제로 사용하려면 우리가 유클리드 울트라 거리입니다 부르는하다 + +220 +00:14:46,000 --> 00:14:49,850 + 대신에 이러한 차이의 제곱합의 차이를 요약 + +221 +00:14:49,850 --> 00:14:55,690 + 이미지 등이 선택 사이 + +222 +00:14:55,690 --> 00:15:02,730 + 그 이상이 다시 사람 + +223 +00:15:02,730 --> 00:15:07,850 + 확인 그래서 어떤 방법을 정확하게 컴퓨터 거리의이 선택은 이산 선택이다 + +224 +00:15:07,850 --> 00:15:11,769 + 우리는 우리가 차 하이퍼라는 뭔가를 제어 할 수 있는지 정말 아니에요 + +225 +00:15:11,769 --> 00:15:14,990 + 분명 당신이 그것을 설정하는 방법은 우리가 나중에 결정해야 하이퍼 매개 변수의 + +226 +00:15:14,990 --> 00:15:19,120 + 정확히 그들이 얘기하자 하이브리드 차의이 어떻게 든 다른 종류를 설정하는 방법 + +227 +00:15:19,120 --> 00:15:22,828 + 우리가 가지고 가까운 이웃을 일반화하는 경우에 대한 분류의 맥락에서 + +228 +00:15:22,828 --> 00:15:26,159 + 우리는 가장 가까운 이웃 분류 AK 전화 무엇 케 horas 이웃에 너무 + +229 +00:15:26,159 --> 00:15:29,328 + 모든 테스트를 위해 검색하는 분류는 가장 가까운 하나의 일치 + +230 +00:15:29,328 --> 00:15:33,958 + 사실 몇 가지 예를 후퇴합니다 예를 양성하고 새로운있을 것이다 + +231 +00:15:33,958 --> 00:15:37,069 + 가장 가까운 이상의 다수결은 실제로 모든 테스트 인스턴스를 분류합니다 + +232 +00:15:37,070 --> 00:15:41,829 + 그래서 이웃이 우리가에있는 5 가장 유사한 이미지를 검색하는 것 말 + +233 +00:15:41,828 --> 00:15:45,528 + 훈련 데이터와 레이블의 과반수 투표를하고 여기에 간단 + +234 +00:15:45,528 --> 00:15:48,970 + 그래서 여기에 요점을 설명하기 위해 설정이 차원 데이터 우리는 세 클래스가 + +235 +00:15:48,970 --> 00:15:53,430 + 데이터 세트 및 2D와 여기에 우리가 결정 영역이 부르는 그리기입니다 + +236 +00:15:53,429 --> 00:15:57,429 + 여기에 가장 가까운 이웃 분류는 정말 훈련을받은되는 의미 + +237 +00:15:57,429 --> 00:16:02,838 + 우리는 전체를 색칠하고 거기 무슨 수업이 가까운에 의해 비행기에서 내리다하기 + +238 +00:16:02,839 --> 00:16:05,430 + 모든 단일 지점은 그렇지 가정 기호 이웃 분류 + +239 +00:16:05,429 --> 00:16:08,698 + 당신은 단지이 것이라고 말하는 것보다 더 여기에 몇 가지 테스트 예를 들어 있다고 가정 + +240 +00:16:08,698 --> 00:16:12,549 + 당신이 개인 얻는 가장 가까운 이웃을 기반으로 푸른 클래스로 분류 된 + +241 +00:16:12,549 --> 00:16:16,708 + 점에 유의 푸른 클러스터 내부의 녹색 지점 인 점이며, + +242 +00:16:16,708 --> 00:16:19,708 + 그것이 많은의 분류했을 클래스 자체의 작은 영역을 갖는다 + +243 +00:16:19,708 --> 00:16:23,750 + 무엇이든 자신보다 그에게 경우 때문에 테스트는 녹색으로 주위에 배치 + +244 +00:16:23,750 --> 00:16:27,879 + 가장 가까운 이웃의 녹색 점은 이제 애에 대한 높은 숫자로 이동할 때 + +245 +00:16:27,879 --> 00:16:30,809 + 당신이 무엇을 발견 같은 오년 이웃 분류기는 경계입니다 + +246 +00:16:30,809 --> 00:16:36,619 + 한 거기 심지어 어디 좋은 효과 그것의 종류 부드럽게 시작 + +247 +00:16:36,620 --> 00:16:37,339 + 포인트 + +248 +00:16:37,339 --> 00:16:41,550 + 가지 무작위 실제로 아니다 푸른 클러스터의 소음과 아웃 라이어로 + +249 +00:16:41,549 --> 00:16:44,539 + 우리는 항상 다섯을 치료하기 때문에 너무 많은 예측을 채용 + +250 +00:16:44,539 --> 00:16:49,679 + 가장 가까운 이웃은 그래서 그들은 실제로 그렇게 그린 포인트를 압도 얻을 + +251 +00:16:49,679 --> 00:16:53,088 + 당신이를 찾을 수 있습니다 일반적으로 여름 분류는 더 나은 제공 할 수 있습니다 + +252 +00:16:53,089 --> 00:16:58,180 + 이제 US시 성능 그러나 다시 K의 선택은 다시 경계 인 하이퍼 + +253 +00:16:58,179 --> 00:17:03,088 + 내가 조금이 다시 올 거 바로 그래서 당신이보기의 예를 보여 + +254 +00:17:03,089 --> 00:17:06,169 + 여기처럼 나는 그들에 의해 위를 기록하고 열 가장 유사한 예를 반환하고있어 자신의 + +255 +00:17:06,169 --> 00:17:08,939 + 거리와 실제로 이러한 훈련을 통해 과반수 투표를 할 것 + +256 +00:17:08,939 --> 00:17:13,089 + 여기 예제는 여기에 모든 테스트 예제를 분류합니다 + +257 +00:17:13,088 --> 00:17:20,649 + 확인 그래서 여기에 단지의 정확성이 무엇인지 고려의 질문의 비트를하자 + +258 +00:17:20,650 --> 00:17:24,259 + 우리는 유클리드을 사용하는 훈련 데이터에 대한 분류의 북쪽 + +259 +00:17:24,259 --> 00:17:29,700 + 나는 우리의 테스트 세트는 정확히 훈련 데이터입니다 가정, 우리가있어 너무 거리 + +260 +00:17:29,700 --> 00:17:32,580 + 우리는 얼마나 자주를 얻을 얼마나 많은 즉 정확성을 찾기 위해 노력 + +261 +00:17:32,579 --> 00:17:34,750 + 정답 + +262 +00:17:34,750 --> 00:17:44,808 + 많은 중 확인을 백 퍼센트의 좋은 예 맞습니다 그래서 우리는 항상 찾을 수있어 + +263 +00:17:44,808 --> 00:17:48,450 + 가 자신이 수행하고 정확하게 테스트의 상단에 기차 예 + +264 +00:17:48,450 --> 00:17:52,870 + 다음과 같은 우리가 맨해튼을 사용하는 경우 무엇을 통해 전송 될 것 + +265 +00:17:52,869 --> 00:18:00,949 + 거리가 + +266 +00:18:00,950 --> 00:18:04,680 + 맨해튼 거리가 약간의 절대 값은 당신에게 있습니다 제곱의 합을 필요로하지 않는다 + +267 +00:18:04,680 --> 00:18:12,110 + 차이에서 그것은 단지 문제는 좋은 같은 것 것 것 + +268 +00:18:12,109 --> 00:18:14,169 + 여름이나 유지 + +269 +00:18:14,170 --> 00:18:18,820 + 관심을 확인 이웃 분류가 훈련 왕의 정확성은 무엇인가 + +270 +00:18:18,819 --> 00:18:25,339 + 반드시 때문에하지 백 %가 그것을 경우 케이블 장소입니다 + +271 +00:18:25,339 --> 00:18:29,230 + 기본적으로 주변의 포인트는 당신이 당신의 최선을 압도 할 수 + +272 +00:18:29,230 --> 00:18:35,269 + 예를 들어 유리 떨어져 실제로 괜찮 그래서 우리는 서로 다른 두 가지 선택을 논의했습니다 + +273 +00:18:35,269 --> 00:18:39,740 + 전제 우리가이 경우에 릭에게 높은 압력을 만난 우리는 어떻게 확실하지 않다 + +274 +00:18:39,740 --> 00:18:45,160 + 그렇게 우리는 이러한 설정하는 방법을 정확하게 확실하지에 1 23 10 이렇게해야 설정 + +275 +00:18:45,160 --> 00:18:48,750 + 사실에 따라 자신의 문제는 당신이 지속적으로 찾을 수 없습니다 찾을 수 있습니다 + +276 +00:18:48,750 --> 00:18:52,250 + 일부 경우에 볼 수있는 몇 가지 응용 프로그램에서 이러한 높은 전제를위한 최선의 선택 + +277 +00:18:52,250 --> 00:18:56,930 + 우리가 이것을하도록 설정하는 방법을 정말 확실하지 않도록 다른 응용 프로그램보다 더 나은 + +278 +00:18:56,930 --> 00:19:00,799 + 여기에 우리가 기본적으로 난 그래서 다른 프라이머의 제비를 시도 할 생각이다 + +279 +00:19:00,799 --> 00:19:05,649 + 거 내가 나의 기차 데이터를 데리고 갈거야 다음 내가 많이 시도거야과 같이 + +280 +00:19:05,650 --> 00:19:11,550 + 다른 매개 변수의 그래서 난 그냥 죽을 수와 나는 케이블 123456 2800 I을 시도 + +281 +00:19:11,549 --> 00:19:14,529 + 즉 가장 적합한 무엇이든 모든 피고인 메트릭을 시도하고는 내가 정액의 + +282 +00:19:14,529 --> 00:19:26,670 + 그래서 괜찮 때문에 좋은 생각에 아주 잘 오른쪽 거짓말을 작동 걸릴 + +283 +00:19:26,670 --> 00:19:36,170 + 그래서 기본적으로 그래서 기본적으로 네 그래서 테스트 데이터에 대한 프록시 당신의 + +284 +00:19:36,170 --> 00:19:40,039 + 신뢰하지 않아야 주문 그들의 일반화해야 테스트 데이터에 + +285 +00:19:40,039 --> 00:19:43,509 + 당신은 당신이 이제까지 STATA에있는 잊지해야 사실은 그래서주는 일을했다 + +286 +00:19:43,509 --> 00:19:46,079 + 데이터 집합은 항상 당신이 그것을 필요가 없습니다 척 유언자를 따로 + +287 +00:19:46,079 --> 00:19:50,129 + 그게 당신을 말하고 어떻게 것입니다 보이지 않는 데이터 점에 일반화 당신의 장기와 + +288 +00:19:50,130 --> 00:19:52,730 + 당신이 당신의 알고리즘을 개발하기 위해 노력하고 있기 때문에 중요합니다 그리고 당신은있어 + +289 +00:19:52,730 --> 00:19:56,120 + 결국 지구와 몇 가지 설정을 할 희망 당신은 이해 좋아 + +290 +00:19:56,119 --> 00:20:01,159 + 정확히 할 것입니다 어떻게 연습 오른쪽 작업이 예상 그래서 당신은 볼 수 있습니다 + +291 +00:20:01,160 --> 00:20:03,830 + 예를 들어 때때로 당신은 매우 매우에 대해 잘 구성 수행 할 수있는 + +292 +00:20:03,829 --> 00:20:05,579 + 아주 잘 일반화 그것을 테스트하지 + +293 +00:20:05,579 --> 00:20:08,659 + 당신은 28 요구 사항 29에 의해 사람이 많이 지나친 것 + +294 +00:20:08,660 --> 00:20:11,750 + 이 클래스는, 그래서 당신은 가장이 질병에 매우 잘 알고 있어야합니다 + +295 +00:20:11,750 --> 00:20:16,519 + 범위이 당신을 위해 좀 더 많은 개요 그러나 기본적으로이 테스트입니다 + +296 +00:20:16,519 --> 00:20:20,940 + 데이터는 우리가 할 대신 무엇을 잊지 매우 드물게 사용된다 + +297 +00:20:20,940 --> 00:20:25,930 + 우리가 안전하게를 사용으로 구분하기 때문에 우리가 주름을 부르는에 우리의 훈련 데이터를 분리 + +298 +00:20:25,930 --> 00:20:29,900 + 우리는 같은 트레이닝 데이터의 20 %를 사용할 수 있도록 5 배 검증 + +299 +00:20:29,900 --> 00:20:35,120 + 데이터 및 그것의 우리는 교육 부분을 상상하고 우리는 우리의 테스트 + +300 +00:20:35,119 --> 00:20:39,279 + 그냥 내가 갈거야 그렇게 설정이 검증에 주로 적용되는 두 가지 선택이 + +301 +00:20:39,279 --> 00:20:42,569 + 내 전화에 훈련과 다른 경우 일부의 모든 앞을 시도 + +302 +00:20:42,569 --> 00:20:45,329 + 성직자와 아직 대략 가까운 이웃를 사용하는 경우 다른 어떤 + +303 +00:20:45,329 --> 00:20:48,750 + 당신이 그것을 시도 많은 다른 선택은 밖으로 그 검증 데이터에 가장 적합한 참조 + +304 +00:20:48,750 --> 00:20:51,859 + 당신이 불편하게 느끼는 경우는 거의 훈련 데이터 포인트를 가지고 있기 때문에 + +305 +00:20:51,859 --> 00:20:54,939 + 당신이 실제로 얻을 곳 사람들은 때때로 교차 유효성 검사를 사용 + +306 +00:20:54,940 --> 00:20:58,640 + 테스트 검증의 선택이 이러한 선택을 통해 뽑아 평가하기 + +307 +00:20:58,640 --> 00:21:03,840 + 그래서 내가 먼저 내 훈련 (124)에 사용할 수 있습니다 나는 다섯에 시도하고 + +308 +00:21:03,839 --> 00:21:07,519 + 검증의 선택이 모든 다섯 선택과 I에서 뽑아 순환 + +309 +00:21:07,519 --> 00:21:11,789 + 내 테스트 배의 모든 가능한 선택을 통해 가장 적합한보고 + +310 +00:21:11,789 --> 00:21:14,839 + 그때는 모든 가능한 시나리오를 통해 가장 적합한 무엇이든 취할 + +311 +00:21:14,839 --> 00:21:19,039 + 그건 선두 주자 교차 검증 고정 나사 검증의 연습 + +312 +00:21:19,039 --> 00:21:21,769 + 이 그들과 같을 것이다 방법은 가장 가까운 이웃을 위해 K에 대한 크로스 건물이었다 + +313 +00:21:21,769 --> 00:21:26,049 + 분류 우리는 K의 다른 값을 시도하고 이것이 우리입니다 + +314 +00:21:26,049 --> 00:21:31,690 + 겹의 다섯 가지 선택에서 성능이 그래서 당신은 모든에 대한 것을 볼 수 있습니다 + +315 +00:21:31,690 --> 00:21:35,759 + 하나의 경우 우리가 5 개의 데이터 지점을 가지고 다음이 정밀도가 그렇다 + +316 +00:21:35,759 --> 00:21:40,240 + 높은 좋은 내가 평균 분석가 숀 Arce에 대한 통하여 선을 음모를 꾸미고 있어요 + +317 +00:21:40,240 --> 00:21:44,190 + 표준 편차는 그래서 우리는 여기에서 볼 성능이 위로가는 것입니다 + +318 +00:21:44,190 --> 00:21:49,240 + 이 여론 조사에서 당신은 가서 같이하지만, 어떤 점에서 스타는 말했다 이것을 그래서 나도 몰라 + +319 +00:21:49,240 --> 00:21:53,460 + 특정 데이터 세트는 그게 그래서 7과 동일 K 최고의 선택이 될 것 같다 무엇 + +320 +00:21:53,460 --> 00:21:58,440 + 나는 대칭과 내가 할에 모든 내 hyperemesis에 대해이 작업을 수행 할 수 있습니다 내 + +321 +00:21:58,440 --> 00:22:03,650 + 내가 약속 교차 검증은 내가 그들을 시험에서 하나의 시간을 평가하고 수정했다 + +322 +00:22:03,650 --> 00:22:07,800 + 내가 그에 도착 사이트와 어떤 수는 내가 여덟 정확도로보고 무엇을 + +323 +00:22:07,799 --> 00:22:11,490 + 왕이나의 종이로가는 무슨이 데이터 세트에 대한 몇 가지 분류 + +324 +00:22:11,490 --> 00:22:15,539 + 무엇의 최종 일반화 결과만큼 우리의 최종 보고서로 전환 + +325 +00:22:15,539 --> 00:22:16,519 + 무슨 짓을했는지 + +326 +00:22:16,519 --> 00:22:36,048 + 이에 대한 질문은 기본적으로는 분포의 통계에 관하여 + +327 +00:22:36,048 --> 00:22:42,378 + 라벨에 이러한 데이터 포인트 당신의 얼굴에 그래서 때때로의 그것은 하드의 + +328 +00:22:42,378 --> 00:22:47,769 + 이 그림 반면 얻을처럼 당신으로 발생하는 약을 확인 말 + +329 +00:22:47,769 --> 00:22:52,209 + 더 청결과 더 많은 경우를 얻을 수 있으며 얼마나 clunkier 데이터에 의존 + +330 +00:22:52,209 --> 00:22:55,129 + 이 내려 오는 것을 정말 서비스는 어떻게 + +331 +00:22:55,128 --> 00:23:01,569 + 로비를하거나 그것을 어떻게 특정 그게 매우 편리 대답은 알고 있지만 그건 + +332 +00:23:01,569 --> 00:23:04,769 + 그 때문에 다른 데이터 세트는 다른 것에 와서 무엇을 대략 무엇 + +333 +00:23:04,769 --> 00:23:27,230 + 지금 우리를 클릭 + +334 +00:23:27,230 --> 00:23:31,769 + 때문에 + +335 +00:23:31,769 --> 00:23:37,308 + 다른 다른 데이터 세트는 다른 선택을 필요로해야합니다 + +336 +00:23:37,308 --> 00:23:40,629 + 실제로 다른 알고리즘을 시도하는 경우에 작동하는 무슨이 가장 참조 + +337 +00:23:40,630 --> 00:23:43,580 + 당신은 당신의 데이터의 선택에서 가장 잘 작동하는 무슨 일이 일어나고 있는지 확실하지 않은 당신의 + +338 +00:23:43,579 --> 00:23:47,699 + 당신이 어떤 작품 그냥 확실하지 않도록하기 위해 하이퍼 망치 같은 종류의도 + +339 +00:23:47,700 --> 00:23:52,019 + 다른 접근 방법이 다를 수 있습니다 + +340 +00:23:52,019 --> 00:23:55,190 + 일반화 경계는 서로 다른 모양과 일부 데이터가 설정하는 + +341 +00:23:55,190 --> 00:23:58,330 + 다른 것보다 앞 구조 몇 가지 다른 사람보다 더 잘 작동 + +342 +00:23:58,329 --> 00:24:05,298 + 그냥 난 그냥 그 왕 또는 더 나쁜 뭔가를 가리 키도록 좋아 좋아 밖으로 시도 실행 + +343 +00:24:05,298 --> 00:24:09,389 + 아무도 기본적으로 그냥 사용하지 않는이 통과하는이 일요일를 사용하지입니다 + +344 +00:24:09,390 --> 00:24:12,480 + 이 훈련은 단지 분할 등 작동 정말 방법이 방법 + +345 +00:24:12,480 --> 00:24:13,450 + ...에 + +346 +00:24:13,450 --> 00:24:17,610 + 그 이유는 이것이 우선 매우 비효율적이기 때문으로 사​​용하지 않고, + +347 +00:24:17,609 --> 00:24:21,139 + 모든이의 두 번째 내 트랙 차원 높은 모든 이미지입니다 + +348 +00:24:21,140 --> 00:24:28,179 + 개체는 그들은 내가에서 촬영 한 적이 매우 부자연스럽고 직관적 인 방법을 행동 + +349 +00:24:28,179 --> 00:24:32,370 + 순서는 제한하고 나는 세 가지 방법으로 변경하지만이 모든 세 + +350 +00:24:32,369 --> 00:24:37,168 + 여기에 서로 다른 이미지는 L 실제로이 하나의 동일한 거리를 + +351 +00:24:37,169 --> 00:24:42,100 + 유클리드 감각에 난 그냥 여기 사람이 약간에 이동에 대해 생각하는 + +352 +00:24:42,099 --> 00:24:46,359 + 그것은 약간 떨어있어 그것은이 여기의 왼쪽 이유로 인해 완전히 다른 + +353 +00:24:46,359 --> 00:24:49,329 + 이러한 픽셀은 정확 하 게 일치하지 않습니다 그것은 모든 모든을 소개하는 것 + +354 +00:24:49,329 --> 00:24:53,109 + 당신은 작은을 얻을 수 있도록이 하나가 약간 어둡게하여 점점 거리의 오류 + +355 +00:24:53,109 --> 00:24:57,629 + 모든 특별 행사를 통해 델타이 하나의 손길이 닿지 않은 60 거리 ERES입니다 + +356 +00:24:57,630 --> 00:25:01,650 + 사방에서 저기 그 위치를 제외하고는 촬영 + +357 +00:25:01,650 --> 00:25:05,900 + 아웃 임계 이미지 조각과 가장 가까운 이웃 분류를하지 않는다 + +358 +00:25:05,900 --> 00:25:08,030 + 정말 이러한 설정의 차이를 말할 수 없습니다 + +359 +00:25:08,029 --> 00:25:11,230 + 이이 거리를 기반으로하기 때문에 그 정말이 아주 잘 작동하지 않는다 + +360 +00:25:11,230 --> 00:25:16,009 + 당신은 매우에 거리를 던져하려고 할 때 경우 이렇게 아주 직관적 일이 일어날 + +361 +00:25:16,009 --> 00:25:21,349 + 우리가 지금까지 요약에 그렇게 존재하지 않는 이유를 부분적으로의 높은 차원 객체 + +362 +00:25:21,349 --> 00:25:26,230 + 우리는 이러한 분류에서 서로 다른 두를 포함하는 특정한 경우를 찾고 + +363 +00:25:26,230 --> 00:25:29,679 + 나중에 엔지니어 이웃 분류의 클래스와 아이디어의 설정 + +364 +00:25:29,679 --> 00:25:33,110 + 최대 데이터를 다른 분할을 갖고 우리는 이러한 고압 호스를 그 + +365 +00:25:33,109 --> 00:25:37,240 + 선택해야하고 우리는이 일반적으로 대부분의 크로스 기반을 사용합니다 + +366 +00:25:37,240 --> 00:25:39,909 + 시간 사람들은 단지 하나가 실제로 전체 교차 유효성 검사를 수행 + +367 +00:25:39,909 --> 00:25:40,519 + 확인 + +368 +00:25:40,519 --> 00:25:43,778 + 그들은 높은 측면에서 가장 적합한 어떤 검증 세트에 시도 + +369 +00:25:43,778 --> 00:25:47,999 + 전제하고 가장이 예비 선거를 일단 당신은 하나에 리드를 + +370 +00:25:47,999 --> 00:25:54,569 + 세입자는 그냥 분류에 갈하지만 질문에있어 그렇게 말했다 + +371 +00:25:54,569 --> 00:26:04,229 + 나는 우리가 텔레 노어 분류 보는거야 좋은 볼이 시점이 인 + +372 +00:26:04,229 --> 00:26:07,649 + 우리는이를 알 수있을 것입니다 상업 네트워크를 향해 작업을 시작하는 지점 + +373 +00:26:07,648 --> 00:26:11,148 + 강의 시리즈는 최대 만들 것이다 분류를 혼란 것 + +374 +00:26:11,148 --> 00:26:15,888 + 전체 상용 네트워크 분석 이미지는 그냥 그 동기를 말씀 드리고 + +375 +00:26:15,888 --> 00:26:20,178 + 작업 별보기에서 클래스 어제이 클래스는 컴퓨터 비전 클래스입니다 + +376 +00:26:20,179 --> 00:26:25,489 + 기계 사이트 될 것이 클래스 동기를 부여하기 위해 다른 방법을 제공에 관심 + +377 +00:26:25,489 --> 00:26:29,409 + 어떤 의미에서보기의 모델 기반 관점에서 그 우리는 너희들을 제공하고 + +378 +00:26:29,409 --> 00:26:34,339 + 배관 및 전기에 대한 사람 보는이 멋진 알고리즘은 + +379 +00:26:34,338 --> 00:26:38,178 + 당신은 단지 일부, 특히 위에 다양한 요구에 적용 할 수있는 + +380 +00:26:38,179 --> 00:26:42,469 + 지난 몇 년 동안 우리는 신경 네트워크는 그 무엇입니까 볼 수없는 것을보고 + +381 +00:26:42,469 --> 00:26:46,479 + 이 클래스에 대해 많은 것을 배울 것이다 그러나 그는 또한 여기에 꽤있다 + +382 +00:26:46,479 --> 00:26:50,828 + 당신이 휴대 전화로 이야기 할 때 음성 인식은 지금 그들이 할 수있는 작동하지 않습니다 + +383 +00:26:50,828 --> 00:26:56,678 + 또한 그래서 여기에 기계 번역을 당신은 세트의 신경 네트워크를 먹이 + +384 +00:26:56,679 --> 00:27:00,700 + 영어 하나 신경망로 단어 하나는 번역을 생산 + +385 +00:27:00,700 --> 00:27:05,328 + 인쇄 또는 어떤 다른 대상 언어에 그렇게 제어를 수행 할 필요가 + +386 +00:27:05,328 --> 00:27:09,308 + 우리는 당신의 네트워크 응용 프로그램을 볼 수 및 로봇 조작에 조작 한 + +387 +00:27:09,308 --> 00:27:14,209 + 및 파티 이익의 직장에서 재생하면 확인하여 세 가지 게임을 바로 재생하는 방법 + +388 +00:27:14,209 --> 00:27:18,089 + 로켓은 화면을 설정하고 우리는 매우 성공적인 것으로 보인다 + +389 +00:27:18,088 --> 00:27:23,878 + 도메인의 다양성과 여기에 조금보다 더 우리는 정확히 확실치 + +390 +00:27:23,878 --> 00:27:27,988 + 어디이 우리를 취할 것 그리고, 나는 또한 우리가 탐구하고 있다는 말을하고 싶습니다 + +391 +00:27:27,989 --> 00:27:31,749 + 이것은 매우 헨리 VIII이라고 생각합니까 가사에 대한 방법은 소망 적 사고입니다 + +392 +00:27:31,749 --> 00:27:35,700 + 하지만 어쩌면 그들은뿐만 아니라 그렇게 할 수있는 몇 가지 힌트가있다 + +393 +00:27:35,700 --> 00:27:39,479 + 그들이 연주하는 재미 모듈 일이기 때문에 신경 네트워크는 아주 좋은입니다 + +394 +00:27:39,479 --> 00:27:42,450 + 나는이 사진의 자신의 네트워크와 I 종류의 작업에 대해 생각할 때와 + +395 +00:27:42,450 --> 00:27:46,548 + 여기에 나를 위해 마음에 오는 우리는 신경 네트워크 개업이 그녀입니다 + +396 +00:27:46,548 --> 00:27:51,519 + 보이는 것을 만드는 것은이 시점에서 대략 10 층이 될 수 있습니다 + +397 +00:27:51,519 --> 00:27:55,269 + 정말 자신의 외모와 함께 연주에 대해 생각하는 가장 좋은 방법은 매우 재미 + +398 +00:27:55,269 --> 00:27:58,619 + 레고 블록처럼 우리가이 작은 기능 조각을 구축하는 것을 볼 수 있습니다 + +399 +00:27:58,619 --> 00:28:02,579 + 당신은 많은 그래서 우리는 다음 전체 아키텍처를 만들어 함께 붙어 수 있습니다 봐 + +400 +00:28:02,579 --> 00:28:06,309 + 아주 쉽게 서로 이야기하고 그래서 우리는 이러한 모듈을 만들 수 있습니다 + +401 +00:28:06,309 --> 00:28:11,519 + 스톡턴 함께 내가 생각이 아주 쉽게 승리 작업 플레이 + +402 +00:28:11,519 --> 00:28:16,039 + 예시이 내 숙제입니다 대략 년 전에 그렇게로부터의 자막에 + +403 +00:28:16,039 --> 00:28:20,289 + 여기에 작업의 이미지를 촬영했다 당신은에 일을 얻기 위해 노력하고 + +404 +00:28:20,289 --> 00:28:23,639 + 예를 들어 상단이 왼쪽 있도록 이미지의 문장 설명을 생산 + +405 +00:28:23,640 --> 00:28:27,810 + 예술가는 결과이 많은 검은 셔츠는 기타를 연주 한 것을 말할 것입니다 설정 + +406 +00:28:27,809 --> 00:28:32,480 + 또는 오렌지 시티 웨스트에서 건설 노동자 등등 그래서 도로에 노력하고있다 + +407 +00:28:32,480 --> 00:28:36,670 + 그들은 영상을보고 하나 하나 이미지의이 설명을 만들 수 있습니다 + +408 +00:28:36,670 --> 00:28:41,100 + 이 모델의 세부 사항에 길을 갈 때 이것은 우리가 복용하고있다 작품 + +409 +00:28:41,099 --> 00:28:45,079 + 우리가 알고있는 길쌈 신경망 그래서 여기에 두 개의 모듈에있다 + +410 +00:28:45,079 --> 00:28:49,480 + 우리가 달성 할 수있는 촬상 모델이 계통도하여 + +411 +00:28:49,480 --> 00:28:52,880 + 우리가 알고있는 네트워크는 우리가 재발 성 신경 네트워크를 복용하고 볼 수있는 + +412 +00:28:52,880 --> 00:28:56,150 + 우리는이 경우 시퀀스에 아주 좋은 및 모델링 시퀀스 알고 + +413 +00:28:56,150 --> 00:28:59,720 + 이미지를 설명하는 것 그리고 우리가 가지고 노는 것처럼 말 + +414 +00:28:59,720 --> 00:29:02,930 + 레고 우리는 그 두 가지를 가지고 우리는 함께 그에 대응을 스틱 + +415 +00:29:02,930 --> 00:29:06,560 + 이러한 네트워크에있는 두 개의 모듈 사이에서 여기 화살표에 배운 + +416 +00:29:06,559 --> 00:29:10,639 + 이러한 이미지를 설명하기 위해 서로와 노력의 과정에서 대화 + +417 +00:29:10,640 --> 00:29:13,110 + 그라디언트는 휴대 전화에서 작동 코미디 쇼를 통해 비행한다 + +418 +00:29:13,109 --> 00:29:16,689 + 시스템은 더하기 위해 이미지를보고 자신을 조정하는 것 + +419 +00:29:16,690 --> 00:29:20,200 + 말을 설명하고 그래서이 모든 시스템은 하나 같이 함께 작동합니다 + +420 +00:29:20,200 --> 00:29:24,920 + 그래서 우리는 실제로이 클래스에 올 것이다이 모델을 위해 노력 할 것이다 것 + +421 +00:29:24,920 --> 00:29:28,279 + 바로이 부​​분이 부분에 대해 모두 떨어져 전체 이해를 + +422 +00:29:28,279 --> 00:29:31,849 + 중간 과정을 통해 대략 당신은 어떻게 교육 모델을 볼 수 있습니다 + +423 +00:29:31,849 --> 00:29:34,909 + 그 정말 우리가 구축하고있는 것에 대해 그냥 의욕이다 제외한 작동 + +424 +00:29:34,910 --> 00:29:40,290 + 당신이 확인 작업을하지만 지금은 다시 410을보고 정말 좋은 모델처럼하고있어 + +425 +00:29:40,289 --> 00:29:43,159 + 모든 분류 + +426 +00:29:43,160 --> 00:29:47,930 + 당신이이 데이터 집합 2000 작업 저스틴 라벨된다 나게 우리는있어 + +427 +00:29:47,930 --> 00:29:50,960 + 당신의 분류에 접근하려고하는 것은 우리가 파라 메트릭 방식 부르는에서입니다 + +428 +00:29:50,960 --> 00:29:55,079 + 우리가 지금 논의 된 것을 기억하는 것은 무엇인가 우리가 부르는의 인스턴스 + +429 +00:29:55,079 --> 00:29:57,439 + 비모수 적 접근 방법은 우리가 될거야 매개 변수가 없습니다 + +430 +00:29:57,440 --> 00:30:02,430 + 이러한 구분을 통해 최적화하는 것은 명확하게 인간은 또한의 뜻 + +431 +00:30:02,430 --> 00:30:04,240 + 우리가하고있는 프로젝트에 명백한 가치가있다 + +432 +00:30:04,240 --> 00:30:09,089 + 이미지를 가져와을 생성하는 기능을 구성에 대해 생각 + +433 +00:30:09,089 --> 00:30:12,769 + 클래스에 대한 점수는 바로이 어떤 이미지를 수행해야 할 우리가해야 할 것입니다 + +434 +00:30:12,769 --> 00:30:17,109 + 우리는 열 중 어느 하나를 파악하고 싶습니다 플러스 그렇게 우리가 쓰고 싶은된다 + +435 +00:30:17,109 --> 00:30:21,169 + 이미지를 소요하고 당신에게 그 두 가지를 제공하는 기능과 표현 아래로 + +436 +00:30:21,170 --> 00:30:24,529 + 숫자 만 표현은 매우뿐만 아니라 그 이미지의 기능 만입니다 + +437 +00:30:24,529 --> 00:30:28,339 + 때때로 병 W라고 이들 파라미터의 함수일 + +438 +00:30:28,339 --> 00:30:33,189 + 또한 무게라고 그래서 정말은 3072 번호로 이동하는 기능입니다 + +439 +00:30:33,190 --> 00:30:37,308 + 10 숫자에이 이미지를 우리는 우리가 정의하고 무슨 일을하는지있어 그 구성하는 + +440 +00:30:37,308 --> 00:30:42,049 + 기능 그리고 우리는이이 기능의 몇 가지 선택을 통해 이동합니다 + +441 +00:30:42,049 --> 00:30:45,589 + 첫 번째 경우는 나중에 기능을보고 한 후 작동을 제어하도록 확장됩니다 + +442 +00:30:45,589 --> 00:30:49,579 + 그리고, 우리는 상업 네트워크 그러나 직관적으로 무엇을 얻기 위해 그 연장합니다 + +443 +00:30:49,579 --> 00:30:53,379 + 우리를 구축하고 우리를 통해이 이미지를 둘 때 우리가 원하는 것은이다 + +444 +00:30:53,380 --> 00:30:57,690 + 우리의 기능 우리는 10의 점수에 해당하는 10 숫자를 싶습니다 + +445 +00:30:57,690 --> 00:31:01,150 + 가장 가까운 높은 것으로 고양이 클래스에 해당하는 번호를 싶습니다 + +446 +00:31:01,150 --> 00:31:06,330 + 다른 모든 숫자는 낮은하고있을 것이다 우리는 X를 통해 선택의 여지가 없어 + +447 +00:31:06,329 --> 00:31:11,428 + 즉, 사용자가 설정하는 무료입니다 (W) 이상 선택의 여지가 주어진 것 우리의 이미지의 역할 + +448 +00:31:11,429 --> 00:31:15,179 + 어떤을 제외하고 우리가 원하는 우리는이 기능을 허용하도록 설정하는 것이 좋습니다 싶어 + +449 +00:31:15,179 --> 00:31:19,050 + 의 우리의 훈련 데이터의 모든 하나의 이미지에 대한 우리에게 정답을 제공합니다 + +450 +00:31:19,049 --> 00:31:23,230 + 우리는 간단한 사용하는 것이 우리 가정하는 방향으로 구축하고 거의 접근 + +451 +00:31:23,230 --> 00:31:29,789 + X 그래서 여기에 간단한 단지 선형 분류는 우리의 이미지입니다 + +452 +00:31:29,789 --> 00:31:34,200 + 이 경우 잘못은 내가 고양이를 구성하는이 영상이 배열을 취하고로 + +453 +00:31:34,200 --> 00:31:38,750 + 나는 거대한 컬럼에 해당 이미지의 모든 픽셀 뻗어있어 + +454 +00:31:38,750 --> 00:31:46,920 + 3072 번호 등의 열 벡터가되도록 벡터 당신이 알고있는 경우 + +455 +00:31:46,920 --> 00:31:52,100 + 당신이이을위한 전제 조건입니다한다 행렬 벡터 연산 + +456 +00:31:52,099 --> 00:31:55,149 + 잘 알고 있어야합니다 단지 행렬 곱셈이 있다는 클래스 + +457 +00:31:55,150 --> 00:32:00,100 + 와 기본적으로 우리는 우리가있어 3072 근육의 열 벡터이다 X를 취하고있어 + +458 +00:32:00,099 --> 00:32:03,569 + (10) 번호를 얻으려고 노력하고 당신은 뒤로 더 이상 기능을 갈 수 있도록 + +459 +00:32:03,569 --> 00:32:08,399 + 이 w의 크기는 3072 그래서 거기에 기본적으로 10입니다 파악 + +460 +00:32:08,400 --> 00:32:14,370 + 즉 W로 전환하고 30,000 772 202 번호는 우리가 통제 할 수있는 무엇 + +461 +00:32:14,369 --> 00:32:16,658 + 그것은 우리가 조정할 및 작동 찾을 필요가 무엇 + +462 +00:32:16,659 --> 00:32:21,710 + 그래서 사람들은 내가 밖으로 떠날거야이 특정한 경우에 매개 변수가되고 있습니다 + +463 +00:32:21,710 --> 00:32:26,919 + 또한 끝에 추가 된 거기에 + 이러한 편견은 편견을 가지고, 그래서 때로는 수 + +464 +00:32:26,919 --> 00:32:31,999 + 10 개 이상의 매개 변수에 대해 우리는 또한 보통 사람들을 찾을 수있다 + +465 +00:32:31,999 --> 00:32:36,098 + 선형 분류 우리가 가장 잘 작동 정확히 찾아 가지고 WNB을 가지고이 + +466 +00:32:36,098 --> 00:32:39,950 + 아기는 온에 그냥 독립적 인 대기의 이미지의 함수가 아니다 + +467 +00:32:39,950 --> 00:32:44,989 + 가능성을 그 중 하나는 당신의 질문에 다시 갈 수 있습니다 당신이 경우 + +468 +00:32:44,989 --> 00:32:50,239 + 어쩌면 당신은 대부분의 고양이 있지만 일부 개를위한 매우 불균형 데이터 집합을 + +469 +00:32:50,239 --> 00:32:54,710 + 또는 그런 일이 당신은 기대할 수있는 고양이에 대한 편견이 + +470 +00:32:54,710 --> 00:32:58,200 + 한번 분류를 기본 때문에 촉매는 약간 높을 수 있습니다 + +471 +00:32:58,200 --> 00:33:04,009 + 뭔가에 다른 뭔가를 제공하지 않는 한 촉매를 예측하는 + +472 +00:33:04,009 --> 00:33:08,069 + 하나님의 형상이, 그렇지 않으면 나는 내가 단지에 같은보다 구체적인 생각 + +473 +00:33:08,069 --> 00:33:11,398 + 그것을 분해하지만 물론 나는 그것이 매우 명시 적으로 3072 폭 시각화 할 수 없습니다 + +474 +00:33:11,398 --> 00:33:17,459 + 숫자는 그래서 우리의 입력 이미지 1024 픽셀 및 그래서 더 많은 사진을 상상 상상 + +475 +00:33:17,460 --> 00:33:21,419 + 또한 열 X에 스트레스를 우리는 세 가지 클래스 정도가 상상 + +476 +00:33:21,419 --> 00:33:27,109 + 적색, 녹색, 청색의 비용이나이 경우 W에서 매우 고양이 입양 처리합니다 + +477 +00:33:27,108 --> 00:33:30,868 + 단지 매트릭스와 우리가 여기서 일을하는지에 대한 의해 세 가지가 우리가하려는하다 + +478 +00:33:30,868 --> 00:33:36,398 + 그래서이 주요 응용 프로그램은 여기 주요 행위의 점수를 계산하는 일이야 + +479 +00:33:36,398 --> 00:33:40,608 + 우리에게 우리가 세 가지 점수를 가지고이 과정은 경로의 출력을 제공합니다 + +480 +00:33:40,608 --> 00:33:45,348 + 세 가지 다른 클래스 그래서 이것은 단지 실행 동료 w 임의 설정까지입니다 + +481 +00:33:45,348 --> 00:33:50,739 + 여기 우리는 약간의 점수가 일부 특히이 이것과 저것을 볼 수있는거야 + +482 +00:33:50,739 --> 00:33:55,639 + 때문에 마켓 w이 설정을 최대로 아주 좋은 옳지 않다 승 설정 + +483 +00:33:55,638 --> 00:34:00,449 + 96의 점수가 다른 클래스들보다 훨씬 적은 바로 그래서이 아니었다 + +484 +00:34:00,450 --> 00:34:04,720 + 그것은 매우 좋지 않다 그래서 올바르게 교육 이미지 분류 + +485 +00:34:04,720 --> 00:34:07,220 + 분류는 그래서 우리는 다른 배를 변경하려면 + +486 +00:34:07,220 --> 00:34:10,250 + 그 스코어가 다른 것보다 더 위로 오도록 다른 W를 사용할 + +487 +00:34:10,250 --> 00:34:14,409 + 사람은 그러나 우리는 전체 교육 등의 예에서 일관되게 그렇게해야 + +488 +00:34:14,409 --> 00:34:20,389 + 하지만 한 가지가 기본적으로 W뿐만 아니라 여기주의 사항 + +489 +00:34:20,389 --> 00:34:25,700 + 그것은 모든 세입자 분류 평가를 병렬로이 함수이야 + +490 +00:34:25,699 --> 00:34:28,230 + 하지만 정말 열 독립적 인 분류가 있습니다 + +491 +00:34:28,230 --> 00:34:32,210 + 여기에 어느 정도 이러한 분류의 모든 일에 고양이 말을 좋아한다 + +492 +00:34:32,210 --> 00:34:36,918 + 분류기 여기 오른쪽 첫 번째 행과 첫 번째의 W 단지 최초의 행 + +493 +00:34:36,918 --> 00:34:41,789 + 바이어스는 득점 할 수 있습니다 및 개 분류는 두 번째 행의 W하고있다 + +494 +00:34:41,789 --> 00:34:46,840 + 선박의 분기 배 + 500 WW 행렬은 모든 다른 분류가 + +495 +00:34:46,840 --> 00:34:50,889 + 스택과 장미 그리고 그들은 모든 제품을 도킹 및 이미지와 되 고있어 + +496 +00:34:50,889 --> 00:34:56,269 + 그래서 여기에 당신이 과정을 제공 선형을 무엇을 당신을 위해 질문 + +497 +00:34:56,269 --> 00:35:02,599 + 분류 영어로 할 우리는 함수 형태의 부착이를하고있다 보았다 + +498 +00:35:02,599 --> 00:35:07,589 + 정말 어떻게 든 어떤이를 해석하고 영어 있었는지가 재미 작업 + +499 +00:35:07,590 --> 00:35:28,640 + 하고있다 + +500 +00:35:28,639 --> 00:35:39,048 + X되는 높은 차원 데이터 포인트와 W는 정말 통해 평야를두고있다 + +501 +00:35:39,048 --> 00:35:43,038 + 사이트는 그 해석에 돌아오고 있지만, 어느 쪽이든은 할 수 + +502 +00:35:43,039 --> 00:35:59,420 + 우리는이 팀 방법에 대해 생각 어디 W의 이러한 행의 모든​​ 하나 하나 + +503 +00:35:59,420 --> 00:36:03,630 + 효율적으로 우리가 이미지와 I과 이야기하지 않는이 템플릿처럼 + +504 +00:36:03,630 --> 00:36:08,608 + 내적 정말하는 방법입니다 같은 얼라이언스 무엇을 얻을 것을보고까지 자연 + +505 +00:36:08,608 --> 00:36:17,960 + 어떤 다른 방법으로 + +506 +00:36:17,960 --> 00:36:42,088 + 우리가 할 수있는 것은 공간의 위치 인덱스의 경우 일부이기 때문에 두 위치 + +507 +00:36:42,088 --> 00:36:44,838 + 우리는 분류가 될 것 제로 가중치가 + +508 +00:36:44,838 --> 00:36:50,329 + 여기이 부분은 다음 아무것도 이미지 때문에 50 대기의 일부에 무슨 상관하지 않는다 + +509 +00:36:50,329 --> 00:36:53,389 + 영향을하지만 당신의 이미지의 다른 부분에 대한 양 또는 음이 + +510 +00:36:53,389 --> 00:36:58,118 + 무게는 뭔가거야이 일어날 등의 점수에 기여 + +511 +00:36:58,119 --> 00:37:23,200 + 라벨 공간의 공간을 기술하는 방법 + +512 +00:37:23,199 --> 00:37:33,009 + 그래서 질문이 우리가 가진 입체 지형 등 때문에 이미지 + +513 +00:37:33,010 --> 00:37:37,369 + 그냥 들것이 의심이 모든 채널을 모든 당신은 그것을 스트레칭 + +514 +00:37:37,369 --> 00:37:41,849 + 당신이 좋아하는 어떤 방법은 녹색 빨간색과 파란색 부분을 나란히 시작하는 말 + +515 +00:37:41,849 --> 00:37:46,030 + 단지 당신은 당신이 좋아하는 어떤 방법으로하지만, 일관된 방법으로 그것을 스트레칭 + +516 +00:37:46,030 --> 00:37:49,930 + 모든 이미지는 당신이 읽을 수있는 방법으로 직렬화하는 방법을 알아낼 + +517 +00:37:49,929 --> 00:37:55,779 + 또한 그에게 전화하는 데 사용되는 사진 오프 + +518 +00:37:55,780 --> 00:38:05,060 + 확인 확인 그래서이 끔찍한 저희는 픽셀 그레이 스케일 이미지의가 있다고 가정하자 + +519 +00:38:05,059 --> 00:38:09,420 + 예를 들어 당신은 내가 싶어 사람들, 특히 때문에 혼동하지 말아 그것을 생각 + +520 +00:38:09,420 --> 00:38:12,539 + 내가 적색, 녹색, 청색이이 그림을 만든 후 사람이 나중에 나에게 지적 + +521 +00:38:12,539 --> 00:38:15,150 + 가장 가까운 두 개의 색상 채널 그러나 여기 적색, 녹색, 청색의 과정에 있습니다 + +522 +00:38:15,150 --> 00:38:21,380 + 내가하지 색상 채널 그냥 사과 있도록이 내 부분에 완전한 나사 - 최대 + +523 +00:38:21,380 --> 00:38:33,769 + 그 괜찮에 대한 세 가지 다른 색깔의 가장 가까운 죄송합니다 + +524 +00:38:33,769 --> 00:38:47,309 + 큰 정확히 우리가 모두 하나의 크기의 열 벡터가 될 수 있도록 어떻게 + +525 +00:38:47,309 --> 00:38:52,369 + 대답은 항상 항상 기본적으로 같은 크기의 우리로 이미지 크기를 조정하다 + +526 +00:38:52,369 --> 00:38:56,190 + 쉽게 우리가 들어갈 수있는 그냥 주말보다 다른 크기를 처리 할 수​​ 없습니다 + +527 +00:38:56,190 --> 00:38:59,789 + 나중에 그러나 가장 간단한 것은 단지 하나 하나의 크기를 조정라고 생각합니다 + +528 +00:38:59,789 --> 00:39:04,460 + 우리가 모든 것을 보장하기 원하기 때문에 이미지가 간단한 것은 동일 크기를 정확하기 + +529 +00:39:04,460 --> 00:39:08,470 + 그들의 종류의 우리가 이것들을 할 수 있도록 동일한 물건의 필적 + +530 +00:39:08,469 --> 00:39:12,049 + 열과 우리는 공간에 정렬 학교 패턴을 분석 할 수 있습니다 + +531 +00:39:12,050 --> 00:39:18,380 + 당해 수집기 사실 상태가 실제로 작동하는 방법이있다 + +532 +00:39:18,380 --> 00:39:21,650 + 하나의 사각형 이미지가 매우 긴을 가지고있는 경우 이러한 방법 것 때문에 + +533 +00:39:21,650 --> 00:39:25,480 + 그들 중 많은 사람들이 그들이 할 것은 그 무엇을의 IT 스쿼시하기 때문에 실제로 나쁜 일 + +534 +00:39:25,480 --> 00:39:30,789 + 우리는 내가 단지에 노력 파노라마처럼 매우 긴 느낌 때문에 여전히 잘 공정하게 작동 할 + +535 +00:39:30,789 --> 00:39:34,059 + 일부 온라인 서비스의 기회가 더 내 일처럼 어딘가에 넣어 + +536 +00:39:34,059 --> 00:39:36,679 + 그들은 아마도 그들이 그것을 할 것을 온을 통해 그것을 넣을 것이기 때문에 + +537 +00:39:36,679 --> 00:39:41,129 + 이러한 의견은 항상 사각형에서 작동하기 때문에 광장 당신은 그들이 일을 할 수 있습니다 + +538 +00:39:41,130 --> 00:39:45,490 + 그건 그냥 일반적으로 다른 질문이 무슨 연습이야 아무것도하지만에 + +539 +00:39:45,489 --> 00:39:58,199 + 젖꼭지 승을 해석하는 것은 그래 각각의 영상이 통과 + +540 +00:39:58,199 --> 00:40:04,109 + 하고 싶은 다른 사람이 해석하는 정도 다른 방법은 실제로 그것을 하나를 넣어 + +541 +00:40:04,110 --> 00:40:07,150 + 방법 나는 듣지 않았다 그러나 그것은 또한이다보고의 좋은 방법 있음 + +542 +00:40:07,150 --> 00:40:12,769 + 기본적으로 모든 단일 스코어가 모든 화소 값들의 단순한 가중 합이고 + +543 +00:40:12,769 --> 00:40:16,489 + 이미지와 이러한 비율은 우리가 결국 그 선택에 도착하지만 난 그냥 + +544 +00:40:16,489 --> 00:40:20,559 + 거대한 가중 합은 정말 그것이 바로 색상을오고있다하고있어 전부 + +545 +00:40:20,559 --> 00:40:25,779 + 그렇게 하나의 방법 한 가지 방법으로 서로 다른 공간 위치에서 색상을오고 + +546 +00:40:25,780 --> 00:40:29,500 + 그 우리가 분류 콘크리트 승이 해석 할 수있는 방법의 측면에서 자랐다 + +547 +00:40:29,500 --> 00:40:33,170 + 그렇게 여기에 템플릿 매칭 것 같은 비트가 무엇을의 가지처럼이다 + +548 +00:40:33,170 --> 00:40:37,059 + 나는 분류를 훈련 한 적이 그리고 당신이 그것을 수행하는 방법 쇼가 아직 있지만 + +549 +00:40:37,059 --> 00:40:41,920 + 내 가중치 행렬을 훈련 한 다음 다시 두 번째로 돌아와 I 밖으로 가지고있어 모든 + +550 +00:40:41,920 --> 00:40:45,010 + 우리가 난 모든 단일 분류를 배운 그 행을 하나 하나 + +551 +00:40:45,010 --> 00:40:46,599 + 끝으로 다시 재편 + +552 +00:40:46,599 --> 00:40:51,809 + 나는 그것을 시각화 할 수 있도록 그래서 나는 원래 그냥 거대한 블로우 업 3072을 데려 갈거야 + +553 +00:40:51,809 --> 00:40:55,650 + 우리는 왜곡을 취소 할 이미지로 다시 발송 번호 내가 수행하고 + +554 +00:40:55,650 --> 00:40:59,660 + 나는이 모든 템플릿이 때문에 예를 들어 당신이 여기에 참조하는 것입니다 + +555 +00:40:59,659 --> 00:41:04,659 + 면 그것은 당신이 파란색 얼룩을 볼 수있는 이유는 여기에 파란색 얼룩 같은가요 당신 경우 + +556 +00:41:04,659 --> 00:41:08,278 + 당신은에있는 것을 볼이 비행기 템플릿의 색상 채널에서 보았다 + +557 +00:41:08,278 --> 00:41:11,440 + 파랑 채널 당신은 긍정적 인 무게의 제비가 그 양의 무게 때문에 + +558 +00:41:11,440 --> 00:41:15,479 + 그럼 그들이 나에게 값을 볼 경우 그들은 사람들과 상호 작용하고 그들은 조금 얻을 + +559 +00:41:15,478 --> 00:41:19,338 + 점수에 기여 그래서이 비행기 분류 정말 그냥 계산 + +560 +00:41:19,338 --> 00:41:23,159 + 모든 특별 행사에서하고있는 경우 이미지의 파란색 물건의 양 + +561 +00:41:23,159 --> 00:41:26,368 + 당신이를 찾을 수있는 평면 분류의 빨간색과 녹색 채널을보고 + +562 +00:41:26,369 --> 00:41:30,499 + 0 값 또는 음의 값 바로 그 계획의 분류입니다 + +563 +00:41:30,498 --> 00:41:35,098 + 가격이 모든 다른 이미지가 개구리 말을하는 당신은 거의 템플릿을 볼 수 있습니다 + +564 +00:41:35,099 --> 00:41:38,900 + 프라하의 그것 권리는 녹색 물건이 일부 녹색 불가사리를 찾고 + +565 +00:41:38,900 --> 00:41:42,849 + 긍정적 여기에 무게와 그 다음 우리는 측면에 약간의 갈색 불가사리 사물을 + +566 +00:41:42,849 --> 00:41:49,599 + 그 이미지와 내적 위에 엉덩이를 얻을 경우 그래서 높은 점수를 얻을 것이다 + +567 +00:41:49,599 --> 00:41:51,430 + 여기에서 주목해야 할 것은 이것 좀입니다 + +568 +00:41:51,429 --> 00:41:56,588 + 또한 듣고 자동차의 아주 같은 좋은 템플릿 아닙니다 차 분류 + +569 +00:41:56,588 --> 00:42:01,679 + 말은 즉, 상기 거짓말을 찾고 차가이었다까지 무슨 조금 이상한 보인다 + +570 +00:42:01,679 --> 00:42:11,048 + 보고 말 이상한 그래 기본적으로 그가에 무슨 일이 일어나고 있는지의 예 + +571 +00:42:11,048 --> 00:42:14,998 + 데이터는 말 누군가가 어딘가에 오른쪽이 분류 왼쪽 직면 + +572 +00:42:14,998 --> 00:42:19,028 + 실제로 매우 강력한 분류가 아니고,이있는 두 가지 모드를 결합하는 + +573 +00:42:19,028 --> 00:42:22,179 + 두 향하고 말에 우리와 함께 머물 동시에 두 일을하는 + +574 +00:42:22,179 --> 00:42:25,879 + 거기에이 결과는 아마 더있을 바로 그 때 당신은 실제로 그런 말을 할 수 있습니다 + +575 +00:42:25,880 --> 00:42:30,599 + 강한 그들은 또한이기 때문에 오른쪽에있는 항구에서 말을 왼쪽에 직면 + +576 +00:42:30,599 --> 00:42:35,219 + 자동차의 권리를 위해 우리는 왼쪽이나 오른쪽 또는 전방 45도 같은 차를 가질 수있다 + +577 +00:42:35,219 --> 00:42:40,588 + 여기에이 분류 모든 병합 같은에서 혼합하는 최적의 방법입니다 + +578 +00:42:40,588 --> 00:42:43,608 + 그 때문에 하나의 템플릿에 해당 모드가 여기서 할 그것을 강요 + +579 +00:42:43,608 --> 00:42:46,900 + 그들은이없는 우리가 실제로 그건 일을하는지와 신경망 + +580 +00:42:46,900 --> 00:42:50,239 + 실제로 원칙적으로 할 수 있습니다 단점은 그들에 대한 템플릿을 가질 수있다 + +581 +00:42:50,239 --> 00:42:53,338 + 카드가 그들에게 더 많은 힘을주고 그들을 통해 결합 곧이 차 + +582 +00:42:53,338 --> 00:42:56,478 + 실제로 더 적절하지만 지금이 분류를 수행하는 + +583 +00:42:56,478 --> 00:42:57,808 + 우리는이 제약된다 + +584 +00:42:57,809 --> 00:43:08,239 + 문제 + +585 +00:43:08,239 --> 00:43:18,389 + 예 뭔가 그래서 우리가 정확하게 수행되지 않을 것이다 기차 시간이 될 것인가 + +586 +00:43:18,389 --> 00:43:21,349 + 그들이 그들을 훔쳐을 스트레칭과 우리가 모든 것을 바꾸어됩니다 생성 + +587 +00:43:21,349 --> 00:43:25,979 + 즉, 그래 난 것 때문에 아주 잘 작동 점점의 큰 부분이 될 것 + +588 +00:43:25,978 --> 00:43:30,038 + 우리가 가고 있다는 것을 변경됩니다 모두를위한 그 물건의 엄청난 금액을 수행 할 수 + +589 +00:43:30,039 --> 00:43:33,469 + 회전하기 때문에 선박의 다른 많은 교육 사례를 규명하고, + +590 +00:43:33,469 --> 00:43:47,009 + 스튜와 이러한 템플릿 체인 평균을 복용하는 방법이 훨씬 더 잘 작동 + +591 +00:43:47,009 --> 00:43:56,969 + 사람 당신은 방법 당신의 세트를 명시 적으로 템플릿을 설정하고 싶은 있도록 + +592 +00:43:56,969 --> 00:44:01,068 + 템플릿은 모든 이미지에 걸쳐 평균이고, 그 템플릿된다 + +593 +00:44:01,068 --> 00:44:13,918 + 그래 그래서이 분류는 그들이 내가 추측하는 것입니다 비슷한 할 것이다 결합 + +594 +00:44:13,918 --> 00:44:18,489 + 분류 당신은 마이클 볼 때 있기 때문에 더 작동합니다 + +595 +00:44:18,489 --> 00:44:22,028 + 이전에 그것을 위해 최적화 무엇 나는 그가 최소있을 것입니다 생각하지 않습니다 + +596 +00:44:22,028 --> 00:44:26,179 + 당신이 이미지의 단지 분에 설명하지만 직관적 같은 것 + +597 +00:44:26,179 --> 00:44:30,079 + 에 리 괜찮은 발견 아마도 그 초기화 또는 분할에 기다립니다 + +598 +00:44:30,079 --> 00:44:34,239 + 그것은 어떤 관련 + +599 +00:44:34,239 --> 00:44:40,349 + 그래하지만 우리는 내가 그들의 몇 가지로 돌아갈 수있을거야 해당 갈 수 있습니다 + +600 +00:44:40,349 --> 00:44:43,980 + 몇 가지 + +601 +00:44:43,980 --> 00:45:06,650 + 에 아마 빨간색 자동차가 있다는 것을 말하고 다른 색상의 빨간색 + +602 +00:45:06,650 --> 00:45:11,750 + 데이터 세트는과 노란색 카드는이를 위해 수 있습니다 실제로 당신을 위해 작동하지 않을 수 있습니다 + +603 +00:45:11,750 --> 00:45:16,909 + 시간은 그래서이 일을 그냥 이유입니다이 모든 것을 할 수있는 능력이 없습니다 + +604 +00:45:16,909 --> 00:45:19,989 + 제대로 그래서 모든 다른 모드를 캡처 할 수있는 충분히 강력한 + +605 +00:45:19,989 --> 00:45:23,689 + 이것은 단지 어디있어 그 이상의 빨간색 차를 거기에 숫자 후 이동합니다 + +606 +00:45:23,690 --> 00:45:28,389 + 이 그레이 스케일 인 경우 그 그가거야 더 잘 작동한다면 잘 모르겠어요 갈 것입니다 + +607 +00:45:28,389 --> 00:45:40,368 + 내가 불균형에 대해 언급 한 바와 같이 실제로 당신이 예상 다시 그에게 올 + +608 +00:45:40,369 --> 00:45:42,190 + 당신이 기대하는 어​​떤 데이터 세트 + +609 +00:45:42,190 --> 00:45:49,150 + 정확히 당신이 고양이를 많이 기대하는 것은 고양이 바이어스가 될 것입니다 + +610 +00:45:49,150 --> 00:45:53,750 + 높은 때문에이 분류는 단지 큰 숫자에 사용되는이 클래스 + +611 +00:45:53,750 --> 00:45:57,980 + 손실에 기초하지만 우리는 볼을 정확하게에 손실 함수로 가야 어떻게 + +612 +00:45:57,980 --> 00:46:01,929 + 그것은 지금 말할 하드 그래서 밖으로 재생됩니다 + +613 +00:46:01,929 --> 00:46:05,960 + 또한 다른 사람이 지적 분류의 또 다른 해석 + +614 +00:46:05,960 --> 00:46:09,869 + 내가 지적하고 싶은 당신은 매우 높은 차원으로 이러한 이미지 생각할 수있다 + +615 +00:46:09,869 --> 00:46:17,619 + 바로 3072 픽셀 공간 공간마다 이미지로 3072 차원 공간에서 점 + +616 +00:46:17,619 --> 00:46:22,130 + 점이며, 이러한 선형 분류 걸쳐 이러한 그라데이션을 설명하는 + +617 +00:46:22,130 --> 00:46:25,070 + 이 점수이있는 삼천 뭔가 2 차원 + +618 +00:46:25,070 --> 00:46:28,580 + 영역 및 공간에서 일부 주류 방향에 따른 긍정적 부정적 + +619 +00:46:28,579 --> 00:46:33,670 + 그래서 여기에 예를 들어 분류에 대한 I는 W의 첫 번째 행을 데려 갈거야 + +620 +00:46:33,670 --> 00:46:37,750 + 자동차 클래스와 여기에 라인에가의 제로 레벨 세트를 표시한다 + +621 +00:46:37,750 --> 00:46:42,739 + 자동차 분류 그 라인을 오랫동안 즉 분류가 0 점수가 + +622 +00:46:42,739 --> 00:46:46,849 + 그래서 차 분류기는 20을 표시하고 화살표가 그들의 갖는다 + +623 +00:46:46,849 --> 00:46:51,730 + 더 많은 공간으로 착색되는 방향을 따라 + +624 +00:46:51,730 --> 00:46:56,400 + 우리는이 예에서 세 가지 분류가 점수 유사 활용 + +625 +00:46:56,400 --> 00:46:59,900 + 특정 레벨 설정과 그들이 이러한 기울기에 반응하고 + +626 +00:46:59,900 --> 00:47:05,650 + 그들은 기본적으로 그들은 공간에있는 모든 끼 경우에 이동하려는 및 + +627 +00:47:05,650 --> 00:47:08,970 + 우리는 다음 초기화 이들 지역의 공급 업체가 임의로이 차 분류는 것 보았다 + +628 +00:47:08,969 --> 00:47:11,969 + 그 수준이 무작위로 설정되어 우리가 실제로 작업을 수행 할 때 당신은 볼 수 있습니다 + +629 +00:47:11,969 --> 00:47:16,449 + 우리가 최적화로 최적화이 당신의 시프트 차례 동물성 단백질을 시작합니다 + +630 +00:47:16,449 --> 00:47:20,239 + 자동차 클래스를 분리하고이 분류를 보는 재미를 좋아합니다 + +631 +00:47:20,239 --> 00:47:25,038 + 이 박사 지킬과 의지를 건너 차에 스냅됩니다 회전하기 때문에 훈련 + +632 +00:47:25,039 --> 00:47:28,528 + 그건 물론 모든 지키는에서 모든 차량을 분리하고자 시도 + +633 +00:47:28,528 --> 00:47:33,289 + 보고 정말 재미 그래서 그 확인을 해석하는 또 다른 방법입니다 + +634 +00:47:33,289 --> 00:47:37,130 + 여기에 당신이 모든 해석은 매우 될 것 주어진에 대한 질문입니다 + +635 +00:47:37,130 --> 00:47:43,028 + 이러한 젖꼭지 하드 당신이 정말로 정말로 일을 기대하는 것이 무엇 작동 + +636 +00:47:43,028 --> 00:47:51,909 + 잘 선형 분류와 + +637 +00:47:51,909 --> 00:48:05,230 + 동시 원은 우리의 가장 가까운 참조 나는 그래서 당신이있어 볼 정확히 어떻게 수업은 + +638 +00:48:05,230 --> 00:48:10,349 + 설명을 찾아 이미지에 공간이 해석에 + +639 +00:48:10,349 --> 00:48:15,630 + 하나의 클래스에 그렇게 난 주위에 같은 다른 클래스 다음 얼룩에와 것 + +640 +00:48:15,630 --> 00:48:19,880 + 즉, 예를 경우 실제로 공간 만 같을 것이다 정확하게 확실하지 + +641 +00:48:19,880 --> 00:48:22,869 + 당신은 내가 그를 분리 할 수​​ 없습니다 굉장이 경우 병원에 맞아 + +642 +00:48:22,869 --> 00:48:26,920 + 하지만 이미지는 것 당신처럼 대해 같은 측면에서 보일 것입니다 무슨 + +643 +00:48:26,920 --> 00:48:31,079 + 스튜디오 설치 이미지를 보면 분명히 나중에 분류 아마 것이라고 + +644 +00:48:31,079 --> 00:49:02,380 + 나중에있어 여기에 아주 잘하지 + +645 +00:49:02,380 --> 00:49:39,210 + 훈련을 분류하고 나는 그것을 그것의 부정적인 이미지를 부정 할 것을 + +646 +00:49:39,210 --> 00:49:42,699 + 당신은 여전히​​ 가장자리를 참조 분류하고 괜찮 그 비행기 말할 수 있습니다 + +647 +00:49:42,699 --> 00:49:45,710 + 분명히 모양 대대 분류 모든 색상이 될 것이다 + +648 +00:49:45,710 --> 00:49:49,760 + 정확히 잘못 때문에 비용이 그 비행기를 싫어 + +649 +00:49:49,760 --> 00:50:02,330 + 예 + +650 +00:50:02,329 --> 00:50:12,630 + 개는 개, 오른쪽에 하나의 가장 가까운 개를 개 및 해당 될 것이라고 생각 + +651 +00:50:12,630 --> 00:50:27,090 + 문제의 권리 + +652 +00:50:27,090 --> 00:50:32,829 + 문제가 될 것입니다 흰색 배경이나 뭔가가 문제의 I되지 않을 것 + +653 +00:50:32,829 --> 00:50:37,059 + 문제가되지 않을 것입니다 + +654 +00:50:37,059 --> 00:50:52,570 + 변환 + +655 +00:50:52,570 --> 00:50:56,789 + 당신이 더 어려울 수 있습니다 말하고있는 일이 될 것이다 당신의 개 우리의 작업을하는 경우 + +656 +00:50:56,789 --> 00:51:00,309 + 어떤면에서는 클래스에 따라 왜 당신이 만약 실제로 문제가되지 않을 것입니다 + +657 +00:51:00,309 --> 00:51:04,279 + 실제로이없는 오른쪽 중앙에 뭔가 일을 + +658 +00:51:04,280 --> 00:51:08,840 + 그에 특히 최대 이해하는 것은 실제로 오른쪽이 될 것이다 발견 + +659 +00:51:08,840 --> 00:51:15,769 + 당신이 중간에 긍정적 인 가중치를해야하기 때문에 상대적으로 쉽게 + +660 +00:51:15,769 --> 00:51:25,219 + 그래 + +661 +00:51:25,219 --> 00:51:34,348 + 그것은 여기 정말이가 무엇을하고 있는지 무엇을하고 있는지 예 그래서 이것은 정말 정말 + +662 +00:51:34,349 --> 00:51:38,619 + 그것은 놨어요 색상 및 특수 위치 아무것도오고 카운트 업이야 + +663 +00:51:38,619 --> 00:51:41,800 + 이건 정말 힘들 것입니다 당신이 있던 경우에 실제로 지점으로 돌아갑니다 + +664 +00:51:41,800 --> 00:51:44,300 + 작동하는 방법과 설정 그레이 스케일 데이터 + +665 +00:51:44,300 --> 00:51:48,070 + 아니 아주 잘 당신이 지금까지 볼 수 있다면 우리의 고객은 아마 작동하지 않습니다와 함께 + +666 +00:51:48,070 --> 00:51:53,250 + 10 당신은 제조 또는 그레이 스케일은 동일한 분류 그레이 스케일을 수행 + +667 +00:51:53,250 --> 00:51:56,059 + 당신이에서 선택할 수 없기 때문에 이미지는 아마 정말 끔찍하게 작동합니다 + +668 +00:51:56,059 --> 00:52:00,739 + 색상은 이제 이러한 질감과 미세한 세부 사항을 데리러해야하고 + +669 +00:52:00,739 --> 00:52:03,848 + 그들이 할 수 있기 때문에 그냥 아주 위치가 없습니다를 지역화 할 수 없습니다 + +670 +00:52:03,849 --> 00:52:08,400 + 일관되게 재해의 종류 것 건너 와서 + +671 +00:52:08,400 --> 00:52:11,660 + 당신이 모든 말을이있는 경우 또 다른 예는 서로 다른 질감을 것입니다 당신의 + +672 +00:52:11,659 --> 00:52:16,989 + 텍스트는 파란색하지만이 정말하지 않습니다이 텍스트는 다른 종류의 수 + +673 +00:52:16,989 --> 00:52:20,799 + 같은 이러한 두 가지 유형의 말을하지만 그들은 공간적으로 불변 일 수있다 + +674 +00:52:20,800 --> 00:52:29,740 + 그, 그래서 그냥 내가 거의가 생각하는 당신을 생각 나게 얻을 끔찍한 끔찍한 것 + +675 +00:52:29,739 --> 00:52:35,269 + 우리가보고있는 특정 케이스와 W에 있도록이 기능을 찾을 것입니다 + +676 +00:52:35,269 --> 00:52:38,588 + 몇 가지 테스트 이미지는 우리가 밖으로 약간의 점수를 받고 그냥 기대하고 + +677 +00:52:38,588 --> 00:52:43,070 + 우리는 지금 향하고 약간의 모든 일부 점수를 얻기 위해 w를 설정 함께있어 + +678 +00:52:43,070 --> 00:52:47,470 + 이미지 등이 이미지에서 우리가보고있는 w는이 설정을 위로 예를 들어, + +679 +00:52:47,469 --> 00:52:51,319 + 고양이 점수는 2.9하지만 나는 높은 점수있어 몇 가지 클래스가 있음 + +680 +00:52:51,320 --> 00:52:54,588 + 이들은 매우 좋은 권리는 아니지만 일부 클래스가 부정적인 점수를 그래서 개 같은 + +681 +00:52:54,588 --> 00:52:59,909 + 이 종류의이 대기에 대한 중간 결과입니다 그래서이 이미지의 선한 + +682 +00:52:59,909 --> 00:53:04,199 + 여기에서이 이미지를 우리는이 자신에 대한 그 차 클래스 단지 올바른 참조 + +683 +00:53:04,199 --> 00:53:08,439 + 이 이미지에 너무 잘 W 작업을 방문 그렇게 쓸 것입니다 가장 높은 점수 + +684 +00:53:08,440 --> 00:53:14,940 + 여기에 우리는 클래스가 너무 끔찍 그에 우리가있어, 그래서 매우 낮은 점수 것을 볼 + +685 +00:53:14,940 --> 00:53:19,990 + 지금 향했다 우리는 우리가 손실의 기능이 손실을 부르는 정의려고하고있다 + +686 +00:53:19,989 --> 00:53:23,899 + 함수는 우리가 지금 좋은 또는 나쁜 생각 무엇을이 직관을 정량화한다 + +687 +00:53:23,900 --> 00:53:26,440 + 우리가이 숫자를 째려하는 것은 무슨 무슨 좋은 말 + +688 +00:53:26,440 --> 00:53:29,490 + 실제로 우리에게 수식을 기록하는 + +689 +00:53:29,489 --> 00:53:35,949 + 바로 이러한 우리의 테스트에서 w를 설정처럼 나쁜 12.5 또는 1220 무엇이든 + +690 +00:53:35,949 --> 00:53:40,469 + 우리가 구체적으로 정의한 후, 일단 우리가 될 것하고 있기 때문에 나쁜 또는 110 나쁜 + +691 +00:53:40,469 --> 00:53:44,318 + 그 forw 찾고 손실을 최소화하고는 같은 방법으로 설정한다는 + +692 +00:53:44,318 --> 00:53:48,500 + 다음도 제로 말처럼 당신은 매우 낮은 숫자의 손실이있을 때 + +693 +00:53:48,500 --> 00:53:53,760 + 정확하게 모든 이미지를 분류하지만 당신은 매우 높은 손실이있는 경우 + +694 +00:53:53,760 --> 00:53:56,970 + 모든 것이 W에 엉망이 전혀 우리가 많이 찾을거야 좋지 않다 + +695 +00:53:56,969 --> 00:54:01,059 + 실제로 모두에서 매우 잘 수행하는 것이 야 승 조치는 다음 다른 찾아 + +696 +00:54:01,059 --> 00:54:03,469 + 그렇게 그 대략 무엇을오고있어 + +697 +00:54:03,469 --> 00:54:09,108 + A는 정량화 방법이다 잘 정의 손실 함수는 HW가 얼마나 나쁜 정할 + +698 +00:54:09,108 --> 00:54:13,328 + 우리의 데이터 세트에 전체 학습 집합의 함수로 손실 함수와 + +699 +00:54:13,329 --> 00:54:19,900 + 당신의 속도는 우리는 잡초 제어의 전송을 제어 할 수 없습니다 + +700 +00:54:19,900 --> 00:54:22,960 + 우리는 어떻게 효율적를 찾기 위해 최적화하는 과정에서 볼거야 + +701 +00:54:22,960 --> 00:54:27,420 + 모든 이미지에서 작동 우리에게 매우 낮은을 제공합니다 가중치 w의 세트 + +702 +00:54:27,420 --> 00:54:30,940 + 손실은 결국 우리가 무엇을 할 거 야 우리가 가서 이것 좀 봐 것입니다 + +703 +00:54:30,940 --> 00:54:34,250 + 우리가 보았던 식 분류 우리는 함께 간섭 시작하는거야 + +704 +00:54:34,250 --> 00:54:38,260 + 여기에 기능 그래서 우리는 간단하지 노력을 소비거야 당신의 + +705 +00:54:38,260 --> 00:54:41,349 + 표현하지만 우리는 조금 더 복잡한 운동을 얻을 수 있도록거야 + +706 +00:54:41,349 --> 00:54:44,630 + 그리고, 우리는 조금 더 복잡하고 운동 연합을 얻을 수 있습니다 + +707 +00:54:44,630 --> 00:54:48,789 + 하지만, 그 전체 프레임 워크는 모든 시간이 될 것입니다 변하지 남아있을 것입니다 + +708 +00:54:48,789 --> 00:54:52,389 + 경쟁이 과정 역기능 형식이 변경 될 수 있지만, 우리는 거 야 + +709 +00:54:52,389 --> 00:54:56,909 + 어떤 종류의 코스 일부 기능을 통해 더 정교하게 만들 것 + +710 +00:54:56,909 --> 00:55:01,179 + 초과 근무 후 우리는 약간의 손실 함수를 식별하고 우리가보고있는 것을 + +711 +00:55:01,179 --> 00:55:04,449 + 예비 선거는 매우 낮은 손실을 부여하고이 설정이 될 것입니다 무엇을 기다립니다 + +712 +00:55:04,449 --> 00:55:09,710 + 다음 손실 기능에 모양 앞으로 그래서 다음 수업을가는 작업 + +713 +00:55:09,710 --> 00:55:13,730 + 우리는 그래서 이것이 나의 마지막 빛 추측있어 그 아스날 에미리트 소득에 갈거야 + +714 +00:55:13,730 --> 00:55:23,920 + 그래서 어떤 마지막 질문에 걸릴 수 있고, + +715 +00:55:23,920 --> 00:55:36,068 + 죄송합니다 죄송합니다 죄송합니다 나는 듣지 않았다 + +716 +00:55:36,068 --> 00:55:41,969 + 프로젝트 최적화 야당 설정하면 작동 할 수 있습니다에 때때로 + +717 +00:55:41,969 --> 00:55:45,429 + 이러한 혁신적인 접근은 기본적으로이 거 우리를 작동 방법입니다 + +718 +00:55:45,429 --> 00:55:49,598 + 우리는 항상 랜덤 W로 시작합니다 참조하면 그것은 우리에게 손실을 줄 것이다 있도록 + +719 +00:55:49,599 --> 00:55:53,249 + 그리고, 우리 우리의 바로 최고의 세트를 찾는 과정이 없습니다 + +720 +00:55:53,248 --> 00:55:57,509 + 무게는하지만 우리는 반복적으로 약간을 개선하는 방법을해야합니까 + +721 +00:55:57,509 --> 00:56:01,309 + 무게가 너무 작은 우리가 손실 함수에서 보면보고 ​​그라데이션을 찾을 수 + +722 +00:56:01,309 --> 00:56:06,380 + 공간과 우리가하는 방법을 알고 무엇을 약간 우리를 어떻게되어 아래로 행진한다 + +723 +00:56:06,380 --> 00:56:09,890 + 우리가 단지 구입의 문제를 수행하는 방법을 모르는 가중치의 세트를 향상 + +724 +00:56:09,889 --> 00:56:12,858 + 바로 통해 가장 좋은 방법은 우리가 그렇게하는 방법을 모르겠어요 특히 때문에 + +725 +00:56:12,858 --> 00:56:17,108 + 이러한 기능은 매우 복잡 할 때 거대한 풍경의 인터콤을 좋아한다 + +726 +00:56:17,108 --> 00:56:31,038 + 그 단지 매우 다루기 힘든 문제 귀하의 질문에 내가 어떻게 잘 모르겠어요 것입니다 + +727 +00:56:31,039 --> 00:56:40,170 + 우리가 여기에 너무 너무 좋아 색상 문제를 처리 할 우리는 선형 것을보고 + +728 +00:56:40,170 --> 00:56:44,809 + 자동차에 대한 분류는 기본적으로 자동차와 신경망이 빨간색 템플릿했다 + +729 +00:56:44,809 --> 00:56:47,619 + 우리가 할 거 야 우리가 당신이있을 때 당신이 적층로 볼 수 있습니다 것 만날입니다 + +730 +00:56:47,619 --> 00:56:50,818 + 어느 정도 분류 그래서 그것이 모든 것입니다 일을 결국 무슨 + +731 +00:56:50,818 --> 00:56:55,748 + 이 길을가는 정말 임대 자동차 자동차 자동차 자동차에 대한이 작은 템플릿 또는 + +732 +00:56:55,748 --> 00:56:58,248 + 그 방법 또는 그 방법으로는 기술에 할당됩니다 거기에 모든 사람의 + +733 +00:56:58,248 --> 00:57:01,399 + 이러한 다양한 모드는 다음 그들은 두 번째에 그에서 결합됩니다 + +734 +00:57:01,400 --> 00:57:04,739 + 이러한 서로 다른 종류의 코스를 찾고있다 그래서 기본적 층 + +735 +00:57:04,739 --> 00:57:08,588 + 다음에 내년 난 그냥 경우를 말할 수있는 방법을 바로 확인처럼 될 것입니다 너희들 + +736 +00:57:08,588 --> 00:57:13,548 + 일이나 이상 동작, 그리고, 우리는 모든 모드에서 차를 검출 할 수있다 + +737 +00:57:13,548 --> 00:57:17,498 + 자신의 위치의 대략 숙제의 의미가 있습니다 + diff --git a/captions/Ko/Lecture3_ko.srt b/captions/Ko/Lecture3_ko.srt new file mode 100644 index 00000000..0ab59137 --- /dev/null +++ b/captions/Ko/Lecture3_ko.srt @@ -0,0 +1,3600 @@ +1 +00:00:00,000 --> 00:00:05,400 +오늘 수업 내용인 Loss function Optimiyation을 시작하기에 앞서서 + +2 +00:00:05,400 --> 00:00:09,429 +몇가지 공지할 사항들이 있습니다. + +3 +00:00:09,429 --> 00:00:12,859 +첫번째 숙제 기한이 다음주 수요일까지입니다. + +4 +00:00:12,859 --> 00:00:18,100 +약 9일정도 남아있구요 다음주 월요일은 휴일이기떄문에 + 약 구일 왼쪽 단지 경고로 월요일이 것 때문에 휴일입니다 + +5 +00:00:18,100 --> 00:00:23,050 +수업과 오피스 아워가 없습니다. 이에 맞춰서 계획해서 + +6 +00:00:23,050 --> 00:00:25,920 +숙제를 제 시간안에 끝내길 바라구요. + +7 +00:00:25,920 --> 00:00:29,960 +Late day 룰을 숙제기한들에 맞춰 잘 사용하길 바랍니다. + +8 +00:00:29,960 --> 00:00:35,149 +이제 수업을 시작합시다. 첫번째로 어디까지 진행했는지 보면.. + +9 +00:00:35,149 --> 00:00:39,100 +저번 시간에 이 시각 인식(Visual Recognition) 문제 중 + +10 +00:00:39,100 --> 00:00:42,950 +이미지 분류법(Image classificatino)을 보았고, 이 문제가 실제로는 + +11 +00:00:42,950 --> 00:00:45,780 +매우 어려운 문제였습니다. 모든 변화의 외적(cross product) 계산이 + +12 +00:00:45,780 --> 00:00:54,198 +고양이와 같은 카테고리를 Robust하게 분류하는데 필요했던 것을 고려한다면 + +13 +00:00:54,198 --> 00:00:58,049 +풀기에 매우 어려운 문제처럼 보였지만, 이제는 + +15 +00:00:58,049 --> 00:01:02,108 +수천개의 카테고리르 위한 문제도 풀 수 있고 + +16 +00:01:02,109 --> 00:01:05,859 +최신의 기법들은 거의 사람의 정확도와 비슷하거나 + +17 +00:01:05,859 --> 00:01:11,829 +심지어 더 좋은 경우도 있습니다. 그리고 거의 실시간(nearly in time)으로 + +18 +00:01:11,829 --> 00:01:16,539 +전화기 수준의 기기에서 동작합니다. 이 모든 일들은 지난 3년간이루어졌고 + +19 +00:01:16,540 --> 00:01:19,790 +이 코스 후에는 학생들은 모두 이 기술에 대한 전문가가 될 것입니다. + +20 +00:01:19,790 --> 00:01:23,609 +정말 멋지고 기대되는 일입니다. OK. + +21 +00:01:23,609 --> 00:01:27,140 +이것은 이미지 인식 분류문제입니다. 우리는 데이터 기반 접근법(Data-driven approach)에 + +22 +00:01:27,140 --> 00:01:30,450 +대해서 이야기했습니다. 이 분류기(Classifier)는 명시적으로 Hard-code할 수 없기떄문에 + +23 +00:01:30,450 --> 00:01:34,100 +데이터를 이용해서 분류기(Classifier)를 학습시켜야합니다. 그래서 다른 트레이닝 데이터와 + +24 +00:01:34,099 --> 00:01:37,188 +Hyperparameter를 테스트 할수 있는 검증 테이터를 갖는 방법들, 그리고 + +25 +00:01:37,188 --> 00:01:41,408 +많이 건드릴 일이없는 테스트 셋에 대해서 보았습니다. + +26 +00:01:41,409 --> 00:01:44,810 +구체적으로 Nearest Neighbor Classifier와 + +27 +00:01:44,810 --> 00:01:48,618 +몇몇의 K-NN Classifer의 예를 보았습니다. + +28 +00:01:48,618 --> 00:01:52,938 +그리고 수업시간에 이야기했던 CIFAR-10 데이터 셋에 대해서 이야기했습니다. + +29 +00:01:52,938 --> 00:01:58,438 +이후에 Parametric Approach이라고 붙인 접근법의 아이디어를 이야기했습니다. + +30 +00:01:58,438 --> 00:02:03,639 +단순히 원래의 이미지로부터 Score를 그대로 가져오는 f()를 만듭니다. +10개의 클래스가 있으면 10개의 스코어를 가져옵니다. + +31 +00:02:03,640 --> 00:02:07,618 +그래서 이 Parametric form은 우선 Linear한것처럼 보이게 됩니다. + +32 +00:02:07,618 --> 00:02:11,520 +그래서 F=Wx를 갖게 됩니다. 그리고 이 Linear Classifer에 대해서 +분석을 이전에 이야기했었는데. + +33 +00:02:11,520 --> 00:02:12,850 +실제로 Linear Classifer를 Matching Template으로 분석해도 되고 + +34 +00:02:12,849 --> 00:02:16,039 +아니면 상위 차원 공간에 있는 이미지들을 상상하고 + +35 +00:02:16,039 --> 00:02:18,449 +Linear Classifer가 이 공간 안에 들어가서 + +36 +00:02:18,449 --> 00:02:23,560 +Class score에 맞춰서 색칠한다고 생각해도 됩니다. + +37 +00:02:23,560 --> 00:02:28,740 +음.. 그래서 이전 수업 마지막에 이 사진들까지 왔습니다. + +38 +00:02:28,740 --> 00:02:32,240 +Training data set에서 이 세장의 사진을 위와 같은 열과 함께 + +39 +00:02:32,240 --> 00:02:36,530 +CIFAR-10내의 10개의 Class를 가지고 있다고 봅시다. + +40 +00:02:36,530 --> 00:02:40,740 +기본적으로이 이 함수 f()는 모든 한장한장의 이미지에 Score를 주게됩니다. + +41 +00:02:40,740 --> 00:02:44,510 +무작위로 선정된 몇개의 Weight 세팅들과 함께요. + +42 +00:02:44,509 --> 00:02:47,939 +그러면 몇개의 좋고 나쁜 score들을 얻게됩니다. + +43 +00:02:47,939 --> 00:02:51,419 +첫 번째 이미지를 예를들면 + +44 +00:02:51,419 --> 00:02:55,509 +올바른 Class인 고양이 Class는 애매한 2.9점을 받았고 + +45 +00:02:55,509 --> 00:03:00,060 +몇몇 Class들이 고양이 Class보다 더 높은 점수를 받았습니다. +(높으면 원래 안되는거죠?) + +46 +00:03:00,060 --> 00:03:03,289 +그리고 몇몇 Class들은 고양이에 비해 많이 낮은 점수를 받았습니다. +(이건 특정 이미지들에게 좋은 징후입니다.) + +47 +00:03:03,289 --> 00:03:09,019 +두번쨰 사진인 자동차는 아주 잘 분류되었습니다. +다른 이미지들에 비해서 자동차 점수가 아주 높죠? + +48 +00:03:09,020 --> 00:03:12,980 +세번째인 사진인 개구리는 분류에 실패했습니다. 그렇죠? + +49 +00:03:12,979 --> 00:03:18,199 +이처럼 다른 Weight들은 여러 이미지들에게 +좋게 적용될수도 있고 나쁘게 적용될수 있다는걸 알았습니다. + +50 +00:03:18,199 --> 00:03:21,389 +그리고 알다시피 우리가 찾고자하는 것은 +모든 Ground Truth Label들과 일치하는 점수를 주는 + +51 +00:03:21,389 --> 00:03:26,209 +모든 라벨과 데이터들을 잘 분류할수 있는 Weight입니다. + +52 +00:03:26,210 --> 00:03:30,490 + 데이터 그래서 우리가 지금 할 거 야하는에서만 지금까지입니다 내가 무엇을 믿는 I + +53 +00:03:30,490 --> 00:03:33,590 + 이 좋은 그 등등 그리 좋은 및하지만 우리가 같은 단지 설명 + +54 +00:03:33,590 --> 00:03:34,900 + 실제로에게 총을주고 + +55 +00:03:34,900 --> 00:03:38,710 + 실제로이 개념을 정량화 우리는 말을 그 무게의이 특별한 세트 + +56 +00:03:38,710 --> 00:03:44,189 + 우리가이 손실 함수를 일단 다음 나쁜 12 1.5 나쁜이든과 같은 WSA + +57 +00:03:44,189 --> 00:03:47,710 + 우리는 우리가 가장 낮은를 가져옵니다 W를 찾을거야, 그래서 우리는 그것을 최소화하기 위해거야 + +58 +00:03:47,710 --> 00:03:50,830 + 손실은 그리고 우리는 우리가 특별히 볼거야 오늘 조사거야 + +59 +00:03:50,830 --> 00:03:55,830 + 그런 다음에이 불행 측정 손실 함수를 정의 할 수있는 방법 + +60 +00:03:55,830 --> 00:04:00,030 + 우리는 실제로 두 개의 서로 다른 경우 보스턴 소프트 최대 비용 보는거야 + +61 +00:04:00,030 --> 00:04:04,840 + 비용과 우리는 어떻게되는 프로세스 최적화로 보는거야 + +62 +00:04:04,840 --> 00:04:08,000 + 이러한 임의의 감사로 시작하는 방법 당신은 실제로 아주 아주 찾을 수 있습니까 + +63 +00:04:08,000 --> 00:04:13,110 + 체중을 잘 관찰을 충분히 그래서이 예제를 소형화거야 그 + +64 +00:04:13,110 --> 00:04:16,620 + 우리는 좋은 작업 예를 가정하는 작업을해야 우리는 세 가지 클래스가 있었다 + +65 +00:04:16,620 --> 00:04:18,030 + 당신이 알고있는 물건 + +66 +00:04:18,029 --> 00:04:22,009 + 수만 우리는이 세 가지 이미지가 이러한 우리의 점수입니다 + +67 +00:04:22,009 --> 00:04:23,360 + 일부 설치 W에 대한 + +68 +00:04:23,360 --> 00:04:27,949 + 우리는 지금이 결과를 정확히 우리의 불행을 작성하려고거야 + +69 +00:04:27,949 --> 00:04:32,680 + 첫 번째 손실 우리는 멀티 클래스 SVM 손실이라고한다 그것으로 볼거야 + +70 +00:04:32,680 --> 00:04:36,629 + 이것은 당신이 가질 수있는 소수 서포트 벡터 머신의 일반화이다 + +71 +00:04:36,629 --> 00:04:42,379 + 나뿐만 아니라 9 커버 사이에 생각하고 그렇게 설정이 여기에 가장 가까운을 통해 본 + +72 +00:04:42,379 --> 00:04:47,710 + 우리는 라코스테의 벡터 물론 이러한 우리의있는 권리 있도록 핵심 기능 그리워 것을 + +73 +00:04:47,709 --> 00:04:50,948 + 사찰단 특정 용어는 여기에있다 + +74 +00:04:50,949 --> 00:04:55,348 + 손실 동일 스튜 물건과 나는 부활절 지금이 손실을 해석하는거야 그 + +75 +00:04:55,348 --> 00:04:59,978 + 우리는 왜 이런 식의 구체적인 예를 통해 볼거야 + +76 +00:04:59,978 --> 00:05:06,158 + 초과 효과적으로 무엇 SVM 손실 같은 것은 모두에서 뭔가 있다는 것입니다 + +77 +00:05:06,158 --> 00:05:11,399 + 모든 잘못된 과정에 걸쳐 모두 모두 동일하므로 잘못된 예 + +78 +00:05:11,399 --> 00:05:17,209 + 클래스는 하나 하나 예를 들어 우리는 그 손실을 그래서 그것을 가로 질러오고있다 + +79 +00:05:17,209 --> 00:05:20,769 + 모든 잘못된 클래스와는 코어 클래스에서 점수를 비교하는 것 + +80 +00:05:20,769 --> 00:05:25,209 + 잘못된 클래스 영수증 제인은 마이너스 이유 다 것을이 법원에 접수 + +81 +00:05:25,209 --> 00:05:31,269 + 나는 왜 이렇게 무엇의 제로 난다 다음 올바른 레이블 더하기 하나 인 + +82 +00:05:31,269 --> 00:05:35,838 + 우리는이 과정이의 차이를 비교하는 여기에 계속 + +83 +00:05:35,838 --> 00:05:40,338 + 특히이 같은 내가 올바른 점수가 높은 것으로 싶어 할뿐만 손실 + +84 +00:05:40,338 --> 00:05:43,918 + 잘못된 점수보다하지만 우리는 퍼팅 안전 마진은 실제로있다 + +85 +00:05:43,918 --> 00:05:46,079 + 안전 마진을 사용하고 넣어 것입니다에 + +86 +00:05:46,079 --> 00:05:53,198 + 정확히 하나의 우리는 반대로 사용하는 하나의 의미가 왜에 갈거야 + +87 +00:05:53,199 --> 00:05:56,900 + 우리가 자신을 선택해야하고 직관적으로 당신이 할 수있는 다른 하이퍼 차 + +88 +00:05:56,899 --> 00:06:00,508 + 훨씬 더 엄격한 유도에 대한 메모를 들여다 정확히 왜 하나 + +89 +00:06:00,509 --> 00:06:04,278 + 중요하지만 이것에 대해 생각하는 다음 너무 일찍 우리의 종류를 강조하지 않습니다 + +90 +00:06:04,278 --> 00:06:08,500 + 스케일이없는 내가 IWI을 탈지 할 수 있기 때문에 그것이 더 크거나 작게 만들 수 있으며있어 + +91 +00:06:08,500 --> 00:06:12,490 + 크거나 작은 코스를 얻기 위하여려고하는 것은 그래서 정말이 미리 차 떨어져있다 + +92 +00:06:12,490 --> 00:06:16,550 + 담론 방법 크거나 그들이 그렇게 할 수있는 작은이에 연결하는 방법 크거나 + +93 +00:06:16,550 --> 00:06:19,930 + 무게는 크기에 등 사용하므로 이러한 창녀의 종류 임의적 + +94 +00:06:19,930 --> 00:06:25,269 + 하나는 확인 그래서 구체적으로 어떻게 볼 수 있도록 어느 정도 그냥 임의의 선택 + +95 +00:06:25,269 --> 00:06:29,128 + 이 표현은 내가 평가하기 위하여려고하고 그래서 여기에 구체적인 예와 함께 작동 + +96 +00:06:29,129 --> 00:06:33,899 + 첫 번째 예를 들어 그 손실은 그래서 여기에 우리는이에 연결하기 위해 경쟁하고 + +97 +00:06:33,899 --> 00:06:35,949 + 물론 그래서 우리는 우리가 비교하는 것을 볼 수 + +98 +00:06:35,949 --> 00:06:40,829 + 올바른 클래스 자동차가있는 점수 우리가 1-3 점에서 당신의 차를 가지고있다하고, + +99 +00:06:40,829 --> 00:06:45,219 + 다음 하나는 최대 0의 우리의 안전 마진을 추가하고 정말 무슨이다 + +100 +00:06:45,220 --> 00:06:48,770 + 그것은 값 (80)을 체결 할 것입니다하고있어 바로 우리가 부정적인 얻는 경우에 이렇게 + +101 +00:06:48,769 --> 00:06:53,759 + 당신이에 대한 두 번째 클래스를 참조하면 결과는 우리가 그렇게 VAT 0을 제외 할거야 + +102 +00:06:53,759 --> 00:06:55,089 + 잘못된 플라자 개구리 + +103 +00:06:55,089 --> 00:06:59,699 + 1.7 안전 마진에서 3.2에서 차감 우리는거야 포인트 구에 도착 + +104 +00:06:59,699 --> 00:07:03,629 + 당신이 당신을 통해이 작업 할 때 다음 2.9의 손실을 가져 + +105 +00:07:03,629 --> 00:07:07,209 + 직관적으로 무엇을 당신이 밖으로 일을하는 방식이 직관적으로 여기 볼 수 있습니다 + +106 +00:07:07,209 --> 00:07:12,930 + 고양이 점수는 3.2 그래서 ESPN 로스에 따라 우리는 우리가 이상적으로 IS 싶은 것 + +107 +00:07:12,930 --> 00:07:16,100 + 모든 클래스에 대한 점수는 최대 것을 가장 + +108 +00:07:16,100 --> 00:07:21,370 + 2.2 그러나 자동차 클래스는 실제로 한 것보다 훨씬 더 훨씬 더 높은 점수를했고, + +109 +00:07:21,370 --> 00:07:24,620 + 우리가 어떤 좋아하는 것 무엇의 차이는 2.2 실제로 무엇인가 + +110 +00:07:24,620 --> 00:07:30,939 + 단지 11처럼 일하면 얼마나 나쁜 2.9의 바로이 차이 + +111 +00:07:30,939 --> 00:07:36,129 + 결과를 점수이 였고, 사기 경우에 다른 경우에 당신은 시저를 볼 수 있습니다 + +112 +00:07:36,129 --> 00:07:40,139 + 점수는 낮은 2.2보다 상당히 낮은 다음 밖으로 작동하므로 방법이었다 + +113 +00:07:40,139 --> 00:07:43,289 + 수학은 당신이 비교할 때 음수를 받고 끝낼 것입니다 + +114 +00:07:43,290 --> 00:07:48,110 + 물론 다음 최대 2000은 특정 부분에 대한 공헌을 잃었다 + +115 +00:07:48,110 --> 00:07:54,439 + 즉,이 최초의 주요의 손실 그래서 당신은 2.9 확인의 손실로 끝날 + +116 +00:07:54,439 --> 00:07:57,050 + 두 번째 이미지는 우리는 다시 같은 일을 할거야 + +117 +00:07:57,050 --> 00:08:01,689 + 고양이는 우리가 얻을 그래서 차 점수를 가지고 비교 한 숫자를 연결 내 + +118 +00:08:01,689 --> 00:08:07,329 + 안전 마진과 다른 클래스의 동일한 19부터 3 개월간 포인트 + +119 +00:08:07,329 --> 00:08:11,659 + 당신이에 연결하면하므로 실제로 0 제로의 많은 손실과 끝 + +120 +00:08:11,660 --> 00:08:17,280 + 여기에 자동차 점수이기 때문에 직관적으로는 자동차 점수 인 것은 사실이다 + +121 +00:08:17,279 --> 00:08:22,479 + 의 적어도 하나의 권리로 해당 이미지에 대한 모든 다른 코스보다 더 높은 + +122 +00:08:22,480 --> 00:08:27,490 + 우리가 가진 이유 제로 점수 0은 너무 제약이 만족하고 일부이었다입니다 손실 + +123 +00:08:27,490 --> 00:08:31,310 + 자신의 손실 때문에 우리가 물론 아주 나쁜 손실 끝이 경우 + +124 +00:08:31,310 --> 00:08:34,470 + 개구리 클래스는 매우 낮은 점수를 받았지만 다른 클래스는 아주 수신 + +125 +00:08:34,470 --> 00:08:39,349 + 우리 경우 고등학교 그래서 이것은 지금 10.9의 불행까지 추가하고 + +126 +00:08:39,349 --> 00:08:42,520 + 실제로 우리가 가고있는 하나의 손실 함수에이 모든 것을 결합하려는 + +127 +00:08:42,519 --> 00:08:45,929 + 우리가 단지을 여기에 상대적으로 직관적 인 변환을 수행하는 + +128 +00:08:45,929 --> 00:08:48,049 + 우리가 얻을 수있는 모든 손실에 걸쳐 평균 + +129 +00:08:48,049 --> 00:08:51,458 + 트레이닝 세트 권한을 부여하고 그래서 말을 그 말에 손실 때를 + +130 +00:08:51,458 --> 00:08:56,369 + 4.6 그래서이 특정 설정은이 훈련에 승까지이며이 숫자를 평균 + +131 +00:08:56,370 --> 00:09:01,320 + 데이터는 우리에게 우리가 손실 함수에 연결 몇 가지 과정을 제공하고 우리는 준 + +132 +00:09:01,320 --> 00:09:06,170 + 당신에게 부탁하지 않을 수 있도록 확인이 결과 4 점 섹스 대한 실망 + +133 +00:09:06,169 --> 00:09:08,939 + 질문의 시리즈는 종류의이 어떻게 작동하는지에 대해 조금 이해를 테스트 + +134 +00:09:08,940 --> 00:09:12,390 + 나는 나를 그냥 내 친구 마이클의 질문을 제기 할 수 있도록 약간의 질문에 얻을 것이다 + +135 +00:09:12,389 --> 00:09:20,230 + 우선 그 일부 전반적으로 잘못 인 저기 무슨 경우 + +136 +00:09:20,230 --> 00:09:25,560 + 제인의 거상 그 의미대로 일부 전반적으로 가장 가까운 그뿐만 아니라 + +137 +00:09:25,559 --> 00:09:29,799 + 잘못된 사람은 그래서 우리는 J 내가 왜 실제로 I 오전 이유에 동일 할 수 있다면 무엇을 + +138 +00:09:29,799 --> 00:09:39,149 + 사실이 네 그래서 여름에 그 작은 제약 조건을 추가하는 어떤 것 + +139 +00:09:39,149 --> 00:09:43,139 + 일어난 우리는 I 허용 것처럼 I에 이유 더 나은 gnite 동일 + +140 +00:09:43,139 --> 00:09:46,539 + 나는 답장을 취소 이유의 점수 + +141 +00:09:46,539 --> 00:09:49,828 + 당신은 0으로 끝날 정말 당신이하는 일은 당신이 상수를 추가하는 것입니다 + +142 +00:09:49,828 --> 00:09:53,549 + 런던의 그 누군가가이 과정은 정말 어쩌면 그냥 다음 전체 있도록 인 경우 + +143 +00:09:53,549 --> 00:09:59,250 + 그 두 번째 이유이다 (10)의 일정으로 손실을 완료 만약에 + +144 +00:09:59,250 --> 00:10:03,940 + 나는이 모든 이상 합산하고있어, 그래서 우리는 갑자기 오른쪽 대신 평균을 사용 + +145 +00:10:03,940 --> 00:10:10,500 + 내가 평균 사용하고처럼 의미로 사용되는 경우 어떤 제약 실제로 평균합니다 + +146 +00:10:10,500 --> 00:10:13,389 + 나는이 과정을 통해 평균을 사용하는 경우 어떤 모든 예제에 대한 모든 손실을 통해 + +147 +00:10:13,389 --> 00:10:28,000 + 당신이 그에 맞아 있도록 점수 문제는 너무 많은 수업이 있었다 + +148 +00:10:28,000 --> 00:10:33,870 + 손실의 절대 값이 낮은 것 + +149 +00:10:33,870 --> 00:10:37,879 + 일정한 인자 이유 + +150 +00:10:37,879 --> 00:10:52,689 + 실제로 여기에 평균을 했는가 클래스의 수에 걸쳐 평균 될 것이다 + +151 +00:10:52,690 --> 00:10:56,220 + 여기하지만 클래스의 상수가 특정 세 말의 + +152 +00:10:56,220 --> 00:10:56,889 + 예 + +153 +00:10:56,889 --> 00:11:01,000 + 손실 앞의 3 분의 1의 상수를 넣어 금액 우리는에 있기 때문에 + +154 +00:11:01,000 --> 00:11:04,450 + 항상 결국 그래서 당신이 지적처럼 낮은 로스를 만들 것입니다하지만, + +155 +00:11:04,450 --> 00:11:07,820 + 결국 우리는 항상 우리가 이상 아를 최소화거야로 관심 + +156 +00:11:07,820 --> 00:11:12,470 + 그 손실은 그래서 만약 당신이 하나를 분실하거나 당신이 그것을 확장하는 경우 이동하고 + +157 +00:11:12,470 --> 00:11:15,350 + 일정은 당사의 솔루션을 변경하지 않습니다 실제로 있지만 여전히 갈거야 + +158 +00:11:15,350 --> 00:11:19,420 + 그래서 이러한 선택이 가지 기본적으로 무료입니다 (W) 같은 최적의에서 결국 + +159 +00:11:19,419 --> 00:11:23,169 + 매개 변수가 나는 Y와 동일하지 않습니다 추​​가 해요 편의를 위해 그렇게 중요하지 않습니다 + +160 +00:11:23,169 --> 00:11:26,299 + 나는 실제로이 같은 일을하고 비록 의미 촬영 아니에요 + +161 +00:11:26,299 --> 00:11:33,329 + 같은 확인 우리가 예에서 일부 평균 여부를 우리에 간다 + +162 +00:11:33,330 --> 00:11:38,410 + 우리가 대신 거기 제제하지만하지 사용되는 경우 어떤 다음 질문 + +163 +00:11:38,409 --> 00:11:42,669 + 매우 유사 인플레이션을 찾고 있지만, 마지막에 제곱 추가있다 + +164 +00:11:42,669 --> 00:11:47,809 + 그래서 우리는 물론 더하기 하나는 아침과의 차이를 취하고있어 + +165 +00:11:47,809 --> 00:11:54,509 + 당신은 우리를 생각할 때 우리는 동일하거나 상이 손실을 얻을 않는 것이 제곱했다 + +166 +00:11:54,509 --> 00:11:57,710 + 어떤 의미에서 동일하거나 상이 손실을받는 것이을 최적화한다면 및 + +167 +00:11:57,710 --> 00:12:05,759 + 우리가 같은 결과를 얻는 가장 좋은 W를 찾을 여부 + +168 +00:12:05,759 --> 00:12:20,340 + 네, 사실 다른이 볼 등 명확하지의 손실 그러나 한 가지 방법을 얻을 + +169 +00:12:20,340 --> 00:12:26,639 + 우리는 분명히 단지 명확 로스를 확장하지 확장하지 않는 것과 그것을 볼 수 있습니다 + +170 +00:12:26,639 --> 00:12:30,710 + 위 또는 일정하거나 우리가 실제로 변화하고 일정하여 이동 아래로 + +171 +00:12:30,710 --> 00:12:35,580 + 차이점은 우리는 방법의 측면에서 비선형 장단점을 변경 + +172 +00:12:35,580 --> 00:12:38,920 + SVM 지원 벡터 기계는 가서 모든 다른 무역 것 + +173 +00:12:38,919 --> 00:12:43,519 + 다른 예에서 여백을 점수하지만보고 분명 아니지만, 기본적으로 + +174 +00:12:43,519 --> 00:12:46,829 + 그것은 매우 분명하지 않다 그러나 나는이 손실에 대한 모든 변경 사항을 설명 할 + +175 +00:12:46,830 --> 00:12:53,320 + 완전하고 여기에 두 번째 권한은 사실 우리가 전화 뭔가 + +176 +00:12:53,320 --> 00:12:57,530 + 당신이 할 수있는 힌지 손실을 불러 상단에 대신 하나의 제곱 힌지 손실 + +177 +00:12:57,529 --> 00:13:01,480 + 주로 20 당신이 볼 가장 자주 사용하는 하이퍼 두 가지 다른 종류를 사용 + +178 +00:13:01,480 --> 00:13:04,750 + 우리는 대부분의 시간을 사용하지만 때로는 당신이 할 수있는 무엇을 먼저 수립 + +179 +00:13:04,750 --> 00:13:07,950 + 제곱 인치 손실이 자산을보고 더 나은 그래서 뭔가 당신입니다 + +180 +00:13:07,950 --> 00:13:12,550 + 그 정말 하이퍼 프라이머이다하지만 가장 자주 처음에 사용 플레이 + +181 +00:13:12,549 --> 00:13:18,919 + 이 손실의 규모가 최소 및 최대 가능 손실이었다에 대해의도 생각해 봅시다 + +182 +00:13:18,919 --> 00:13:23,149 + 당신은 당신의 전체 데이터 세트에 다중 클래스 SVM을 달성 할 수 + +183 +00:13:23,149 --> 00:13:26,759 + 작은 말리 무엇인가 + +184 +00:13:26,759 --> 00:13:35,029 + 점수를 임의로 될 수 기본적 있도록 가장 높은 값이 무엇인지 0 좋은 + +185 +00:13:35,029 --> 00:13:39,870 + 올바른 예 끔찍한 당신이 로그인 그래서 만약 점수는 매우 매우 작은 + +186 +00:13:39,870 --> 00:13:45,230 + 당신은 당신 무한대로가는 손실과 한 번 더 질문을받을거야하는 + +187 +00:13:45,230 --> 00:13:49,480 + 우리가 때 일반적으로 최적화를 수행 시작할 때 가지 중요 + +188 +00:13:49,480 --> 00:13:53,200 + 실제로, 우리는 초기화 AW와 시동이 손실 함수를 최적화 + +189 +00:13:53,200 --> 00:13:56,430 + 아주 작은 무게가 있기 때문에 무슨 일이 끝나는 것은 그에서 점수 + +190 +00:13:56,429 --> 00:14:00,819 + 최적화의 처음에 가까운 검은 색이 모두 제로 대략있다 + +191 +00:14:00,820 --> 00:14:05,650 + 제로 근처의 작은 숫자는 그렇게 모든이의 새로운 시대 때 손실 무엇인가 + +192 +00:14:05,649 --> 00:14:12,329 + 이 모든 과정이 있다면 바로 클래스의 수를 뺀 10의 특별한 경우 + +193 +00:14:12,330 --> 00:14:16,639 + 제로 그는이 특정 손실 나는 평균을하고 여기에 의해 아래로 둘 것 + +194 +00:14:16,639 --> 00:14:21,269 + 이 아주없는 이러한 방식을 통해 우리는 확인이의 손실을 달성 한 것 + +195 +00:14:21,269 --> 00:14:24,429 + 당신이 실제로 시작 때 중요한 무엇이 중요한 안전 점검을위한 + +196 +00:14:24,429 --> 00:14:28,399 + 최적화는 당신은 W 매우 작은 숫자로 시작하고 당신은 인쇄 + +197 +00:14:28,399 --> 00:14:31,389 + 당신이 있는지 확인하려면 당신이 이전에 대해 얘기로 첫 손실 + +198 +00:14:31,389 --> 00:14:34,279 + 당신은 종류의 기능 양식을 이해하고이를 생각할 수 있음 + +199 +00:14:34,279 --> 00:14:38,929 + 수 있는지 여부를 당신은 내가이 경우에 볼 수있어 너무 의미를 얻을 + +200 +00:14:38,929 --> 00:14:42,799 + 그때는 더 손실이 올바르게 %로 구현 될 수 있음을 행복 해요 + +201 +00:14:42,799 --> 00:14:46,990 + 확실하지만 곧 확실히 잘못된 것은 바로이 그래서가 없습니다 + +202 +00:14:46,990 --> 00:14:51,730 + 나는 작은이 손실에 더 갈거야이 생각하는 재미 + +203 +00:14:51,730 --> 00:14:55,950 + 비트하지만 지금 슬라이드의 관점에서 질문으로 + +204 +00:14:55,950 --> 00:15:10,870 + 질문 나는 질문했다 + +205 +00:15:10,870 --> 00:15:15,029 + 실제로이 제약 기쁨이없는 것이 효율적이지 왜 그것 때문에 + +206 +00:15:15,029 --> 00:15:19,049 + 만드는 것이 더 어려워 실제로이 쉽게 더 눈 구현을 할 수 + +207 +00:15:19,049 --> 00:15:23,799 + 이 손실 구현의 실제로 일부 내 옆에 슬라이드를 예측할 수 있도록 + +208 +00:15:23,799 --> 00:15:27,459 + 정도는 그렇게 나를 바로 알아 할리우드의 코드에 의해 언젠가 여기에 말을하려고하자 + +209 +00:15:27,460 --> 00:15:33,290 + 여기에 같은이 손실 함수에 우리는 지금 침대에서 거짓말을 평가하고 + +210 +00:15:33,289 --> 00:15:37,759 + 하나의 열 벡터 빛이기 때문에 우리는 행동 때문에 여기에 하나의 예를 받고있어 + +211 +00:15:37,759 --> 00:15:42,279 + 정수 레이블을 지정하고 W는 우리가 우리가 할 그래서 우리의 가중치 행렬입니다 + +212 +00:15:42,279 --> 00:15:45,799 + 단지 몇 시간의 X는 그리고 우리는 이러한 계산이다이 과정을 확인 + +213 +00:15:45,799 --> 00:15:50,179 + 우리가 획득 과정 올바른 간의 차이 마진 + +214 +00:15:50,179 --> 00:15:55,569 + 점수 + 10이 0에서 무엇이든 다음이 접시를 볼 사이의 번호는 + +215 +00:15:55,570 --> 00:16:03,360 + 온라인 여백 Y가 0 YZ 그와 같 + +216 +00:16:03,360 --> 00:16:07,320 + 그래 정확히 그래서 기본적으로 나는이 효율적인 배경 수입을하고있어 어떤 + +217 +00:16:07,320 --> 00:16:11,209 + 당신의 포인트로 이동 한 다음 내가이기 때문에이 그 여백을 수용 할 + +218 +00:16:11,208 --> 00:16:15,569 + 현재 하나를 가지고 있는데이 팽창하지 않는 이유 이익률은 말했다 어떤 것을 내 + +219 +00:16:15,570 --> 00:16:18,360 + 점수 그리고 나는 20로 설정합니다 + +220 +00:16:18,360 --> 00:16:27,269 + 그래 나는 우리가 경우에 최적화 할 수 있도록뿐만 아니라 말을 뺄 수도있을 것 같군요 우리 + +221 +00:16:27,269 --> 00:16:31,200 + 원하지만 우리는 너무 많은 당신이 할 경우, 당신이 할 경우이에 대해 생각하지 않을거야 + +222 +00:16:31,200 --> 00:16:35,050 + 극단적 인 처벌에 대한 매우 환영의 일부가되었다 과제 + +223 +00:16:35,049 --> 00:16:40,859 + 그 시장 그리고 우리는 더 이상 다시 사이트에 질문에가는 길을 잃었다 + +224 +00:16:40,860 --> 00:16:45,320 + 이 제제에 대해 당신이 만들고 싶어하는 경우 방법이 제제에 의해 + +225 +00:16:45,320 --> 00:16:49,430 + 당신은 실제로 두 가장 가까운 당신이 그것을 볼 수 있습니다 그것을 아래로 작성하는 경우 + +226 +00:16:49,429 --> 00:16:57,229 + 우리가 다른를 볼 수 있도록 확인 잃은 작은 서포트 벡터 머신에 감소 + +227 +00:16:57,230 --> 00:17:00,190 + 기능은 곧 다음 우리는뿐만 아니라 이들의 비교에서 볼거야 + +228 +00:17:00,190 --> 00:17:05,400 + 하지만 지금은 실제로 우리가 가지고있는이 시점에서 우리가 이것을 가지고있다 + +229 +00:17:05,400 --> 00:17:08,699 + 그 과정을 마무리하고 우리는하지 않은이 손실 함수를 + +230 +00:17:08,699 --> 00:17:11,870 + 써 우리는이 사이에 이러한 차이가 자사의 전체 형태 + +231 +00:17:11,869 --> 00:17:18,178 + 물론 한 그녀의 가장 가까운과 태양과 홀드 예에서 평균의 일부 + +232 +00:17:18,179 --> 00:17:21,309 + 즉 지금 손실 함수를 그건 바로 그래서 내가 당신을 설득하고 싶습니다 + +233 +00:17:21,308 --> 00:17:25,149 + 내가하고 싶은 경우 즉이 손실 함수에 버그가 실제로있다 + +234 +00:17:25,150 --> 00:17:31,798 + 나는 매우 좋은하지 속성을 얻을 수 있습니다 연습과 일요일이 손실을 사용 + +235 +00:17:31,798 --> 00:17:36,589 + 이이 경우는 내 전화를 사용하고있는 유일한이었고, 경우 확인이 아니에요 + +236 +00:17:36,589 --> 00:17:39,709 + 정확히 문제가 너무 무엇인지보고 완전히 분명 내가 너희들을 줄 것이다 + +237 +00:17:39,710 --> 00:17:43,620 + 특히 힌트는 우리가 W를 발견한다고 가정 + +238 +00:17:43,619 --> 00:17:55,058 + 뭔가에 제로 손실을 확인 받고 이제 문제는 것은이 w 고유하거나 + +239 +00:17:55,058 --> 00:18:00,329 + 다른 방법에 직면 당신이 내게 줄 수 앗 그 또한 다를 수 있지만 것 + +240 +00:18:00,329 --> 00:18:04,210 + 확실히 다시 제로 손실을 달성 + +241 +00:18:04,210 --> 00:18:12,410 + 맞아 그래서 당신은 우리가 어떤 상수와 그것을 확장 할 수있는 말을하는지 + +242 +00:18:12,410 --> 00:18:20,009 + 특히 모든 형식은 아마의 만남을 원하는 제약 조건을 기반으로 + +243 +00:18:20,009 --> 00:18:24,259 + 젊은 내가 변경할 수있는 내가 할 수있는 권리 그래서 기본적으로 1보다 큰 + +244 +00:18:24,259 --> 00:18:28,119 + 내 무게와 내가 일을 할 수있는 모든 난 그냥 해요입니다 그들이 더 크고 더 크게 만들 + +245 +00:18:28,119 --> 00:18:31,639 + 내가 바로 승 등장하면서 점수 차이가 크고 큰 만들기 만들 + +246 +00:18:31,640 --> 00:18:35,890 + 여기 그래서 기본적으로 주류 법 스포츠의 그것은 매우 바람직하지 때문에 + +247 +00:18:35,890 --> 00:18:40,370 + 부동산 우리는 최적의 및 모든 인 W의 전체 부분 공간을 가지고 있기 때문에 + +248 +00:18:40,369 --> 00:18:44,319 + 그것들이 손실 함수에 따라되는 완전히 동일하지만, 직감적 + +249 +00:18:44,319 --> 00:18:48,019 + 그게 내가 전달하는 속성으로 구울 수 있고, 그래서 그냥이를 볼 게 아니에요 + +250 +00:18:48,019 --> 00:18:51,920 + 미국이 내가이 예를 복용하는 경우가 있음을 자신을 설득 + +251 +00:18:51,920 --> 00:18:58,480 + 나는 두 번 내 말은 IWI을 가정 해 우리가 전에이 이전에 0 손실을 달성 + +252 +00:18:58,480 --> 00:19:02,360 + 여기 아주 간단한 수학이다 일어나고 있지만, 기본적으로 내가 충돌 할 수 또는 것 + +253 +00:19:02,359 --> 00:19:07,000 + 내 점수 두 배 그래서 그 차이는 매우 커진다 것 + +254 +00:19:07,000 --> 00:19:11,019 + 최대 50 아니라 내부의 모든 점수 차이 이미 부정적인 경우 + +255 +00:19:11,019 --> 00:19:14,389 + 이 점점 더 부정적이 될 것 그래서 당신은 더 큰 끝낼 것 + +256 +00:19:14,390 --> 00:19:18,040 + 더 큰 음의 값 접근을 그들에게 내부와 단지 제로 모든 시간이 될 + +257 +00:19:18,039 --> 00:19:32,159 + 그러나 스케일 팩터는 1보다 크게 할 것이기 때문에 + +258 +00:19:32,160 --> 00:19:56,940 + 단순성에 대한 또 다른 질문하지만 그래 기본적으로 점수는 WX가 + 그래서 그렇게 될 수 있습니다 + +259 +00:19:56,940 --> 00:19:58,309 + 당신은 아직이야 + +260 +00:19:58,309 --> 00:20:06,589 + W 일부 단지 어때을 구입하는 것을 잊지 자신이 문제를 해결하는 방법이 직관적 그래서 확인 + +261 +00:20:06,589 --> 00:20:10,250 + 우리는이 전체 지하철 몇 W의를 가지고 모든이에 따라 동일하게 작동 + +262 +00:20:10,250 --> 00:20:13,269 + 손실 함수와 우리가 환경 설정을 통해이하고 싶은대로 우리가하고 싶습니다 + +263 +00:20:13,269 --> 00:20:17,170 + 일부 W의 이상 다른 사람이 단지 고유에 따라 당신은 우리가 무엇을 알고 + +264 +00:20:17,170 --> 00:20:21,430 + 데이터를 잊지 같이하는 W의 욕망에 좋은 일이 무엇 것입니다 + +265 +00:20:21,430 --> 00:20:26,110 + 일이 그래서 이것은 우리가가는거야 정규화의 개념을 소개합니다 + +266 +00:20:26,109 --> 00:20:29,319 + 우리가하는 추가 용어를 그래서 우리의 손실 함수에 참석 + +267 +00:20:29,319 --> 00:20:33,309 + W의 정규화 기능과 정규화 작동 시간을 착륙 + +268 +00:20:33,309 --> 00:20:37,500 + 확인을 W의 쾌적을 측정하고 그래서 우리는 단지 데이터에 맞게 싶지 않아 + +269 +00:20:37,500 --> 00:20:43,279 + 그러나 우리는 또한 좋은 것으로 W를 원하고 우리는 프레임의 몇 가지 방법을 보게 될 것입니다 그 + +270 +00:20:43,279 --> 00:20:47,549 + 정확히 왜 그들이 이해와 정규화로가는로하는 방법이있다 + +271 +00:20:47,549 --> 00:20:52,509 + 훈련 떨어져 거래는 훈련 손실 및 일반화 행동 + +272 +00:20:52,509 --> 00:20:56,589 + 그래서 직관적으로 설정 테스트에 손실은 기술 곳의 세트를 정규화 + +273 +00:20:56,589 --> 00:21:00,899 + 우리는이 사람과 싸우게 될 것이다 손실에 목표를 추가하고 그래서 + +274 +00:21:00,900 --> 00:21:04,560 + 이 사람은 당신의 훈련 데이터에 맞게 원하는 한 번 W 그 사람은 몇 가지를보고 + +275 +00:21:04,559 --> 00:21:07,879 + 특정 방식 그래서 그들은 당신의 목적에 때로는 서로 싸우고있어 + +276 +00:21:07,880 --> 00:21:11,730 + 우리는 동시에 모두 달성하고자하지만 밝혀 때문에 + +277 +00:21:11,730 --> 00:21:14,470 + 이러한 정규화 기술을 추가하는 것은 그것을 만드는 경우에도 귀하 + +278 +00:21:14,470 --> 00:21:18,319 + 교육 에러가 악화 그래서 우리는 제대로 예를 분류하지 않는 한 그 + +279 +00:21:18,319 --> 00:21:21,599 + 주의는 테스트 세트 성능과 더 나은 뭔가 우리가를 볼 수 있다는 것입니다 + +280 +00:21:21,599 --> 00:21:26,089 + 그 내용은 다음 지금 난 그냥 원하는 것을 실제로 할 수있는 이유의 예 + +281 +00:21:26,089 --> 00:21:29,109 + 다음 빛을 지적하지만 지금 난 그냥 가장 지적하고 싶은 + +282 +00:21:29,109 --> 00:21:33,019 + 실현의 일반적인 형태는 우리가 정규화 또는 중량에 전화를 무엇 + +283 +00:21:33,019 --> 00:21:37,539 + 부패와 정말 우리가이 경우 W 생각되는 일을하는지 그래서 2 차원 행렬 + +284 +00:21:37,539 --> 00:21:42,230 + 좀 뵈르 가야 엘에 정말있는 행과 열을했다 + +285 +00:21:42,230 --> 00:21:44,230 + 제곱 W 현명한 요소 + +286 +00:21:44,230 --> 00:21:48,019 + 우리는 로스 확인되므로이이 특정에 그들 모두를 가하고있어 + +287 +00:21:48,019 --> 00:21:55,069 + 이 승 좋아하는 규정은 모든 09 실현 행복 WS을 때 바로 그렇게 공을 수있어하지만 + +288 +00:21:55,069 --> 00:21:58,649 + 물론 당신은 당신은 그래서이 사람들이 의지가없는 사람을 분류 할 수 있기 때문에 + +289 +00:21:58,650 --> 00:22:03,140 + 서로 싸울 다른과 정규화의 다른 형태가있다 + +290 +00:22:03,140 --> 00:22:08,570 + 홍콩의 클래스에 훨씬 나중에 그들 중 일부에 가서 그냥 것 접근 + +291 +00:22:08,569 --> 00:22:12,548 + 2 중위 정규화가 가장 흔한 형태이며, 그처럼 당신은 무엇을거야 + +292 +00:22:12,548 --> 00:22:17,569 + 이 클래스에서 자주 사용뿐만 아니라 내가 원하는 당신을 설득 같지 + +293 +00:22:17,569 --> 00:22:20,529 + 당신을 설득하면이 승 그것에서 할 수있는 합리적인 것입니다 + +294 +00:22:20,529 --> 00:22:25,779 + 그래서이 매우 간단 위로 요리 예를 고려의 무게가 작은 것을 + +295 +00:22:25,779 --> 00:22:30,149 + 우리는 네 가지 차원에서 어디에 직관은 우리가 예를 들어 있다고 가정하세요 + +296 +00:22:30,150 --> 00:22:32,370 + 우리는이 분류를하고있는 우리는 심지어이 공간 + +297 +00:22:32,369 --> 00:22:36,139 + 그냥 한꺼번에 X 나을 지금 생각 우리는이 두 후보가 + +298 +00:22:36,140 --> 00:22:37,880 + 체중 행렬 또는 대기 + +299 +00:22:37,880 --> 00:22:44,780 + I 지금까지 가정 단일 음성 때문에 그 중 하나 (100)이고 다른 하나는 25 + +300 +00:22:44,779 --> 00:22:49,200 + 우리는 당신의 손실 함수에있는 사방 이후 자신의 효과를 볼 수 있습니다 + +301 +00:22:49,200 --> 00:22:55,080 + 같은 그래서 기본적으로 선도적 득점이 문서 제품과 있도록 WX입니다 가지고 있습니다 + +302 +00:22:55,079 --> 00:22:59,109 + 예는이 두 그러나 이러한 담론의 모두에 대해 동일 + +303 +00:22:59,109 --> 00:23:03,469 + 엄격와 정규화는 다른 통해 이들 중 하나를 선호하는 하나 + +304 +00:23:03,470 --> 00:23:07,720 + 정규화 선정 호의 그 효과는 동일 할지라도 + +305 +00:23:07,720 --> 00:23:13,548 + 하나는 실현의 관점에서 그래서 두 번째 오른쪽 낫다 + +306 +00:23:13,548 --> 00:23:15,740 + 정규화는 동일한을 달성하는 경우에도 당신을 말할 것 + +307 +00:23:15,740 --> 00:23:19,109 + 도로 실제로 우리 다운 데이터 손실 분류면에서 효과 + +308 +00:23:19,109 --> 00:23:22,629 + 크게 두 번째가에 대한 더 나은 무엇 두 번째를 선호 + +309 +00:23:22,630 --> 00:23:27,340 + 좋은 생각 가지고하는 것이 + +310 +00:23:27,339 --> 00:23:38,230 + 그는 내가 가장 좋아하는 하나의 해석이 잘 있어요 잘 맞습니다 + +311 +00:23:38,230 --> 00:23:43,549 + 그것은 바로 그래서 당신의 X 팩터에서 고려 사물의 가장 많은 무엇이 + +312 +00:23:43,549 --> 00:23:47,859 + 이 델타 실현하고 싶어 가능한 한 많은 당신의 WSUS를 확산하는 것입니다 + +313 +00:23:47,859 --> 00:23:51,169 + 당신이 고려하고 있도록 모든 입력 기능은 공감이다 + +314 +00:23:51,170 --> 00:23:55,900 + 소스와는 자사을 좋아하는만큼 많은 다른 차원을 사용하고 싶어 + +315 +00:23:55,900 --> 00:23:57,600 + 동일한 효과를 부정 + +316 +00:23:57,599 --> 00:24:01,439 + 직관적으로 말하기, 그래서 그것은 단지 하나에 집중보다 낫다 + +317 +00:24:01,440 --> 00:24:06,990 + 차원은 종종 기본적으로 실제로 작동 뭔가 그냥 좋은 + +318 +00:24:06,990 --> 00:24:11,880 + 다만 방법의 일이며 가장 큰 배열 속성 그들이 + +319 +00:24:11,880 --> 00:24:17,230 + 일반적으로 정규화 좋은 아이디어에 대한 질문이 해결해야 + +320 +00:24:17,230 --> 00:24:22,130 + 모든 사람이 어떤 기본적으로 우리의 손실이 항상이 포럼 곳이있을 것이다 판매했다 + +321 +00:24:22,130 --> 00:24:25,350 + 우리는 저녁 식사 손실을 가지고 있고 또한 매우이다 정규화를해야합니다 + +322 +00:24:25,349 --> 00:24:29,529 + 실제로 가지고 일반적인 것은 좋아 내가 두 번째로 이동하지 않을거야 + +323 +00:24:29,529 --> 00:24:34,629 + 분류 젖꼭지와 우리는 미국과 지원 사이에 약간의 차이를 볼 수 있습니다 + +324 +00:24:34,630 --> 00:24:38,070 + 벡터 머신과 실천이 부드러운 마스크 분류 이러한 종류의 수 있습니다 + +325 +00:24:38,069 --> 00:24:41,369 + 이 두 가지 선택처럼 당신이 가장 좋아 스팸 또는 뭔가를 가질 수 + +326 +00:24:41,369 --> 00:24:47,629 + 선호로 일반적으로 지금까지 자주 당신이 나타납니다 선형 분류를 사용 + +327 +00:24:47,630 --> 00:24:51,480 + 나는 정확히 모르겠어요 왜 같은 난에 대해 작업 보통 말까지 있기 때문에 + +328 +00:24:51,480 --> 00:24:54,420 + 다만이 또한 때때로라고 멀티 그냥 것을 말씀 드리고 + +329 +00:24:54,420 --> 00:24:57,019 + 침략 당신은 로지스틱 회귀에 대해 잘 알고 있다면이 그냥 그래서 + +330 +00:24:57,019 --> 00:25:00,190 + 여러 차원으로 또는이 경우 여러에 그것의 일반화 + +331 +00:25:00,190 --> 00:25:12,009 + 연기의 구름은 저쪽에 의문을 제기하는 것처럼 + +332 +00:25:12,009 --> 00:25:32,150 + 왜 우리는 우리가 어떤 식 으로든 내가 그들 사이에서 선택하려는 경우 사용하려는 + +333 +00:25:32,150 --> 00:25:36,820 + 우리가 선택하는 합리적인 방법입니다 (W) 우리가 갈 생각은 한 가지 낮다 + +334 +00:25:36,819 --> 00:25:42,700 + 남자와 울트라 오른쪽 호의 확산 중 여기이 경우처럼 W 및 + +335 +00:25:42,700 --> 00:25:47,900 + 나는 피치 시도 할 수있는 직관적 인 방법 중 하나는 왜이 좋은 생각입니다 + +336 +00:25:47,900 --> 00:25:54,290 + 그 확산 가중치는 기본적으로 하나의 승를 확인 완전히 입력을 무시 + +337 +00:25:54,289 --> 00:25:58,220 + 셋, 넷하지만 W이 오른쪽 방식 때문에의 입력을 모두 사용 + +338 +00:25:58,220 --> 00:26:04,480 + 완화 등 직관적으로 이것은 단지 보통 테스트에서 더 나은 작업을 끝낼 수 있습니다 + +339 +00:26:04,480 --> 00:26:10,150 + 더 많은 증거가 대신 축적하고 결정되고 있기 때문에 난 + +340 +00:26:10,150 --> 00:26:21,470 + 단 하나의 증거 하나의 기능으로 맞아 + +341 +00:26:21,470 --> 00:26:28,140 + 맞아 맞아 그래서 아이디어는 여기입니다의 두 110 W에 w 그 + +342 +00:26:28,140 --> 00:26:32,630 + 동일한 효과를 얻기 때문에이 데이터 손실은 기본적으로 있다고 가정 + +343 +00:26:32,630 --> 00:26:35,650 + 두 정규화 그러나 사이에 상관하지 않는 환경 설정을 표시 + +344 +00:26:35,650 --> 00:26:39,169 + 우리가 어떤 목표를 가지고 있었고, 때문에 우리는 최적화 끝날거야 그들과 + +345 +00:26:39,169 --> 00:26:42,240 + 이 손실을 통해 기능은 동시에 W를 찾을거야 + +346 +00:26:42,240 --> 00:26:46,659 + 그 모두를 수행 그래서 우리는 제대로 분류되지 않은 아우를 종료 + +347 +00:26:46,659 --> 00:26:50,360 + 그러나 우리는 또한 실제로 싶었다 추가 환경 설정을 가지고 우리는 원 + +348 +00:26:50,359 --> 00:27:05,668 + 또한 무관심 L의 하나가 될 수있는 가능한 한 많이 확산 될 것은 멋진이 + +349 +00:27:05,669 --> 00:27:09,240 + 나는에 가고 싶지 않아 속성 지금 우리는 나중에 떨어졌다 덮을 수 있습니다 + +350 +00:27:09,240 --> 00:27:16,579 + 하나는 당신이 끝날 경우 어떤 속성을 유도 희소성과 같은 몇 가지 특성을 가지고 + +351 +00:27:16,579 --> 00:27:20,240 + 당신의 목표에서 점심을 먹고는 W의 많은이 끝나게 것을 확인할 수 있습니다 + +352 +00:27:20,240 --> 00:27:25,329 + 정확히 제로 우리가 노동에 갈 수도하고 때로는 같다 이유 + +353 +00:27:25,329 --> 00:27:30,629 + 기능 선택은 거의 그리고 나는 하나는 우리가 수도 또 다른 대안 인 것이다 + +354 +00:27:30,630 --> 00:27:45,760 + 더 이상 조금로 이동 + +355 +00:27:45,759 --> 00:27:54,220 + 즉, 기능을 무시하고 그냥 사용하고 좋은 일이 될 수 없습니다 + +356 +00:27:54,220 --> 00:28:02,960 + 실현 좋은 생각 나는 이유 중 하나는 그래 많은 기술적 인 이유가있다 + +357 +00:28:02,960 --> 00:28:09,090 + 당신은 단지 기본적인 직관을주고 갔다 그래서 어쩌면 어쩌면하지만 내가 생각하는 그들에게 그 말 + +358 +00:28:09,089 --> 00:28:59,740 + 그게 내가 좋은 수익을 만약 내가 일부를 무시되어야 할 것이다 공정한 점이다 + +359 +00:28:59,740 --> 00:29:25,980 + 때때로보고 이론을 학습하고 229에 그 중 일부를보고 + +360 +00:29:25,980 --> 00:29:29,710 + 흰색 정규화에 대한 몇 가지 결과는 거기에서 좋은 사례가되고있다 + +361 +00:29:29,710 --> 00:29:33,650 + 그 지역과 나는거야 생각하지 않습니다이 넘어도 그에 가서 소금 + +362 +00:29:33,650 --> 00:29:37,610 + 이 클래스의 범위 지금까지이 클래스는 것 우리의 국가를 변경하여 + +363 +00:29:37,609 --> 00:29:44,139 + 테스트 오류 나은 사람은 어떤 알 수 만족시키기 위해 이동 + +364 +00:29:44,140 --> 00:29:49,309 + 방법에 대한 로지스틱 회귀 분석의 일반화는이 같은 작동 방식 + +365 +00:29:49,308 --> 00:29:53,049 + 손실이 위에 지정된 방법에 대한 그냥 다른 함수 형태입니다 + +366 +00:29:53,049 --> 00:29:58,539 + 분류가 박았 물론 일부 특정 이러한 해석이있다 + +367 +00:29:58,539 --> 00:30:02,170 + 이 과정의 상단이는 어떤 임의의 점수가 아니며, 우리가 원하는 + +368 +00:30:02,170 --> 00:30:05,769 + 마진이 충족되어야합니다하지만 우리는 어쩌면 더 구체적인 해석이 + +369 +00:30:05,769 --> 00:30:10,549 + 보기의 문제에서 그 시점 어디에 실제로 우리의 원칙 종류 + +370 +00:30:10,549 --> 00:30:14,490 + 그냥 여백을 의미하지만, 이러한는 이러한 것들을로하지 이러한 과정을 해석 + +371 +00:30:14,490 --> 00:30:17,880 + 에 할당 된 실제 정규화 된 잠금 확률 + +372 +00:30:17,880 --> 00:30:23,140 + 다른 클래스는 확인 그래서 우리는이 조금 의미 정확히 무엇에 갈거야 + +373 +00:30:23,140 --> 00:30:28,880 + 이러한 모든 회 주어진 이미지의 정규화 로크 확률의 + +374 +00:30:28,880 --> 00:30:34,490 + 즉 우리는 점수가보다 문제의 평화와 달리 것을 가정합니다 + +375 +00:30:34,490 --> 00:30:38,799 + 사스케와 같은 가장 가까운 확률을 얻을 수있는 방법은 우리가이을 것입니다 + +376 +00:30:38,799 --> 00:30:39,690 + 점수 + +377 +00:30:39,690 --> 00:30:45,029 + 변칙 확률을 얻기 위해 그들 모두를 기하 급수적으로 우리는 정상화 + +378 +00:30:45,029 --> 00:30:48,849 + 우리가 합으로 나눈 있도록 그들에게 그들이 확률 정상화를 얻을 수 + +379 +00:30:48,849 --> 00:30:54,209 + 모든 지수의 과정을 통해 그 우리가 실제로 얻을 방법 + +380 +00:30:54,210 --> 00:30:58,240 + 클래스의 확률에 대한 표현은 이미지 등이 기능을 부여 + +381 +00:30:58,240 --> 00:31:02,880 + 당신이 누군가가 그들에게 먹을 경우 참조하는 경우 여기에 부드러운 최대 함수를 호출한다 + +382 +00:31:02,880 --> 00:31:07,840 + 요소가 합 전반적인 비용 매로 나눈 현재 관심 + +383 +00:31:07,839 --> 00:31:11,918 + 우리는이 문제에 있다면이 기본적으로 작동 할 방법의 과정이다 + +384 +00:31:11,919 --> 00:31:13,040 + premark 우리는 정말 운이 좋다 + +385 +00:31:13,039 --> 00:31:16,869 + 우리는 이것이 다른 클래스의 확률 것을 결정하고 있다는 것을 + +386 +00:31:16,869 --> 00:31:19,619 + 당신이 정말이 설정에서 수행 할 작업을 어떤 측면에서 의미가 있습니다 + +387 +00:31:19,619 --> 00:31:23,809 + 것입니다 아마 다른 클래스를 통해 이들 중 하나는 우리가 원 정확 + +388 +00:31:23,809 --> 00:31:25,429 + 로그 우도를 최대화 + +389 +00:31:25,430 --> 00:31:32,900 + 손실 함수 등등을 위해 우리는 실제의 로그 우도를 최대화 할 + +390 +00:31:32,900 --> 00:31:38,140 + 클래스와 우리가 손실 함수를 실행하고 있기 때문에 우리는을 최소화하려면 + +391 +00:31:38,140 --> 00:31:42,980 + 진정한 클래스의 음의 로그 우도 확인이 일련의와 끝까지 그렇게 + +392 +00:31:42,980 --> 00:31:46,599 + 로그-가능성을 원하는대로 여기 표현은 정말 기능을 잃게됩니다 + +393 +00:31:46,599 --> 00:31:51,169 + 올바른 클래스는 너무 부정적인 높은 낮은 싶어하는 + +394 +00:31:51,170 --> 00:31:54,820 + 로그인 가능성이 코스의 일부 확장되어 이제 구체적인 예를 살펴 보자 + +395 +00:31:54,819 --> 00:32:00,599 + 실제로 있도록 표현 뭔가처럼 나중에 여기에 내가이 더 만들려​​면 + +396 +00:32:00,599 --> 00:32:04,839 + 이 표현의 방법이 살펴 보겠습니다하는 로스 음의 로그입니다 + +397 +00:32:04,839 --> 00:32:07,859 + 표현은 작품과 내가 아는 당신에게 더 나은 직관을주지 생각을 정확히 + +398 +00:32:07,859 --> 00:32:12,009 + 이것은 우리가이 점수를하지 않은 여기에 가정 그래서 컴퓨팅 있는지 등을하고있다 + +399 +00:32:12,009 --> 00:32:16,379 + 즉, 우리의 신경 네트워크 또는 우리의 이전 젖꼭지에서 나온 이들이다 + +400 +00:32:16,380 --> 00:32:19,780 + 내가 언급 한 바와 같이 그렇게 잠금 해제 문제의 평화 우리는 기하 급수적으로 그들을 원하는 + +401 +00:32:19,779 --> 00:32:22,879 + 첫 번째 때문에 우리에게 정규화를 제공이 해석에서 + +402 +00:32:22,880 --> 00:32:28,150 + 확률과 우리가 두 가지의 합으로 나눈 값이 지금은 항상 일부 (21)의 + +403 +00:32:28,150 --> 00:32:33,310 + 이 모든 그래서 우리는이 사람을 추가하고 우리가 실제로 아마 아웃을 얻기 위해 분할 + +404 +00:32:33,309 --> 00:32:37,609 + 이 해석에서 우리는 변환 어떤 세트를 수행 한 + +405 +00:32:37,609 --> 00:32:41,219 + 말하는이이 해석은 확률 할당 된 것입니다 + +406 +00:32:41,220 --> 00:32:47,029 + 고양이가되는이 이미지에 13 %의 차량이 87 % 진행 매우 가능성이 0 %입니다 + +407 +00:32:47,029 --> 00:32:51,399 + 이러한 확률이며,하지 일반적으로 설정하는 당신이 원하는 + +408 +00:32:51,400 --> 00:32:54,960 + 그냥 바위를 극대화 밝혀 때문에 잠금 확률을 극대화 + +409 +00:32:54,960 --> 00:32:58,049 + 확률은 수학적으로 볼 그래서 외로운으로 좋은하지 않습니다 + +410 +00:32:58,049 --> 00:33:03,460 + 당신이 확률을 최소화 할 수 있도록 다음 행운의 확률을 극대화 + +411 +00:33:03,460 --> 00:33:08,850 + 그래서 여기에 올바른 클래스는 13 %의 확률을 가지고있다 고양이는 + +412 +00:33:08,849 --> 00:33:14,679 + 포인트 13 앤더슨 오해 때문에 음의 로그가 우리를 얻을 수 89 등 + +413 +00:33:14,680 --> 00:33:21,180 + 그것은 우리가 아래에 여기에이 클래스 달성 할 손실을 찾을 수있는 마지막이다 + +414 +00:33:21,180 --> 00:33:25,529 + 분류기의이 해석 때문에 29 + +415 +00:33:25,529 --> 00:33:32,869 + 의 몇 가지 예에 시도하는 지금이 관련된 몇 가지 질문을했다 통해 가자 + +416 +00:33:32,869 --> 00:33:34,219 + 이 작품 정확히 어떻게 해석 + +417 +00:33:34,220 --> 00:33:38,519 + 처음 나는 그래서이 손실 기능을 잃은 분 가능한 최대이었다 + +418 +00:33:38,519 --> 00:33:44,460 + 손실 함수 어떤 작은 말리이며, 가장 높은 몸은 생각한다 + +419 +00:33:44,460 --> 00:33:49,809 + 이것은 우리가 싼 제로하고 어떻게 일어날 수있는 가장 작은 값 것입니다 + +420 +00:33:49,809 --> 00:33:57,220 + 당신은 우리가 하나가 올바른 클래스 아마지고 있다면 나는 그렇게 얻을 수 있습니다 + +421 +00:33:57,220 --> 00:34:02,890 + 하나는 법에 회신하고 우리는 (110)과의 음의 로그를 받고있어 + +422 +00:34:02,890 --> 00:34:09,030 + 그래서 그냥뿐만 아니라 우리는 같은 공을 받고 있었다으로 가장 높은 손실이 최소 + +423 +00:34:09,030 --> 00:34:14,250 + 무한 당신이주는 끝날 경우 최대 그래서 유아 손실을 달성 할 것입니다 + +424 +00:34:14,250 --> 00:34:18,769 + 0 당신이 부정적인 제공의 고양이는 아주 작은 확률을 득점 한 후 로그인 + +425 +00:34:18,769 --> 00:34:24,679 + 그래서 그래 그래서 같은 균형으로 바로 무한 무한 그래서 음 + +426 +00:34:24,679 --> 00:34:28,159 + 오후이 질문 + +427 +00:34:28,159 --> 00:34:33,440 + 우리는 대략 작은 작은 무게와 W를 초기화 할 때 일반적으로 바람 + +428 +00:34:33,440 --> 00:34:37,550 + 모든 자동차는 거의이 경우 손실 될 수있을 테니까요 무엇 제로되어 + +429 +00:34:37,550 --> 00:34:40,419 + 당신이보고 기대 무슨 최적화의 시작 부분에 체크 + +430 +00:34:40,418 --> 00:34:47,000 + 첫 번째 손실 + +431 +00:34:47,000 --> 00:34:59,449 + 나이가 점점 될 수 있도록 클래스의 수에 하나 당신이 모든에 도착 여기 + +432 +00:34:59,449 --> 00:35:04,139 + 한 다음 여기 그래서 여기에 클래스의 수에 하나입니다 그리고 그들은 얻을 + +433 +00:35:04,139 --> 00:35:07,599 + 에 대한 블로그를 자신 할 때마다 대한 그래서 실제로 최종 멋진 뭔가 + +434 +00:35:07,599 --> 00:35:11,569 + 나는 때때로 클래스와 I의 내 번호 주목을 내 스테이션을 실행 + +435 +00:35:11,570 --> 00:35:14,970 + 클래스의 숫자 중 하나를 부정 로그 평가하고 내가 무엇을보고 노력하고있어 + +436 +00:35:14,969 --> 00:35:18,429 + 잃어버린 내 첫 시작은 기대와 내 결정을 시작할 때 그래서 나는 확인 + +437 +00:35:18,429 --> 00:35:21,159 + 확실히 내가 아는 다른 몇 가지가있을 수 있음을 대략 얻고 있음 + +438 +00:35:21,159 --> 00:35:24,399 + 약간 떨어져 순서에 뭔가를 얻을 것으로 예상 + +439 +00:35:24,400 --> 00:35:28,630 + 또한 최적화 나는 20 그게 전부에서 가서 제가 보는 경우 기대 + +440 +00:35:28,630 --> 00:35:31,039 + 음수는 내가 함수 형태 알고 뭔가 매우 + +441 +00:35:31,039 --> 00:35:32,590 + 이상은 오른쪽에있는 것입니다 + +442 +00:35:32,590 --> 00:35:37,070 + 실제로 예상하지거야이 폭행 최대 손실에서 번호를 부여 + +443 +00:35:37,070 --> 00:35:40,630 + 몇 가지 질문 단지를 반복하는 당신에게 또 하나의 슬라이드 아무것도 표시되지 것입니다 + +444 +00:35:40,630 --> 00:35:44,599 + 그들과 정말 그들이 우리가 점수를 가지고있는 모습의 차이 + +445 +00:35:44,599 --> 00:35:48,909 + 차이 지금 우리가 배우의 우리의 점수를 얻을 아​​와를 제공 기능 + +446 +00:35:48,909 --> 00:35:54,420 + 그들은이 함수에서 나오는 이러한 과정이 무엇인지 해석하는 방법을 그냥 + +447 +00:35:54,420 --> 00:35:58,500 + 그래서 난 그냥 무엇이든지 우리가 원하는 어떤 해석의 과정을 실행하지 + +448 +00:35:58,500 --> 00:36:02,710 + 더 큰 점수 정확한 점수를 많이 위의 몇 가지 한계가 될 것을 + +449 +00:36:02,710 --> 00:36:07,240 + 잘못된 과정이나 해석은 로트 확률과하지 않는 한이 될 수 있습니다 + +450 +00:36:07,239 --> 00:36:10,569 + 이 프레임 워크에서 우리는 먼저 확률을 가져 갔고, 우리는 원하는 + +451 +00:36:10,570 --> 00:36:14,450 + 균열 손실의 공개 또는 이들의 로그를 극대화하고 그래서 그 끝 + +452 +00:36:14,449 --> 00:36:19,250 + 우리에게 손실 함수 또는 무언가를주는 동일한 방법에서 시작하도록하지만, + +453 +00:36:19,250 --> 00:36:22,780 + 그들은 단지 우리가 가고있는 차이가 적은 결과를 얻을 수 있었 + +454 +00:36:22,780 --> 00:36:31,150 + 정확히 차이가 조금에서 어떤 질문이 있습니다 + +455 +00:36:31,150 --> 00:36:41,579 + 그들은 대부분을 평가하는 순간 근처로 분류을 + +456 +00:36:41,579 --> 00:36:45,949 + 작품은 회선에서 수행되고, 그렇게 볼 것이다 분류 및 + +457 +00:36:45,949 --> 00:36:51,629 + 물론 남한 최대의 거의 같은 특히 손실이 수반 일부 XP 및 + +458 +00:36:51,630 --> 00:36:56,200 + 등등 그래서이 작업은 아마도하지만 보통을 약간 더 비싸다 + +459 +00:36:56,199 --> 00:36:57,439 + 완전히 씻어 + +460 +00:36:57,440 --> 00:36:59,320 + 당신이하는 걱정 다른 모든 것들에 비해 모든입니다 + +461 +00:36:59,320 --> 00:37:15,260 + 하나님의 형상을 통해 대회 + +462 +00:37:15,260 --> 00:37:32,600 + 아마도 + +463 +00:37:32,599 --> 00:37:42,210 + 동일한 문제 등 특성을 극대화하고 지역을 극대화 + +464 +00:37:42,210 --> 00:37:46,119 + 경기의 모든 나오는 당신에게 동일한 결과 있지만 측면에서 제공 + +465 +00:37:46,119 --> 00:37:49,279 + 너무 멋진 당신이 실제로이 많이 넣으면 찾고 있지만 정확한이다 + +466 +00:37:49,280 --> 00:37:51,310 + 동일한 최적화 문제 + +467 +00:37:51,309 --> 00:37:56,539 + 확인의 그들이 차이가 정확히 어떻게 이러한 둘의 해석을 얻을 수 있습니다 + +468 +00:37:56,539 --> 00:38:01,230 + SEM 대 최대 당신에게 아이디어를 제공하기 위해 노력에 대한 하나의 속성 실제로 + +469 +00:38:01,230 --> 00:38:03,559 + 둘 사이의 상당히 다른 + +470 +00:38:03,559 --> 00:38:08,059 + 우리가이 세 가지 예를 다음 두 가지 기능 분석 팀 + +471 +00:38:08,059 --> 00:38:12,710 + 세 가지 예제와 세 가까운 세 가지 예제가있는 가정 + +472 +00:38:12,710 --> 00:38:15,980 + 이들은 이러한 예제 하나 하나에 대한이 예제의 담론이다 + +473 +00:38:15,980 --> 00:38:19,659 + 여기에 첫 번째 클래스는 올바른 클래스 그래서 10이 올바른 수준의 점수입니다 + +474 +00:38:19,659 --> 00:38:24,509 + 다른 점수는이 사람들 중 첫 번째 두 번째 또는 세 번째는 + +475 +00:38:24,510 --> 00:38:30,970 + 지금은 단지 이러한 손실이 얼마나 바람직한에 대해 얘기하는 방법을 생각 + +476 +00:38:30,969 --> 00:38:36,480 + 결과는 그것에 대해 생각하는 특정 하나의 방법으로 승 그 측면에 + +477 +00:38:36,480 --> 00:38:39,530 + 예를 들어 내가이 데이터는 백의 세 번째 십분의 일을 가리키는 생각한다고 가정한다 + +478 +00:38:39,530 --> 00:38:44,700 + 그리고 팔백 내가 조금 내 입력 주위를 이동 가볍게 흔들다 가정 + +479 +00:38:44,699 --> 00:38:58,159 + 내가 그렇게 같은 공간 손실에 무슨 일이 일어나고 + +480 +00:38:58,159 --> 00:39:03,339 + 나는 그들이 둘 다 증가 할들이 증가 및 감소 내가 주위에 이동하는 것처럼 그렇게 나 + +481 +00:39:03,340 --> 00:39:10,050 + 나 예를 들어 제 약속 줄이고 동일하게 유지 + +482 +00:39:10,050 --> 00:39:13,740 + 정확한 이유는 마진이 그렇게 엄청난 금액으로 이어졌습니다 때문에이 점이다 + +483 +00:39:13,739 --> 00:39:17,659 + 나는 주위의 시트에 하루를 찍을 때 그냥 견고성이 추가됩니다 + +484 +00:39:17,659 --> 00:39:22,379 + 우리가 원하는 알고에 의해 여백이 충족 되었기 때문에 SVM은 이미 매우 행복하다 + +485 +00:39:22,380 --> 00:39:27,809 + 여기에 하나의 마진 우리는 이백의 여유를 가지고 거대한 여백이있다 + +486 +00:39:27,809 --> 00:39:32,299 + 이 과정이 올 곳 ESPN은 이러한 예를 통해 환경 설정을 표현하지 않습니다 + +487 +00:39:32,300 --> 00:39:37,010 + 이상 내가 부정되고 싶어하지 않는다 추가 환경 설정 매우 부정적인 광고 아웃 + +488 +00:39:37,010 --> 00:39:43,890 + 2009 200,000 PSP 및 상처 치료 만의 그러나 남부 최대 수 항상 당신을보고 + +489 +00:39:43,889 --> 00:39:46,659 + 항상 옳다 그래서 소프트 맥스가 뭔가에 대한 개선을 얻을 것이다 + +490 +00:39:46,659 --> 00:39:49,480 + 기능에 대해 부정적인 것으로 모든 사람의 요구에 대한 선호를 표현 + +491 +00:39:49,480 --> 00:39:53,590 + 이백이나 오백 또는 이들의 천명은 더 나은 손실 권리를 줄 것이다 + +492 +00:39:53,590 --> 00:39:58,530 + 다른 예는 내가 알고하지 않으면하지만이 시점에서 SVM은 상관하지 않는다 + +493 +00:39:58,530 --> 00:40:03,320 + 연방 수사 국 (FBI)의 명확한 구분이 권한을 한 번에 견고성을 결정했다대로입니다 + +494 +00:40:03,320 --> 00:40:07,120 + 이 마진이 충족되어야합니다하지만 그 이상은 코스 어디 세세한하지 않습니다 + +495 +00:40:07,119 --> 00:40:11,400 + 소프트 맥스는 항상 당신이 모든 것을 아무것도 알 수 평화 과정을 원할 것입니다 + +496 +00:40:11,400 --> 00:40:15,300 + 그래서 둘 사이의 차이가 명확 1 종 존재하고있어 + +497 +00:40:15,300 --> 00:40:20,548 + 이 질문했다 + +498 +00:40:20,548 --> 00:40:28,568 + 예 하나의 이익률은 그게 하이퍼 차 아니다 매우 간략하게 언급 + +499 +00:40:28,568 --> 00:40:34,528 + 즉 그리스 코스입니다 당신은의 친절 하나의 이유가 고칠 수 + +500 +00:40:34,528 --> 00:40:40,048 + 그 과정의 절대 값을 가지 정말 때문에 중요하지 않습니다되어 내 + +501 +00:40:40,048 --> 00:40:45,088 + WI 그것이 더 크거나 작게 만들 수 있고, 내가 다른 크기의 과정을 달성 할 수 있으며, + +502 +00:40:45,088 --> 00:40:49,759 + 그래서 하나는 잘 작동 밝혀과 노트에 나는 더 긴 기간이 갈이 + +503 +00:40:49,759 --> 00:40:54,699 + 세부 사항을 정확히 이유 하나 때문에 선택이 참조하는 것이 안전하지만 난 싶지 해달라고 + +504 +00:40:54,699 --> 00:41:03,239 + 당신이 20 싶어하는 경우 문제가있을 것 같은 20에서 시간을 보내고는 것 + +505 +00:41:03,239 --> 00:41:07,358 + 양수를 사용할 수 있습니다 그가 0 인 경우 그것은 당신에게 좋은 오후를 줄 것이다 + +506 +00:41:07,358 --> 00:41:14,328 + 그 다르게 보일 것입니다 + +507 +00:41:14,329 --> 00:41:18,259 + 예를 들어이 경우 실제로 하나의 속성을 제공가 일정 추가 + +508 +00:41:18,259 --> 00:41:21,920 + CST (29)의 엉덩이 오후의 수학적 분석의 좋아하는을 통해로 이동 + +509 +00:41:21,920 --> 00:41:26,269 + 최고는 에스키모 재생 여백 속성을 의심 것을 당신은 볼 수 있습니다 + +510 +00:41:26,268 --> 00:41:29,698 + 최고의 마진 실제로 플러스을 때 + +511 +00:41:29,699 --> 00:41:33,539 + 상수는 길에 제단 정규화와 결합 된 아주의 자신의 + +512 +00:41:33,539 --> 00:41:38,499 + 특정 마진을 충족뿐만 아니라 당신이 아주 좋은 혼합을 제공 작은 무게 + +513 +00:41:38,498 --> 00:41:42,259 + 난 정말이 강의에서이에서로 이동하지 않은 여백 재산권 + +514 +00:41:42,259 --> 00:41:46,818 + 지금 그러나 나는 기본적 그렇지 않으면이 양수를 일을 할 싶어 + +515 +00:41:46,818 --> 00:41:51,480 + 단절 + +516 +00:41:51,480 --> 00:42:14,780 + 실수하고 우리는 종류의 무료 아웃이 과정에서 얻을 수있는 좋은 방법입니다 번호 + +517 +00:42:14,780 --> 00:42:18,200 + 바로 우리가 수 있으며 해석을 부여하도록 당신의 + +518 +00:42:18,199 --> 00:42:21,669 + 거기에이 특정한 경우에 다른 손실은 내가 당신에게 가장 가까운시를 보였다 + +519 +00:42:21,670 --> 00:42:25,180 + 멀티 클래스 SVM의 여러 버전은 정확히 주위에 헤엄 수 있습니다 + +520 +00:42:25,179 --> 00:42:30,750 + 우리는이에 넣을 수있는 해석의 하나 일 로스 식 + +521 +00:42:30,750 --> 00:42:34,510 + 코스는 아마도 그들이 할 수없는 말을 몇 가지 표준화 된 블록이있을 것 + +522 +00:42:34,510 --> 00:42:37,590 + 그들은 단지 온 정규화 때문에 우리는 더이 있기 때문에 명시 적으로해야 + +523 +00:42:37,590 --> 00:42:42,180 + 함수의 출력이 정상화 될 것이라고 제약 조건과 그들이 + +524 +00:42:42,179 --> 00:42:45,579 + 당신은 단지 자신의 실수에 그 밖으로이기 때문에 아마 캠프를해야 + +525 +00:42:45,579 --> 00:42:51,309 + 즉 양 또는 음이 될 수 있도록 우리가 문제를 평화로를 해석하고 + +526 +00:42:51,309 --> 00:42:52,699 + 및 수행 + +527 +00:42:52,699 --> 00:42:58,329 + 우리를 필요로하는 것은 그들에게 그것을 설명 매우 나쁜 종류를 취급하지만 난 생각하는 + +528 +00:42:58,329 --> 00:43:05,889 + 그는 있어요 + +529 +00:43:05,889 --> 00:43:57,139 + 에너지와 당신이있어 무엇에 대한 모든 동등한 종류의 같은 손실 + +530 +00:43:57,139 --> 00:44:05,690 + 나는 주위에이를 봤 경우 말을 여기 여기 하나 봐 말 + +531 +00:44:05,690 --> 00:44:09,460 + 아무것도 나는 차이가 확실히 손실 것이라고 생각 변화없는 것 + +532 +00:44:09,460 --> 00:44:12,800 + 가 많이 변경하지 않을 경우에도 최대에 대한 변경하지만 난 확실히 것 + +533 +00:44:12,800 --> 00:44:16,660 + 환경 설정을 표현하는 제목을 변경 오후 반면 당신이 동일 제로 같아요 + +534 +00:44:16,659 --> 00:44:27,339 + 다른 선호도하지만, 기본적으로 실제로는 매우 큰 실수를하지 않을 것이다 + +535 +00:44:27,340 --> 00:44:32,720 + 이 구별 당신에게 노력의 상호 작용은 SPM이있다이다 + +536 +00:44:32,719 --> 00:44:38,469 + 공간의 매우 로컬 부분이 관심이 있다고 분류 미숙에 대한과 + +537 +00:44:38,469 --> 00:44:40,279 + 그 이후 + +538 +00:44:40,280 --> 00:44:43,700 + 환경 및 전체 데이터 구름 물리적 작용 소프트 맥스 종류 + +539 +00:44:43,699 --> 00:44:48,129 + 그것은 당신의 데이터를 클라우드에 대한 모든 사항을 관심에 대해 그냥 당신을하지 관심 + +540 +00:44:48,130 --> 00:44:50,590 + 당신으로부터 분리하려는 것은 여기에서 작은 클래스처럼 거기에 알고 + +541 +00:44:50,590 --> 00:44:51,410 + 다른 모든 것들 + +542 +00:44:51,409 --> 00:44:55,659 + 해당 전체 데이터 옷장의 공격 맥스웰 종류의 비행기를 얻고 + +543 +00:44:55,659 --> 00:44:59,059 + SPM 단지의 직접적인 부분에서 작은 조각 것을 구분합니다 + +544 +00:44:59,059 --> 00:45:04,219 + 실제로 같은 데이터 구름이 실제로 국가가 제공 할 수 있습니다 실행할 때 + +545 +00:45:04,219 --> 00:45:09,569 + 내가 노력 아니에요에 거의 동일한 결과는 거의 항상 정말하려고 할 때 + +546 +00:45:09,570 --> 00:45:12,640 + 하나의 피치하거나 그냥 시도하고 다른 하나는 당신이 개념을주고 그 + +547 +00:45:12,639 --> 00:45:16,809 + 당신은 당신이 밖으로 약간의 점수를 얻을 손실 기능을 담당하고있어, 당신은 할 수 + +548 +00:45:16,809 --> 00:45:19,199 + 거의 모든 수식을 적어 + +549 +00:45:19,199 --> 00:45:23,279 + 당신이 당신의 점수처럼되고 싶은에 미분하고있다 + +550 +00:45:23,280 --> 00:45:26,619 + 다른 사실이 수립의 방법과입니다 실제로 두 가지 예 + +551 +00:45:26,619 --> 00:45:30,579 + 연습을보기 위해 오는 그러나 실제로 우리는 무엇을 어떤 손실을 넣을 수 있습니다 + +552 +00:45:30,579 --> 00:45:34,619 + 점수이 원하는 우리가 최적화 할 수 있기 때문에 그것은 아주 좋은 사진입니다 + +553 +00:45:34,619 --> 00:45:46,700 + 전반적인 날이 시점에서 당신에게 대화 형 웹을 보여 드리겠습니다 + +554 +00:45:46,699 --> 00:45:54,289 + 이것은 당신이 할 수있는 대화 형 세미나 클래스 페이지가 그래서 확실히이 참조 + +555 +00:45:54,289 --> 00:45:58,409 + 이 URL에서 찾을 작년를 쓴 나는 너희들 모두에게 보여해야 + +556 +00:45:58,409 --> 00:46:04,279 + 확인 개발에 지출 하루 정당화하지만, 일부 그 마지막을합니다 + +557 +00:46:04,280 --> 00:46:12,440 + 올해 너무 많은 사람들이 차량에서 발견하지 그래서 우리는 내 삶의 하루가있다 + +558 +00:46:12,440 --> 00:46:18,000 + 여기에 세 가지 클래스와 이차원 문제는 내가 여기에 세 가지를 보여주는거야 + +559 +00:46:18,000 --> 00:46:22,139 + 클래스 각각은 여기에 두 개의 차원을 통해 세 가지 예를 가지고 있는데 보여주는거야 + +560 +00:46:22,139 --> 00:46:24,969 + 여기에 세 가지 분류는 수준은 예를 빨간색 방치하기 때문에 + +561 +00:46:24,969 --> 00:46:29,659 + 분류는 선을 따라 0의 점수로하고 그때의 화살표를 보여주는거야 + +562 +00:46:29,659 --> 00:46:35,509 + 이는 당신이 W 매트릭스 기억으로, 그래서 여기 증가 점수 RW 행렬이다 + +563 +00:46:35,510 --> 00:46:38,609 + 우리는이 그래서 w 행렬의 두 행은 서로 다른 분류입니다 + +564 +00:46:38,608 --> 00:46:42,289 + 파란색 분류 빨간색과 녹색의 분류와 브렛 분류 및 우리가 모두 + +565 +00:46:42,289 --> 00:46:47,349 + 여기에 우리가 다음 X & Y 구성 요소 또한 바이어스 모두에 대한 가중치 + +566 +00:46:47,349 --> 00:46:50,609 + 데이터는 우리가 모든 데이터 포인트의 X 및 Y 좌표를 상기 그래서 + +567 +00:46:50,608 --> 00:46:55,779 + 올바른 라벨 따라서 과정뿐만 아니라 모든 데이터에 의해 달성 손실 + +568 +00:46:55,780 --> 00:46:59,769 + 이 w 설정하고 그래서 내가 데려 갈거야 것을 볼 수 있습니다로 지금 포인트 + +569 +00:46:59,769 --> 00:47:04,568 + 우리의 데이터 손실이 너무 좋아 지금은 2.77 정규화 손실이 전체 손실을 의미 + +570 +00:47:04,568 --> 00:47:08,509 + 이 w 3.5과 이야기 안녕 6.27이다 + +571 +00:47:08,510 --> 00:47:14,810 + 그래서 기본적으로 내가 내 W를 변경할 수 있도록 그래서 당신이 할 수있는이 주위에 바이올린 수 있습니다 + +572 +00:47:14,809 --> 00:47:19,328 + 내 W 더 큰 WC 중 하나 만들고있어 여기에 볼 당신은 그 무엇을 볼 수 있습니다 + +573 +00:47:19,329 --> 00:47:25,940 + 순서 바이어스에서 당신은 바이어스는 기본적으로 이러한 높은 평야를 종료 볼 수 있습니다 + +574 +00:47:25,940 --> 00:47:32,639 + 우리가 할 수있는 일 후 좋아하고 우리는 우리가 이런 종류의 작업을 진행하고 있습니다입니다 + +575 +00:47:32,639 --> 00:47:35,848 + 무슨 일이 일어나고 있는지의 미리보기 우리가 여기 손실을 얻고 일어나고 있었다하기 + +576 +00:47:35,849 --> 00:47:38,829 + 전파를 다시하기 위하여려고하는 것은 우리에게 우리가 원하는 방법을 통해 기울기를 부여하고있다있는 + +577 +00:47:38,829 --> 00:47:44,359 + 이 법이 작고 만들기 위해의 W 이러한 조정 그래서 우리는 할거야 + +578 +00:47:44,358 --> 00:47:48,838 + 이것은 우리가이 w로 시작 상태를 반복한다하지만 지금 난을 향상시킬 수 있습니다 + +579 +00:47:48,838 --> 00:47:54,460 + W의이 세트를 향상 그래서 주변 업데이트를 수행 할 때이 실제로 수 + +580 +00:47:54,460 --> 00:47:57,568 + 지금 바로 여기에 표시됩니다 이러한 그라디언트를 사용하여 실제로입니다 + +581 +00:47:57,568 --> 00:47:59,900 + 작은 변화 모두를 만들기 + +582 +00:47:59,900 --> 00:48:03,088 + 바로 그래서 내가 할로이 경사에 따라 + +583 +00:48:03,088 --> 00:48:07,699 + 차 업데이트는 여기 손실이 총 특별한 감소하고 있음을 알 수 + +584 +00:48:07,699 --> 00:48:11,338 + 여기 손실 잃어버린 내가 할로 단지 더 좋아 계속 차 날짜 때문에 + +585 +00:48:11,338 --> 00:48:16,639 + 그래서 이것은 우리가 조금에 들어갈거야 최적화의 과정이다 + +586 +00:48:16,639 --> 00:48:20,989 + 또한 반복되는 업데이트를 시작하고 기본적으로 우리는이 w를 개선 유지 + +587 +00:48:20,989 --> 00:48:24,808 + 이상 우리의 손실 때까지 진형을 통해 대략 세 또는 무언가였습니다 + +588 +00:48:24,809 --> 00:48:29,579 + 당신은 데이터에 대한 평균 손실은 같은 하나의 포인트입니다 그리고 우리가 제대로이야 + +589 +00:48:29,579 --> 00:48:39,068 + 나는 또한 그래서 그냥 (W) 무작위 랜덤 수 있도록 여기에 모든 버튼을 분류 + +590 +00:48:39,068 --> 00:48:41,980 + 종류의 그것을 노크하고 거기에 항상 이러한 행동 포인트를 수렴 + +591 +00:48:41,980 --> 00:48:47,650 + 공정 최적화를 통해 당신은 정규화으로 여기 재생할 수 있습니다 + +592 +00:48:47,650 --> 00:48:51,730 + 하나는 내가 지금 당신을 보여 있도록 잘 손실의 다른 형태가되어 있습니다 + +593 +00:48:51,730 --> 00:48:55,990 + 합의 오후 제제는 몇 가지 더 SPM 제제가 그리고 거기에 있었다 + +594 +00:48:55,989 --> 00:49:01,098 + 또한 여기에 소프트 맥스는 내가 우리의 손실이 스위셔 소프트 최대 손실을 볼 때 것을 볼 수 있습니다 + +595 +00:49:01,099 --> 00:49:06,670 + 다른하고 있지만 솔루션은 I 스위치 그렇게 할 때 거의 같은되고있다 + +596 +00:49:06,670 --> 00:49:10,700 + 다시 그에게 당신은 작은 조각 주위에 이동 플레이어의 유형을 알고 있지만 정말이야 + +597 +00:49:10,699 --> 00:49:21,558 + 그것은 대부분 동일합니다이 얼마나 얼마나 큰 단계 그래서 그래서 이것은 단지 크기 + +598 +00:49:21,559 --> 00:49:25,650 + 우리는 너무 많은 약속 일을 개선하는 방법에 그라데이션을 얻을 때 우리는하고 있습니다 + +599 +00:49:25,650 --> 00:49:29,119 + 우리는 장면이 킥킥 웃고하려고하는 매우 큰 상승을 시작한다 + +600 +00:49:29,119 --> 00:49:32,309 + 이러한 데이터 포인트를 분리 한 후 시간이 지남에 따라 우리는에서 일을 할거야 + +601 +00:49:32,309 --> 00:49:36,430 + 우리가 우리의 업데이 트의 눈을 감소거야로 위치와이 일을하거나 + +602 +00:49:36,429 --> 00:49:43,298 + 천천히 우리는 결국 원하는 그래서 그래서 당신은 재생할 수있는 전제 수렴 + +603 +00:49:43,298 --> 00:49:47,170 + 우리와 함께 당신은 그가 점수는 손실이 나는 경우를 주위에 가서 어떤 방법을 볼 수 있습니다 + +604 +00:49:47,170 --> 00:49:53,358 + 당신이 이러한 점을 드래그 할 수 있습니다 반복 갱신을 중지하지만 맥 그것을 생각 + +605 +00:49:53,358 --> 00:49:58,598 + 내가 그렇게 좋은 사라이 점을 드래그하려고 그렇게 작동하지 않습니다 + +606 +00:49:58,599 --> 00:50:02,479 + 하지만 바탕 화면에서 작동 그래서 내가 가서 무슨 일이 있었 정확히 파악하지 + +607 +00:50:02,478 --> 00:50:14,480 + 그러나이 거기 재생할 수 있습니다 + +608 +00:50:14,480 --> 00:50:30,840 + 우리는 이것이 다른 한 도면이다 데이터 플러스 정규화 이상으로 평균 손실이 + +609 +00:50:30,840 --> 00:50:35,240 + 나는 그것이 아주 좋은도 생각하지 않습니다처럼이 어떻게 생겼는지 한 방법을 보여 + +610 +00:50:35,239 --> 00:50:38,858 + 내가 작년 기억할 수없는 그것에 대해 혼란 거기에 뭔가하지만, + +611 +00:50:38,858 --> 00:50:45,269 + 기본적으로이 데이터가 왜 이미지 레이블 및 W있다 + +612 +00:50:45,269 --> 00:50:49,719 + 이 과정을 유지하고 소송을 얻고 정규화 손실 + +613 +00:50:49,719 --> 00:50:54,939 + 우리가 지금 무엇을 원하는가 아닌 데이터와 아저씨의 무게의 기능 + +614 +00:50:54,940 --> 00:50:58,608 + 우리는 우리에게 주어진 것 바로 데이터 세트 우리가 제어 할 수없는된다 + +615 +00:50:58,608 --> 00:51:04,130 + 그 w를 제어하고 우리가 손실 W 변경할로 다를 수 있습니다 어떤을 위해 그렇게 + +616 +00:51:04,130 --> 00:51:08,340 + W 내가 손실을 계산할 수 나를 포기하고 그 손실은 우리가있어 얼마나 잘 연결되어 있습니다 + +617 +00:51:08,340 --> 00:51:12,730 + 우리의 모든 예제를 분류하는 것은 낮은 손실은 세계 최고 수준의 발견을 의미 한 가지도록 + +618 +00:51:12,730 --> 00:51:15,880 + 그들을 아주 잘의 훈련 데이터, 그리고, 우리는 우리의 손가락이 교차하고 + +619 +00:51:15,880 --> 00:51:20,809 + 또한 우리가 여기 보지 못했다 테스트 데이터에서 작동하는 하나의 전략이다 + +620 +00:51:20,809 --> 00:51:26,139 + 우리가 어떤에 대한 손실을 평가할 수 있기 때문에 있도록 최적화는 임의의 검색입니다 + +621 +00:51:26,139 --> 00:51:30,500 + 임의 W는 때 나는 어떻게 감당할 수와 내가 메신저를 통해 이동 해달라고 있는지 확실하지 않습니다 + +622 +00:51:30,500 --> 00:51:34,480 + 이 전체 상세히하지만 효과적으로 나는 무작위로 샘플링 나는 확인할 수 있습니다 자신의 + +623 +00:51:34,480 --> 00:51:37,460 + 손실 난 그냥 가장 적합한 W 추적 할 수 있습니다 + +624 +00:51:37,460 --> 00:51:43,090 + 좋아, 그래서 점점 점검과 최적화의 놀라운 과정이다 + +625 +00:51:43,090 --> 00:51:46,760 + 이 작업을 수행 할 경우, 나는이 작업을 수행 할 경우 내가 이천 번 시도 생각 밝혀 + +626 +00:51:46,760 --> 00:51:50,970 + 천 번과 최고의 W는 무작위로 발견 걸릴 당신은 당신의 좌석에서 실행 + +627 +00:51:50,969 --> 00:51:56,108 + 당신이 약 15.5 %의 정확도로 결국 그냥 만든 데이터를 바텐더와 + +628 +00:51:56,108 --> 00:52:01,150 + 그들이 행동하고 있기 때문에 클래스는 10 %의 확률로 평균 기준은 + +629 +00:52:01,150 --> 00:52:06,559 + 성능 때문에 15.5이 실제로 특히와 예술의 매우 상태 일부 신호 + +630 +00:52:06,559 --> 00:52:10,219 + 아흔다섯 공통입니다 그래서 우리는 몇 가지를 통해 너무 가까이 가지고 있다는 점이다 + +631 +00:52:10,219 --> 00:52:10,980 + 다음 + +632 +00:52:10,980 --> 00:52:17,670 + 이 슬라이드 하나에 그냥 있기 때문에 2 주 정도 그래서 이것은 그래서이를 사용하지 않는있다 + +633 +00:52:17,670 --> 00:52:21,659 + 이 프로세스 최적화처럼 정확히 어떤이의 해석은 보인다 + +634 +00:52:21,659 --> 00:52:25,399 + 우리가이 손실 풍경을 가지고 바로이 손실 풍경이 높은에 + +635 +00:52:25,400 --> 00:52:32,619 + 우리는 그 다음 차원에서 앉아서 당신의 손실 높이 여기도록 차원 W 공간 + +636 +00:52:32,619 --> 00:52:38,369 + 당신은 단지 2 W의이 경우가 있고 당신은 여기 그리고 당신 (W) 눈을 가리고있어 + +637 +00:52:38,369 --> 00:52:42,269 + 계곡이 있지만 당신은 당신이있는 한 낮은 손실을 찾기 위해 노력하고 위치를 볼 수 있습니다 + +638 +00:52:42,269 --> 00:52:45,699 + 눈을 가린 당신은 고도 측정기를 가지고 있고 그래서 당신은 무엇을 말할 수 있습니다 + +639 +00:52:45,699 --> 00:52:49,029 + 단일 지점에서의 손실과 당신의 하단에 도착하기 위해 노력하고 + +640 +00:52:49,030 --> 00:52:55,430 + 계곡 오른쪽 그래서 정말 최적화하는 과정이고 우리가했습니다 + +641 +00:52:55,429 --> 00:52:59,399 + 당신은 것 도시 실제로 지금까지 당신이 순간 이동이 임의의 최적화로 + +642 +00:52:59,400 --> 00:53:03,309 + 주위에 당신은 단지 우리가있어 너무 좋아 너무 좋은 생각을 당신의 고도를하지 확인 + +643 +00:53:03,309 --> 00:53:06,940 + 우리가 그라데이션으로 내가 참조 무​​엇을 사용하려고하고있다 대신 할 예정이나 + +644 +00:53:06,940 --> 00:53:12,800 + 정말 우리가 너무 난 모든 단일 방향으로 가로 질러 기울기를 계산하고 + +645 +00:53:12,800 --> 00:53:17,990 + 기울기를 계산하기 위해 노력하고 우리가있어 그래서 내리막 확인 갈거야 + +646 +00:53:17,989 --> 00:53:21,289 + 나는이 있지만, 너무 많은 세부 사항으로 갈 않을거야 기울기를 다음 + +647 +00:53:21,289 --> 00:53:24,779 + 기본적으로 그렇게 정의 된 그라데이션 표현이있다 + +648 +00:53:24,780 --> 00:53:31,859 + 파생 포퓰리즘 (101) 정의와 여러 차원 경우가있다 + +649 +00:53:31,858 --> 00:53:35,409 + 당신은 그라데이션 권리라고있어 파생 상품의 이사가 + +650 +00:53:35,409 --> 00:53:39,589 + 우리의 승 여러 차원을 여러가 그렇게 때문에 우리는 그라데이션 벡터가 + +651 +00:53:39,590 --> 00:53:45,660 + 확인 그래서 이것은 표현이며, 실제로 우리는 수치를 평가할 수 있습니다 + +652 +00:53:45,659 --> 00:53:48,769 + 식 그게 보일 것 무엇을 표시하는 방법 논어에 가기 전에 + +653 +00:53:48,769 --> 00:53:54,190 + 일부 W의 그라데이션 우리는 약간의 현재 W를 가지고 우리가있어 가정 평가하려면 + +654 +00:53:54,190 --> 00:53:58,500 + 우리는 경사에 대한 아이디어를 얻을 싶지 않아하고 싶은 일부 손실이 좋아지고 + +655 +00:53:58,500 --> 00:54:03,239 + 그래서이 시점에서 우리는 기본적으로이 공식에서 볼거야 그리고 우리는있어 + +656 +00:54:03,239 --> 00:54:07,329 + 그냥 첫 번째 차원에 갈거야 그래서 평가에 가서 내가 갈거야 + +657 +00:54:07,329 --> 00:54:11,840 + 정말 어떻게이 수행 할 수 말하고있는 것은 폭발 고도를 평가입니다 + +658 +00:54:11,840 --> 00:54:15,590 + 크리스마스 H에 H에 의해 FFX 및 분할에서 제외 + +659 +00:54:15,590 --> 00:54:19,800 + 어떤 날의 작은 단계를 복용이 풍경 것으로 해당 응답 + +660 +00:54:19,800 --> 00:54:23,130 + 어떤 방향으로할지 여부를보고 내 발은 올라 갔다 아래로 + +661 +00:54:23,130 --> 00:54:27,340 + 바로 그 때문에 나는 작은 단계를 데리고 어떤 기울기가 말해가요 + +662 +00:54:27,340 --> 00:54:32,150 + 잃어버린 1.25 그때 유한 차이가 그 공식을 사용할 수있다 + +663 +00:54:32,150 --> 00:54:36,230 + 근사 우리는 작은 H이 실제로 파생 검토가 여기에 그라데이션 + +664 +00:54:36,230 --> 00:54:41,199 + 마이너스 2.5로 기울기는 아래로 그래서 나는 단계에게 손실을 가져다가 이렇게 감소 + +665 +00:54:41,199 --> 00:54:45,480 + 점에서 손실 함수의 관점 그래서 마이너스 2.5 하향 경 + +666 +00:54:45,480 --> 00:54:49,369 + 특히 치수는 그래서 독립적으로 모든 단일 차원에 대해이 작업을 수행 할 수 있습니다 + +667 +00:54:49,369 --> 00:54:53,210 + 나는에 단계 그래서 바로 그래서 내가 작은 금액을 추가 두 번째 차원으로 이동 + +668 +00:54:53,210 --> 00:54:56,869 + 나는 손실에 무슨 일이 있었는지를 보면 다른 방향 나는 공식 것을 사용 + +669 +00:54:56,869 --> 00:55:00,969 + 및 기울기가 2.6 인 그라데이션 내가 세 번째에 해당 할 수 있다고 말해됩니다 + +670 +00:55:00,969 --> 00:55:06,429 + 기본적으로 내가 여기서 말하는 겁니다 치수와 나는 확인 너무 슬퍼하세요 + +671 +00:55:06,429 --> 00:55:11,149 + 척추의 차를 사용하는 성분 수치를 평가 + +672 +00:55:11,150 --> 00:55:14,539 + 모든 단일 차원에 대해 독립적으로 내가 걸릴 수 있습니다 근사 + +673 +00:55:14,539 --> 00:55:18,500 + 작은 손실에 단계 그리고 나에게 느린 지시가 위쪽으로 갈거나 + +674 +00:55:18,500 --> 00:55:23,829 + 아래 이러한 매개 변수의 모든 하나 하나 등이 미국이다 + +675 +00:55:23,829 --> 00:55:28,500 + 그것은 추한 보이는 여기를 피하는이 그것과 같을 것이다 방법을 사기 펑크가된다 그라데이션 + +676 +00:55:28,500 --> 00:55:32,630 + 그것이 나오는 것에 있기 때문에 반복 약간 까다로운 모든 W의 만 + +677 +00:55:32,630 --> 00:55:36,780 + 기본적으로 우리는 단지 두 가지 효과를 비교로 나누어 나이에 찾고 + +678 +00:55:36,780 --> 00:55:41,200 + 우리가 계약을 받고있어 나이 사용할 경우 지금의 문제입니다 + +679 +00:55:41,199 --> 00:55:44,960 + 물론 수치 그라데이션 이벤트는 우리가 매일이 작업을 수행해야 + +680 +00:55:44,960 --> 00:55:47,949 + 차원은 어떤 위대한 스피 이러한 노력 단일 차원의 감각을 얻을 수 + +681 +00:55:47,949 --> 00:55:53,079 + 당신은 코멘트를 때 오른쪽 당신은 매개 변수의 수백만의 수백 + +682 +00:55:53,079 --> 00:55:58,139 + 바로 그래서 우리는 실제로 수억의 손실을 확인 할 여유가 없다 + +683 +00:55:58,139 --> 00:56:02,920 + 예비 선거의 우리는 한 단계 우리가하려고 것 때문에이 방법을하기 전에 + +684 +00:56:02,920 --> 00:56:06,869 + 우리는 유한 사용하고 있기 때문에 평가 그라데이션 수치 대략적인 + +685 +00:56:06,869 --> 00:56:11,119 + 내가 할 필요가 있기 때문에 차분 근사 둘째도 매우 느립니다 + +686 +00:56:11,119 --> 00:56:15,460 + 아이콘의 손실 함수에 만 확인 내가 무엇을 알기도 전에 것이 + +687 +00:56:15,460 --> 00:56:20,519 + 그라데이션 나는 매우 느린 대략적인 회전 수 있도록 차 업데이트를 취할 수 없습니다 + +688 +00:56:20,519 --> 00:56:26,730 + 이 때문에 바보 권리는 모든 것을 밖으로 때문에 W의 함수로 손실 + +689 +00:56:26,730 --> 00:56:29,800 + 우리는 그것에 대해 서면으로 작성했습니다 정말 우리가 원하는 것을 우리는의 기울기를 원하는됩니다 + +690 +00:56:29,800 --> 00:56:33,220 + 각각 11 마지막 운 좋게 우리는 단지를 쓸 수 있습니다 + +691 +00:56:33,219 --> 00:56:42,598 + 이 녀석 덕분에 실제로 당신이 바로 그 사람이하고있는 사람을 알고 + +692 +00:56:42,599 --> 00:56:49,400 + 단지 모양이 매우 비슷하지만, 기본적으로 얻을 수있는 것입니다 알고 + +693 +00:56:49,400 --> 00:56:54,289 + 미적분학의 발명자에 이런 일이 실제로 논란이있다 + +694 +00:56:54,289 --> 00:56:59,429 + 이상 사람 정말 발명 한 미적분시키고이 사람이 서로를하지만, + +695 +00:56:59,429 --> 00:57:03,799 + 기본적으로 수학이 강력한 망치이며, 우리가 할 수있는 일이 아닌 것입니다 + +696 +00:57:03,800 --> 00:57:06,440 + 어리석은 일을의 우리는 우리가 할 수있는 수치 그라데이션을 평가하고 + +697 +00:57:06,440 --> 00:57:10,230 + 실제로 수학을 사용하고 우리는 어떤 기울기에 대한 표현을 분해 할 수 있습니다 + +698 +00:57:10,230 --> 00:57:14,880 + 공백의 손실 함수 떨어져 그래서 기본적으로 대신 멍청이 + +699 +00:57:14,880 --> 00:57:18,289 + 주변이 작업은 최대 것입니다 아니면 손실 I을 선택하여 추락 + +700 +00:57:18,289 --> 00:57:22,509 + 그냥이의 그라데이션을 표현을하고 난 간단하게 동기화 할 수 있습니다 + +701 +00:57:22,510 --> 00:57:26,500 + 전체 물질이 실제로있는 유일한 방법을 실행할 수 있다는 것입니다 무엇 평가 + +702 +00:57:26,500 --> 00:57:30,159 + 년 동안이 연습의 권리를 우리는 할 수 있습니다 만 표현 우리가 할 수있는 그라데이션 + +703 +00:57:30,159 --> 00:57:35,149 + 그래서 중지하려면 어떻게 요약 기본적으로 숫자 그라데이션 대략에게에 너무 + +704 +00:57:35,150 --> 00:57:39,800 + 느리지 만 그냥이 매우 간단한 일을하고 있기 때문에 쓰기 아주 쉽게 + +705 +00:57:39,800 --> 00:57:44,190 + 손해 나 손실에 기능에 대한 처리 난에 대한 그라데이션 벡터를 얻을 수 있습니다 + +706 +00:57:44,190 --> 00:57:47,659 + 당신이 실제로 할 것입니다 구배는 정확한에는 유한을 수학하지 + +707 +00:57:47,659 --> 00:57:52,210 + 포고문 매우 빨리하지만 오류가 발생하기 쉬운 당신이 실제로에 있기 때문이다 + +708 +00:57:52,210 --> 00:57:57,300 + 실제로 바로 그래서 수학을 당신이 무엇을보고 우리는 항상 그라데이션을 많이 사용 + +709 +00:57:57,300 --> 00:58:01,380 + 우리는 항상 우리가 그라데이션해야 알아낼 수학을하지만, + +710 +00:58:01,380 --> 00:58:04,789 + 그것의 언급으로 미국의 그라데이션 체크를 사용하여 구현을 확인 + +711 +00:58:04,789 --> 00:58:10,480 + 그래서 나는 내가 작성해야합니다 나는 손실 함수에 대한 관심을 모두 수행 할 수 있습니다 + +712 +00:58:10,480 --> 00:58:15,500 + 내 코드에서 평가 된 그라데이션 표현은 그래서 휴가를 얻을 + +713 +00:58:15,500 --> 00:58:18,769 + 인사는 다음 나는 또한 측면에와 있음을 수치 그라데이션의 리드가 + +714 +00:58:18,769 --> 00:58:22,280 + 잠시 필요하지만 당신은 당신에게 더 편리 리드가 성숙하면 확인 + +715 +00:58:22,280 --> 00:58:25,890 + 그 두 가지가 동일하고 우리는 당신이 녹색을 통과라고 확인 + +716 +00:58:25,889 --> 00:58:29,500 + 그래서 트럭 확인 당신이 개발을 시도 할 때마다 당신이 실제로 무엇을보고있어 + +717 +00:58:29,500 --> 00:58:32,519 + 내부 네트워크에 대한 새로운 모듈은 바로 내가 할 수있는 권리를 잃은 것하고 + +718 +00:58:32,519 --> 00:58:35,759 + 당신이 확인해야 다음 그라데이션 완전하고 대한 후방 패스 + +719 +00:58:35,760 --> 00:58:40,250 + 그라데이션은 당신의 수학이 정확한지 확인해 확인하고 I + +720 +00:58:40,250 --> 00:58:43,980 + 이미 우리는에 잘 보았다 최적화이 과정 함 + +721 +00:58:43,980 --> 00:58:45,838 + 우리가이 웹 데모 + +722 +00:58:45,838 --> 00:58:49,548 + 루프는 우리는 당신의 손실에 어디서 단순히 밸리 그라데이션을 최적화 할 때 + +723 +00:58:49,548 --> 00:58:53,759 + 때 그라데이션을 아는 기능, 그리고, 우리는 프라이머 업데이트를 수행 할 수 있습니다 + +724 +00:58:53,759 --> 00:58:58,509 + 우리가 부정적으로 업데이트 할 특히 WBI 작은 양을 변경할 + +725 +00:58:58,509 --> 00:59:04,509 + 스텝 사이즈 배 네거티브 기울기 때문에 존재 그래디언트 + +726 +00:59:04,509 --> 00:59:07,478 + 그것은 당신을 알려줍니다 가장 큰 증가의 방향을 알려줍니다 끝까지 + +727 +00:59:07,478 --> 00:59:10,848 + 손실이 증가하고 부정적인 그것은 어디 인을 최소화하려면 + +728 +00:59:10,849 --> 00:59:14,298 + 어디에서 오는 가서 여기에 음의 판독 방향 스텝 크기로 + +729 +00:59:14,298 --> 00:59:17,818 + 당신에게있는 두통의 스텝 크기의 거대한 양의가 발생할 하이퍼 차 + +730 +00:59:17,818 --> 00:59:23,298 + 비율이 기본적으로 걱정하는 가장 중요한 매개 변수입니다 학습 + +731 +00:59:23,298 --> 00:59:27,778 + 정말 당신은 스텝 크기 또는 대부분에 대해 걱정할 필요가 두 개의있다 + +732 +00:59:27,778 --> 00:59:31,539 + 속도를 학습하고 정규화 강도 레임덕이있다 그 + +733 +00:59:31,539 --> 00:59:35,180 + 우리가 이미 보았던 그 두 개의 매개 변수 정말이 가장 큰 두통이며, + +734 +00:59:35,179 --> 00:59:45,219 + 그것은 우리가 몸 도버에 대한 질문이었다 교차 무엇 일반적이다하지만 그 아니다 + +735 +00:59:45,219 --> 00:59:50,849 + 큰 단지 위대하고 그것은 당신에게 경사 매 방향으로 다음을 알려줍니다 + +736 +00:59:50,849 --> 00:59:56,109 + 우리는 단지 단계 그래서 무게에 반대의 과정에 의한 조치를 취할 + +737 +00:59:56,108 --> 01:00:00,768 + 공간은 W에 어딘가에 당신이 당신의 그라데이션 어떠한 월 일부 금액을 얻을 수있다 + +738 +01:00:00,768 --> 01:00:05,228 + 그러나 그라디언트의 방향으로 당신은 얼마나 알고하지 않도록하는 단계 + +739 +01:00:05,228 --> 01:00:08,449 + 크기와 내가 데모 일에 스텝 크기를 증가시 더라 + +740 +01:00:08,449 --> 01:00:11,248 + 꽤 많은 주위에 기뻐 생성 바로 에너지없이 많이 있었다 + +741 +01:00:11,248 --> 01:00:15,449 + 나는 거대한 복용했기 때문에의 시스템은 모든 그래서 여기에이 기본 이상 및 점프 + +742 +01:00:15,449 --> 01:00:19,578 + 손실 함수가 파란색 부분에 최소이며 보고서에서 높은입니다 + +743 +01:00:19,579 --> 01:00:23,920 + 그래서 우리는 분지의 일부로서 그들에게 싶어이 손실로 실제로 + +744 +01:00:23,920 --> 01:00:28,579 + 기능은 공주 오후 모양 또는 재량 그래서 우리의 복잡한 문제입니다 + +745 +01:00:28,579 --> 01:00:31,729 + 정말 그냥 그릇 그리고 우리는 그것의 바닥하지만이 그릇에 도착하기 위해 노력하고 + +746 +01:00:31,728 --> 01:00:35,009 + 잠시 걸리는 이유 등 30,000 차원 그래서입니다 + +747 +01:00:35,010 --> 01:00:39,640 + 확인 그래서 우리는 조치를 취할 우리는 우리가 그라데이션을 평가하고 이상이 반복 + +748 +01:00:39,639 --> 01:00:44,980 + 이상 실제로 우리가하지 않는 위치를 언급 원이 추가 부분이있다 + +749 +01:00:44,980 --> 01:00:49,860 + 실제로 전체 훈련에 대한 손실은 우리가하는 모든 일이 실제로 한 평가 + +750 +01:00:49,860 --> 01:00:53,370 + 우리는 단지 일을 읽고 나에게 다시 전화 무엇을 사용합니다. 우리는이 곳 + +751 +01:00:53,369 --> 01:00:58,670 + 말처럼 우리는 판매를 샘플링 때문에 전체 데이터 세트는하지만 우리는 그것에서 일괄 샘플 + +752 +01:00:58,670 --> 01:01:02,300 + 삼십로서는 내 트레이닝 데이터로부터 I 그래디언트의 손실을 평가할 + +753 +01:01:02,300 --> 01:01:05,940 + 다음 32의이 배치에 내 차 업데이 트를 알고 나는이 일을 계속 + +754 +01:01:05,940 --> 01:01:09,619 + 계속해서 또 다시하고 있는지 확인 무슨 일이 끝나는 것은 경우에만 샘플입니다 + +755 +01:01:09,619 --> 01:01:14,699 + 다음, 트레이닝 데이터의 기울기중인 추정치 거의 데이터 포인트 + +756 +01:01:14,699 --> 01:01:18,109 + 당신은 단지이기 때문에 전체 학습 집합을 통해 코스 종류의 잡음이 + +757 +01:01:18,110 --> 01:01:21,970 + 데이터의 작은 하위 집합을 기반으로하지만 좀 더 단계 수 추정하는 + +758 +01:01:21,969 --> 01:01:25,689 + 그래서 당신은 대략 그라데이션 더 많은 단계를 수행하거나 몇 가지 작업을 수행 할 수 있습니다 + +759 +01:01:25,690 --> 01:01:30,179 + 저를 사용하면 더 나은 작업을 끝 무엇 정확한 그라데이션과 연습 단계 + +760 +01:01:30,179 --> 01:01:35,049 + 다시 그것을 훨씬 효율적 물론이고 실제로는 비실용적이다 + +761 +01:01:35,050 --> 01:01:41,550 + 풀백 그라데이션 하강는 많은 크기 32 64 128 256이 아닌 오는가 + +762 +01:01:41,550 --> 01:01:45,940 + 일반적으로 하이퍼 주로에 따라 너무 많은 보통 정착 걱정대로 + +763 +01:01:45,940 --> 01:01:49,380 + 우리는 비트에서 BP의 이야기 될 것하지만하고 당신의 GPU에 맞는 사람들 + +764 +01:01:49,380 --> 01:01:53,030 + 메모리의 유한 한 양이 6기가바이트 등에 대해 말하거나 그것의 좋은 이야기 + +765 +01:01:53,030 --> 01:01:58,030 + GPU 일반적으로 후면을 선택 같은 날 작은 다시 예에서 뱉어 그 + +766 +01:01:58,030 --> 01:02:01,150 + 당신의 기억은 그래서 그 용어의 방법 일반적으로 그리고 그것은 기본이 아니다 그 + +767 +01:02:01,150 --> 01:02:09,570 + 실제로 많은 및 최적화 감각을 중요 + +768 +01:02:09,570 --> 01:02:14,789 + 우리는 약간의 모멘텀을받을거야하지만 당신은 그 기세를 사용하려는 경우 + +769 +01:02:14,789 --> 01:02:18,969 + 이것은 우리가 항상 모멘텀이 매우 일반적인를 보낼 수 있지만 노력에 대해하고 괜찮아요 + +770 +01:02:18,969 --> 01:02:23,799 + 그래서 그냥 할 당신이 연습하는 경우에 어떻게 보일까의 아이디어를 제공합니다 + +771 +01:02:23,800 --> 01:02:28,510 + 나는 최적화 초과 근무를 실행하고 있는데 난 그냥 평가 로스 찾고 있어요 + +772 +01:02:28,510 --> 01:02:32,700 + 작은 많은 데이터를 배치하고 당신은 기본적으로 내 손실이 아래로가는 것을 볼 수 있습니다 + +773 +01:02:32,699 --> 01:02:37,309 + 학습 데이터에서 이러한 여러 일괄 처리에 시간이 지남에 그래서 난 최적화로 + +774 +01:02:37,309 --> 01:02:42,119 + 나는이 때문에 그라데이션 하강 주가 하락을하는 경우 지금은 물론 내리막 것 + +775 +01:02:42,119 --> 01:02:44,839 + 그냥 날 다시 데이터의 샘플 당신은 많은 소음을 기대하지 않을 것이다되지 않았습니다 + +776 +01:02:44,840 --> 01:02:48,550 + 당신은 단지 우리가 저를 사용하기 때문에하지만 내려갑니다이가 정렬 될 것으로 예상 + +777 +01:02:48,550 --> 01:02:51,730 + 당신에 대해 뭔가 더 나은 있기 때문에 당신이이 노이즈를 얻을 다시하는 경우 + +778 +01:02:51,730 --> 01:03:01,980 + 다른 사람보다하지만 시간이지나면서 그들은 모두가 질문을 아래로 갈 수있다 + +779 +01:03:01,980 --> 01:03:07,539 + 예 선생님 당신은 당신이 사용하는이 손실 함수의 형태에 대해 궁금 + +780 +01:03:07,539 --> 01:03:11,420 + 어쩌면 더 빠른 개선을보고 바로 이러한 손실 함수에 와서 있습니다 + +781 +01:03:11,420 --> 01:03:17,079 + 정말 그것이 반드시 그렇지 않다 달려있다, 그래서 다른 모양은 크기 + +782 +01:03:17,079 --> 01:03:21,940 + 그 손실 함수는 처음에 있지만 때로는 매우 날카로운 봐야한다 그들이 + +783 +01:03:21,940 --> 01:03:25,929 + 그들은 그것은 또한 당신에 중요한 예를 들어 다른 모양을 가지고 수행 + +784 +01:03:25,929 --> 01:03:29,618 + 내 초기화 조심 해요 경우 초기화 내가 덜 기대 + +785 +01:03:29,619 --> 01:03:34,990 + 점프하지만 매우 잘못 초기화하면 당신은 그 기대 + +786 +01:03:34,989 --> 01:03:38,649 + 그것은 우리가받을거야 최적화에 매우 초기에 해결 될 것 + +787 +01:03:38,650 --> 01:03:43,309 + 그 부분의 일부 나중에 나는 또한 당신에게 많이 보여주고 싶은 많은 생각 + +788 +01:03:43,309 --> 01:03:49,710 + 하여 손실 함수와 계속 학습의 학습 속도의 영향 + +789 +01:03:49,710 --> 01:03:53,820 + 속도는 기본적으로는 스텝 사이즈의 학습 속도 또는 스텝 사이즈하면 매우 높다 + +790 +01:03:53,820 --> 01:03:59,240 + 당신의 W 공간에서 주위에 돌진 시작하고 그래서 난 수렴 해달라고 또는 당신은 당신이 경우 폭발 + +791 +01:03:59,239 --> 01:04:02,618 + 당신은 거의 또한 업데이트를하고있어 다음 매우 낮은 학습 속도가 + +792 +01:04:02,619 --> 01:04:07,869 + 실제로 수렴 시간이 오래 걸리고는 높은 교육이 있다면 + +793 +01:04:07,869 --> 01:04:11,150 + 요금은 때때로 당신은 기본적으로 나쁜 위치에 붙어의 종류를 얻을 수 있습니다 + +794 +01:04:11,150 --> 01:04:14,950 + 당신이 손실 함수의 종류 때문에 손실이 그렇다면 최소한으로 내려받을 필요 + +795 +01:04:14,949 --> 01:04:17,929 + 당신이 없을 때 당신은 너무 빨리 당신의 스타킹에 너무 많은 에너지를 가지고 + +796 +01:04:17,929 --> 01:04:21,679 + 당신은 당신의 문제가 가지 작은 로컬 최소값에 정착하는 것을 허용하지 않습니다 + +797 +01:04:21,679 --> 01:04:25,480 + 당신은 신경 네트워크 및 최적화에 대해 일반적으로 당신의 목표를 이야기 할 때 + +798 +01:04:25,480 --> 01:04:28,320 + 즉 우리가 통신 할 수있는 유일한 방법이기 때문에 당신은 손을 흔들며을 많이 볼 수 있습니다 + +799 +01:04:28,320 --> 01:04:32,350 + 이러한 손실과 거리가 그래서 그냥 큰 손실의 분지와 같은 상상 + +800 +01:04:32,349 --> 01:04:36,069 + 당신은 탈곡하는 경우 이러한 작은 손실 등의 작은 주머니 같은있다 + +801 +01:04:36,070 --> 01:04:39,480 + 주위에 당신은 작은 손실 부품 컨버터에 정착 할 수 + +802 +01:04:39,480 --> 01:04:43,730 + 그 때문에 그 이유는 학습 속도 너무 좋은 그래서 올바른을 찾을 필요 + +803 +01:04:43,730 --> 01:04:47,150 + 속도를 배우는 것은 많은 두통의 원인과 사람들은 대부분 무엇을 할 + +804 +01:04:47,150 --> 01:04:49,970 + 시간은 때때로 우리가 어떤 혜택을받을 높은 학습 속도로 시작된다 + +805 +01:04:49,969 --> 01:04:55,319 + 다음 높은 함께 시작하는 데 시간이 지남에 그것을 UDK 그리고, 우리는 학습 타락 + +806 +01:04:55,320 --> 01:05:00,780 + 우리는 좋은 해결책에 정착하고 시간이 지남에 읽고 나는 또한 원하는 + +807 +01:05:00,780 --> 01:05:03,550 + 훨씬 더 자세하게거야하지만 방법은 내가 일을 해요 누가 지적 + +808 +01:05:03,550 --> 01:05:07,890 + 실제로 W을 수정 구배를 사용하는 방법이다 여기서 업데이트 + +809 +01:05:07,889 --> 01:05:12,789 + 그 일을 여러 가지 형태가 업데이트 펌웨어 업데이트라고 + +810 +01:05:12,789 --> 01:05:14,869 + 그것은이 있었다 가장 간단한 방법입니다 + +811 +01:05:14,869 --> 01:05:20,299 + 다만 STD 간단한 사용자 지정 인사말 %가되지만, 많은 공식이있다 + +812 +01:05:20,300 --> 01:05:23,740 + 당신이있어 이미 모멘텀에 언급 된 모멘텀은 기본적으로 상상 + +813 +01:05:23,739 --> 01:05:27,949 + 이 최적화를 수행하면은 그래서이 블로그 도시의 트랙을 유지 상상 + +814 +01:05:27,949 --> 01:05:31,389 + 나는 긍정적를보고 계속 그래서 만약 또한 내 속도의 트랙을 유지하고 스테핑 해요 + +815 +01:05:31,389 --> 01:05:35,519 + 내가 그 방향으로 속도를 축적 몇 가지 방향을 읽고 그래서 난 몰라 + +816 +01:05:35,519 --> 01:05:39,550 + 러시아에서 빨리 갈 사람이 필요하고 그래서 로스에서 몇 가지가 있습니다 것 + +817 +01:05:39,550 --> 01:05:46,100 + 보고 일반적으로 곧 클래스하지만 토마스 소품 아담 또는 그래서 그냥에 사용 + +818 +01:05:46,099 --> 01:05:50,569 + 이러한 서로 다른 선택의 모양을 그들이 할 수있는 것을 보여 + +819 +01:05:50,570 --> 01:05:56,760 + 당신의 손실 함수이 우리가 손실을 가지고, 그래서 여기에 알렉에서 그림입니다 + +820 +01:05:56,760 --> 01:06:02,390 + 기능과 이러한 낮은 수준의 점원 그리고 우리는 저기 반대를 시작 + +821 +01:06:02,389 --> 01:06:06,920 + 우리는 당신을 줄 것이다 유역과 다른 업데이트 공식에 도착하기 위해 노력하고 + +822 +01:06:06,920 --> 01:06:10,670 + 당신은 예를 들어이 서로 다른 문제에 좋든 나쁘 든 수렴을 볼 수 있도록 + +823 +01:06:10,670 --> 01:06:15,369 + 녹색 모멘텀이 내려 갔다으로는 모멘텀을 구축 한 다음은 오버 슈팅과 + +824 +01:06:15,369 --> 01:06:19,259 + 다음은 가지 돌아가 다시 돌아가 UD 등이 읽을 수 수렴 영원히 소요 + +825 +01:06:19,260 --> 01:06:23,370 + 그것은 그녀가 등장하고 있습니다 영원히 걸립니다 내가 지금까지 당신을 제시 무엇 + +826 +01:06:23,369 --> 01:06:27,489 + 실제로이 차 위로가 수행하는 다른 방법은 더 많거나 적은 + +827 +01:06:27,489 --> 01:06:35,259 + 현대화 효율적인 내가 또한에서 언급하고 싶었이 훨씬 더 볼 수 있습니다 + +828 +01:06:35,260 --> 01:06:39,950 + 확률이이 점은 예 내가 분명히 설명 해요로 약간 가고 싶어 + +829 +01:06:39,949 --> 01:06:43,049 + 당신의 분류처럼 우리는 우리가 알고있는 문제를 설정하는 방법을 알고 + +830 +01:06:43,050 --> 01:06:47,070 + 다른 손실이 날 우리가 가지에서 할 수 있도록 그들을 최적화하는 방법을 알고 기능 + +831 +01:06:47,070 --> 01:06:51,050 + I에서이 점은 내가 당신의 감각을 부여 할 것을 언급 원하는 것을 + +832 +01:06:51,050 --> 01:06:53,710 + 댓글에 대한 그래서 당신은을 가지고 오기 전에 컴퓨터 비전처럼 보였다 + +833 +01:06:53,710 --> 01:06:57,920 + 역사적 관점의 비트 우리는 선형 분류 모든 시간을 사용하기 때문에 + +834 +01:06:57,920 --> 01:07:01,019 + 하지만 물론 당신은 도로 원본 이미지에없는 보통 클래식 자동차를 수행 + +835 +01:07:01,019 --> 01:07:06,759 + 즉 당신이 믿는하려는 모든 때문에 우리는 당신 같은 그것으로 문제를 해결 + +836 +01:07:06,760 --> 01:07:10,250 + 나는 그들이에 사용 된 경찰이해야 할 생각에 모든 모드 등을 포함해야 + +837 +01:07:10,250 --> 01:07:14,380 + 모든 이미지의 서로 다른 기능 유형을 계산하고 당신은 볼 수 있습니다 + +838 +01:07:14,380 --> 01:07:17,160 + 다른 기능 유형에서 다른 설명하면 다음을 얻을 + +839 +01:07:17,159 --> 01:07:22,049 + 이미지가 주파수처럼 어떤 모습의 통계 요약 + +840 +01:07:22,050 --> 01:07:26,160 + 등등, 그리고, 우리는 capitated 모든 큰 벡터에 그와 우리는 넣을 수 있습니다 + +841 +01:07:26,159 --> 01:07:27,710 + 선형 분류에 그 + +842 +01:07:27,710 --> 01:07:32,050 + 까지 간 다음, 그들 모두 연결된 등 다양한 기능 유형 + +843 +01:07:32,050 --> 01:07:35,369 + 일반적으로 파이프 라인이었다 당신의 분류, 그래서 그냥 당신의 아이디어를 제공합니다 + +844 +01:07:35,369 --> 01:07:39,088 + 정말 무슨 회담 당신이 수도 한 아주 간단한 기능 유형 같았다 + +845 +01:07:39,088 --> 01:07:43,269 + 나는 모든 이미지의 픽셀을 통해 갈 수 있도록 단지 컬러 히스토그램 상상 + +846 +01:07:43,269 --> 01:07:47,449 + 나는 그들 거라고하고 따라 다른 색상이 얼마나 많은 밴드 대답 + +847 +01:07:47,449 --> 01:07:50,750 + 당신이 상상할 수있는 색상의 색조에 종류의 일처럼 + +848 +01:07:50,750 --> 01:07:54,250 + 이미지에 무엇의 통계 요약 색상 각 단지 숫자입니다 + +849 +01:07:54,250 --> 01:07:57,400 + 그래서 이것은 내가 결국 될 것입니다 선생님의 하나가 될 것입니다되었습니다 + +850 +01:07:57,400 --> 01:08:03,440 + 다양한 기능의 유형과 친밀의 다른 종류의 절단 + +851 +01:08:03,440 --> 01:08:06,530 + 분류는 당신이 그것에 대해 생각하면 선형 분류는 이러한 기능을 사용할 수 있습니다 + +852 +01:08:06,530 --> 01:08:09,690 + 선형 분류 좋아 할 수 있기 때문에 실제로 분류를 수행하는 + +853 +01:08:09,690 --> 01:08:14,320 + 또는 양 또는과 이미지에 다른 색상을 많이보고 싫어 + +854 +01:08:14,320 --> 01:08:17,930 + 부정적인 무엇 매우 일반적인 기능은 또한 우리가 부르는 등이 포함됩니다 + +855 +01:08:17,930 --> 01:08:22,440 + 610 매 기능은 기본적으로 이러한 당신은 현지 지역에 이동했다 + +856 +01:08:22,439 --> 01:08:26,539 + 발명과는 다른 방향의 제비가 있는지 여부를보고 + +857 +01:08:26,539 --> 01:08:30,588 + 그래서 수평 또는 수직 가장자리의 많은이 우리는 히스토그램을 구성 + +858 +01:08:30,588 --> 01:08:35,850 + 그 이상하고 그래서 당신은 가장자리의 종류 단지 요약 끝날 때 + +859 +01:08:35,850 --> 01:08:40,338 + 상기 이미지이고, 당신이 사람들은 모두 함께 있었다 계산할 수 있습니다 + +860 +01:08:40,338 --> 01:08:45,250 + 수년에 걸쳐까지 제안 된 우리의 다른 유형의 많은 단지 내가있을거야 + +861 +01:08:45,250 --> 01:08:50,359 + 측정하는 다른 방법을 많이에 과세 것들의 종류가 + +862 +01:08:50,359 --> 01:08:54,850 + 이미지와 그들의 통계, 그리고, 우리는이 파이프 라인은 다시 전화했다 + +863 +01:08:54,850 --> 01:08:59,660 + 내 장소를 통해 당신이 다른 점을보고 어디 + +864 +01:08:59,659 --> 01:09:04,250 + 당신은 같은 당신이 와서 뭔가 조금 로컬 패치를 설명 + +865 +01:09:04,250 --> 01:09:08,329 + 주파수보고는 색상을보고 또는 다음 무엇이든되고 우리 + +866 +01:09:08,329 --> 01:09:12,269 + 여기에 확인 이러한 사전 내놓았다 우리가 이미지를 볼 수있는 물건입니다 + +867 +01:09:12,270 --> 01:09:16,250 + 같은 파란색과 낮은 주파수 물건에 대한 고주파 정지 많이있다 + +868 +01:09:16,250 --> 01:09:16,699 + ...에 + +869 +01:09:16,699 --> 01:09:21,338 + 물건의 종류를 볼 수 무엇의 K-수단을 사용하여 무게 중심으로 끝낼 수 있습니다 + +870 +01:09:21,338 --> 01:09:25,818 + A의 바로 다음 우리는 얼마 동안 통계 등의 모든 하나의 이미지를 표현 + +871 +01:09:25,819 --> 01:09:29,660 + 예를 들어,이 이미지는 많이있다, 그래서 각각의 일의 우리는 이미지 참조 + +872 +01:09:29,659 --> 01:09:33,949 + 고주파 녹색 물건 당신은 기본적으로 몇 가지 특징 벡터가 표시 될 수 있도록 + +873 +01:09:33,949 --> 01:09:38,568 + 더 높은 가치와 높은 주파수와 녹색이있을 것이다, 그리고, 우리는 않았다 + +874 +01:09:38,569 --> 01:09:40,760 + 우리는 기본적으로 이러한 특징 벡터했다 + +875 +01:09:40,760 --> 01:09:45,210 + 그들에게 필요한 무엇을 위해 이렇게 정말 그들에 컨텍스트를 선형 분류를 넣어 + +876 +01:09:45,210 --> 01:09:49,090 + 그 이전에 대부분 컴퓨터 비전 속에서 다음과 같이 우리는 일을하는지 + +877 +01:09:49,090 --> 01:09:52,840 + 약 2012 당신이 당신의 이미지를 촬영하게되며 기능의 단계를 + +878 +01:09:52,840 --> 01:09:57,409 + 추출 우리는 당신이에 대해 알아야 할 중요한 일이 무엇인지 결정 곳 + +879 +01:09:57,409 --> 01:10:01,859 + 이미지 서로 다른 주파수 다른 텐트와 우리는 어떤 결정 + +880 +01:10:01,859 --> 01:10:05,109 + 흥미로운 기능은 당신이 사람들이 10 개의 서로 다른 기능 유형처럼 걸릴 참조 + +881 +01:10:05,109 --> 01:10:09,369 + 모든 종이와 그냥 일어 났는데 그냥 당신이 하나의 거대한 두 배로 할 수 있습니다 히트 그것의 모든 필요 + +882 +01:10:09,369 --> 01:10:12,640 + 기능 이미지를 통해 벡터 및 당신은 그 위에 선형 분류를 넣어 + +883 +01:10:12,640 --> 01:10:15,920 + 우리는 지금 그것을보고처럼 그래서 당신은에 당신의 엉덩이 오후에 기차 판매를 재생 + +884 +01:10:15,920 --> 01:10:20,109 + 이러한 모든 기능 유형의 상단과 우리가 우리가 발견 한 이후로 교체하고 + +885 +01:10:20,109 --> 01:10:24,869 + 그건 당신이 원시 이미지로 시작하는만큼 더 잘 작동하고 전체의 생각 + +886 +01:10:24,869 --> 01:10:28,979 + 당신이 생각하는 어떤 것은 당신의 분리에의 일부를 설계하지 않는 + +887 +01:10:28,979 --> 01:10:33,479 + 많은 시뮬레이션 할 수 있습니다 우리가 건축을 마련 좋은 아이디어 나하지 + +888 +01:10:33,479 --> 01:10:38,189 + 모든 것이 하나의 기능 우리 때문에 다른 특징은 그렇게 말하고합니다 + +889 +01:10:38,189 --> 01:10:41,879 + 단지 우리가 실제로 훈련 할 수의 기능의 상단에 상단하려고하지 않는다 + +890 +01:10:41,880 --> 01:10:45,400 + 모든 방법 화소에 이르기까지 우리는 우리의 기능 추출기를 훈련 할 수있다 + +891 +01:10:45,399 --> 01:10:49,989 + 효과적으로 있도록 큰 혁신이었다 당신이 접근 방법이 문제는 우리는 + +892 +01:10:49,989 --> 01:10:53,300 + 반면 설계 요소를 많이 가지고하려고하는 제거 시도 + +893 +01:10:53,300 --> 01:10:56,779 + 우리가 완벽하게 일을 끌어 훈련을 할 수 있도록 하나의 주요 얼룩 + +894 +01:10:56,779 --> 01:11:01,550 + 역사적으로이 오는하고 무엇을 그 바위 텍사스에서 시작 + +895 +01:11:01,550 --> 01:11:06,760 + 우리는 무엇을하고있을 것입니다 그래서 다음 마지막이에 구체적으로보고됩니다 + +896 +01:11:06,760 --> 01:11:10,520 + 우리의 문제는 분석 구배를 계산해야하고 그래서 우리는에 갈거야 + +897 +01:11:10,520 --> 01:11:14,860 + 분석 기울기를 계산하는 효율적인 방법은 역 전파 및 + +898 +01:11:14,859 --> 01:11:18,839 + 그래서 그 배경 그리고 당신은 잘 될거야 그리고 우리는거야 + +899 +01:11:18,840 --> 01:11:20,039 + 약간 작동 이동 + diff --git a/captions/Ko/Lecture4_ko.srt b/captions/Ko/Lecture4_ko.srt new file mode 100644 index 00000000..57959a1c --- /dev/null +++ b/captions/Ko/Lecture4_ko.srt @@ -0,0 +1,3936 @@ +1 +00:00:02,740 --> 00:00:07,000 + 확인 그래서 내가 어떤 관리자에 뛰어 보자 + +2 +00:00:07,000 --> 00:00:14,669 + 나는 그 과제를 호출 할 수 있도록 먼저 가서 한 다음 주 수요일 때문이다 + +3 +00:00:14,669 --> 00:00:19,050 + 그래하지만 백 오십 시간 왼쪽보다 거기 때문에 우리를 사용하는 + +4 +00:00:19,050 --> 00:00:23,320 + 일반적인 운명의 감각과 그가있을거야 그 시간의 세 번째 기억 + +5 +00:00:23,320 --> 00:00:29,278 + 의식이 그래서 당신은 많은 시간이 정말로 실행중인 것을 가지고 있고하지 않습니다 + +6 +00:00:29,278 --> 00:00:31,768 + 당신은 당신이 그래서 늦은 하루 일을 생각할 수 있습니다 알고 있지만 이러한 이미지를 얻을 + +7 +00:00:31,768 --> 00:00:38,640 + 시간이 지남에 열심히 그래서 당신은 그들을보고 싶어하고 그래서 그렇게 가능성이 지금 시작 + +8 +00:00:38,640 --> 00:00:43,109 + 월요일에는 근무 시간 또는 그런 아무것도 없다 내가 사무실을 만들 개최합니다 + +9 +00:00:43,109 --> 00:00:45,839 + 수요일에 시간이 나는 너희들이에 대해 나에게 이야기 할 수 있도록하려면 때문에 + +10 +00:00:45,840 --> 00:00:49,260 + 그래서 월요일부터 내 근무 시간 이동됩니다에 특별 프로젝트 등 + +11 +00:00:49,259 --> 00:00:52,820 + 수요일은 일반적으로 내 사무실은 오후 6시 시작했다 대신 나는 5시에를해야합니다 + +12 +00:00:52,820 --> 00:00:59,909 + 오후 일반적으로는 게이트 (260)를 생각하지만 지금은 39-1 그들 모두와 약혼 할 예 + +13 +00:00:59,909 --> 00:01:03,429 + 또한 당신이오고있어 중간에 대해 공부하려고 할 때주의해야 + +14 +00:01:03,429 --> 00:01:04,170 + 몇 주 + +15 +00:01:04,170 --> 00:01:07,109 + 당신이 정말로의 일부뿐만 아니라 강의 노트를 통해 이동해야합니다 + +16 +00:01:07,109 --> 00:01:09,819 + 이 클래스와 선택의 종류와 내가 가장 생각하는 것들 중 몇 가지를 선택 + +17 +00:01:09,819 --> 00:01:13,579 + 강의를 제공하는 귀중한하지만보다 재료의 꽤가있다 + +18 +00:01:13,579 --> 00:01:16,548 + 내가의 일부를오고 있어요에도 중기에 팝업을 생각하고 조심하는 + +19 +00:01:16,549 --> 00:01:19,610 + 그 강의를 통해 URI보다 일반적으로 더 큰 가장 중요한 물건 + +20 +00:01:19,609 --> 00:01:25,618 + 여배우에 자신의 무료 노트 등 재료의 재료가 될 수 + +21 +00:01:25,618 --> 00:01:32,269 + 강의 양쪽에서 그려진 그 확인하여 모든 우리가 갈거야, 상기 한 + +22 +00:01:32,269 --> 00:01:36,769 + 재료에 뛰어 그래서 우리는 단지 우리가 미리 알림으로 바로 지금 어디에 + +23 +00:01:36,769 --> 00:01:39,989 + 이 핵심 기능은 우리가 같은 SP의 손실로 여러 손실 함수를 보았다 + +24 +00:01:39,989 --> 00:01:44,359 + 기능 마지막으로 우리는 당신이 어떤을 위해 달성하는 것이 손실 전체를 보면 + +25 +00:01:44,359 --> 00:01:49,379 + 특정 훈련 데이터에에 가중치의 세트로 구성이 손실 + +26 +00:01:49,379 --> 00:01:53,509 + 두 가지 구성 요소 우리가 무엇을 원하는 정말이 데이터 손실 및 손실 맞아 및 + +27 +00:01:53,509 --> 00:01:57,200 + 우리가 지금에 대한 손실의 그라데이션 표현을하고 싶은 것입니다 + +28 +00:01:57,200 --> 00:02:01,118 + 무게는 그리고 우리는 우리가 실제로 최적화를 수행 할 수 있도록이 작업을 수행 할 수 + +29 +00:02:01,118 --> 00:02:07,069 + 우리가 선두에 반복 우리가 반대 의견에서하고있는 공정 최적화 과정 + +30 +00:02:07,069 --> 00:02:11,030 + ㄱ 주 업데이트 중에 무게에 그라데이션과 그냥이 반복 + +31 +00:02:11,030 --> 00:02:14,259 + 또 다시 이렇게 수렴 된 그 + +32 +00:02:14,259 --> 00:02:17,929 + 저 그 손실 함수의 점과 우리가 손실에 도착했을 때의 + +33 +00:02:17,930 --> 00:02:20,799 + 이 측면에서 우리의 훈련 데이터에 대한 좋은 예측을하는 것과 + +34 +00:02:20,799 --> 00:02:25,030 + 지금 나오는 과정은 우리는 또한 수 있습니다 너무 종류의 폐기물이 평가 보았다 + +35 +00:02:25,030 --> 00:02:29,019 + 그라데이션이 미국의 기울기 그리고이 작성하는 매우 간단하지만, 그것의 + +36 +00:02:29,019 --> 00:02:32,840 + 극단적으로 평가 느리게하고있는이있는 비가 구배가있다 + +37 +00:02:32,840 --> 00:02:36,658 + 수학을 사용하여 얻을이 강의에 그에 갈 것 꽤 + +38 +00:02:36,658 --> 00:02:41,318 + 좀 더하고 그래서 큰하지만 그렇지 당신이 잘못 얻을 수있어 어떤 빠르고 정확한이다 + +39 +00:02:41,318 --> 00:02:45,969 + 때로는 그래서 우리는 항상 이미 검사에서 다음 주에 우리는 모든 쓰기 위치 + +40 +00:02:45,969 --> 00:02:48,639 + 표현은 분석 그라디언트를 완료 한 후 우리는 다시 확인의 + +41 +00:02:48,639 --> 00:02:51,828 + 수치 그라데이션 정확성 그리고 나는 당신이 볼 거라면 확실하지 않다 + +42 +00:02:51,829 --> 00:02:59,250 + 당신은 당신이 할 수있는 지금이 확실히 확인 할당 볼 거라고 + +43 +00:02:59,250 --> 00:03:04,378 + 이 설정을 볼 때 유혹 우리는 단지의 기울기를 구동 할 + +44 +00:03:04,378 --> 00:03:08,459 + 다시 당신은 단지에 유혹 될 수있는 가중치를 바로 알고에 손실 함수 + +45 +00:03:08,459 --> 00:03:11,709 + 전체 손실 밖으로 당신이 당신의 미적분을 본 바로 그라데이션을 시작 + +46 +00:03:11,709 --> 00:03:16,120 + 내가하고 싶습니다 클래스하지만 당신이 측면에서이 훨씬 더 생각해야한다는 것입니다 + +47 +00:03:16,120 --> 00:03:22,480 + 대신 하나의 거대한 표현의 단지 복용 생각의 계산 잔디의 + +48 +00:03:22,479 --> 00:03:25,369 + 당신은 펜과 용지를위한 표현에 만족 구동 할 거라고 + +49 +00:03:25,370 --> 00:03:27,549 + 기울기 및 그 이유 + +50 +00:03:27,549 --> 00:03:31,689 + 그래서 여기에 우리는 흐르는 이러한 값에 대한 흐름을 생각하고 + +51 +00:03:31,689 --> 00:03:35,509 + 원을 따라 이러한 작업 주위에 경쟁과 그들에 전달 + +52 +00:03:35,509 --> 00:03:38,979 + 모든 방법으로 입력을 변환 기본적으로 기능 조각 + +53 +00:03:38,979 --> 00:03:43,018 + 마지막에 손실 함수는 그래서 우리는 우리의 데이터와 우리의 매개 변수로로 시작 + +54 +00:03:43,019 --> 00:03:46,079 + 입력 그들은 단지 모든 인이 경쟁 그래프를 통해 공급 + +55 +00:03:46,079 --> 00:03:49,790 + 길을 따라와 말의 기능이 시리즈는 우리는 하나의 번호를 + +56 +00:03:49,789 --> 00:03:53,590 + 손실 난이 방법은 그것에 대해 생각하고자하는 이유는있다 + +57 +00:03:53,590 --> 00:03:57,069 + 이러한 표현은 지금 매우 작은 보면 당신은 할 수있을 수 있음 + +58 +00:03:57,068 --> 00:04:00,339 + 이러한 고충을 유도하지만, 이러한 표현은 경쟁 잔디되어 있습니다 + +59 +00:04:00,340 --> 00:04:04,250 + 매우 큰 얻을에 대한 그래서 예를 들어 길쌈 신경 네트워크는 것 + +60 +00:04:04,250 --> 00:04:08,829 + 수백 어쩌면 우리 모두가 이러한 이미지를해야하므로 작업의 수십입니다 + +61 +00:04:08,829 --> 00:04:12,939 + 우리의 손실을 얻기 위해 큰 계산 그래프처럼-을 통해 흐름 등이된다 + +62 +00:04:12,939 --> 00:04:16,858 + 바로 이러한 표현을 쓸 비현실적 및 상업 네트워크는 + +63 +00:04:16,858 --> 00:04:19,370 + 심지어 당신은 실제로 예를 들어 수행에 최악 시작되지 일단 + +64 +00:04:19,370 --> 00:04:23,509 + 마음 어디에서 용지 인 대체 광택이라는 것을 + +65 +00:04:23,509 --> 00:04:26,329 + 이것은 기본적으로 미분 튜링 기계 + +66 +00:04:26,329 --> 00:04:30,128 + 그래서 전부는 컴퓨터가하는 모든 절차 미분이다 + +67 +00:04:30,129 --> 00:04:33,590 + 테이프에 수행이 원활하게 이루어지고 미분 컴퓨터는 기본적으로 + +68 +00:04:33,589 --> 00:04:39,519 + 및 경쟁 그래픽이이이 명중되지 거대하고뿐만 아니라 + +69 +00:04:39,519 --> 00:04:42,478 + 당신이 일을 끝낼 우리가 재발 신경망과에가는거야 무엇 때문에 + +70 +00:04:42,478 --> 00:04:45,848 + 비트하지만 당신은 결국 일을 당신이이 그래프 그렇게 생각 제어 끝입니다 + +71 +00:04:45,848 --> 00:04:51,658 + 이 그래프는 시간 단계의 수백을 복사 그래서 당신은이 거대한 끝낼 + +72 +00:04:51,658 --> 00:04:56,379 + 수천 개의 노드와 약간의 계산 단위의 수백의 몬스터와 + +73 +00:04:56,379 --> 00:04:59,819 + 그래서 당신이 알고 쓰는 데 신경 튜링에 대한 손실 불가능이다 + +74 +00:04:59,819 --> 00:05:03,650 + 기계 그것은 페이지의 수십억 우리과 같이 걸릴 것 단지 불가능 + +75 +00:05:03,649 --> 00:05:07,068 + 더 구조의 관점 너무 작은 기능에 이것에 대해 생각해야 + +76 +00:05:07,069 --> 00:05:11,710 + 우리는거야 그래서 중간 변수를 변환하는 것은 바로 맨 끝에 분실하기 + +77 +00:05:11,709 --> 00:05:14,318 + 경쟁 그래프에서 구체적으로보고 할 우리는 어떻게 유도 할 수있다 + +78 +00:05:14,319 --> 00:05:20,560 + 맨 끝에 손실 함수에 대한 입력에 기울기 때문에 + +79 +00:05:20,560 --> 00:05:25,569 + 시작 - 무슨 간단하고 구체적인 아주 작은 경쟁 그래프를 우리는 세 가지가 + +80 +00:05:25,569 --> 00:05:29,778 + 이 그래프 XY 및 Z에 대한 입력으로 스칼라 그들은 약이 특정에 걸릴 + +81 +00:05:29,778 --> 00:05:35,069 + 94의 마이너스 25의 예에서 이러한 우리는이 매우 작은 그래픽이 + +82 +00:05:35,069 --> 00:05:38,669 + 또는 당신이 날이 상호 교환 안녕을 참조 듣게 회로에 대한 그래프가 + +83 +00:05:38,668 --> 00:05:43,038 + 회로는 그래서 우리는 마지막에 음을 우리에게 이것을 제공하는 것이이 그래프를 + +84 +00:05:43,038 --> 00:05:47,288 + 12 그래서 여기에 확인 내가 무슨 짓을했는지하는 동안 깊은 리필 모습은를 호출까지입니다입니다 + +85 +00:05:47,288 --> 00:05:51,120 + 내가 입력을 설정 한 다음 나는 복장을 계산이 그래프의 전진 패스 + +86 +00:05:51,120 --> 00:05:56,288 + 그리고 나는 우리가에 식의 기울기를 구동하고 싶은대로 할 싶습니다 + +87 +00:05:56,288 --> 00:06:01,250 + 입력과 우리가 그 무엇을 할 거 야이 중간 변수를 도입 + +88 +00:06:01,250 --> 00:06:07,050 + 내가 그들에게 참조로 플러스 게이트 및 시간 게이트가 그래서 플러스 게이트 큐와 + +89 +00:06:07,050 --> 00:06:10,800 + 따라서이 컴퓨팅이 옷 큐를 얻고 수키는이 중간이었다합니다 + +90 +00:06:10,800 --> 00:06:14,788 + 내가 작성한 것을 다음 X 플러스 Y와의 결과는 f를 qnz의 곱셈이다 + +91 +00:06:14,788 --> 00:06:19,360 + 여기에서 우리가 원하는 것은 그라디언트 파생 뻣뻣한 생각입니다 경우 I + +92 +00:06:19,360 --> 00:06:25,598 + 내 원하는받을 수 있나요 나는 중간 로그인하시기 바랍니다 그라데이션을 작성했습니다 + +93 +00:06:25,598 --> 00:06:30,120 + 우리가 수행 한 개별적으로 지금이 두 표현 모두를위한 + +94 +00:06:30,120 --> 00:06:33,490 + 에서가 클래스는 왼쪽에서 오른쪽으로 지금 무엇을 할 것 인 것은 역을 유도합니다 + +95 +00:06:33,490 --> 00:06:35,699 + 패스 뒤쪽에서 이동합니다 + +96 +00:06:35,699 --> 00:06:39,300 + 앞으로 우리의 회로까지 모든 중간체의 그라디언트 경쟁 + +97 +00:06:39,300 --> 00:06:43,509 + 맨 끝에 우리는 그라디언트 입력에 그리고 우리 그것을 구축하는거야 + +98 +00:06:43,509 --> 00:06:47,680 + 맨 오른쪽이 재귀의 기본 케이스의 일종으로 시작 + +99 +00:06:47,680 --> 00:06:52,670 + 절차 우리는 각각의 기울기를 고려하고 그래서 이것은 단지입니다 + +100 +00:06:52,670 --> 00:06:56,020 + 식별 기능은 그래서 그것의 파생 무엇인가 + +101 +00:06:56,019 --> 00:07:06,240 + 그것은 정체성 바로 그래서 하나의 아이디어를 매핑 ID는 하나의 기울기가 + +102 +00:07:06,240 --> 00:07:10,329 + 그래서 우리가 하나를 시작하고 지금 우리가 갈거야 우리의 기본 사건 + +103 +00:07:10,329 --> 00:07:18,519 + 존경 그 너무 거꾸로이 그래프를 통해 우리는 그라데이션 할 + +104 +00:07:18,519 --> 00:07:27,089 + 이 경쟁 그래프에서 확인이 그래서 우리는 권리를 작성하지 않은 있다는 것입니다 + +105 +00:07:27,089 --> 00:07:32,879 + 여기에 무엇이 특정 예제의 핵심은 세 가지 바로 그라데이션이되도록있어 + +106 +00:07:32,879 --> 00:07:36,279 + 그에이에 따라 내가 바로 재료가 될거야 불과 3이 될 것이다 + +107 +00:07:36,279 --> 00:07:42,309 + 빨간색 선과 값 아래의 라인에 대한 녹색에 + +108 +00:07:42,310 --> 00:07:48,420 + 전면의 그라데이션은 하나가 아닌 그라데이션 발병 텔링으로 33 + +109 +00:07:48,420 --> 00:07:52,009 + 당신은 정말 직관적으로 그라데이션의 해석은 무엇을 염두에 두어야 + +110 +00:07:52,009 --> 00:07:58,459 + 즉 말하는 최종 값에 죽은의 영향이 긍정적이라고하고 + +111 +00:07:58,459 --> 00:08:02,859 + 세 코스의 종류와 그래서 소량의 팔에 의해 Z를 증가하는 경우 + +112 +00:08:02,860 --> 00:08:07,759 + 그 회로의 출력은 a를 증가 때문에 반응 + +113 +00:08:07,759 --> 00:08:13,009 + 긍정적 세 긍정적 발생합니다 세 이렇게 작은 변화로 증가 + +114 +00:08:13,009 --> 00:08:21,560 + 궁극적 인 변화는 이제이 경우 큐에 따라 기울기가 너무 신격화한다 + +115 +00:08:21,560 --> 00:08:30,860 + IQ는 그 어떤 것을 우리는 그 부분에 대한 음의 기울기를 얻을 수 있도록하기 전에 + +116 +00:08:30,860 --> 00:08:34,599 + 그 말과 회로 그리고 그가 인 경우 출력을 증가시키는 것입니다 + +117 +00:08:34,599 --> 00:08:39,740 + 회로가 좋아 감소 당신은 H 증가하는 경우가 회로까지 일 + +118 +00:08:39,740 --> 00:08:44,789 + 기울기의 네 나이가 감소하는 것은 지금 우리가 가고있는 확인에 대한 부정적 + +119 +00:08:44,789 --> 00:08:48,480 + 이 플러스 게이트를 통해이 과정을 계속이 일을 얻을 수있는 곳입니다하기 + +120 +00:08:48,480 --> 00:08:49,039 + 약간 + +121 +00:08:49,039 --> 00:08:54,328 + 나는 우리가 Y에 대한 이유에에 계약을 계산하고 싶습니다 가정 + +122 +00:08:54,328 --> 00:09:10,208 + 등 그라데이션 왜이 특정 그래프이 될 것이다 것 + +123 +00:09:10,208 --> 00:09:23,979 + 어느 쪽이든 내가 이것에 대해 생각하고 싶습니다 그것에 유리 학습 가능한 확인을 적용하는 것입니다 + +124 +00:09:23,980 --> 00:09:27,709 + 그래서 체인 규칙은 모든 사람의 기울기를 직접 할 것인지 말한다 + +125 +00:09:27,708 --> 00:09:33,208 + 왜 그때 내가 바로 그래서 우리 연방 수사 국 (FBI)의 DQ 시간에 큐브 이상적인 동일입니다 + +126 +00:09:33,208 --> 00:09:36,438 + 우리가 알고 이유 일 수 있습니다 특정 IQ에서 그 표현을 모두 계산 + +127 +00:09:36,438 --> 00:09:42,519 + 음 정도 그 쿠폰의 영향의 효과가있어입니다 DFID의 Q입니다 + +128 +00:09:42,519 --> 00:09:46,619 + 부정적인 지금 우리가 지방을 알고는 로컬 영향을 알고 싶습니다 + +129 +00:09:46,619 --> 00:09:52,449 + 왜 Q에 빛의 로컬 영향 쿠바에 그의 것은 지역 주민을의 하나입니다 + +130 +00:09:52,448 --> 00:09:58,969 + 전립선에 대한 Y의 로컬 유도체 등 일반으로 참조 + +131 +00:09:58,970 --> 00:10:02,019 + 정확한 것은이 두 그라디언트에게 지역을 변경 할 것을 우리에게 알려줍니다 + +132 +00:10:02,019 --> 00:10:06,139 + 그라데이션 끔찍한 왜 당신과의 Q의 글로벌 그라데이션의 종류하지 + +133 +00:10:06,139 --> 00:10:10,948 + 우리가 네 번 만든거야 그래서 회로의 업데이트를 곱하는 것입니다 + +134 +00:10:10,948 --> 00:10:14,588 + 그래서 그녀를 다시 전파의 요점 이런 종류의이 매우이다 작동 + +135 +00:10:14,589 --> 00:10:18,209 + 중요한 우리는 우리가 계속 적어도 두 가지를 한 것으로 여기에 이​​해하기 + +136 +00:10:18,208 --> 00:10:24,289 + 우리가 일반적으로 수행하는 경우를 통해 곱 우리는 X 플러스 Y와를 계산 한 + +137 +00:10:24,289 --> 00:10:29,379 + 그 하나의 표현에 대한 미분 X & Y는 하나 하나 그렇게 계속하다 + +138 +00:10:29,379 --> 00:10:32,749 + 말하는 그라디언트의 마음의 해석에 X & Y가있을 것입니다 + +139 +00:10:32,749 --> 00:10:38,509 + (10)의 기울기 H X가 증가함에 따라 큐에 긍정적 인 영향 + +140 +00:10:38,509 --> 00:10:44,548 + H에 의해 큐 증가하고 결국 같은처럼 우리는 빛의 영향을하고 싶습니다 + +141 +00:10:44,548 --> 00:10:49,980 + 최종 밖으로하지만, 회로 등 길에이를 최대 작업은 걸릴 것입니다 + +142 +00:10:49,980 --> 00:10:53,480 + 의 영향을하고 우리는 최종 손실에 대한 Q의 영향을 알고 왜 + +143 +00:10:53,480 --> 00:10:57,058 + 인 우리가 반복적으로이 그래프를 통해 여기 컴퓨팅 무엇을하고 + +144 +00:10:57,058 --> 00:11:00,350 + 할 수있는 올바른 것은 우리가 (10)의 별명으로 끝낼 수 있도록를 곱하는 것입니다 + +145 +00:11:00,350 --> 00:11:05,189 + 15 음과 그래서 이것은 밖으로 작동 방식은 기본적으로 이것이 무엇인가 + +146 +00:11:05,188 --> 00:11:08,649 + 속담 최종 출력 회로에 대한 이유의 영향을 부정하는 것이거나 + +147 +00:11:08,649 --> 00:11:14,649 + 왜 부정적인 네 배 앨범 회로를 감소시켜야 증가 + +148 +00:11:14,649 --> 00:11:18,230 + 가 왜 법 당신이 만든 변화와 운동을 끝낼 방법입니다 + +149 +00:11:18,230 --> 00:11:21,810 + 약간 비스듬히 증가 이유를 증가 Cuse에 긍정적 인 영향 + +150 +00:11:21,809 --> 00:11:27,959 + 어떤 체인의 규칙이 종류의 우리에게주는되도록 가능성이 회로 감소 + +151 +00:11:27,960 --> 00:11:29,120 + 일치 + +152 +00:11:29,120 --> 00:11:45,259 + 우리가이 많은 많은 많은 연결을 볼 수 있습니다이 당신에받을거야 및 + +153 +00:11:45,259 --> 00:11:48,889 + 모든 클래스의 말에 당신이 점을 드릴 당신이 그것을 이해는하지 않습니다 + +154 +00:11:48,889 --> 00:11:51,870 + 우리가 실제로이 편지를 완료하면 어디서나 어떤 상징적 인 표현이 + +155 +00:11:51,870 --> 00:11:54,639 + 이 구현 당신은이 이후에 그것의 구현을 볼 수 있습니다 + +156 +00:11:54,639 --> 00:11:57,009 + 항상있을 것입니다 이것은 단지 숫자 요인 + +157 +00:11:57,009 --> 00:12:02,230 + 로버트 숫자 확인 및 X보고 우리가 일을 어떻게하는 아주 똑똑한를하다 + +158 +00:12:02,230 --> 00:12:05,889 + 우리는 우리의 최종 목표이다 그 IDX 궁금 발생하지만 우리는 결합해야 + +159 +00:12:05,889 --> 00:12:09,799 + 그것은 우리가 접근이 무엇인지 무엇 예전 친구를 알고 당신을 듣고 당신에게 같은 장소를 물어 + +160 +00:12:09,799 --> 00:12:13,979 + 체인이 그렇게을 성장 될 수있을 테니까요 때문에 회로의 끝에서 + +161 +00:12:13,980 --> 00:12:19,240 + 음 네 번 당신이이 일반화 작동하는 방식 때문에 하나의 확인을주고 싶어 + +162 +00:12:19,240 --> 00:12:23,289 + 당신이 게이트입니다 다음과 같이이 예제와 방법에서 비트는이에 대해 생각하는 + +163 +00:12:23,289 --> 00:12:28,429 + 회로에 삽입이 매우 큰 계산 그래프 또는 회로이며 + +164 +00:12:28,429 --> 00:12:32,250 + 당신은 어떤 특정 번호 X & Y가 와서 몇 가지 템플릿을 수신하고, + +165 +00:12:32,250 --> 00:12:39,059 + 그들에 대한 몇 가지 작업을 수행하고 좋은 세트 Z를 계산하고 지금이 + +166 +00:12:39,059 --> 00:12:43,019 + 잡지는 경쟁 잔디로 전환 무언가가 일어나는하지만 당신은 그냥있어 + +167 +00:12:43,019 --> 00:12:46,169 + 너무 큰 회로에서 놀고 당신은이 아니라 무슨 확실하지 않다 + +168 +00:12:46,169 --> 00:12:50,939 + 우린 다음 회로의 끝은 손실을 계산하고 그 전진 패스 그리고 + +169 +00:12:50,940 --> 00:12:56,250 + 거꾸로 역순으로 반복적으로 진행하지만, 실제로 전 + +170 +00:12:56,250 --> 00:13:01,120 + 나는 X & Y 내가하고 싶은 일이 있다는 지적에 도착하면 바로 그 부분에 도착 + +171 +00:13:01,120 --> 00:13:05,279 + 전진 패스 동안이 게이트 있다면 당신은 당신의 값 X & Y 당신에 도착 + +172 +00:13:05,279 --> 00:13:08,500 + 컴퓨터 출력과 상기 다른 일이있다 할 수 있습니다 컴퓨터에 바로와 + +173 +00:13:08,500 --> 00:13:10,230 + 그 지역 그라디언트입니다 + +174 +00:13:10,230 --> 00:13:14,789 + X & Y 그래서 바로 그냥 게이트이기 때문에 사람들을 계산할 수 있습니다 내가 알고있는 + +175 +00:13:14,789 --> 00:13:18,009 + 내가 좋아하는 수행하고있어 추가 응용 프로그램은 내가 영향을 알고 말을 그 + +176 +00:13:18,009 --> 00:13:24,259 + X & Y 그래서 지금 당장하지만 그 사람들을 계산할 수 있습니다 내 밖으로 몸을 이겼다 + +177 +00:13:24,259 --> 00:13:25,389 + 무슨 일이야 + +178 +00:13:25,389 --> 00:13:29,769 + 끝 부분에서 계산 된 소송은 다른 결국 배울 뒤로 갈 수 있도록 + +179 +00:13:29,769 --> 00:13:32,499 + 내 영향에 무엇인가에 대한 + +180 +00:13:32,499 --> 00:13:37,839 + 회로의 최종 출력 DL은 이들에 의해 손쉽게 배울 수있는 손실 자신의 + +181 +00:13:37,839 --> 00:13:41,419 + 성분은 내게로 흘러 제가해야 할 일은 내가 그 변경해야 할 것입니다 것입니다 + +182 +00:13:41,418 --> 00:13:45,278 + 나는를 변경할 수 있는지 확인해야합니다 있도록이 재귀 경우를 통해 그라데이션 + +183 +00:13:45,278 --> 00:13:48,778 + 내 작업을 통해 그라데이션을 수행하고 정확한 것은 밝혀 + +184 +00:13:48,778 --> 00:13:52,068 + 여기에서이 말하는 정말 무슨 트라마돌을 구입하는 것은이다 할 올바른 일이 + +185 +00:13:52,068 --> 00:13:56,068 + 그라데이션없이 해당 지역의 그라데이션을 곱 것을 실제로 당신에게 제공하는 + +186 +00:13:56,068 --> 00:13:57,838 + DL IDX + +187 +00:13:57,839 --> 00:14:02,739 + 회로의 최종 출력에 X 오프 직원 그래서 정말 체인 규칙은 그냥 + +188 +00:14:02,739 --> 00:14:08,229 + 우리는이 글로벌 그라데이션이라고 무엇을 가지고이 추가 곱셈 + +189 +00:14:08,229 --> 00:14:12,669 + 의상에 게이트와 우리는 같은에서 로컬 그라데이션을 변경했습니다 + +190 +00:14:12,668 --> 00:14:18,509 + 그것은 그 사람 그라디언트의 단지 곱셈 그래서 것은 잠시 동안 간다 + +191 +00:14:18,509 --> 00:14:22,889 + 해당 지역의 그라데이션으로 당신은 게이트있어 다음 기억한다면 이들 X 년대와 Y의 + +192 +00:14:22,889 --> 00:14:27,229 + 당신이 저주로 끝날 바로 그래서 다른 상태에서이오고있다 + +193 +00:14:27,229 --> 00:14:31,899 + 추가 터키어 등이 게이트 전체 컵을 통해이 과정 + +194 +00:14:31,899 --> 00:14:36,808 + 다만 기본적 그들이 그렇게 마지막 손실에 서로 영향을 통신 + +195 +00:14:36,808 --> 00:14:39,688 + 이것은 당신이 긍정적있어 의미 긍정적 인 그라데이션 경우 서로 확인 말해 + +196 +00:14:39,688 --> 00:14:43,198 + 부정적인 부정적인 그라데이션 부정적인 영향의 손실에 영향을 미치는 + +197 +00:14:43,198 --> 00:14:46,788 + 손실에 영향을 미치는 그는 단지 거의 이들에 의해 회로를 통해 적용됩니다 + +198 +00:14:46,788 --> 00:14:51,019 + 지역 그라디언트와 함께 결국이 과정은 전파 다시 호출 + +199 +00:14:51,019 --> 00:14:54,489 + 그것은 연쇄 규칙 재귀 적용하여 계산하는 방법이다 + +200 +00:14:54,489 --> 00:14:58,399 + 경쟁을 통해 하나 하나 중간 값의 영향에 잡아 + +201 +00:14:58,399 --> 00:15:02,158 + 최종 손실 함수 등이 그래프는이 많은 예제를 볼 수 있습니다 + +202 +00:15:02,158 --> 00:15:06,918 + 트럭 내가 약간이 구체적인 예에​​ 갈거야 그녀처럼 + +203 +00:15:06,918 --> 00:15:11,298 + 더 큰 우리가 구체적으로 그것을 통해 작동합니다하지만 난에 자신의 질문을 해달라고 + +204 +00:15:11,298 --> 00:15:20,389 + 내가 좋아하는 것,이 점은 내가 당신에게 돌아올거야 앞서 물어 + +205 +00:15:20,389 --> 00:15:25,538 + Z가 사용되는 경우 있도록 등급에게 그라디언트에게인지 아담를 추가 + +206 +00:15:25,538 --> 00:15:29,928 + 서커스의 여러 장소에서 다시 도로는 그 뜻을 추가합니다 폐쇄 + +207 +00:15:29,928 --> 00:15:31,539 + 그 시점에 돌아온다 + +208 +00:15:31,539 --> 00:16:03,139 + 같은 우리가 그 문제의 전부를받을거야 그리고 우리는거야 당신이있어 나중에 참조 + +209 +00:16:03,139 --> 00:16:05,769 + 거야 우리가 그라데이션 문제를 추방 호출 것을 얻을 + +210 +00:16:05,769 --> 00:16:10,669 + 우리의이보다 구체적인 그렇게하기 위해 또 다른 예를 통해 풀어 볼 수 있습니다 + +211 +00:16:10,669 --> 00:16:14,318 + 여기에 우리가 그런 일이 다른 회로는 작은 두 개의 차원을 계산해야 할 + +212 +00:16:14,318 --> 00:16:18,179 + 이란에서하지만 지금은 그냥이 생각하는 그 해석에 대해 걱정하지 마십시오 + +213 +00:16:18,179 --> 00:16:22,849 + 그 표현 때문에 하나를 통해 하나의 플러스 키의 어떤 숫자의로 + +214 +00:16:22,850 --> 00:16:29,000 + 여기에 입력 앤드류 기능에 의해 그리고 우리는 저기 내가 단일 출력이 + +215 +00:16:29,000 --> 00:16:32,490 + 초안 형태 때문에이 대회에 수식 것을 번역 + +216 +00:16:32,490 --> 00:16:35,769 + 그래서 사람이 할 식으로 우리가 안에서부터 밖으로 재귀에있는 경쟁 + +217 +00:16:35,769 --> 00:16:42,129 + 모든 작은 W 시간에 액세스하고 우리는 그들 모두를 추가 한 다음 우리는을 + +218 +00:16:42,129 --> 00:16:46,129 + 그것의 부정적이고 우리는 기하 급수적으로 그들은했다 하나, 그리고, 우리 + +219 +00:16:46,129 --> 00:16:49,769 + 마지막으로 나누어 우리는 식의 결과를 얻을 그래서 우리가 할거야 + +220 +00:16:49,769 --> 00:16:52,409 + 지금 우리가 가고있는이 식을 통해 전파를 백업하는거야입니다 + +221 +00:16:52,409 --> 00:16:56,500 + 매 입력 값의 영향의 출력에 손쉽게 계산 + +222 +00:16:56,500 --> 00:17:07,230 + 여기 저하되어이 식 + +223 +00:17:07,230 --> 00:17:22,039 + 그래서 지금 미국은 플러스 전체 + 게이트 이진 그리고 우리는 플러스가 + +224 +00:17:22,039 --> 00:17:26,519 + 하나의 게이트 나는 그 자리에서이 문을 만들고있어 우리는 어떤이는 것을 볼 수 있습니다 + +225 +00:17:26,519 --> 00:17:31,519 + 게이트 또는 게이트가 당신에게 달려 가지이다 아니다를위한 그래서이 시점에 돌아온다 + +226 +00:17:31,519 --> 00:17:35,639 + 지금은 단지 우리가 우리가 전반에 걸쳐 그래서 사용하는 몇 가지 더 게이트가 좋아 + +227 +00:17:35,640 --> 00:17:38,650 + 난 그냥 우리가 이들 중 몇 가지 예를 통해 이동으로 쓰는 좋아 + +228 +00:17:38,650 --> 00:17:42,720 + 파생 상품 지수 그리고 우리는 모든 작은 지역의 게이트에 대해 알고있는이 + +229 +00:17:42,720 --> 00:17:49,048 + 지역 그라디언트를 잘 그래서 우리가 할 수있는되는 미적분을 사용하여 추가 세금이 너무과 + +230 +00:17:49,048 --> 00:17:52,900 + 그래서 이러한 있도록 모든 작업과 덧셈과 곱셈이다 + +231 +00:17:52,900 --> 00:17:56,040 + 나는 당신이 어떤 위대한 측면에서 기억했다고 믿고있어하는 + +232 +00:17:56,039 --> 00:17:58,970 + 그들은 회로의 끝에서 시작하는거야 같은 것들을 모양과 나는했습니다 + +233 +00:17:58,970 --> 00:18:03,450 + 이미 뒷면에 원 포인트 제로 제로 채워 그건 어떻게 항상 있기 때문에 + +234 +00:18:03,450 --> 00:18:04,860 + 이 재귀를 시작 + +235 +00:18:04,859 --> 00:18:10,519 + 1110 오른쪽하지만 그 신원 기능에 그라데이션 지금 이후 우리는거야 + +236 +00:18:10,519 --> 00:18:17,849 + 하나의 상대적 그래서 확인 X 작업을 통해이 일을 통해 전파를 백업하기 + +237 +00:18:17,849 --> 00:18:22,048 + 로컬 그라데이션이 X를 통해 음의 하나가 렉스의 그래서 아무도 제곱되지 난파 + +238 +00:18:22,048 --> 00:18:27,119 + 게이트는 앞으로 통과하는 동안 입력 1.37을 받고 즉시 중 하나가 + +239 +00:18:27,119 --> 00:18:30,759 + 그녀의 전 케이트 계산 한 수 로컬 변형은 지역 그라디언트 것이었다 + +240 +00:18:30,759 --> 00:18:35,048 + X를 통해 음의 하나는 제곱과 전파를 다시 주문 및 트라마돌을 구입한다 + +241 +00:18:35,048 --> 00:18:40,750 + 의 마지막에의 경사가 그 로컬 기울기를 곱할 + +242 +00:18:40,750 --> 00:18:44,789 + 쉽게 회로는 그렇게 무엇 인 끝을 될 일이 있기 때문에 + +243 +00:18:44,789 --> 00:18:51,349 + 뒷면에 대한 표현은 내 전 케이트 중 하나를 여기에 읽기 전파 + +244 +00:18:51,349 --> 00:18:59,829 + 하지만 그녀는 항상 두 가지 지역 그라데이션 배에서 또는에서 기울기가 + +245 +00:18:59,829 --> 00:19:18,069 + 이는 그 로컬 구배가되도록 기울기 DFID X입니다 + +246 +00:19:18,069 --> 00:19:23,480 + 3.7 이상 하나의 제곱 한 다음 하나 포인트 0으로 곱한 제공 + +247 +00:19:23,480 --> 00:19:27,940 + 있는 분해하는 것은 정말 우리가 시작했기 때문에 하나 때문에 적용입니다 + +248 +00:19:27,940 --> 00:19:34,850 + 일반적으로는 바로 여기에 다른 하나는 구배에있어 그 01534 음 + +249 +00:19:34,849 --> 00:19:38,798 + 이 계곡은 확인 불고 된 와이어의 조각은 그래서 음이 + +250 +00:19:38,798 --> 00:19:43,889 + 당신에게 당신이 있다면 바로 때문에 기대할 수있는 복장에 효과 + +251 +00:19:43,890 --> 00:19:47,850 + 이 값이 증가하고 그 후 X 위에 하나의 게이트를 통과 + +252 +00:19:47,849 --> 00:19:50,939 + 그 이유는 당신이 부정적인보고있는, 그래서 렉스의 증가 금액은 작아 + +253 +00:19:50,940 --> 00:19:55,620 + 그라데이션 속도는 우리는 다음 게이트 여기에 전파를 다시 계속거야 + +254 +00:19:55,619 --> 00:20:01,048 + 당신이 보면 회로에서 하나의 일정한 때문에 로컬 그라데이션을 추가하는 것 + +255 +00:20:01,048 --> 00:20:06,960 + 출구에 값으로 기울기를 일정을 추가하면 하나의 권리 + +256 +00:20:06,960 --> 00:20:13,169 + 우리에게 이야기하고 그래서 여기에 변화 구배 우리는 선을 따라 계속합니다 + +257 +00:20:13,169 --> 00:20:22,940 + 상기에서 그라데이션을 한 시간이 해당 지역의 그라데이션 될 것입니다 + +258 +00:20:22,940 --> 00:20:28,590 + 그냥 배운 게이트 부정적인 2013년 7월 23일가 함께 계속됩니다 + +259 +00:20:28,589 --> 00:20:34,709 + 방법 변경되지 않습니다 직관적 즉,이 값이 바로 때문에 의미가 있습니다 + +260 +00:20:34,710 --> 00:20:38,319 + 수레 그리고 마지막 회로에 어떤 영향을하고 있다면 당신이 있다면 + +261 +00:20:38,319 --> 00:20:42,798 + 그 영향력 후 최종쪽으로 기울기의 변화의 속도를 하나 추가 + +262 +00:20:42,798 --> 00:20:46,970 + 당신이 어떤 양만큼의 효과를이 증가하는 경우 값은 변경되지 않습니다 + +263 +00:20:46,970 --> 00:20:51,548 + 변화율이 1을 변경하지 않기 때문에 일단은 동일 할 것이다 + +264 +00:20:51,548 --> 00:20:56,859 + 게이는 일정한 장교의 기울기 때문에 여기에 혁신을 계속 + +265 +00:20:56,859 --> 00:21:01,599 + 우리가 수행하는거야 전파를 돌아올 수 있도록 도끼 도끼 + +266 +00:21:01,599 --> 00:21:05,000 + 음 하나의 게이트 입력 + +267 +00:21:05,000 --> 00:21:08,329 + 그것은 바로 로컬 그라데이션을 완료 할 수 지금은 것​​을 알고 + +268 +00:21:08,329 --> 00:21:12,259 + 위의 그라데이션이 세 가지 때문에 계속 역 전파에 의해 음의 포인트입니다 + +269 +00:21:12,259 --> 00:21:20,000 + 여기에 체인 규칙을 적용하는 것 난 수사학 질문을 받았다 + +270 +00:21:20,000 --> 00:21:25,119 + 확실하지만,하지만, 기본적으로 전이 전 인 부정적인 하나의 각을하지 + +271 +00:21:25,119 --> 00:21:30,569 + 권리 세에 의해 지점이 전문가에 입력 8 배 체인 규칙 + +272 +00:21:30,569 --> 00:21:35,269 + 그래서 우리는 자신을 곱 계속 이렇게 나에 미치는 영향은 무엇이고 나는 무슨이 + +273 +00:21:35,269 --> 00:21:39,069 + 그 회로의 최종 끝에 효과는 항상 우리가 곱되고있다 + +274 +00:21:39,069 --> 00:21:46,859 + 그래서 지금이 시점에서 마이너스 22를 얻을 우리는 부정적인 하나의 게이트에 시간이 그래서 뭐 + +275 +00:21:46,859 --> 00:21:50,279 + 그것이 나를집니다 당신이 할 때 그라데이션 일어나는 일이 끝납니다 + +276 +00:21:50,279 --> 00:21:57,139 + 우리는 기본적으로 일정 입력을 가지고 있기 때문에 바로 주위에 다 입술에 달성 + +277 +00:21:57,140 --> 00:22:02,038 + 그래서 음의 음 하나 하나 시간을 일정하게 일어난 어느 + +278 +00:22:02,038 --> 00:22:05,548 + 시간 그들은 전진 패스로 우리에게 부정적인 하나를 제공 해달라고 그래서 지금 우리에게있다 + +279 +00:22:05,548 --> 00:22:09,569 + 인 밥에서 인사말 로컬 그라데이션 시간을의하는 곱 + +280 +00:22:09,569 --> 00:22:14,879 + 미세 너무 그래서 우리는 지금 그냥 긍정적으로 끝낼 전파를 다시 계속 + +281 +00:22:14,880 --> 00:22:21,110 + 전파 +이 플러스 작업은 여러 여기에 입력에 녹색이 + +282 +00:22:21,109 --> 00:22:25,599 + 하나는 10 버스 게이트 현지 그라데이션은 무엇 일어나고 끝 + +283 +00:22:25,599 --> 00:22:42,359 + 상단 구매자 따라 광택 흐름 + +284 +00:22:42,359 --> 00:22:48,089 + 지불 잉여는 모든 로컬 그라데이션이 항상 하나 때문에이됩니다 + +285 +00:22:48,089 --> 00:22:53,769 + 당신은 단지 기능이있는 경우 해당 기능에 대한 이유를 다음 전문가를 알고 + +286 +00:22:53,769 --> 00:22:58,109 + X 또는 Y 중 하나에 그라데이션은 하나이며, 그래서 당신은 점점 끝날 것입니다 + +287 +00:22:58,109 --> 00:23:03,619 + 한 시간은 2 시간에 그렇게 더하기 게이트에 대한 사실 항상 같은 사실을보고 참조 + +288 +00:23:03,619 --> 00:23:07,469 + 모든 입력의 로컬 그라데이션 하나 때문에 어디를 무엇을 등급 + +289 +00:23:07,470 --> 00:23:11,289 + 그냥 항상 모두에게 동등하게 그라데이션을 배포 이상에서 가져옵니다 + +290 +00:23:11,289 --> 00:23:14,339 + 그 입력은 체인 규칙 곱하지 않고 승산 때문에 + +291 +00:23:14,339 --> 00:23:18,129 + 10 일이 변경되지 않은 잉여는 같은 성분의이 종류를 얻을 남아 + +292 +00:23:18,130 --> 00:23:22,170 + 뭔가 반면 유통은 모든 단지 모든 퍼져 상단에서 유입 + +293 +00:23:22,170 --> 00:23:26,560 + 위대한 팀은 동등하게 모든 자식과 우리는 이미받은 + +294 +00:23:26,559 --> 00:23:32,139 + 입력 그라데이션 포인트 중 하나는 회로의 최종 출력에 매우 듣고 + +295 +00:23:32,140 --> 00:23:35,970 + 그래서이 직원의 애플리케이션 일련 완료 + +296 +00:23:35,970 --> 00:23:42,450 + 트레이너 길을 따라가 다른이었다 플러스 그 이상 등이 생략 얻을 + +297 +00:23:42,450 --> 00:23:47,090 + 모두 20.2이 공물의 당신 종류를 가리 동일하게 우리가 이미 수행 한 + +298 +00:23:47,089 --> 00:23:51,750 + 봉쇄하고있다 곱셈 거기 그래서 지금 우리는 다시거야 + +299 +00:23:51,750 --> 00:23:55,940 + 그 곱셈 연산을 통해 전파 등 지역 학년 때문에 + +300 +00:23:55,940 --> 00:24:06,450 + 기본적으로 40 저하됩니다 00w에 대한 그래서 무슨 일이 그라데이션이됩니다 + +301 +00:24:06,450 --> 00:24:19,059 + 2000 당신은 한 번 할 때 음수가 될 것이다 0시 반 (W) 될 것 W 하나에 갈 것 + +302 +00:24:19,059 --> 00:24:24,389 + 너무 좋은있을 것입니다 X 제로에 그라데이션 버그가 슬라이드에 떨어져 물린입니다 + +303 +00:24:24,390 --> 00:24:27,840 + 난 사실 또한 클래스를 작성하기 전에 나는 단지 몇 분처럼 발견하는 것이 + +304 +00:24:27,839 --> 00:24:34,289 + 당신이 볼 수 있도록 클래스에 시작 증가한다. 39이 그것에 대한 포인트가 될한다 그 + +305 +00:24:34,289 --> 00:24:37,480 + 때문에 복음화의 버그 난 작은로를 절단하고 있습니다 때문에 + +306 +00:24:37,480 --> 00:24:41,190 + 숫자하지만 기본적으로 그 지적해야하거나 것을 얻는 방법 때문에 + +307 +00:24:41,190 --> 00:24:45,400 + 두 개의 시간은 내가 지금 거기 작성한처럼의 포인트를 얻을 수 지적 + +308 +00:24:45,400 --> 00:24:50,980 + 우리는을 전파했습니다 있도록 그가 어떤 기회를 괜찮아 + +309 +00:24:50,980 --> 00:24:55,190 + 여기에 회로 우리는이 표현을 통해 얻을 그래서 당신의 상상 + +310 +00:24:55,190 --> 00:24:59,289 + 실제 다운 스트림 데이터를해야합니다 응용 프로그램 및 모든 매개 변수 등이있다 + +311 +00:24:59,289 --> 00:25:03,450 + 끝 상단 입력 손실 함수는 앞으로있을 평가할 합격 + +312 +00:25:03,450 --> 00:25:06,440 + 손실 기능과 우리가 다시 것은 모든 조각을 통해 전파 + +313 +00:25:06,440 --> 00:25:10,450 + 경쟁은 우리가 길을 따라 한 적이과 웰벡이에 대한 모든 게이트를 통해 전파 + +314 +00:25:10,450 --> 00:25:14,150 + 우리의 수입을 얻고 백업 다시 단지 공급 체인 규칙 많은 많은 시간을 의미 + +315 +00:25:14,150 --> 00:25:21,720 + 우리는 그에서 구현하는 방법을 볼 수 있지만, 문제는 내가 메신저에가는 것 같아요 + +316 +00:25:21,720 --> 00:25:31,769 + 이 같은이기 때문에 다른 질문을 건너 뛸거야 것을 이동 + +317 +00:25:31,769 --> 00:25:45,869 + 그래서 전후 전파의 비용은 대략 거의 항상 끝 + +318 +00:25:45,869 --> 00:25:49,500 + 기본적으로 같다고까지 당신은 일반적으로 백업 약간 타이밍을 볼 때 + +319 +00:25:49,500 --> 00:25:58,710 + 느린 생각은 그래서 하나가 있다는 것입니다 내가 이전에 지적하고 싶은 한 가지를 보자 + +320 +00:25:58,710 --> 00:26:02,350 + 이 게이트 등이 게이트의 설정은 그래서 무엇을 할 수 내가 할 수있는 임의적 + +321 +00:26:02,349 --> 00:26:06,509 + 예를 들어 알고 당신 중 일부는 내가이 문을 축소 할 수 있습니다 이것을 알고있다 + +322 +00:26:06,509 --> 00:26:10,549 + 하나의 게이트에 뭔가, 예를 들어 시그 모이 드 함수를 호출하고 싶다면 + +323 +00:26:10,549 --> 00:26:14,069 + 이는 시그 모이 드 함수 특정 형태의 하나의 사실이있다 + +324 +00:26:14,069 --> 00:26:19,460 + 하나 플러스 또는 마이너스 세금을 통해 원 계산하고 그래서 난 것을 다시 한 수 + +325 +00:26:19,460 --> 00:26:22,650 + 표현은 내가 S 상을 만들어 그들 문을 모두 붕괴 캔트 + +326 +00:26:22,650 --> 00:26:27,769 + 단일 게이트에 게이트 등 시그 모이 나는이 할 수 있었다 여기에 도착하고 있어요 + +327 +00:26:27,769 --> 00:26:32,440 + 내가하고 싶어하는 경우해야 할 일을했을 것이다 때 하나의 종류의 갈 것을 + +328 +00:26:32,440 --> 00:26:37,980 + 그 게이트 나는이 그래서 무엇 방법에 대한 식을 계산하기 위해 필요로하는 + +329 +00:26:37,980 --> 00:26:41,670 + 기본적으로 얻을 S 상 로컬 그라데이션 그래서의 기울기 무엇인가 + +330 +00:26:41,670 --> 00:26:44,470 + 작은 입력에 게이트와 내가 않을거야 일부 수학을 통과했다 + +331 +00:26:44,470 --> 00:26:46,980 + 세부 사항으로 이동하지만 당신은 저기있는 식으로 끝날 + +332 +00:26:46,980 --> 00:26:51,750 + 이 지역의 기울기와 그 액세스의 1-6 다음 세그먼트 인 끝 + +333 +00:26:51,750 --> 00:26:55,450 + 나 경쟁 그래프로이 조각을 넣을 수 있습니다 내가 아는 한 번 때문에 + +334 +00:26:55,450 --> 00:26:58,819 + 다른 지역 그라데이션 모든 단지를 통해 정의되는 방법을 계산하는 방법 + +335 +00:26:58,819 --> 00:27:02,389 + 체인 규칙과 우리가 전파 백업 할 수 있도록 모든 것을 함께 곱 + +336 +00:27:02,390 --> 00:27:06,720 + S 상을 통해 내려와 같을 것이다 방법은에 입력되고, + +337 +00:27:06,720 --> 00:27:11,750 + 게이트 독감 게이트에 가서 무엇을 하나 포인트 제로이었고, 펑크 73은 밖으로 나갔습니다 + +338 +00:27:11,750 --> 00:27:18,759 + 그래서. 7360 사실 좋아 그리고 우리는 우리가 본 것 같다 현지 그라데이션하려는 + +339 +00:27:18,759 --> 00:27:26,450 + 자신의 허리에 수학에서 당신은 1-23 곱 액세스 포인트 묘지를 얻을 수 있도록 + +340 +00:27:26,450 --> 00:27:31,170 + 즉, 로컬 그라데이션의 다음 번 우리가 마지막에 우연히 작동합니다 + +341 +00:27:31,170 --> 00:27:36,330 + 10도 작성 회로의 그렇게 시간은 그래서 우리는 12 물론 결국 우리 + +342 +00:27:36,329 --> 00:27:37,649 + 같은 답변을 얻을 + +343 +00:27:37,650 --> 00:27:42,220 + 수학이 있지만, 기본적으로 작동하기 때문에 우리가 12 전에받은 가리킨 우리 + +344 +00:27:42,220 --> 00:27:44,480 + 다운이 식을 부러 졌을 수 있으며, + +345 +00:27:44,480 --> 00:27:47,450 + 한 번에 조각 또는 우리는 단지 하나의 신호 게이트를 가질 수 그것은이다 + +346 +00:27:47,450 --> 00:27:51,569 + 종류의 어떤 수준까지 여기에 이​​러한 식을 깰 열쇠 우리에게 달려과과 + +347 +00:27:51,569 --> 00:27:52,339 + 그래서 당신은하고 싶습니다 + +348 +00:27:52,339 --> 00:27:55,829 + 그것은 매우 효율적인지 직관적으로 하나의 게이트에 이러한 식을 클러스터 + +349 +00:27:55,829 --> 00:28:06,819 + 그들은 당신의 조각 그렇게 될 수 있기 때문에 또는 쉽게 로컬 윤기를 연출하는 + +350 +00:28:06,819 --> 00:28:10,529 + 문제는 일반적으로 당신이 알고에 대해 나는 그들이 걱정 않도록해야합니까 라이브러리입니다 + +351 +00:28:10,529 --> 00:28:14,058 + 어떤 컴퓨터를 설득 쉽게 무엇을하고 대답은 '예 나는 것입니다 + +352 +00:28:14,058 --> 00:28:17,480 + 그래서 그래서 그는 당신을 통해 수행하려는 작업의 일부 조각이 있음을 지적 말 + +353 +00:28:17,480 --> 00:28:20,798 + 또 다시 그리고 그것은 매우 뭔가 아주 간단한 로컬 그라데이션이 + +354 +00:28:20,798 --> 00:28:24,900 + 실제로 단일 유닛을 만들 호소 우리는 그 중 일부를 볼 수 있습니다 + +355 +00:28:24,900 --> 00:28:30,230 + 예를 들면 실제로하지만 난 또한 지적하고 싶은 생각하면 한 번 + +356 +00:28:30,230 --> 00:28:32,490 + 나는이 조성 잔디에 대해 생각하는 좋아하는 이유는 정말 희망입니다 + +357 +00:28:32,490 --> 00:28:36,289 + 그렇지 않은 방법 욕심 느린 신경 네트워크에있는 당신의 직감에 대해 생각하는 + +358 +00:28:36,289 --> 00:28:39,369 + 당신이 당신이 이해 싶어 블랙 박스 싶지 않아 + +359 +00:28:39,369 --> 00:28:43,959 + 직관적으로 어떻게 이런 일이 발생하면의 잠시 후에 개발 시작 + +360 +00:28:43,960 --> 00:28:47,850 + 이 graybeards 흐름이 방법에 대한 자세한 그래프 직관보고 + +361 +00:28:47,849 --> 00:28:52,029 + 말은 성분 문제를 추방하기 위해 갈 것 같은 당신이 어떤 문제를 디버깅하는 데 도움이 될 수 + +362 +00:28:52,029 --> 00:28:55,950 + 그것은 무엇 최적화에 잘못된거야 정확히 이해하는 것이 훨씬 쉽게 + +363 +00:28:55,950 --> 00:28:59,250 + 당신이 도움이 될 것입니다 얼마나 욕심과 느린 네트워크를 이해한다면 이러한 디버깅 + +364 +00:28:59,250 --> 00:29:02,740 + 훨씬 더 효율적으로 네트워크와 우리는 이미 예를 들어, 그래서 몇 가지 정보 + +365 +00:29:02,740 --> 00:29:07,609 + 그것의 입력 그래서 모두에게 하나를 읽고 조금이 게이트에서 여덟 번째를 보았다 + +366 +00:29:07,609 --> 00:29:11,279 + 그것은 그것에 대해 생각하는 좋은 방법처럼 그냥 인사 대리점입니다 + +367 +00:29:11,279 --> 00:29:14,548 + 당신은 당신의 점수 기능 또는 어디 더하기 수술을 할 때마다 + +368 +00:29:14,548 --> 00:29:18,740 + 댓글을 다른 곳은 최대 케이트는 평가를 분산 있어요 + +369 +00:29:18,740 --> 00:29:23,009 + 당신이 표현 보면 대신이 작품 훌륭한 작가와 방법은 + +370 +00:29:23,009 --> 00:29:30,970 + 당신은 아주 간단한 바이너리가있는 경우 등 우리는 이러한 마커는 정말 대단 작동하지 않는 한 + +371 +00:29:30,970 --> 00:29:38,410 + 당신이 경우 맥심 XY의 표현은 그래서 이것은 온라인으로 다음 게이트 X의 기울기이다 + +372 +00:29:38,410 --> 00:29:42,570 + 더 큰 당신의 입력의 큰 일에 대해 녹색을 생각한다 + +373 +00:29:42,569 --> 00:29:46,389 + 그 사람에 그라데이션은 하나이며 모든 이것과 더 작은 하나의 인사말입니다 + +374 +00:29:46,390 --> 00:29:50,630 + 제로 직관적으로 이들 때문에 경우 하나는 더이 무엇보다 작은 것을 + +375 +00:29:50,630 --> 00:29:53,220 + 다른 사람의 큰 및 그건 무슨 일이 끝나는 때문에 출력에 영향을하지만, + +376 +00:29:53,220 --> 00:29:57,009 + 게이트를 통해 점점 당신은 하나의 구배로 끝날 수 있도록 + +377 +00:29:57,009 --> 00:30:03,140 + 입력 중 하나 크고 그래서 난 경우 그라데이션 작가로 왜 맥스 캐디의 + +378 +00:30:03,140 --> 00:30:06,420 + 실제로 내가받은 여러 입력 그들 중 하나의 가장 큰했다 + +379 +00:30:06,420 --> 00:30:09,550 + 그들 모두 그 내가 회로를 통해 전파되는 값이고 + +380 +00:30:09,549 --> 00:30:12,909 + 응용 프로그램 시간은 그냥 위에서 내 구배를받을거야 그리고 난 + +381 +00:30:12,910 --> 00:30:16,590 + 나의 가장 큰 충격이었다 누구에 기록하려고하면은 그라데이션 작가의 + +382 +00:30:16,589 --> 00:30:22,569 + 및 다중 게이트 그라데이션 스위처는 실제로 아주 좋은 생각하지 않습니다이다 + +383 +00:30:22,569 --> 00:30:26,960 + 방법은 그것을보고 할 수 있지만, 실제로는 아니에요 난 사실을 말하는 겁니다 + +384 +00:30:26,960 --> 00:30:39,150 + 신경 끄시 고 질문 그래서 그 부분에 대해 두 가지 경우 발생하는 것입니다 + +385 +00:30:39,150 --> 00:30:53,470 + 당신은 내가 그것을 생각하지 않습니다 무슨 일 최대 카데을 통과 할 때 입력은 동일하다 + +386 +00:30:53,470 --> 00:30:57,559 + 그들 모두에게 분배에 올바른 난 당신이 하나를 선택해야한다고 생각 + +387 +00:30:57,559 --> 00:31:07,990 + 즉, 기본적으로 결코 실제로 여기에 실제 연습 때문에 최대 구배를 발생하지 않습니다 + +388 +00:31:07,990 --> 00:31:13,019 + 예를 들어 여기에 너무 만이 영향에가있다 (W)보다 큰 것을이다가 + +389 +00:31:13,019 --> 00:31:16,839 + 이 최대 카데의 출력 바로 그렇게 할 때 최대 게이트로 두 흐름 및 + +390 +00:31:16,839 --> 00:31:20,879 + 읽어와 회로에 효과가 있으므로 W가 0 구배를 얻는다 도착 + +391 +00:31:20,880 --> 00:31:25,360 + 아무것도 제로가없는 당신이 변경할 때 중요하지 않습니다를 변경할 때 때문에 + +392 +00:31:25,359 --> 00:31:29,689 + 그것은 그 경쟁 경내 I 통과하는 큰 발리 없기 때문에 + +393 +00:31:29,690 --> 00:31:33,100 + 전파하는 우리는 이미 백업과 관련된 또 다른 메모가 + +394 +00:31:33,099 --> 00:31:36,490 + 난 그냥 간단히 정말 그것으로 지적하고 싶은 질문을 통해 해결 + +395 +00:31:36,490 --> 00:31:40,440 + 불운과 이러한 회로가있는 경우 때때로 당신은이 있는지 그림 + +396 +00:31:40,440 --> 00:31:43,330 + 값 회로에 지점 밖으로 그와의 여러 부분에 사용된다 + +397 +00:31:43,329 --> 00:31:47,179 + 정확한 것은 변수 체인 규칙에 의해 수행하는 회로는 사실이다 + +398 +00:31:47,180 --> 00:31:55,110 + 그라디언트 배경을 추가 할 수 있도록 동작에 기여를 추가 + +399 +00:31:55,109 --> 00:32:00,009 + 회로를 통해 거꾸로 그들이 이제까지이 역류에 유입하는 경우 + +400 +00:32:00,009 --> 00:32:04,879 + 바로 우리는 매우 간단한 구현 단지 몇으로 갈거야 + +401 +00:32:04,880 --> 00:32:05,700 + 질문 + +402 +00:32:05,700 --> 00:32:11,620 + 질문은 질문 해 지금까지 이들의 루프처럼이됩니다 감사합니다 + +403 +00:32:11,619 --> 00:32:15,839 + 당신이 생각 수있는 루프가 결코 그래서 외모 없을 것 그래프 + +404 +00:32:15,839 --> 00:32:18,589 + 당신은 재발 성 신경 네트워크를 사용하는 경우가 있음을 거기에 루프하지만, + +405 +00:32:18,589 --> 00:32:21,658 + 우리가 할 거 야하는 것이 있기 때문에 실제로는 더 우리는 재발 성 신경이 걸릴 거 있어요 + +406 +00:32:21,659 --> 00:32:26,230 + 네트워크 및 시간 단계를 통해 전개되며,이 모두가 될 것입니다 + +407 +00:32:26,230 --> 00:32:31,259 + 사진에 루프가 있음을 붙여 복사 할 수 결코 작은 조각 또는 시간 + +408 +00:32:31,259 --> 00:32:39,538 + 우리가 실제로 그것으로 얻을 때 당신은 더 많은 것을 볼 수 있습니다하지만 그는 항상보고 있어요 + +409 +00:32:39,538 --> 00:32:42,220 + 이것의 구현보고의 사실 실제로 구현하자 + +410 +00:32:42,220 --> 00:32:46,860 + 나는 우리가 항상 이러한 그래서뿐만 아니라이보다 구체적를하는 데 도움이됩니다 생각 + +411 +00:32:46,859 --> 00:32:52,038 + 그래프는 이러한 신경 네트워크를 구성에 대해 생각하는 가장 좋은 방법입니다 그래프 + +412 +00:32:52,038 --> 00:32:56,929 + 그래서 우리가 결국 어떻게이 모든 게이트가 약간 보일 거라고하지만, + +413 +00:32:56,929 --> 00:33:00,059 + 연결 구조를 유지할 필요가 무언가 게이트 위에 + +414 +00:33:00,058 --> 00:33:03,490 + 같은 단락의 내용 게이트 그래서 일반적으로 서로 연결되어 + +415 +00:33:03,490 --> 00:33:09,710 + 그 그래프에 의해 처리 또는 순 객체가 필요가 있는지에 일반적으로 객체의 + +416 +00:33:09,710 --> 00:33:13,679 + 두 가지 주요 부분 전후 평화이었고, 이것은 당신은 단지 인 + +417 +00:33:13,679 --> 00:33:19,929 + 이 코트는 실행되지만 기본적으로 거의 생각은 앞으로 패스이다 + +418 +00:33:19,929 --> 00:33:23,759 + 전체 그들이 위상으로 정렬하는 회로의 게이트를 거래 + +419 +00:33:23,759 --> 00:33:27,980 + 그게 무슨 뜻인지 주문하면 모든 입력이되기 전에 모든 노트에 와서해야한다는 것입니다 + +420 +00:33:27,980 --> 00:33:32,099 + 기회는 바로 왼쪽에서 오른쪽으로 주문하고 우리는 그냥있어 소모 된 + +421 +00:33:32,099 --> 00:33:35,969 + 탑승 우리가 반복 그래서 길을 따라 모든 단일 게이트 앞으로 나중에 호출 + +422 +00:33:35,970 --> 00:33:39,600 + 이 그래프를 통해 단지 하나 하나 조각 전진이 오브젝트 것 + +423 +00:33:39,599 --> 00:33:43,189 + 단지 있는지 확인하는 적절한 연결 패턴과 이전 버전에서 발생 + +424 +00:33:43,190 --> 00:33:46,620 + 우리는 정확한 역순으로거야 우리가 역에 전화하는거야 통과 + +425 +00:33:46,619 --> 00:33:49,709 + 모든 단일 게이트 및이 게이트는 각각 그라디언트를 전달 끝날 것 + +426 +00:33:49,710 --> 00:33:53,429 + 다른 및 이전 가져 오기 체인지업과 분석 그라디언트 그것을 다시 계산 + +427 +00:33:53,429 --> 00:33:57,860 + 그래서 진짜 목적은 모든 게이트 주위에 또는 매우 얇은 래퍼는 우리 + +428 +00:33:57,859 --> 00:34:01,879 + 자신의 차가운 레이어 층 또는 게이트 내가 같은 의미로 사용하는거야 볼 수 있습니다 + +429 +00:34:01,880 --> 00:34:05,700 + 그들은이 단지 매우 얇은 래퍼 서라운드 연결 구조있어 + +430 +00:34:05,700 --> 00:34:09,369 + 게이트는 그들에 순방향 및 역방향 함수를 호출 한 다음의가 살펴 보자 + +431 +00:34:09,369 --> 00:34:12,950 + 게이트의 한 방법이 구현 될 수 있으며의 구체적인 예 + +432 +00:34:12,949 --> 00:34:16,759 + 이것은이 올바른처럼 실제로 더 만 년 전 아닙니다 + +433 +00:34:16,760 --> 00:34:18,730 + 이 같은 구현 뭔가를 실행할 수 있습니다 + +434 +00:34:18,730 --> 00:34:23,769 + 마지막에 그래서 우리가 입력 게이트를 곱하자 어떻게 구현 될 수 있으며, + +435 +00:34:23,769 --> 00:34:27,690 + 이 경우 다중 게이트 단지 이진 곱셈은 두 개의 입력을 수신 + +436 +00:34:27,690 --> 00:34:33,780 + X & Y가 자신의 곱셈을 계산 그의 전 시간 왜 반환 및 + +437 +00:34:33,780 --> 00:34:38,950 + 모든 게임은 앞으로 얼마나 멋진 이전 버전의 API를 만족해야합니다 + +438 +00:34:38,949 --> 00:34:42,529 + 당신이 앞으로 패스 행동과 그들이 뒤로 패스에서 어떻게 행동하는지 않고 + +439 +00:34:42,530 --> 00:34:46,019 + 후방 패스에서 우리는 결국 결국 어떤 repass은 컴퓨터 + +440 +00:34:46,019 --> 00:34:52,639 + 것과 오래된 아이디어를 최종 손실에 대한 우리의 기울기를 무엇인지에 대한 학습 + +441 +00:34:52,639 --> 00:34:55,628 + 우리는 지금이 변수에이 머리를 표현하고있어 학습 + +442 +00:34:55,628 --> 00:35:00,639 + 모든 XY 여기에 우리의 숫자는 그가 말했다 그래서 스칼라입니다 또한 숫자입니다 + +443 +00:35:00,639 --> 00:35:07,799 + 고용주 어떤이 문이 뒤로 패스에 충전된다 이야기 + +444 +00:35:07,800 --> 00:35:11,550 + 그래서 우리는 무엇을 계산하기 위해 일반의 작은 조각을 수행하면 작업을 수행하는 방법이다 + +445 +00:35:11,550 --> 00:35:16,550 + 전을 계산하여 입력 X에이 그라데이션이 변경 & Y NDY 우리 + +446 +00:35:16,550 --> 00:35:19,820 + 뒤로 패스로 우리를 설정 한 다음 초안에 경쟁이 있는지 확인합니다 + +447 +00:35:19,820 --> 00:35:23,720 + 이러한 것이 다른 모든 백 제대로 전달 얻을 어떤가 있는지 + +448 +00:35:23,719 --> 00:35:27,919 + 경쟁을 추가 배지 내 아버지가 모든 재료를 추가 할 수 있습니다 잡아 + +449 +00:35:27,920 --> 00:35:35,650 + 함께 확인 그래서 우리가 어떻게 무엇 인 예를 들어 DAX와 장치를 구현하는 것 + +450 +00:35:35,650 --> 00:35:42,300 + 이 경우 X는 그 구현에 동등한 것 + +451 +00:35:42,300 --> 00:35:49,460 + 왜 항상 쉬운 휴식에 의해 여기에 만들 수있는 흰색과 쉽게 추가 점 + +452 +00:35:49,460 --> 00:35:53,659 + I는 과거에 어떤 거짓말을 첨가하는 방법, 우리는 이들 값을 기억해야 + +453 +00:35:53,659 --> 00:35:57,509 + X & Y는 우리가에 할당에서 뒤로 패스를 사용하게하기 때문에 + +454 +00:35:57,510 --> 00:36:01,000 + 나는에 대한 액세스를 필요로하기 때문에 XY가 기억해야하기 때문에 중지 판매 + +455 +00:36:01,000 --> 00:36:04,949 + 일반 및 역 전파에 내 뒤뜰 단계의 그들을 우리는 이러한 구축 + +456 +00:36:04,949 --> 00:36:09,359 + 실제로 앞으로이 통과 할 때마다 하나의 게이트에 자극을 기억해야한다 + +457 +00:36:09,360 --> 00:36:13,430 + 중간 계산의 종류는 필요하다고 할 필요가 있음을 수행 + +458 +00:36:13,429 --> 00:36:17,069 + 후방 패스에 액세스 그래서 기본적으로 우리는 이들 네트워크를 실행 끝 + +459 +00:36:17,070 --> 00:36:20,050 + 런타임은 항상 당신이하고있는 것처럼 앞으로 거대한를 통과 있음을 알아 두셔야 + +460 +00:36:20,050 --> 00:36:22,890 + 물건의 금액은 메모리에 현금화되는 모든가 가지고있는 곁에 + +461 +00:36:22,889 --> 00:36:25,909 + 나는 그 변수의 일부에 대한 액세스를 필요로하는 전파 동안 때문에 + +462 +00:36:25,909 --> 00:36:30,779 + 그래서 당신의 기억과 앞으로 패스 동안 열기구까지 뒤로 건네 + +463 +00:36:30,780 --> 00:36:33,690 + 모든 소비 얻고 우리는 실제로 경쟁하는 모든 중개인이 필요 + +464 +00:36:33,690 --> 00:36:45,289 + 이러한 것들과 많은 제거 할 수 있도록 적절한 뒤로 클래스 + +465 +00:36:45,289 --> 00:36:49,710 + 당신은 확실히 메모리에 저장할 수 있도록 그들을 현금하려고 경쟁 할 필요가 없습니다 + +466 +00:36:49,710 --> 00:36:54,110 + 하지만 난하지 않습니다에 대한 사실 걱정 대부분의 구현을 생각하지 않는다 + +467 +00:36:54,110 --> 00:36:57,280 + 보통 기억 결국 그 다루는 로직이 많이 있다고 생각 + +468 +00:36:57,280 --> 00:37:09,370 + 어쨌든 당신은 예를 들어 임베디드 장치에 있다면 나는 네 생각하고 있었다 + +469 +00:37:09,369 --> 00:37:11,949 + 미국 균주에 의해 무시 무시 이것은 당신이 활용할 수있는 무언가이다 + +470 +00:37:11,949 --> 00:37:15,539 + 그것을 우리는 신경 네트워크는 다음 테스트 시간을 당신이 있습니다 실행하는 것을 알고 + +471 +00:37:15,539 --> 00:37:18,750 + 확인이 없는지 확인하기 위해 코드에 갈 수 있도록 할 경우에 현금으로 도착 + +472 +00:37:18,750 --> 00:37:33,130 + 당신은 우리가 지방 그라디언트를 기억 예 후방 패스 질문을 할 싶어 + +473 +00:37:33,130 --> 00:37:39,750 + 전진 패스는 우리는 내가 생각하는 다른 중간체를 기억할 필요가 없습니다 + +474 +00:37:39,750 --> 00:37:45,269 + 그는 난이 1 등 몇 가지 간단한 표현에서 이러한 경우를 수 있습니다 + +475 +00:37:45,269 --> 00:37:49,170 + 실제로 있는지 즉, 일반적으로 사실 만 기억 당신이 담당하고있어 의미 + +476 +00:37:49,170 --> 00:37:54,950 + 당신은 게임으로 당신에 의해 후방 패스 게이트를 수행 할 필요가 무엇 + +477 +00:37:54,949 --> 00:37:58,509 + 당신이에 발자국이 좋아 기분이 어떤 기억 할 수 있는지 모르겠어요 + +478 +00:37:58,510 --> 00:38:04,420 + 사람과 당신은의 모습의 그 사람의 예와 영리한 될 수 있습니다 + +479 +00:38:04,420 --> 00:38:08,250 + 실제로 우리는 구체적인 예를 살펴거야 고문 깊은 고문 + +480 +00:38:08,250 --> 00:38:11,480 + 우리는 클래스의 끝 부분에 약간에 갈 수있는 학습 프레임 워크 + +481 +00:38:11,480 --> 00:38:16,750 + 여러분 중 일부는 github에의 REPO에가는 프로젝트에 사용 끝날 수도 + +482 +00:38:16,750 --> 00:38:20,320 + 와 죽 당신은 음악적으로 그냥 거대한 컬렉션이야 봐 + +483 +00:38:20,320 --> 00:38:24,580 + 거기 그래서 이러한 나중에 개체의 이러한 게이트 게이트 같은 일이다 + +484 +00:38:24,579 --> 00:38:27,429 + 깊은 학습 프레임 워크는이 단지를 무엇인지 정말 모든 층 + +485 +00:38:27,429 --> 00:38:31,559 + 층의 전체 무리를 추적 매우 얇은 경쟁 그래프 일 + +486 +00:38:31,559 --> 00:38:36,420 + 모든 연결 및 그래서 정말 이미지의 모든에서 염두에두고하는 + +487 +00:38:36,420 --> 00:38:42,639 + 일이 당신의 다리 블록이며, 우리는 밖으로 이러한 그래프를 구축하고 + +488 +00:38:42,639 --> 00:38:44,829 + 블록에서 리그 층에서 당신은 다양한에서 함께 그들을 가하고있어 + +489 +00:38:44,829 --> 00:38:47,549 + 방법은 당신이 달성하려는 작업에 따라와 말까지 모든 건물 + +490 +00:38:47,550 --> 00:38:51,519 + 물건의 종류 그래서 당신이 그렇게 자신의 네트워크에있는 모든 라이브러리와 함께 작동 방법 + +491 +00:38:51,519 --> 00:38:54,809 + 당신이 할 수 있습니다 층의 단지 전체 집합을 계산하고 모든 층이다 + +492 +00:38:54,809 --> 00:38:58,840 + 연기 함수 평화를 구현하고 그 기능 키를 이동하는 방법을 알고 + +493 +00:38:58,840 --> 00:39:02,670 + 앞으로 그래서 그냥하자 구체적인 예 위의 이전 버전을 수행하는 방법을 알고 + +494 +00:39:02,670 --> 00:39:10,150 + 쇼핑몰 상수 층 토치 쇼핑몰 상수 층 또는 크롬 봐 + +495 +00:39:10,150 --> 00:39:16,039 + 스칼라로 불과 스케일링은 일부 tenser X를 취하도록 그래서이 스칼라 아니다 + +496 +00:39:16,039 --> 00:39:19,300 + 하지만 숫자의 배열 기본적으로 우리를하기 때문에 같은 사실입니다 + +497 +00:39:19,300 --> 00:39:22,410 + 실제로 우리가 우리가 텐서를 엑스트라 작업을 많이 받는가이 작업 + +498 +00:39:22,409 --> 00:39:28,289 + 이는 정말하고 차원 배열이며, 일정 및 의해 살해되었다 + +499 +00:39:28,289 --> 00:39:31,980 + 이이 실제로 단지 스포티 한 라인 일부 초기화 물건 것을 알 수 있습니다 + +500 +00:39:31,980 --> 00:39:35,940 + 이것은이 당신에게 외국을 찾고 있지만, 거기 경우 그런데 룰라입니다 + +501 +00:39:35,940 --> 00:39:40,510 + 당신이 실제로 당신이 원하는 그 A가로 사용할 것을 전달 초기화 + +502 +00:39:40,510 --> 00:39:44,630 + 당신은 그들이 밖으로 업데이트를 호출 전진 패스 동안 다음 확장되고, + +503 +00:39:44,630 --> 00:39:49,170 + 하지만 앞으로 패스에 그들이 모두 그들은 단지 X를 곱 그것을 반환하고 + +504 +00:39:49,170 --> 00:39:53,760 + 그들은 업데이트 대학원 입력 전화를 뒤로 패스에있는 문이있다 + +505 +00:39:53,760 --> 00:39:56,510 + 당신이이 세 가지를 볼 때 여기에 있지만 정말 가장 중요한 생활을 수행 할 수 있습니다 + +506 +00:39:56,510 --> 00:39:59,690 + 모든 변수 대학원으로의 복사를하고있는 것을 볼 수 + +507 +00:39:59,690 --> 00:40:03,539 + 그 점에서 당신이 훌륭한을 통과하는 성적이다 계산해야 + +508 +00:40:03,539 --> 00:40:08,309 + 자극 아웃 참을 수가 만하는 것은 최종 손실에이에 그라디언트를 실행 + +509 +00:40:08,309 --> 00:40:11,989 + 당신은 대학원 입력에 그 이상 참을 수가있어 당신은에 의해 곱있어 + +510 +00:40:11,989 --> 00:40:15,629 + 이는 스칼라는 해당 지역의 평점은 그냥 있기 때문에 당신이 일을해야 무엇인가 + +511 +00:40:15,630 --> 00:40:19,980 + A와 C는 아웃을하지만 당신은에서 위의 단지 그라데이션을해야 + +512 +00:40:19,980 --> 00:40:23,150 + AP에 의해 살해 이러한 세 가지 라인이 일을하고 당신의 대학원 무엇 인 + +513 +00:40:23,150 --> 00:40:27,849 + 중요한 그것은 당신이 그렇게 그 층의 수백 중 하나 반환 무엇 + +514 +00:40:27,849 --> 00:40:32,110 + 그 그리고 당신은 또한 카페에서 예를 볼 수 있습니다 고문은 또한이받을 + +515 +00:40:32,110 --> 00:40:36,140 + 특히 이미지에 대한 깊은 학습 프레임 워크는 다시 경우 작업 할 수 있습니다 + +516 +00:40:36,139 --> 00:40:39,690 + 당신은 모든 층이 모두 구현하는 참조 레이어 디렉터로 이동 + +517 +00:40:39,690 --> 00:40:43,490 + 앞으로 뒤로 API 그래서 그냥 단층 거기에 당신에게 예를 제공합니다 + +518 +00:40:43,489 --> 00:40:51,269 + 그것은 소요 있도록 층이 텐서의 블로그를 호출 할 수 있도록 편안 좋아하는 블롭 소요 + +519 +00:40:51,269 --> 00:40:54,219 + 덩어리는 숫자에 불과 국제 배열이며 전달 + +520 +00:40:54,219 --> 00:40:57,949 + 하나의 함수에 현명하고 그래서 전방에서의 컴퓨팅가 전달 요소 + +521 +00:40:57,949 --> 00:41:04,379 + 그들이 많은 전화하는거야, 그래서 당신이 볼 수 시그 모이 자신의 프린터를 사용 + +522 +00:41:04,380 --> 00:41:07,840 + 이 물건은 다음 모든 데이터에 대한 포인터를 받고 보일러되어 우리 + +523 +00:41:07,840 --> 00:41:11,730 + 하부 블롭이 우리는 아래의 시그 모이 드 함수를 호출하고 있고 + +524 +00:41:11,730 --> 00:41:14,829 + 우리가에 계산하는 이유는 그건 바로 거기에 그냥 시그 모이 드 함수의 + +525 +00:41:14,829 --> 00:41:18,719 + 이전 버전과 일부 상용구 물건을 전달하지만 정말 중요한 것은 우리가 필요하다 + +526 +00:41:18,719 --> 00:41:23,369 + 즉이에 무엇을보고, 그래서 그라데이션 시간을 여기에 체인 규칙을 계산 + +527 +00:41:23,369 --> 00:41:26,150 + 우리가 걸릴 때 마법 발생의 라인 + +528 +00:41:26,150 --> 00:41:32,048 + 그래서 그들은 인사 강하를 호출하면 하단 DIFF 상단 경우입니다 계산 + +529 +00:41:32,048 --> 00:41:36,869 + 시간이 그래서 정말 그 지역 그라데이션의 인이 작품 + +530 +00:41:36,869 --> 00:41:41,960 + 그래서 그래서 모든 것을 곱셈을 통해 바로 여기 일어나고 체인 규칙 + +531 +00:41:41,960 --> 00:41:45,179 + 다음 단일 층 단지 앞뒤로 API 당신은 경쟁 성장을 + +532 +00:41:45,179 --> 00:41:52,288 + 상단 또는 일부에 대한 연결 및 질문에 고민 또 다른 목적에 + +533 +00:41:52,289 --> 00:42:00,849 + 이러한 구현 등 + +534 +00:42:00,849 --> 00:42:15,559 + 당신이 바로 후방에 수행 할 때 나는 그라데이션을 가지고 있기 때문에 + +535 +00:42:15,559 --> 00:42:19,369 + 그것은 작은이야 내가 바로 내 골목 그라데이션까지 업데이트를 할 수있는 나는 내 방식을 변경 + +536 +00:42:19,369 --> 00:42:24,960 + 비트와 당신의 쓰기의 음의 방향이 그렇게 극복 방향 + +537 +00:42:24,960 --> 00:42:28,858 + 손실 이전 버전과 컴퓨터 그라데이션 및 다음 업데이트는 그라디언트를 사용 + +538 +00:42:28,858 --> 00:42:33,278 + 그 일이 일어나고 루팡 III의 신경 네트워크를 유지 무엇​​ 때문에 당신이 조금 있습니다 증가 + +539 +00:42:33,278 --> 00:42:36,318 + 그 앞으로 뒤로 업데이트를 앞뒤로 상태를 일어나는 모든이다 + +540 +00:42:36,318 --> 00:42:51,808 + 이 때문에 루프 LAPEER의에 대해 문의하는 것을 볼 수 있습니다 내가 확인 통지 할 + +541 +00:42:51,809 --> 00:42:57,160 + 그래 그들은 루프 그래 당신은 더 나은 눈으로 우리를 싶은 있고 실제로 + +542 +00:42:57,159 --> 00:43:03,679 + 확인이 그래서 그들은 그냥 갈 생각 ++ C이다 + +543 +00:43:03,679 --> 00:43:10,899 + 그래 그래서 이것은 내가이는 것을 언급해야하는 방식에 의해 CPU의 구현입니다 + +544 +00:43:10,900 --> 00:43:14,599 + 비슷한의 CPU 구현은 구현 번째 파일을 거기에 + +545 +00:43:14,599 --> 00:43:19,420 + GPU에 시뮬레이터와 그 올바른 코드이고 그래서 별도의 파일입니다 그 + +546 +00:43:19,420 --> 00:43:21,980 + 시그 모이-것 밖으로 그런 당신이나 무언가가 당신을 보여주는 아니에요 참조 + +547 +00:43:21,980 --> 00:43:30,349 + 확인 위대한 러시아가 내가 할 좋아하는 것은 작업 과정이 될 것입니다 + +548 +00:43:30,349 --> 00:43:33,519 + 더 나은 그래서 우리의 잔디 따라 흐르는 이런 일들은거야 그냥 살인자 없습니다 + +549 +00:43:33,519 --> 00:43:38,449 + 우리에게 다시 전체 수 그래서 아무것도 다른 유일한 변화 없습니다 + +550 +00:43:38,449 --> 00:43:43,529 + 이제이 때문에 벡터 XY 및 Z는 벡터입니다되어있는이 지역의 그라데이션이 + +551 +00:43:43,530 --> 00:43:47,530 + 전에 일반의 일반적인 지금이 바로 스칼라로 사용하는 + +552 +00:43:47,530 --> 00:43:51,290 + 자신의 모든 코비안 행렬을 표현하고 그래서 주요 출 수 + +553 +00:43:51,289 --> 00:43:54,670 + 매트릭스 2 차원 기본적으로 모든의 영향이 무엇인지 알려줍니다 + +554 +00:43:54,670 --> 00:43:58,010 + 모든 단일 요소에 X에서 하나의 요소 + +555 +00:43:58,010 --> 00:44:01,880 + 그것은 당신이 주요 소스와 그라데이션 동일 할 수있는 작업 + +556 +00:44:01,880 --> 00:44:08,960 + 그들은 IDX는 벡터이다 듣고 DL 무디 전에 그러나 지금과 같은 식입니다 말했다 + +557 +00:44:08,960 --> 00:44:16,079 + 배우로 설계 닥스에 의해 디자인하는 것은 전체 코비안 행렬로 끝날 것입니다 + +558 +00:44:16,079 --> 00:44:32,130 + 실제로 그라데이션을 변경하려면 전체 행렬 - 벡터 곱셈 그렇게 알고 + +559 +00:44:32,130 --> 00:44:36,380 + 난 당신이 실제로 형성 결국 결코 조금의이 시점에 다시 올 것이다 + +560 +00:44:36,380 --> 00:44:40,119 + 코비 실제로이 행렬은 대부분의 시간이된다 번식 할 수 없을거야 + +561 +00:44:40,119 --> 00:44:43,730 + 당신을 찾고 그냥 일반적인 방법은 임의의 기능을 알고 난 필요 + +562 +00:44:43,730 --> 00:44:46,260 + 이 추적하고 나는이 두 가지 순서가 실제로 생각 + +563 +00:44:46,260 --> 00:44:49,569 + 그는 왼쪽에 있어야 출구 코비안 말했다 때문에 이렇게 + +564 +00:44:49,568 --> 00:44:53,159 + 그게 중요한 요인이 곱해야하기 때문에 그가 잘못 슬라이드의의 + +565 +00:44:53,159 --> 00:44:57,618 + 당신이 실제로 그렇게하자 그 자코뱅를 수행 할 필요가 없습니다 왜 그렇게 당신을 보여 드리죠 + +566 +00:44:57,619 --> 00:45:02,119 + 작품 비교적 일반적인 구체 예에서 작동 + +567 +00:45:02,119 --> 00:45:06,869 + 우리가 정말이 동작이 무엇인지이 비선형 최대 50 인덱스가 있다고 가정 + +568 +00:45:06,869 --> 00:45:11,068 + 그 전형적인 것 인 4096 번호 벡터 판매를 수신하고있다 + +569 +00:45:11,068 --> 00:45:12,308 + 당신이 수행 할 수 있습니다 + +570 +00:45:12,309 --> 00:45:14,630 + 4096 번호 진정한 가치 + +571 +00:45:14,630 --> 00:45:19,630 + 당신은 요소 현명한 임계 값을 낮은 0 그래서 아무것도 계산 + +572 +00:45:19,630 --> 00:45:24,680 + 0 20 고정됩니다 그것은 당신의 컴퓨팅 함수의 최대 바느질보다 + +573 +00:45:24,679 --> 00:45:28,588 + 질문에 동일한 차원의 승리는 여기에 내가 물어 싶습니다 + +574 +00:45:28,588 --> 00:45:40,268 + 원칙적으로이 계층 4096 4096에 대한 자 코비안 행렬의 크기는 무엇인가 + +575 +00:45:40,268 --> 00:45:45,018 + 여기에있는 모든 단일 번호는 거기에 매 수에 영향을 수 있었다 + +576 +00:45:45,018 --> 00:45:49,459 + 하지만 두 번째 질문에 반드시 오른쪽의 경우는 그래서이됩니다 아니다 + +577 +00:45:49,460 --> 00:45:52,949 + 거대한 측정 천육백만 수 있지만, 당신은 왜 형성되지 않을 것입니다 + +578 +00:45:52,949 --> 00:46:02,719 + 실제로 무엇을 항상 같이 행렬 않기 때문에 이러한 4096의 모든 일 + +579 +00:46:02,719 --> 00:46:09,949 + 그것은 거대한 4085 4086 여전히 공산주의자를 모든입니다 영향을 수 있었다 + +580 +00:46:09,949 --> 00:46:14,558 + 매트릭스 그러나 특수 구조의 권리가 무엇이며 특별한 구조 만 + +581 +00:46:14,559 --> 00:46:27,420 + 그래서 큰 가슴 4095 4096 행렬 만 요소는 대각선에있다 + +582 +00:46:27,420 --> 00:46:33,700 + 이 요소가 작동했다이며, 또한 그들은 단지 한 번하지 않은 있기 때문에 + +583 +00:46:33,699 --> 00:46:38,129 + 어느 요소가 그래서 이러한 것들 중 일부를 20 제로보다 고정 하였다 + +584 +00:46:38,130 --> 00:46:42,798 + 실제로 요소가 동안보다 낮은 제로 값을 가지고 중의 제로입니다 + +585 +00:46:42,798 --> 00:46:47,429 + 앞으로 전달하고 그래서 코비안 그냥 거의 행렬 없을 것입니다하지만, + +586 +00:46:47,429 --> 00:46:52,250 + 당신이 결코 실제로을 형성 할 것이다, 그래서 그들 중 일부는 실제로 사라입니다 + +587 +00:46:52,250 --> 00:46:55,429 + 전체 코빈 그 바보 그래서 당신은 실제로 수행 싶지 않기 때문에 + +588 +00:46:55,429 --> 00:47:00,808 + 그들의 특별한 구조 때문에 매트릭스 - 벡터 곱셈 연산 등이 + +589 +00:47:00,809 --> 00:47:04,150 + 우리의 등 특정 기울기 후방을에 활용하려는 + +590 +00:47:04,150 --> 00:47:09,269 + 그냥보고 싶지 때문에 아주 아주 쉽게이 작업에 전달 + +591 +00:47:09,268 --> 00:47:14,159 + 귀하의 의견은 당신이 죽이고 싶어 제로보다 작은이었다 모든 차원 + +592 +00:47:14,159 --> 00:47:17,210 + 그라데이션과 언급은 그 차원에서 그라데이션 (20)을 설정하려면 + +593 +00:47:17,210 --> 00:47:21,650 + 그래서 당신은 밖으로 그러나 여기 그리드를 가지고 있었다 중 번호 제로보다 작은 + +594 +00:47:21,650 --> 00:47:25,910 + 그냥 200을 설정 한 다음 당신은 요청할 수 있습니다 + +595 +00:47:25,909 --> 00:47:52,230 + 측면에서 결국에 이렇게 아주 간단한 작업 + +596 +00:47:52,230 --> 00:47:55,940 + 당신은 할 수 있습니다 당신이 원하는하지만 당신 내부의 게이트 말했다 경우 + +597 +00:47:55,940 --> 00:47:59,670 + 당신은 배경을 수행하는 것을 사용할 수 있지만 다른 날짜로 다시 무슨 일이 일어나고 있는지 그들 + +598 +00:47:59,670 --> 00:48:17,380 + 그라데이션 벡터 걱정 만 그래서 우리는 실제로 그런 경우에 실행하지 않을거야 + +599 +00:48:17,380 --> 00:48:20,430 + 우리는 거의 항상 하나의 아웃하지만 기술을 가지고 결국 재편성하기 때문에 + +600 +00:48:20,429 --> 00:48:24,129 + 우리는 로스 기능에 관심이 있기 때문에 우리는 단지 하나가 + +601 +00:48:24,130 --> 00:48:27,318 + 우리가 있던 경우에 장래에 대한 거래에 관심이 끝 번호 + +602 +00:48:27,318 --> 00:48:30,949 + 다중 출력 후 우리는 그 모두의도를 추적 할 수 있습니다 + +603 +00:48:30,949 --> 00:48:35,769 + 우리는 역 전파 할 때 위태롭게하지만 우리는 집회 손실을 얻을 수있다 + +604 +00:48:35,769 --> 00:48:45,880 + 기능 그래서 나는 또한 지점을 확인하려면 그것에 대해 걱정하지 않도록 그 + +605 +00:48:45,880 --> 00:48:51,230 + 실제로 사천 미친 일반적으로 우리가 사용하는 많은 일괄 처리하므로 여러 배치를 말한다 + +606 +00:48:51,230 --> 00:48:54,929 + 백 요소는 같은 시간을 통과하고 당신은 백으로 끝날 + +607 +00:48:54,929 --> 00:48:59,038 + 4096 감정적 모든 위험에서 오는 요인하지만 모든 예제 + +608 +00:48:59,039 --> 00:49:02,539 + 적을 더 위험에서 서로 독립적으로 처리하므로 그 + +609 +00:49:02,539 --> 00:49:08,869 + 정말 당신이 공식적으로하지 않을 수 있도록 사억이 너무 큰 것을 끝낼 수 있었다 + +610 +00:49:08,869 --> 00:49:14,160 + 기본적으로 당신은 실제로 희소성을 활용하는데주의를 기울여야하는 데 걸리는 + +611 +00:49:14,159 --> 00:49:17,538 + 코비안과 손 코드 작업에서 구조 당신은 실제로 바로하지 않습니다 + +612 +00:49:17,539 --> 00:49:25,819 + 모든 게이트 구현 내부의 일반화 된 일반적인하기 전에 확인 그래서 내가 좋아하는 것 + +613 +00:49:25,818 --> 00:49:30,788 + 과제 그가 최대로 작성하는 등 단지 I됩니다 지적합니다 + +614 +00:49:30,789 --> 00:49:33,680 + 실제로이 접근 방법의 디자인에 당신에게 힌트를주고 싶어 + +615 +00:49:33,679 --> 00:49:39,769 + 당신이 무엇을해야하는지 문제는 다시 전파에도 것처럼 생각된다 + +616 +00:49:39,769 --> 00:49:44,108 + 당신은 분류 최적화 너무 약하거나 구조에 대해이 작업을하고있는 + +617 +00:49:44,108 --> 00:49:50,048 + 주요 계산 및 단위에 대해이 곳처럼 보일 것이다 그 + +618 +00:49:50,048 --> 00:49:53,960 + 당신은 떨어져 지역의 기울기를 알고 다음 배경 당신 실제로 이러한 작업을 수행 + +619 +00:49:53,960 --> 00:49:57,679 + 과제에 그라디언트 코드는 같을 것이다 상단에 있도록 + +620 +00:49:57,679 --> 00:49:59,679 + 당신이하고있는 때문에 우리는 어떤 그래프 구조가없는 경우 + +621 +00:49:59,679 --> 00:50:04,038 + 라인의 모든 그래서 미친 난 그냥 당신이해야 할 그 그런 식으로 실행되지 + +622 +00:50:04,039 --> 00:50:07,200 + 당신은 두 번째 과제를 실제로 그래픽을 마련하고 있다고 할 것입니다 + +623 +00:50:07,199 --> 00:50:10,509 + 당신이 당신의 레이어를 구현하지만 내 첫 임무는 바로 그것을하고있는 객체 + +624 +00:50:10,510 --> 00:50:15,579 + 라인은 똑바로 굉장한 등 wnx에 따라 점수를 완료 + +625 +00:50:15,579 --> 00:50:21,798 + 맥심 0있는 이러한 여백을 계산하고 점수 차이는를 계산 + +626 +00:50:21,798 --> 00:50:26,239 + 다음 손실 및 배경을 특히 나는 정말로 당신을 권합니다 + +627 +00:50:26,239 --> 00:50:30,949 + 이 중간 과정은 당신이 매트릭스를 만들 수있는 다음을 계산 + +628 +00:50:30,949 --> 00:50:34,769 + 당신의 무게 등은 그라데이션을 볼 수 있습니다 전에 점수에 그라데이션 + +629 +00:50:34,769 --> 00:50:40,179 + 여기에 당신과 같은 체인 체인 규칙은 단지 (W) 도착하려고하는 유혹 될 수 있습니다 + +630 +00:50:40,179 --> 00:50:43,798 + W 등호에 그라데이션과 그 구현과 그의 건강에 해로운 방법 + +631 +00:50:43,798 --> 00:50:47,349 + 문제를 접근하는 것은 그래서 당신의 경쟁을 명시하고이를 통해 배경을 + +632 +00:50:47,349 --> 00:50:55,800 + 물론 그들은 당신을 너무 밖으로 도움이 될 것입니다 + +633 +00:50:55,800 --> 00:51:01,570 + 지금까지 우리는이 경쟁 구조와 이들에 절망적으로 큰 끝날된다 + +634 +00:51:01,570 --> 00:51:05,470 + 중간은 대한주의 사항도 모두 앞으로 뒤로 API 노드 + +635 +00:51:05,469 --> 00:51:08,869 + 그래프 구조와 하부 구조는 일반적으로 모든 이들의 매우 얇은 래퍼입니다 + +636 +00:51:08,869 --> 00:51:12,059 + 층과 그 사이의 통신을 처리 할 수​​있는 그의 + +637 +00:51:12,059 --> 00:51:16,380 + 통신은 따라 의사가 실제로 주위에 전달되는 것처럼 항상 + +638 +00:51:16,380 --> 00:51:19,289 + 우리는 우리가 우리의 DS 주위를 통과하는지 이러한 구현을 쓸 때와 + +639 +00:51:19,289 --> 00:51:23,079 + 차원 센서는 정말 무슨 뜻인지 그냥 끝 차원 배열입니다 + +640 +00:51:23,079 --> 00:51:28,059 + 이러한 배열은 무슨 일이 내부적으로 다음 게이트 사이에 매일 진행됩니다 + +641 +00:51:28,059 --> 00:51:33,529 + 게이트는 확인을 전후 패스에 무엇을 알고 난이 시점에서 이렇게 + +642 +00:51:33,530 --> 00:51:37,690 + 그 전파로 끝나는 것 나는 그렇게 신경 네트워크에 갈거야 + +643 +00:51:37,690 --> 00:51:49,860 + 우리는 배경에서 이동하기 전에 모든 질문 + +644 +00:51:49,860 --> 00:52:03,130 + 동작 도전 과제는 거의 방법이 모든 수행해야합니다 않습니다이다 + +645 +00:52:03,130 --> 00:52:06,750 + 충분히 잘 NumPy와의 작업과 그 뭔가 될 것 있도록 + +646 +00:52:06,750 --> 00:52:18,030 + 그것은 우리 너희들처럼 될 예정 물건과 당신이 그들을 원하는을 제공합니다 + +647 +00:52:18,030 --> 00:52:24,490 + 수 나는 그가 그렇게 할 거라고 생각하지 않습니다 + +648 +00:52:24,489 --> 00:52:30,739 + 그래, 나는 아마 작동이 확실하지 않다하지만이과에 디자인까지 당신의 + +649 +00:52:30,739 --> 00:52:38,609 + 다시 그것을 통해 그, 그래서 그것은 우리가 신경망에 갈거야 무슨이다 + +650 +00:52:38,610 --> 00:52:44,010 + 정확히 당신이 날을 포함 할 것처럼 보이는 이것은 무슨 일이 무엇인지 + +651 +00:52:44,010 --> 00:52:46,770 + 당신이 Google 이미지 네트워크에서 검색 할 때이 내가 처음 생각입니다 + +652 +00:52:46,769 --> 00:52:51,590 + 이 같은 결과는 그래서는 네트워크를보고 우리가 다이빙을하기 전에하자 + +653 +00:52:51,590 --> 00:52:55,100 + 신경 네트워크에 실제로 나는 모든 뇌없이 먼저 수행하고 싶습니다 + +654 +00:52:55,099 --> 00:52:58,329 + 물건 그래서 그들의 신경은 그들이 어떤 관계든지를 잊지 잊지 + +655 +00:52:58,329 --> 00:53:03,170 + 뇌에 그들은 당신이 그들이했던 것을 생각하면 잊지 마세요하지만 그들은하자 할 + +656 +00:53:03,170 --> 00:53:07,309 + 그냥이 WX 무엇입니까 동일 우리가 생각도하기 전에 학교 기능을보고 + +657 +00:53:07,309 --> 00:53:11,079 + 나는 우리가 만드는 시작하는거야 말했듯이 우리는 지금까지하지만 지금은 작업했습니다 + +658 +00:53:11,079 --> 00:53:14,590 + F 더 복잡하고 그래서 당신은 당신이있어 신경 네트워크를 사용하려는 경우 + +659 +00:53:14,590 --> 00:53:20,309 + 그래서이 이것에 해당 방정식을 변경하려고하는 것은 두 계층 신경망이며, + +660 +00:53:20,309 --> 00:53:24,820 + 그게처럼 보이는 그것은 단지 더 복잡한 수학 식 X의 무엇을 + +661 +00:53:24,820 --> 00:53:30,230 + 당신이 당신의 입력 X를 받고 당신이 만드는으로 그래서 무슨 일이 일어나고 + +662 +00:53:30,230 --> 00:53:32,369 + 우리가 전에했던 것처럼 매트릭스를 곱한 + +663 +00:53:32,369 --> 00:53:36,619 + 지금 무엇을 다음에 오는 것을 옆오고하는 것은 비선형 또는 활성화 기능입니다 + +664 +00:53:36,619 --> 00:53:39,710 + 나는 당신이 이들에 대해 할 수있는 몇 가지 선택에 갈거야 + +665 +00:53:39,710 --> 00:53:43,800 + 케이스는 정말 기본적으로 우리가있어 활성화 함수로 임계 값 0을 사용하고 있습니다 + +666 +00:53:43,800 --> 00:53:47,780 + 행렬 곱셈 우리 임계 모든 일을 그들은 20 얻을하고 우리가 할 + +667 +00:53:47,780 --> 00:53:52,240 + 또 하나의 주요 공급하고 우리에게주는이 부족하다 그래서 난이 드롭 인 경우 + +668 +00:53:52,239 --> 00:53:58,169 + 세 남아 3072 참조 화소 값을가는 10 C의 경우라고 + +669 +00:53:58,170 --> 00:54:02,110 + 우리가 하나의 주요 대사 산물 담론을 가기 전에 우리는 바로 갔다 + +670 +00:54:02,110 --> 00:54:02,470 + (22) + +671 +00:54:02,469 --> 00:54:05,899 + 번호 그러나 이제 우리는이 중간 표현을 통해 이동하세요 + +672 +00:54:05,900 --> 00:54:13,019 + 상태 숨겨진 펜던트 그래서 숨겨진 레이어 백 - 각 숫자를 그들에게 전화한다 + +673 +00:54:13,019 --> 00:54:16,849 + 또는 당신이되고 네트워크의 사이즈를 원하는 그래서 이것은 높은 압력 + +674 +00:54:16,849 --> 00:54:21,109 + 그 AA 백 그리고 우리는 그렇게 할이 중간 표현을 통해 이동 + +675 +00:54:21,108 --> 00:54:24,319 + 곱해야 제로에서 우리에게 백 번호 임계 값을 제공하고 + +676 +00:54:24,320 --> 00:54:28,559 + 다음 하나는 확실이이 과정을하고 우리가 더 많은 숫자를 가지고 있기 때문에 우리는이 + +677 +00:54:28,559 --> 00:54:33,820 + 더 호기심이 난 그래서 더 흥미로운 또는 하나의 특정 예를 수행하는 + +678 +00:54:33,820 --> 00:54:36,330 + 당신이에서 그런 생각 무엇을 수행 할 수 있습니다 흥미로운의 + +679 +00:54:36,329 --> 00:54:40,210 + 해석 선형의 예를 다시 것입니다 할 수있는 후자의 + +680 +00:54:40,210 --> 00:54:45,690 + C 부 (10)에 분류 우리는 자동차 클래스가 시도이 빨간 차를 가지고 보았다 + +681 +00:54:45,690 --> 00:54:51,280 + 다른 자동차의 모든 공간 모드 상이한 방향 등의 병합 + +682 +00:54:51,280 --> 00:54:57,980 + 이 경우 하나의 층을 하나의 리더 일제 공격 모두를 가로 질러 가야했다 + +683 +00:54:57,980 --> 00:55:02,250 + 이러한 모드와 우리는 서로 다른 색상의 예를 들어 다루지 수있는 + +684 +00:55:02,250 --> 00:55:05,190 + 할 매우 자연스러운하지 않았다 그러나 지금 우리는이에 백 번호를 + +685 +00:55:05,190 --> 00:55:08,289 + 중간 그래서 당신은 예를 들어 그 숫자의 한 상상 + +686 +00:55:08,289 --> 00:55:11,539 + 단지 앞으로 임대 레드 카펫에 따기 수 있습니다 단지 돼있다 + +687 +00:55:11,539 --> 00:55:14,750 + 빨간 자동차가 될 수 앞으로 또 다른 하나를 직면 난파 자동차가 발견 + +688 +00:55:14,750 --> 00:55:16,280 + 왼쪽으로 약간 직면 + +689 +00:55:16,280 --> 00:55:20,650 + 하자 carvey 오른쪽처럼 보인다 연령의 이러한 요소는 될 것입니다 + +690 +00:55:20,650 --> 00:55:24,358 + 긍정적 인 사람들은 이미지의 그 일을 찾을 경우 + +691 +00:55:24,358 --> 00:55:28,029 + 그렇지 않으면 제로 숙박 등 다른 연령은 녹색 카드를 보일 수 있습니다 + +692 +00:55:28,030 --> 00:55:31,180 + 노란색 카드 나 이제 서로 다른 방향에서 다른 무엇이든 우리는 할 수 또는 + +693 +00:55:31,179 --> 00:55:35,669 + 이러한 모든 다른 모드에 대한 템플릿을 가지고 있으므로 이러한 뉴런을 켜거나 + +694 +00:55:35,670 --> 00:55:41,869 + 그들은 다음 몇 가지 구체적인 유형과 찾고있는 것을 발견하면 해제 + +695 +00:55:41,869 --> 00:55:46,660 + 모든 작은 카드 템플릿과 나는 우리를 가로 질러이 W 두 가지 주요 검사 일부 + +696 +00:55:46,659 --> 00:55:50,719 + 같은 완료하기 위해 지금 당신이 어떻게 생겼는지의 스물 카드 템플릿을 말하고있다 + +697 +00:55:50,719 --> 00:55:54,149 + 우리는 선택의 여지가 있도록 점수 분류는 추가 조치를 거기에 + +698 +00:55:54,150 --> 00:55:58,700 + 그들에 가중 합과 그 사람을 통해 다음 켜져 그렇다면 내 + +699 +00:55:58,699 --> 00:56:02,269 + 방법은 다소 긍정적 인 가중치는 아마도 내가 위로하고 추가 할 것입니다 + +700 +00:56:02,269 --> 00:56:07,358 + 높은 점수를 얻기 때문에 지금은이 복합 우리의 분류를 가질 수있다 + +701 +00:56:07,358 --> 00:56:13,098 + 이 이유에 대해 물결 모양의 이유와이 부가 숨겨진 계층을 통해 + +702 +00:56:13,099 --> 00:56:14,720 + 이들은 더 재미있는 일을 할 것입니다 + +703 +00:56:14,719 --> 00:56:49,509 + 할당에 추가 점수에 대한 질문이었다 뭔가 재미 또는 추가 할 + +704 +00:56:49,510 --> 00:56:53,220 + 그래서 당신은 당신이 흥미있는 실험이 생각과 의지대로 카펫을 얻을 + +705 +00:56:53,219 --> 00:56:56,699 + 당신에게 뭔가를 당신이 수도에 대한 좋은 후보의 일부 보너스 포인트를 제공 + +706 +00:56:56,699 --> 00:56:59,659 + 그 작동 여부를 조사 할 + +707 +00:56:59,659 --> 00:57:08,329 + 질문 + +708 +00:57:08,329 --> 00:57:34,989 + 데이터 세트의 다양한 모드를 통해 할당 나는 좋은이 없습니다 + +709 +00:57:34,989 --> 00:57:37,969 + 우리는 완전히로이 훈련을거야 때문에 그 이것에 대한 대답 + +710 +00:57:37,969 --> 00:57:39,500 + 역 전파 + +711 +00:57:39,500 --> 00:57:42,690 + 나는 판매를위한 정확한 템플릿이있을 것이라고 생각하는 순진한처럼 생각 + +712 +00:57:42,690 --> 00:57:46,539 + 레드 카펫을보고 carvey 당신은 아마 당신이 찾을 것을 발견 할 남아 내버려 + +713 +00:57:46,539 --> 00:57:50,690 + 그래서 같은 믹스와 이상한 일 중간체 및 이러한 종류 + +714 +00:57:50,690 --> 00:57:55,630 + 동물을 오는 최적의 경계를 사용하여 데이터를자를 수있는 방법을 찾을 + +715 +00:57:55,630 --> 00:57:59,809 + 쿠웨이트의 그냥 그래서 확실히 올 수있는 회사를 조정 강등 + +716 +00:57:59,809 --> 00:58:10,579 + 정말 열심히 잘 나는 그이를 그래서 그 잘 생각 휩쓸 리게 될 대답 + +717 +00:58:10,579 --> 00:58:14,579 + 숨겨진 레이어 주로 높은의 크기는 내가 선택한 것을 선택하세요 + +718 +00:58:14,579 --> 00:58:18,719 + 백은 일반적으로 그것은 우리가 가고있는 것을 볼 수 있습니다 일반적으로 될 것 + +719 +00:58:18,719 --> 00:58:22,739 + 이 많이 있지만, 일반적으로 당신은 그들이 당신의 가능한 한 큰되고 싶어 + +720 +00:58:22,739 --> 00:58:30,659 + 등 컴퓨터가 그래서 더 나은 내가 그에게 갈거야입니다 + +721 +00:58:30,659 --> 00:58:38,639 + 우리는 항상 최대 10 특성을 가지고 않습니다 물어 우리는 다섯 슬라이드처럼이 얻을하지 않습니다 + +722 +00:58:38,639 --> 00:58:44,359 + 멀리 어딘가에 내가 아마 난 그냥 가야 추측 신경 네트워크로 이동합니다 + +723 +00:58:44,360 --> 00:58:48,390 + 이는 3 층으로 원한다면 앞서 및 끝 부분에 질문을 + +724 +00:58:48,389 --> 00:58:50,940 + 우리가 확장되는 아주 간단한 방법이 방법에 의해 신경 네트워크 + +725 +00:58:50,940 --> 00:58:53,710 + 그것은 우리가 단지 우리가 모든이 동일한 패턴을 계속 유지하도록 맞아 + +726 +00:58:53,710 --> 00:58:57,159 + 중간 노드를 숨겨 후 우리는 우리의 네트워크는 깊은하게 유지할 수 있으며, + +727 +00:58:57,159 --> 00:58:59,750 + 깊은 당신이이기 때문에 당신은 더 많은 흥미로운 기능을 계산할 수 있습니다 + +728 +00:58:59,750 --> 00:59:03,369 + 자신에게 시간을 더주는 것은 흥미로운과 헨리 VIII의 방법으로 계산하기 + +729 +00:59:03,369 --> 00:59:09,559 + 내가 깜빡 할 하나의 다른 슬라이드 달려 그 두 계층 신경망 훈련 + +730 +00:59:09,559 --> 00:59:12,690 + 나는 그렇게이가 내려 오면이 같은 사실은 꽤 간단 의미 + +731 +00:59:12,690 --> 00:59:17,349 + 의 블록 버스터에서 빌린 기본적으로 가격은 대략 열​​한 라인 + +732 +00:59:17,349 --> 00:59:21,980 + 파이썬은 이진 분류 동안 두 계층 신경망을 구현하는 + +733 +00:59:21,980 --> 00:59:27,570 + 이 이차원 더 이차원 데이터 행렬 X가 무엇 당신 + +734 +00:59:27,570 --> 00:59:32,580 + 서른세 차원이 있고 왜 다음에 대한 이진 레이블이 + +735 +00:59:32,579 --> 00:59:36,579 + 죄 0 죄 (1)는 당신의 체중 행렬 그래서 나는 그들이 있다고 생각 종료 한 가지 방법을 기다릴 수 있습니다 + +736 +00:59:36,579 --> 00:59:41,150 + 중앙 시냅스하지만 성숙한라고하며 다음이 반대 그룹은 여기 + +737 +00:59:41,150 --> 00:59:46,269 + 당신은 당신이보고있는 것을 내가 단지보다 더 내 지점을 사용한다 + +738 +00:59:46,269 --> 00:59:50,139 + 우리는 첫 번째 계층 활성화를 완료하고 있지만, 이것이 사용되는 여기 + +739 +00:59:50,139 --> 00:59:54,069 + 신호 비선형 0이 아닌 목의 최대 및 우리의 비트에가는하는지 + +740 +00:59:54,070 --> 00:59:58,650 + 이러한 비선형는 하나의 형태 이상이 1 층을 검토 할 수 있습니다 + +741 +00:59:58,650 --> 01:00:03,059 + 두 번째 레이어 바로 여기에 다음의 컴퓨팅 후방 및 + +742 +01:00:03,059 --> 01:00:08,130 + 구배 용액 (1)과의 구배 겔 정도로이 성인 성인 합격 + +743 +01:00:08,130 --> 01:00:13,390 + 그라데이션이 너무 바로 그 업데이트에를하고있어 여기에 주요 업데이트입니다 + +744 +01:00:13,389 --> 01:00:17,150 + 그는 공식화 그 다음에 배경의 최종 부분 동안과 같이 동시에 + +745 +01:00:17,150 --> 01:00:22,519 + 바로 W 및 그라데이션 그는 여기 정말 (22) 경사를 추가했다 + +746 +01:00:22,519 --> 01:00:24,630 + 열한 라인 공급 신경망을 훈련합니다 + +747 +01:00:24,630 --> 01:00:29,710 + 이 손실은 무엇과 약간 다를 수 있습니다 이유를 분류 + +748 +01:00:29,710 --> 01:00:33,500 + 당신은 지금 볼 당신이보고, 그래서 이것은 로지스틱 회귀 손실이 있다는 것 + +749 +01:00:33,500 --> 01:00:37,159 + 여러 차원으로 좋은 분류가 그것의 일반화하지만, + +750 +01:00:37,159 --> 01:00:40,149 + 이것은 기본적으로 여기에 업데이트되는 물류 손실이며이 통과 할 수 있습니다 + +751 +01:00:40,150 --> 01:00:43,500 + 자신보다 상세하게하지만 잃어버린 로지스틱 회귀가 약간 보일 + +752 +01:00:43,500 --> 01:00:50,539 + 다른 그리고 그 안에있다중인 있지만 그렇지 않으면 네이 너무 없습니다 + +753 +01:00:50,539 --> 01:00:55,320 + 실제로 훈련 코드 충분할의 경쟁 미친 매우 몇 줄 + +754 +01:00:55,320 --> 01:00:58,900 + 다른 이들 네트워크의 모든 플러스 방법은 어떻게합니까 공식을 만들어 않습니다 + +755 +01:00:58,900 --> 01:01:03,019 + 당신은 당신이 그것을 모든 물건을 가질 필요가 교차 검증 파이프 라인있다 + +756 +01:01:03,019 --> 01:01:07,050 + 즉 그것이 실제로 이러한 큰 코드베이스하지만 커널을 제공하기 위해 상단에 간다 + +757 +01:01:07,050 --> 01:01:11,019 + 매우 간단 우리는 뒤로 통과 이러한 층이 앞으로 통과 계산 + +758 +01:01:11,019 --> 01:01:18,840 + 업데이트는 비가 오면하지만 비는 개인 초기 랜덤를 만드는 것입니다 + +759 +01:01:18,840 --> 01:01:24,170 + 당신이 임의의 W를 생성 할 수 있도록 가중치는 그래서 당신은 어딘가에 시작해야 + +760 +01:01:24,170 --> 01:01:29,150 + 지금 당신은 또한 두 계층 신경망을 훈련 할 것이다 언급 할 + +761 +01:01:29,150 --> 01:01:32,070 + 이 클래스에 그래서 당신이 매우 비슷한 일을하고있을거야하지만, + +762 +01:01:32,070 --> 01:01:34,950 + 당신은 로지스틱 회귀를 사용하지 않는 당신은 다른 활성화가있을 수 있습니다 + +763 +01:01:34,949 --> 01:01:39,149 + 기능 그러나 다시이 구현 당신에게 내 조언을 개최한다 + +764 +01:01:39,150 --> 01:01:42,789 + 당신이 중간 결과로 계산 한 다음 적절한 수행 + +765 +01:01:42,789 --> 01:01:46,909 + 모든 중간 결과에 역 전파 그래서 당신은 당신이 계산해야 할 수도 있습니다 + +766 +01:01:46,909 --> 01:01:54,460 + 컴퓨터가 이러한 무게 행렬 또한 편견 그렇게하지를받을 + +767 +01:01:54,460 --> 01:01:59,940 + 믿는 당신은 당신의 슬롯 최대의 편견의 오후를 가지고 있지만 여기에 당신은 너무 편견이있을 것이다 + +768 +01:01:59,940 --> 01:02:03,269 + 편견 컴퓨터 사람이 나중에 컴퓨터 과정에서 체중 행렬을 + +769 +01:02:03,269 --> 01:02:08,429 + 당신의 손실을 완료 한 다음 뒤로 다음이 과정에서 너무 배경을 통과 할 + +770 +01:02:08,429 --> 01:02:13,739 + 이 H1 의사에 두 번째 레이어와 배경의 무게에 배경 + +771 +01:02:13,739 --> 01:02:18,849 + 다음을 통해 최초의 무게 매트릭스와 향신료 않습니다에 배경을 팔 실행 + +772 +01:02:18,849 --> 01:02:22,929 + 여기에 적절한 역 전파 그렇지 않으면 시도하면 바로 바로 말 + +773 +01:02:22,929 --> 01:02:26,739 + 당신은 단지 하나의 표현 만들려고하면 어떻게 W 하나를 것입니다에 음주 운전하다 + +774 +01:02:26,739 --> 01:02:31,099 + 그것이 너무 클 것이며 대한 두통 정도의 일련 그것을 + +775 +01:02:31,099 --> 01:02:32,619 + 단계 및 역 전파 + +776 +01:02:32,619 --> 01:02:36,119 + 그건 그냥 힌트입니다 + +777 +01:02:36,119 --> 01:02:39,940 + 확인 지금은하지 않고 그 신경망의 표현이었다 말을하고 싶습니다 + +778 +01:02:39,940 --> 01:02:43,940 + 모든 물건을 가지고, 우리가 그것을 만들거야 그것은 지금 매우 간단 보인다 + +779 +01:02:43,940 --> 01:02:47,740 + 약간 더 미친 같은 동기의 모든 종류의 접이식에 의해 주로 + +780 +01:02:47,739 --> 01:02:51,219 + 이이 모든 것을 가지고 관련이 있는지에 대해 온 방법 등 역사에 대한 + +781 +01:02:51,219 --> 01:02:54,939 + 그리고 우리는 신경 네트워크를 가지고 우리는 이러한 신경 내의 신경이 + +782 +01:02:54,940 --> 01:02:59,440 + 이 때문에 네트워크는 내가 당신이 검색 할 때 발생하는 그냥 뭐처럼 보이는 것입니다 + +783 +01:02:59,440 --> 01:03:03,800 + 이미지 검색이란은 그래서 당신은 이제 실제 생물학적 인 뉴런하지 않는 이동 + +784 +01:03:03,800 --> 01:03:09,030 + 같이 더 많은 그래서 그와 같은 현재 + +785 +01:03:09,030 --> 01:03:11,880 + 그냥 아주 간단하게 단지 모두에서 오는 당신이 어디 이것에 대해 생각을한다주고 + +786 +01:03:11,880 --> 01:03:17,220 + 당신은 세포체이 있거나 너무 많이 전화를 좋아하고는 모든 수상 돌기를 가지고 + +787 +01:03:17,219 --> 01:03:21,049 + 다른 뉴런에 연결되어있는 다른 뉴런의 클러스터가 그리고 + +788 +01:03:21,050 --> 01:03:25,450 + 누군가가 여기에있어 다음 드라이브는 정말 듣고이 부속된다 + +789 +01:03:25,449 --> 01:03:30,869 + 그 때문에이이란에 입력을하고 그것은 하나의 축삭을 가지고 그 + +790 +01:03:30,869 --> 01:03:35,839 + 이 번호에 경쟁의 출력을 전달하는 신경 나오는 + +791 +01:03:35,840 --> 01:03:40,579 + 형태는 그래서 보통 일반적으로이 신경 세포는 그들 중 많은 경우 입력을 수신해야 + +792 +01:03:40,579 --> 01:03:46,179 + 온라인 다음이 판매 당신은 활성화를 전송 스파이크를 선택할 수 있습니다 자신의 + +793 +01:03:46,179 --> 01:03:50,199 + 축삭 아래로 잠재하고이 사실처럼 그냥 밖으로했다 + +794 +01:03:50,199 --> 01:03:54,659 + 다운 스트림 다른 뉴런이 너무 다른가 수상 돌기에 연결 + +795 +01:03:54,659 --> 01:03:57,639 + 이 사람의 축삭에 연결 여기에 신경과 수상 돌기 + +796 +01:03:57,639 --> 01:04:02,299 + 기본적으로 그냥 사이의 이러한 시냅스를 통해 연결되어 있고 우리는이 있었다 뉴런 + +797 +01:04:02,300 --> 01:04:05,840 + 특히 그 로드은에 수상 돌기과 그에이 작업은 실제로 운반 + +798 +01:04:05,840 --> 01:04:10,410 + 자신의 등 기본적에 출력 당신은 매우 원유 모델을 가지고 올 수 + +799 +01:04:10,409 --> 01:04:16,769 + 우리가 신경과는 다음과 같이 보일 것입니다 그래서이 셀 몸 + +800 +01:04:16,769 --> 01:04:20,909 + 여기에 자신에 단지 다른 신경 세포에서 나오는 축삭을 상상 + +801 +01:04:20,909 --> 01:04:24,730 + 직장에서 사람이 신경이를 통해이이란에 연결되어 + +802 +01:04:24,730 --> 01:04:29,840 + 시냅스 이러한 시냅스의 각 하나와 연관된 가중치를 갖는다 + +803 +01:04:29,840 --> 01:04:35,350 + 얼마나 많은이 신경 세포가 좋아하는 신경 세포는 기본적 그래서 실제로 수행하는 것이 + +804 +01:04:35,349 --> 01:04:39,769 + 이 X이는 시냅스에서 상호 작용하고 증식 및 이산 모델 당신 때문에 + +805 +01:04:39,769 --> 01:04:44,989 + 00 홍수가 여름에 흐르는 W 얻을 후 많은 이라크 인에 대한 일어나는 + +806 +01:04:44,989 --> 01:04:45,849 + 많이 가지고있는 사람 + +807 +01:04:45,849 --> 01:04:51,500 + 및 시간의 폭발과는 좀 상쇄있어 여기 세포체 승두고 + +808 +01:04:51,500 --> 01:04:56,940 + 바이어스 한 후이 통과되도록 활성화 기능이 여기에 충족되는 경우 + +809 +01:04:56,940 --> 01:05:02,800 + 활성화 기능은 실제로 지금의에 색소폰의 옷을 완료합니다 + +810 +01:05:02,800 --> 01:05:06,570 + 생물학적 모델은 역사적으로 사람들에 S 상 비선형 성을 사용하려면 + +811 +01:05:06,570 --> 01:05:11,730 + 는 0과 하나 사이의 수를 얻을 수 있기 때문에 실제로 그 이유는 + +812 +01:05:11,730 --> 01:05:15,420 + 이 신경 세포가에 대한 영감을하는 속도로 그 해석 할 수 있습니다 + +813 +01:05:15,420 --> 01:05:19,809 + 그것은을 통해 일어나고 0과 1 사이의 비율이 그래서 특정 입력 + +814 +01:05:19,809 --> 01:05:23,889 + 이 신경 세포는 신경 세포에서 좋아하는 무언가를 볼 수있다, 그래서 만약 활성화 함수 + +815 +01:05:23,889 --> 01:05:27,900 + 연결된이 많이 스파이크 시작되고 속도에 의해 설명된다 + +816 +01:05:27,900 --> 01:05:33,139 + 그게 내가 구현하고자한다면 신경 세포의 원유 모델은 그래서 충격 오프 F 확인 + +817 +01:05:33,139 --> 01:05:38,819 + 그것은이 너무과 신경 기능 전진 패스 같은 것을보고받을 것이다 + +818 +01:05:38,820 --> 01:05:44,500 + 일부 입력이 세포체 그래서 그냥 변호사 몇 가지의 벡터 개혁이다 + +819 +01:05:44,500 --> 01:05:49,980 + 우리는 소말리아 일부를 S 자형으로 발사 속도를 넣고 발사로 돌아가 + +820 +01:05:49,980 --> 01:05:53,579 + 속도와 다음이 바로 그래서 당신이 할 수있는 다른 뉴런에 연결 할 수 있습니다 + +821 +01:05:53,579 --> 01:05:56,710 + 실제로이 선형과 매우 유사 보이는 것을 알 수 있습니다 상상 + +822 +01:05:56,710 --> 01:06:02,750 + 일부 여기에 우리가 통과하는 MIMO의 레러에 대한 분류 레이더 + +823 +01:06:02,750 --> 01:06:07,050 + 비선형 그래서이 모델에 하나 하나 신경 세포는 작은처럼 정말 당신의 + +824 +01:06:07,050 --> 01:06:11,530 + 분류 그러나이 저자는 서로 연결하고 그들에게 함께 작업 할 수 있습니다 + +825 +01:06:11,530 --> 01:06:16,650 + 그들은 매우 그들이있어있어 뉴런에 대해 지금 10 재미있는 일을 + +826 +01:06:16,650 --> 01:06:21,300 + 생물학적 뉴런은 당신이 가고 그래서 만약 슈퍼 복잡한 생물학적 뉴런을 좋아하지 + +827 +01:06:21,300 --> 01:06:24,670 + 주위에 당신은 뇌의 사람들이 같은 신경 네트워크가 작동한다는 시작 + +828 +01:06:24,670 --> 01:06:28,849 + 라운드 사람들에게 시작하는 당신을 발사하기 시작하고있다 때문이다 + +829 +01:06:28,849 --> 01:06:33,650 + 그들이 작동 뉴런의 많은 다른 종류가 복잡한 동적 시스템 + +830 +01:06:33,650 --> 01:06:38,550 + 다르게이이 수상 돌기는 흥미를 많이 수행 할 수 있습니다 + +831 +01:06:38,550 --> 01:06:42,140 + 계산 좋은 리뷰 기사가 직접적인 경쟁이다 정말 + +832 +01:06:42,139 --> 01:06:46,069 + 이 시냅스는 복잡한 동적 시스템이다 그들은 그냥 아니에요 즐겼다 + +833 +01:06:46,070 --> 01:06:49,720 + 우리는 뇌의 정말 확실하지 않은 단일 무게에 속도 코드를 사용 + +834 +01:06:49,719 --> 01:06:54,689 + 그래서 매우 원유 수학적 모델을 전달하고 너무 많은 그의 비유를 넣지 마십시오 + +835 +01:06:54,690 --> 01:06:57,960 + 하지만 같은 미디어 제품의 종류에 좋은 + +836 +01:06:57,960 --> 01:07:01,990 + 그래서 나는이 다시 올라오고 계속 그 이유는 우리로 다시 겠 + +837 +01:07:01,989 --> 01:07:04,989 + 설명이 뇌처럼 작동하지만 난 너무 깊이에 갈 아니에요 + +838 +01:07:04,989 --> 01:07:09,829 + 의 전체 세트 거기에이 질문 한 질문으로 돌아갑니다 + +839 +01:07:09,829 --> 01:07:17,559 + 우리가 역사적으로 신호를 선택할 수 있습니다 비선형가 사용되었습니다 + +840 +01:07:17,559 --> 01:07:20,210 + 꽤는 그리고 우리는 무엇을이 이상 더 많은 세부 사항으로 갈거야 + +841 +01:07:20,210 --> 01:07:23,690 + 비선형는 거래의 장단점이 무엇이며 왜 사용 할 수 있습니다 + +842 +01:07:23,690 --> 01:07:27,838 + 하나 또는 다른하지만​​ 플래시가 많은이 있다는 것을 언급하는 것처럼 지금은 + +843 +01:07:27,838 --> 01:07:28,579 + 에서 선택 + +844 +01:07:28,579 --> 01:07:33,940 + 정말 아주 인기가 2012 년으로 역사적으로 사람들은 H (10)에 사용 + +845 +01:07:33,940 --> 01:07:38,429 + 당신이 원하는 경우 더 빨리 그래서 지금 당신의 네트워크를 꽤한다 + +846 +01:07:38,429 --> 01:07:40,429 + 비선형의 기본 선택 + +847 +01:07:40,429 --> 01:07:45,679 + 즉 현재의 디폴트 추천 그리고 relew 다음 몇 가지있다 + +848 +01:07:45,679 --> 01:07:51,489 + 여기 이렇게 활성화 기능은 내가 밖으로는 최대 몇 년 전 제안 + +849 +01:07:51,489 --> 01:07:54,989 + 재미 있고 아주 최근에 루 그래서 당신은 다른 가지고 올 수 + +850 +01:07:54,989 --> 01:07:58,319 + 활성화 기능과 내가이 더 잘 작동 할 수 설명 할 수 있습니다 또는 + +851 +01:07:58,320 --> 01:08:01,789 + 그래서이 연구의 활성 영역은에 의해 이동을 시도되지 않으며 + +852 +01:08:01,789 --> 01:08:05,949 + 이 수행 활성화 기능을 하나의 방법으로 더 좋은 특성을 가지고 또는 + +853 +01:08:05,949 --> 01:08:10,909 + 또 다른 우리는 클래스에 있지만뿐만 곧 더 많은 세부 사항으로 갈거야 + +854 +01:08:10,909 --> 01:08:15,980 + 이제 우리는 활성화 함수의 선택의 여지가 이러한 바보가 + +855 +01:08:15,980 --> 01:08:19,259 + 우리는 바로 그래서 우리는 단지 그들을 연결하는 신경 네트워크에 이러한 신경 세포를 실행 + +856 +01:08:19,259 --> 01:08:23,140 + 함께 그들이 그렇게 여기에 서로 통신 할 수 있도록하는 것은 무엇의 예입니다 + +857 +01:08:23,140 --> 01:08:27,170 + 배우 또는 레이어와의 수를 계산하려면 rowlett을 재 학습 자신의 + +858 +01:08:27,170 --> 01:08:30,829 + 신경망 당신은 듣고 대기 일어난 선수의 수를 계산 + +859 +01:08:30,829 --> 01:08:35,449 + 이유이란 최대가 없습니다 사촌 입력 층은 나중에로 계산하지 않습니다 + +860 +01:08:35,449 --> 01:08:39,729 + 우리는 여기에 두 선수가되도록 하나의 값은 실제로 어떤 계산을하지 않는다 + +861 +01:08:39,729 --> 01:08:45,068 + 무게가 그 그게 배우는 우리는 완전히 연결이 레이어를 호출 + +862 +01:08:45,069 --> 01:08:50,870 + 레이어와 내가 당신 단일 신경 세포 시스템이 작은 그런를 표시하는 것이 + +863 +01:08:50,869 --> 01:08:54,750 + 신경 네트워크의 일부 대사 비선형의 중량 + +864 +01:08:54,750 --> 01:08:58,829 + 레이어로이란은 수 있기 때문에 우리가 층으로 이들을 배치 이유는 + +865 +01:08:58,829 --> 01:09:01,759 + 경쟁에 우리를 훨씬 더 효율적 그래서 대신의를 갖는 + +866 +01:09:01,759 --> 01:09:04,460 + 비정질 뉴런의 얼룩과 그들 모두 독립적으로 계산되어야한다 + +867 +01:09:04,460 --> 01:09:08,699 + 레이어를 갖는 벡터화 작업을 사용하는 우리를 허용하고 그래서 우리는 할 수 있습니다 + +868 +01:09:08,699 --> 01:09:10,139 + 의 전체 세트를 계산할 + +869 +01:09:10,140 --> 01:09:14,410 + 아마추어 곱 단지 하나의 배 하나의 숨겨진 계층의 뉴런과 + +870 +01:09:14,409 --> 01:09:17,619 + 그것이 우리가이 층을 배치 이유 곳이란 내가 제공 이후 및 + +871 +01:09:17,619 --> 01:09:21,119 + 완전히 위험 그것을 평가 모두 동일한 것을 말하지만, 이것은 A의 + +872 +01:09:21,119 --> 01:09:25,519 + 계산 트릭이는 3 층 신경망 인 지도자로 정렬하기 + +873 +01:09:25,520 --> 01:09:30,500 + 이것은 당신이 주요 곱셈의 단지 무리를 계산하는 방법을이다 + +874 +01:09:30,500 --> 01:09:35,550 + 지금뿐만 아니라 활성 기능에 의해 나는 것 다음에 다른 활성화 한 다음 + +875 +01:09:35,550 --> 01:09:40,520 + 그래서 이것은 그냥 당신이 신경 네트워크가 작동하는 방법의 데모를 보여 드리고자 + +876 +01:09:40,520 --> 01:09:44,770 + 모델은 약간 당신을 촬영하지만, 기본적으로 이것은의 예입니다 잡고는 + +877 +01:09:44,770 --> 01:09:50,080 + 두 층은 신경망 AP 이진 분류 업무 개의 하 분류 + +878 +01:09:50,079 --> 01:09:54,119 + 가장 가까운 빨간색과 녹색 등 두 가지 차원에서 이러한 점은 내가 그리는거야 경우 + +879 +01:09:54,119 --> 01:09:58,109 + 신경망에 의해 결정 경계와는 당신이 볼 수있는 것은 때입니다 참조 + +880 +01:09:58,109 --> 01:10:01,969 + 나는이 데이터를 내가에있는 더 많은 숨겨진 뉴런 신경 네트워크를 양성 내 + +881 +01:10:01,970 --> 01:10:05,770 + 당신의 전기 자동차가 오른쪽으로 더 계산할 수 나중에 더 호기심 머리 + +882 +01:10:05,770 --> 01:10:12,290 + 미친 기능은 당신이 또한 정규화 강도 그래서이는 것을 보여 + +883 +01:10:12,289 --> 01:10:17,069 + 당신 W 큰 불이익을 어느 정도의 정규화 당신이 주장 할 때 볼 수 있습니다 + +884 +01:10:17,069 --> 01:10:22,340 + 그렇지 않은 귀하의 WR 당신은 매우 부드러운 기능을 아주 작은 끝낼 것을 + +885 +01:10:22,340 --> 01:10:27,050 + 많은 몸부림이 아니라 이러한 신경망 있도록 많은 차이가 + +886 +01:10:27,050 --> 01:10:31,090 + 그들은 당신을 제공 할 수 있습니다 그리고 당신은이 알고있는 정규화를 감소하는 것이 + +887 +01:10:31,090 --> 01:10:34,090 + 그들이 종류의에서 얻을 얻을 수 있도록 우리는 점점 더 복잡한 작업을 할 수있는 + +888 +01:10:34,090 --> 01:10:38,710 + 훈련 데이터를 포함하는 점을 압착이 법은 그렇게 나에게 쇼를 보자 + +889 +01:10:38,710 --> 01:10:41,489 + 당신이 어떻게 생겼는지 + +890 +01:10:41,489 --> 01:10:47,079 + 훈련 도중 + +891 +01:10:47,079 --> 01:10:53,010 + 그래서 여기에 설명 첫번째 실제로 당신이 놀 수있는 날 수 있도록하기 위해 몇 가지 물건이있다 + +892 +01:10:53,010 --> 01:10:56,060 + 이는 자바 스크립트 전부 때문에 + +893 +01:10:56,060 --> 01:11:04,060 + 확실히 우리가 여섯 뉴런을 가지고 우리는 여기에서하고있는이 이진입니다 + +894 +01:11:04,060 --> 01:11:09,000 + 분류가 원 데이터와 상기 그래서 우리의 작은 클러스터가 + +895 +01:11:09,000 --> 01:11:13,520 + 분류 신경망을 훈련 빨간색 점과 직장으로 구분 녹색 점 + +896 +01:11:13,520 --> 01:11:18,080 + 내가 신경 네트워크를 다시 시작하면 그것이 바로 진형 그래서이 데이터 집합 + +897 +01:11:18,079 --> 01:11:20,949 + W 무작위 실제로 분류하는 결정 경계 수렴 + +898 +01:11:20,949 --> 01:11:26,289 + 시원한 부분 오른쪽에 표시 데이터 중 하나 해석 + +899 +01:11:26,289 --> 01:11:29,529 + 여기에 신경 네트워크는 내가 그 데려 갈거야 것은 듣기 좋은 그리고 난입니다 + +900 +01:11:29,529 --> 01:11:33,909 + 이 공간은 신경 네트워크에 의해 작동되는 방법을 보여주는 것은 그래서 당신은 해석 할 수 있습니다 + +901 +01:11:33,909 --> 01:11:37,619 + 무엇 신경 네트워크가하고있는 것은 수송 숨겨진 층을 사용하고 당신의 + +902 +01:11:37,619 --> 01:11:41,159 + 제 은닉층 선형으로 올 수있는 방식으로 입력 데이터 + +903 +01:11:41,159 --> 01:11:47,059 + 분류 및 신경망이 볼 여기에 귀하의 데이터를 분류 + +904 +01:11:47,060 --> 01:11:51,920 + 그것을 정상적으로 작동 공간이 배치되도록 실제로, 상기 제 2 층 + +905 +01:11:51,920 --> 01:11:56,779 + 제 1 층의 상단에 선형 분류는 괜찮 통해 비행기를 넣을 수있다 + +906 +01:11:56,779 --> 01:11:59,939 + 당신이 그것을 통해 비행기를 넣어 수 있도록 그래서 공간을 일하고 + +907 +01:11:59,939 --> 01:12:06,259 + 포인트가 그래서 당신은 정말 볼 수 있도록 이제 다시 살펴 보자 분리 해 무엇을 + +908 +01:12:06,260 --> 01:12:10,940 + 당신이이 데이터를 분류 일찍 떠날 수있는 근무되는 일이 발생 + +909 +01:12:10,939 --> 01:12:13,569 + 사람들은 때때로 그것의 현재 여행이라 뭔가 + +910 +01:12:13,569 --> 01:12:19,149 + 공간에 데이터 표현을 변경하는 경우 두 개의 선형 분리 확인 + +911 +01:12:19,149 --> 01:12:23,079 + 지금 여기에 우리가 지금 우리가 육이 권리를 분리하려는 경우 질문입니다 + +912 +01:12:23,079 --> 01:12:27,809 + 여기 뉴런과 중간층과 우리가 이러한 분리를 허용 + +913 +01:12:27,810 --> 01:12:33,580 + 당신이 실제로 그 여섯 신경 세포를 볼 수 있도록 일들이 대략 이러한 라인을 볼 수 있습니다 + +914 +01:12:33,579 --> 01:12:36,869 + 그들은 종류의 때문에 이러한 뉴런의 하나의 이러한 기능 같아 여기처럼 + +915 +01:12:36,869 --> 01:12:40,349 + 여기에 신경 세포의 최소 숫자가 무엇인지에 대한 질문입니다 어떤이에 대한 + +916 +01:12:40,350 --> 01:12:45,570 + 내가 그 일을 알고 싶다면 데이터 세트는 신경 네트워크 등으로 분리 가능 + +917 +01:12:45,569 --> 01:12:51,889 + 제대로 최소한이 분류하는 + +918 +01:12:51,890 --> 01:13:15,270 + 그래서에 방법이 작업은 34 그렇게하여 발생하거나 무슨 일이있다 + +919 +01:13:15,270 --> 01:13:18,910 + 여기에 주변이 그런 식으로이 방법을 그런 식으로이 방법에서이 방법을 갔다 + +920 +01:13:18,909 --> 01:13:22,689 + 그 방법으로 다음 더이 비행기를 절단하는 뉴런하고있다 + +921 +01:13:22,689 --> 01:13:27,039 + 가중 합계의 추가 계층이있다 가장 낮은 사실 때문에 + +922 +01:13:27,039 --> 01:13:34,739 + 수는 여기에 무엇을 확인 세 개의 신경 세포로 그렇게 일하는 것이 세 것 + +923 +01:13:34,739 --> 01:13:39,189 + 하나의 평면 제 2 평면 비행기 선형성 내에서 세 개의 선형 함수 + +924 +01:13:39,189 --> 01:13:45,649 + 다음은 기본적으로 세 가지 라인으로 그렇게 공간을 개척 할 수 있습니다 + +925 +01:13:45,649 --> 01:13:52,429 + 그 수는 102 인 경우 제 2 층은 단지 그들 결합 수 + +926 +01:13:52,430 --> 01:13:57,850 + 두 줄이 충분 I하지 않기 때문에 확실히 깰 것이에 기부 + +927 +01:13:57,850 --> 01:14:03,900 + 기본적으로는 찾을 수와 그래서 여기에 아주 좋은이 작업 뭔가를 가정 + +928 +01:14:03,899 --> 01:14:07,239 + 단지 그들이 종류의이 만드는이 두 줄을 사용하는 최적의 방법 + +929 +01:14:07,239 --> 01:14:14,599 + 터널과 최고의 것을 당신은 할 수있다 + +930 +01:14:14,600 --> 01:14:31,300 + 나는 오히려 사용하는 경우 나는 많은 초현실주의 내가있을 것이라고 생각 생각 + +931 +01:14:31,300 --> 01:14:50,460 + 당신이 날카로운 경계를 볼 수있을 거라고 생각 당신이 지금 할 수있는 네의이 있기 때문에 그것을 할 수 있습니다 + +932 +01:14:50,460 --> 01:14:52,130 + 이러한 부품의 일부 + +933 +01:14:52,130 --> 01:14:58,119 + 그 수익의 둘 이상의 활성 그래서 당신이와 끝까지있다 + +934 +01:14:58,119 --> 01:15:02,359 + 내가 123처럼 생각 정말 세 줄하지만 다음에에게 모서리 중 일부는 푹 빠졌하기 + +935 +01:15:02,359 --> 01:15:05,689 + 눈에 활성 등이 가중치는 펑키의 종류있을 것이다 당신 + +936 +01:15:05,689 --> 01:15:12,649 + 그것에 대해 생각해야하지만, 확인 그러니에서 그렇게 20로 변경 여기 스무 말을 살펴 보자 + +937 +01:15:12,649 --> 01:15:16,670 + 그래서 우리는이 공간을 많이 가지고의 나선형처럼 다른 자산을 살펴 보자 + +938 +01:15:16,670 --> 01:15:22,390 + 당신은 그냥 거기에 가서이 업데이트를하고 있어요 것처럼 어떻게이 일을 볼 수 있습니다 + +939 +01:15:22,390 --> 01:15:32,800 + 내 자신의 원없는 매우 간단한 데이터를 알아낼 후 실행 + +940 +01:15:32,800 --> 01:15:39,880 + 그 아래는 그래서 당신은 가지가 간다 수 있고 녹색을 커버처럼입니다 + +941 +01:15:39,880 --> 01:15:48,039 + 잔디와 붉은 색과 그래 내가이 깰거야처럼 적은 수의 말과 + +942 +01:15:48,039 --> 01:15:54,890 + 지금은 오 갈 않을거야 그래이 더 악화 작업을 시작합니다 + +943 +01:15:54,890 --> 01:15:58,770 + 당신은 당신이 할 수 있도록이 데이터를 분리 할 수​​있는 충분한 능력이 없기 때문에 + +944 +01:15:58,770 --> 01:16:05,270 + 당신의 자유 시간 및 요약도록이 놀이 + +945 +01:16:05,270 --> 01:16:10,690 + 우리는 정치적 후계자로 이러한 신경 세포와 신경 네트워크를 배치 + +946 +01:16:10,689 --> 01:16:14,579 + 그 작물을보고이 경쟁 그래프를 변경 얻을 방법을 그들이있어 + +947 +01:16:14,579 --> 01:16:19,149 + 정말 신경과 곧 볼 수와 같은 더 큰 더 나은 우리는에 갈거야 + +948 +01:16:19,149 --> 01:16:28,210 + 나는 우리가 그냥 미안 질문을 생각하기 전에 내가 원하는 많은 질문을하는 것을 + +949 +01:16:28,210 --> 01:16:29,359 + 두 분 이상 + +950 +01:16:29,359 --> 01:16:36,899 + 네 감사합니다 + +951 +01:16:36,899 --> 01:16:41,119 + 그래서 더 신경과 신경 네트워크 대답에이 항상 더 낫다 + +952 +01:16:41,119 --> 01:16:48,809 + 그 네 더 그렇게 더 일반적으로 경쟁 제한의 항상 더 나은입니다 + +953 +01:16:48,810 --> 01:16:52,510 + 항상 잘 작동하지만 당신은 조심해야하므로 제대로 정례화하기 + +954 +01:16:52,510 --> 01:16:55,810 + 당신이 당신의 데이터를 넣어 이상 일하지 않는 제한 할 수있는 올바른 방법으로하지 않습니다 + +955 +01:16:55,810 --> 01:16:58,940 + 작은 네트워크를 할 수있는 올바른 방법을 만드는 것은 증가하는 것입니다 + +956 +01:16:58,939 --> 01:17:03,079 + 정규화 당신은 항상 당신이 원하는대로 큰 네트워크로 사용할 그래서하지만 + +957 +01:17:03,079 --> 01:17:06,269 + 당신은 대부분의 시간을 적절하게 상승 조절할 확인해야하지만, + +958 +01:17:06,270 --> 01:17:09,920 + 나는 훈련을 영원히 기다릴 시간이없는 이유는 경쟁의 이유 때문에 우리 + +959 +01:17:09,920 --> 01:17:19,980 + 실제적인 이유 질문은 동일하게 발생에 대한 네트워크는 작은 것을 사용 + +960 +01:17:19,979 --> 01:17:25,509 + 일반적으로 당신은 당신이 좋아 당신이 대부분을 자주 볼 단순화과 같이 + +961 +01:17:25,510 --> 01:17:28,030 + 실제로 훈련 네트워크들은 동일한 방법 전반을 정규화한다 + +962 +01:17:28,029 --> 01:17:33,809 + 하지만 당신은 반드시 필요 없어요 + +963 +01:17:33,810 --> 01:17:40,500 + 값이 누구 최적화 네트워크에 보조 옵션을 사용 + +964 +01:17:40,500 --> 01:17:44,859 + 데이터 세트가 작은 경우 때때로 당신은 파운드 같은 것들을 사용할 수있는 I + +965 +01:17:44,859 --> 01:17:47,729 + 너무 들어가서 보통 2 차 방법하지만 데이터의하지 + +966 +01:17:47,729 --> 01:17:50,500 + 세트는 정말 크고, 그것이 내가 당신을 얻을 것이다 때 매우 잘 작동하지 않습니다이다 + +967 +01:17:50,500 --> 01:17:57,039 + 그래서 당신은 당신과 함께 최대의 당신 수백만 나중에에 대한 파운드를 할 수없는 경우와 LBJ는 아니다 + +968 +01:17:57,039 --> 01:18:01,970 + 많은 배치 매우 좋은 당신은 항상 기본적으로 후퇴해야 + +969 +01:18:01,970 --> 01:18:16,650 + 같은 방법 당신 때문에 불행하게도 그에 대한 좋은 대답을하지 할당 할 + +970 +01:18:16,649 --> 01:18:20,899 + 깊이가 양호하지만 어쩌면 같은 후 열 층은 간단한 데이터 일 수있다 싶어 + +971 +01:18:20,899 --> 01:18:25,219 + 말했다 정말 내가 여전히 걸릴 수 있습니다 1 분 너무 많이 추가 아니에요 + +972 +01:18:25,220 --> 01:18:35,990 + 질문은 내가 할당 않는 곳 사이의 트레이드 오프에 대한 질문이 내 + +973 +01:18:35,989 --> 01:18:40,019 + 나는 우리가 깊은되고 싶어하거나 수행에 용량 나는 그것이 매우 좋은 넓은지지 않습니다 싶어 + +974 +01:18:40,020 --> 01:18:47,860 + 일반적으로, 특히 이미지와 함께 우리는 더 많은 층이 발견이 yes로 응답 + +975 +01:18:47,859 --> 01:18:51,199 + 중요하지만 때로는 당신이있을 때 간단한 취향은 몇 가지 작업을 수행하려면 + +976 +01:18:51,199 --> 01:18:55,359 + 깊이와 같은 다른 것들하지 중요하고 그래서 약간 종류의 + +977 +01:18:55,359 --> 01:19:01,670 + 종속 데이터 + +978 +01:19:01,670 --> 01:19:10,050 + 건강이 보통 일반적으로하지있어 다른 레이어에 대해 서로 다른 + +979 +01:19:10,050 --> 01:19:15,960 + 그냥 거 하나를 선택하고 또한 표시됩니다 예를 들어있어 그것으로 이동 + +980 +01:19:15,960 --> 01:19:19,279 + 그들 중 대부분은 다른 사람과 변경하고 그래서 당신은 단지 전체에 그것을 사용하고 + +981 +01:19:19,279 --> 01:19:22,389 + 사람들이 그와 함께 연주하지 않는 주위를 전환 할 수있는 실제 혜택은 없습니다 + +982 +01:19:22,390 --> 01:19:26,660 + 원칙에 너무 많은 당신이 그래서 420입니다 있습니다 방지 아무것도 없다 + +983 +01:19:26,659 --> 01:19:29,789 + 그래서 우리는 여기서 끝나지거야하지만 우리는 많은 있도록 더 많은 신경 네트워크를 볼 수 있습니다 + +984 +01:19:29,789 --> 01:19:31,238 + 이러한 질문은 그들을 통해 이동합니다 + diff --git a/captions/Ko/Lecture5_ko.srt b/captions/Ko/Lecture5_ko.srt new file mode 100644 index 00000000..c824f870 --- /dev/null +++ b/captions/Ko/Lecture5_ko.srt @@ -0,0 +1,4280 @@ +1 +00:00:00,000 --> 00:00:05,299 + 수평선 그러나 그것은 당신의 대부분이 완료 세미나 및 미완성 것입니다하지만 + +2 +00:00:05,299 --> 00:00:11,109 + 확인 좀 괜찮은 확인을 얻을에 나는 메이크업 근무 시간 권리를 보유 할 수 있습니다 + +3 +00:00:11,109 --> 00:00:15,660 + 이 클래스 할당 한 후이 내일 이후 내일 또는 일 발매 예정 + +4 +00:00:15,660 --> 00:00:19,710 + 우리는 완벽하게 일을 마무리하거나 여전히 작업을 우리는 변화하고하지 않은 + +5 +00:00:19,710 --> 00:00:23,050 + 그것은 작년 그래서 우리는 개발의 과정에서 우리는 희망 + +6 +00:00:23,050 --> 00:00:24,580 + 가능한 한 빨리이 + +7 +00:00:24,579 --> 00:00:31,469 + 그 모임하지만 가끔 당신은 최대한 빨리 한 번 그에서 시작하려는 않도록 + +8 +00:00:31,469 --> 00:00:36,039 + 우리가이 때문에 기한 또는 뭔가를 조정 될 수 있습니다 발표이야 + +9 +00:00:36,039 --> 00:00:41,850 + 약간 큰 그래 그래서 그래서 주위에 이런 것들의 일부를 셔플한다 + +10 +00:00:41,850 --> 00:00:46,219 + 또한 재료의 등급을 매기는 방식은 임시 변경 될 수 있습니다 + +11 +00:00:46,219 --> 00:00:48,929 + 우리는 여전히 비교적 새로운 아직 코스를 알아 내기 위해 노력하고 있기 때문에 + +12 +00:00:48,929 --> 00:00:53,899 + 그것의 많은 우리가 시작하기 전에 사람들은 그냥 머리까지입니다 그래서 변화 + +13 +00:00:53,899 --> 00:00:57,829 + 약 10 일 예정이다 그런데 프로젝트 제안서의 조건 I + +14 +00:00:57,829 --> 00:01:00,799 + 당신에 대해 생각하고있을 것이기 때문에 몇 가지 포인트를 불어 넣고 싶었 당신의 + +15 +00:01:00,799 --> 00:01:05,890 + 무슨 좋은하게 대해 프로젝트와 여러분 중 일부는 약간의 오해가있을 수 있습니다 + +16 +00:01:05,890 --> 00:01:11,159 + 나쁜 프로젝트는 그래서 그냥이 그들을 가장 일반적인 하나는 아마 그 사람입니다 + +17 +00:01:11,159 --> 00:01:14,570 + 그들이 그이 있다고 생각하기 때문에 작은 데이터 세트로 작업 할 주저 + +18 +00:01:14,569 --> 00:01:17,669 + 데이터 교육의 엄청난 금액을 요구하고이 수백 거기에 해당하는 + +19 +00:01:17,670 --> 00:01:21,450 + 총리의 수백만 나올하고 그들은 훈련이 필요하지만, 사실에 대한 + +20 +00:01:21,450 --> 00:01:25,019 + 프로젝트에 목적이 뭔가 당신이 엉망의 종류, 그렇지 않습니다 + +21 +00:01:25,019 --> 00:01:28,579 + 당신이 작은 데이터로 작업 할 수 있습니다 많은 걱정하는 이유의 확인을 설정 + +22 +00:01:28,579 --> 00:01:32,188 + 우리가이 과정을 가지고 나중에 더 많은 세부 사항으로 이동합니다되는 괜찮아요 + +23 +00:01:32,188 --> 00:01:35,938 + 미세 조정과 일이라는 클래스가 실제로 당신이 거의 없다 + +24 +00:01:35,938 --> 00:01:41,039 + 지금이 거대한 낙타 응답 충돌 훈련을 거의 항상이 재교육을 + +25 +00:01:41,040 --> 00:01:43,729 + 및 심기 처리 방식 때문에이 작동합니다 + +26 +00:01:43,728 --> 00:01:47,590 + 거의 항상 그 일부에 대한 교육을 상용 네트워크를 취할 것처럼 + +27 +00:01:47,590 --> 00:01:51,520 + 설정 대용량 데이터 이미지는 당신이있어 대량의 데이터를 좋아하는 말 + +28 +00:01:51,519 --> 00:01:54,618 + 바로 거기 설정 다른 데이터에 관심이 당신은 당신의 코멘트를 훈련 할 수있다 + +29 +00:01:54,618 --> 00:01:58,430 + 중소 기업 여기를 설정하고 우리는 그것을 통해 전송할 수 있습니다 그 + +30 +00:01:58,430 --> 00:02:01,240 + 거기 그것이 같은 방식이 이전 작품 + +31 +00:02:01,239 --> 00:02:05,359 + 그래서 여기에 우리가 이미지와 이야기에 대한 시작 코미디 쇼 네트워크의 개략도이다 + +32 +00:02:05,359 --> 00:02:09,000 + 당신이 사용하고, 그래서 우리는 아래 분류에 층의 시리즈를 통해 갈거야 + +33 +00:02:09,000 --> 00:02:12,150 + 이것은 그러나 우리는 물론 여기에 특정 선수에 대해 이야기하지만하지 않은 우리 + +34 +00:02:12,150 --> 00:02:16,120 + 우리를 우리가 분에 대한 교육을 해당 이미지 순 자유 무역 네트워크를 가지고 다음 + +35 +00:02:16,120 --> 00:02:20,129 + 그것은 멀리 가기 층 오프 다진 테이크로 분류를 다진 우리 + +36 +00:02:20,129 --> 00:02:24,150 + 전체 상용 네트워크는 고정 된 특징 추출기를 가지고 훈련 등을 수행 할 수 있습니다 + +37 +00:02:24,150 --> 00:02:27,219 + 새 데이터 세트의 상단에 그 특징 추출기를 넣어 당신은 단지거야 + +38 +00:02:27,219 --> 00:02:30,739 + 그래서 상단에 분류를 수행하는 다른 층에 교환합니다 + +39 +00:02:30,739 --> 00:02:34,810 + 당신이 당신의 자신의 마지막 층을 훈련하기 위하여려고 얼마나 많은 데이터에 따라 + +40 +00:02:34,810 --> 00:02:38,159 + 당신이 실제로 다시 전파 경우 네트워크하거나 미세 조정을 할 수 있습니다 + +41 +00:02:38,159 --> 00:02:41,379 + 그리고 전투의 일부를하면 다시 할 거 야 더 많은 데이터를 얻을 + +42 +00:02:41,379 --> 00:02:47,229 + 깊은 네트워크를 통해 특히 봄 훈련에 전파 + +43 +00:02:47,229 --> 00:02:51,649 + 한 사용자의 거대한 라인이 그래서 샘플 이미지 그물 사람들은 당신을 위해 이렇게 + +44 +00:02:51,650 --> 00:02:55,400 + 다른에 시간 주 홈 네트워크가됩니다 온다 훈련을 꺼리는 + +45 +00:02:55,400 --> 00:02:58,939 + 데이터 세트와 라인이에 그들은 주석의 무게를 업로드 + +46 +00:02:58,939 --> 00:03:02,229 + 뭔가 예를 들어 이들은 모든입니다 몇 가지 모델을 호출 + +47 +00:03:02,229 --> 00:03:05,629 + 상업 네트워크는 이미 대규모 데이터 세트에 설교 한 + +48 +00:03:05,629 --> 00:03:09,310 + 매개 변수가 많이 배웠 단지 주변 교환을 참조하여 + +49 +00:03:09,310 --> 00:03:12,769 + 당신이없는 경우 데이터 센터는 그렇게 기본적으로 네트워크를 통해 그를 찾을 + +50 +00:03:12,769 --> 00:03:16,799 + 데이터를 많이 괜찮아 당신은 전투와 잘에서 설교자을 그 + +51 +00:03:16,799 --> 00:03:20,500 + 조정을하고 그래서 일 것 작은 데이터 세트로 작업하는 것을 두려워하지 않는다 + +52 +00:03:20,500 --> 00:03:27,239 + 우리가 지난 시간에 몇 가지 문제가 보았던 두 번째 것은 밖으로 사람들이다 + +53 +00:03:27,239 --> 00:03:31,209 + 그들은 무한한 컴퓨터가이 단지를 가리 키도록 좋아도 금속 생각 + +54 +00:03:31,209 --> 00:03:35,000 + 아웃 지나치게 야심 찬하고 당신이 제안하는 어떤 것들은 잠시에 적용되지 않습니다 + +55 +00:03:35,000 --> 00:03:37,959 + 당신이 하이퍼해야 할거야 너무 많은 GPU를이없는 훈련 + +56 +00:03:37,959 --> 00:03:41,780 + 최적화는 우리가 몇 가지 있었다 있도록 여기에 대해 걱정할 필요가 몇 가지있다 + +57 +00:03:41,780 --> 00:03:45,840 + 사람들이 매우 큰 데이터에 대한 교육의 프로젝트를 제안 프로젝트 지난해 + +58 +00:03:45,840 --> 00:03:51,889 + 설정하고 당신은 시간이되도록 염두이 없어 그래, 당신은 얻을 것이다 + +59 +00:03:51,889 --> 00:03:54,980 + 더 나은 감각 우리는 클래스를 통해 갈 것입니다 또는 제공 할 수 없습니다로 + +60 +00:03:54,979 --> 00:03:59,949 + 우리가 강의에 뛰어거야 확인 컴퓨터 제약이 어떤 있습니다 + +61 +00:03:59,949 --> 00:04:02,780 + 나는 당신이 그것에 대해 물어보고 싶은 것을 남아있을 수 있습니다 관리 일 + +62 +00:04:02,780 --> 00:04:07,068 + 확인 좋은 그래서 우리는 오늘날 우리가 꽤있는 재료로 다이빙을거야 + +63 +00:04:07,068 --> 00:04:12,138 + 그래서 그냥 알림 목공 산업으로의 합격 점수를 표시 + +64 +00:04:12,139 --> 00:04:13,189 + 교육 센터 + +65 +00:04:13,189 --> 00:04:16,750 + 네트워크 기본적 신경 네트워크를 훈련하는 4 단계 공정은 그대로 + +66 +00:04:16,750 --> 00:04:21,589 + 당신이 데이터 세트에서 데이터의 배치 귀하의 데이터를 샘플링 123 간단 + +67 +00:04:21,589 --> 00:04:25,079 + 당신은 로스를 계산하기 위해 네트워크를 통해 전달 + +68 +00:04:25,079 --> 00:04:29,339 + 당신의 생기 새 차 업데이트를 완료 전파하거​​나 조정할 당신의 + +69 +00:04:29,339 --> 00:04:33,529 + 약간 재료 등의 방향으로 무게가 끝날 때 + +70 +00:04:33,529 --> 00:04:36,519 + 정말 어떤이가 내려 오는 것은 최적화가이 과정을 반복 + +71 +00:04:36,519 --> 00:04:39,909 + 공간을 대기하는 것을 특징으로 문제는 공백의 영역으로 수렴 하였다 + +72 +00:04:39,910 --> 00:04:42,990 + 우리는 낮은 손실과 그 수단 제대로 분류하거나 교육 센터를 + +73 +00:04:42,990 --> 00:04:48,590 + 우리는이 매우 큰 것을보고 나는 변경 광택의 디스크 이미지를 플래시 + +74 +00:04:48,589 --> 00:04:51,589 + 기본적으로이 거대한 계산 그래프는 그리고 우리는 할 필요가 + +75 +00:04:51,589 --> 00:04:54,699 + 그들을 통해 전파 그래서 우리는 직관 몇 가지 다시 이야기하고 + +76 +00:04:54,699 --> 00:04:57,289 + 전파와 정말 그냥 재귀 응용 프로그램의 사실 + +77 +00:04:57,290 --> 00:05:01,220 + 우리가 그라디언트를 변경하고 전면에 회로 다시에서 일반 + +78 +00:05:01,220 --> 00:05:05,110 + 모든 로컬 작업을 통해 우리는이 일부 구현 보았다 + +79 +00:05:05,110 --> 00:05:10,350 + 수 신속하게 모두 해안 경쟁 그래프에서 앞으로 뒤로 API와 + +80 +00:05:10,350 --> 00:05:14,379 + 또한 노드의 관점에서 이는 동일한 API를 구현을 위해 할 + +81 +00:05:14,379 --> 00:05:18,750 + 전파 및 역 전파 우리는 포르투갈 구체적인 예 보았다 + +82 +00:05:18,750 --> 00:05:22,199 + 카페와 나는이 당신 불법 블록 같은 종류의 것을이 비유를 그린 + +83 +00:05:22,199 --> 00:05:26,159 + 이 층은 게이트 당신이에 구축하는 당신의 작은 블록입니다입니다 + +84 +00:05:26,160 --> 00:05:30,280 + 다음 작품 인터콤 시스템은 먼저 신경 네트워크에 대해 이야기 + +85 +00:05:30,279 --> 00:05:33,329 + 물건을 가지고 그것은 우리가 만드는 것입니다 금액 기본적으로 무엇을하지 않고 + +86 +00:05:33,329 --> 00:05:37,990 + 클래스 코스로 이미지에서 진행되는이 더 복잡하고 우리는 보았다 + +87 +00:05:37,990 --> 00:05:41,800 + 법안에이 연대기는 뇌 물건의 관점에서 작품을 + +88 +00:05:41,800 --> 00:05:47,168 + 신경과 우리가하고있는 것입니다으로 우리는 확인 그래서 이러한 이메일과 편지를 중지하고 + +89 +00:05:47,168 --> 00:05:49,370 + 그것은 우리가 지금 무슨 일을하는지 대략 그리고 우리는이 얘기하는거야 + +90 +00:05:49,370 --> 00:05:54,959 + 훈련이 프로세스에 대한 클래스는 초기 우린 있도록 효과적으로 확인 작업 + +91 +00:05:54,959 --> 00:05:58,049 + 내가 다이빙을하기 전에 세부 사항에 해당로 갈 난 그냥 싶었 종류 + +92 +00:05:58,050 --> 00:06:02,280 + 밖으로 당겨주고 당신은 최대에게 어떻게 역사를 조금 축소 + +93 +00:06:02,279 --> 00:06:06,918 + 당신이 유출 된 기름이 어디​​에서 오는지 발견하려고하면이 시간이 지남에 진화 + +94 +00:06:06,918 --> 00:06:09,870 + 여기서 제 등등 제안 + +95 +00:06:09,870 --> 00:06:15,269 + 당신은 아마 1957 년에 약 1,964 프랭크 로젠 블랏으로 돌아 갈 것이었다 + +96 +00:06:15,269 --> 00:06:18,899 + 뭔가라는 퍼셉트론 즈 (Perceptrons)와 기본적으로 퍼셉트론과 장난 + +97 +00:06:18,899 --> 00:06:24,379 + 당신이 좋아하는 그래서이 구현 및 하드웨어 었죠 + +98 +00:06:24,379 --> 00:06:28,269 + 그들은 단지 실제로이 일을 구축했다 지금 코드를 작성 않습니다 + +99 +00:06:28,269 --> 00:06:37,099 + 이 시간에 회로와 전자 기기에서 대부분의 경우와 제출 + +100 +00:06:37,100 --> 00:06:42,450 + 퍼셉트론은 대략이 기능 여기이었고, 그것은 매우 유사 모습 + +101 +00:06:42,449 --> 00:06:46,110 + 우리는 그 다음 활성화 만 명시 적으로 저스틴을 잘 알고 있지만, + +102 +00:06:46,110 --> 00:06:49,930 + 활성화 함수 실제로되었다는 신호로 사용 된 함수 + +103 +00:06:49,930 --> 00:06:54,439 + 스텝 함수는 그것이이 때문에 바이너리 스텝 함수 등이었다 중 10이었다 + +104 +00:06:54,439 --> 00:06:57,459 + 나의 새로운 단계 기능은 당신이이 미분 아니라는 것을 알 수 있습니다 + +105 +00:06:57,459 --> 00:07:01,649 + 작업은 그래서 그들은 비용 사실이 통해 전파 백업 할 수 없었다 + +106 +00:07:01,649 --> 00:07:04,139 + 교육 신경 네트워크를위한 역 전파의 훨씬 나중에 올 필요 + +107 +00:07:04,139 --> 00:07:08,169 + 그래서 그들은 이러한 이진 단계적으로 기능 퍼셉트론 그들은 함께했다 + +108 +00:07:08,170 --> 00:07:12,449 + 이러한 학습 규칙에 와서 그래서 이것은 임시 지정의 종류 + +109 +00:07:12,449 --> 00:07:17,110 + 가중치를 쥐게 규칙을 학습하는 것은에서 원하는 결과를 만들려면 + +110 +00:07:17,110 --> 00:07:22,240 + 퍼셉트론 일치하는 진정한 욕망의 진정한는 균형을하지만, 거기에 아무 + +111 +00:07:22,240 --> 00:07:25,490 + 손실 함수의 개념은 역 전파 그의 DS DS 광고의 개념이 없었다 + +112 +00:07:25,490 --> 00:07:28,949 + 당신이 그들을 볼 때 그들이 종류의 거의 배경을 특별 규칙 만 + +113 +00:07:28,949 --> 00:07:32,779 + 이 때문에 미분하지 않고 스텝 기능의 종류의 재미 + +114 +00:07:32,779 --> 00:07:36,809 + 다음 사람들은 매들린의 출현으로 1960 년에 있으므로이를 중지 시작 + +115 +00:07:36,810 --> 00:07:42,110 + 우드로 매들린이 충분히들은 것 같은이 퍼셉트론을 시작했고, + +116 +00:07:42,110 --> 00:07:46,470 + 제 다층 퍼셉트론 망로 물건 이는 여전히 + +117 +00:07:46,470 --> 00:07:51,980 + 모든 전자와 LG에서 수행 실제로 포터로부터 구축 + +118 +00:07:51,980 --> 00:07:55,830 + 하지만 여전히이이 모든 규칙을했다이 경우에는 다시 전파가 없습니다 + +119 +00:07:55,829 --> 00:07:59,060 + 그들이 측면에서 가지고 올 것을의 그것을 뒤집기 시도에 대해 생각하기 추천하고 + +120 +00:07:59,060 --> 00:08:02,949 + 더 나은 여부를 작동하고 가지의 더보기가 없었다 경우보고 + +121 +00:08:02,949 --> 00:08:06,430 + 이 시점에서 역 전파 등 약 1960년 사람들은 매우있어 + +122 +00:08:06,430 --> 00:08:09,560 + 흥분과 회로를 구축하고 그들은 당신이 갈 수 알고 있다고 생각 + +123 +00:08:09,560 --> 00:08:12,930 + 정말 지금까지 우리는 내용이 회로는 것을 기억해야 할 수 있습니다 + +124 +00:08:12,930 --> 00:08:17,829 + 그때 프로그래밍의 개념은 매우 명시했다 당신은 일련의 쓰기 + +125 +00:08:17,829 --> 00:08:20,689 + 컴퓨터에 대한 지침이 사람들이 생각하는이 처음이다 + +126 +00:08:20,689 --> 00:08:24,379 + 이러한 종류의 데이터는 접근 방식을 기반에 대해 당신은 회로의 일종 곳 + +127 +00:08:24,379 --> 00:08:29,019 + 이 배울 수 있고, 그래서 이것은 시간 사람들이 큰 개념적 도약에 있었다 + +128 +00:08:29,019 --> 00:08:33,179 + 실제로 작업 끝나지 이러한 네트워크에 대한 매우 흥분 + +129 +00:08:33,179 --> 00:08:37,528 + 잘 바로 1964 예 측면에서 그들은 흥분을 통해 약간있어 + +130 +00:08:37,528 --> 00:08:41,088 + 이상은 약속​​과 약간 아래에 따라서 기간 동안 전달 + +131 +00:08:41,089 --> 00:08:45,660 + 열 아홉 칠십의 실제 현장에서 매우 조용하고 많은 아니었다 + +132 +00:08:45,659 --> 00:08:52,958 + 연구는 다음 부스트 사실에 대한 대략 1986 년에 와서 완료되었습니다 + +133 +00:08:52,958 --> 00:08:57,179 + 1천9백86명이 기본적으로 자신이 처음입니다이 영향력있는 논문이었다 + +134 +00:08:57,179 --> 00:09:03,069 + 당신이 잘되게 형식으로 규칙 등의 전파를 다시 볼 시간과 + +135 +00:09:03,070 --> 00:09:07,910 + 그래서 이것은 (10)와 윌슨에 정말 열심히 그리고 그들은 여러 층으로 연주했다 + +136 +00:09:07,909 --> 00:09:11,129 + 퍼셉트론 즈 (Perceptrons) 그리고 당신이 우리가 실제로 볼 수있는 종이에 갈 때이 처음이다 + +137 +00:09:11,129 --> 00:09:13,879 + 이 시점에서 다시 전파 등처럼 보이는 뭔가 그들이 + +138 +00:09:13,879 --> 00:09:17,830 + 이미 임시 규칙이 아이디어를 폐기 정말 자물쇠가 + +139 +00:09:17,830 --> 00:09:20,589 + 기능 등 전파 그라데이션 하강에 대해 다시 이야기하고 + +140 +00:09:20,589 --> 00:09:25,390 + 그들이 있다고 생각하기 때문에 그래서이 시간 사람들은 1986 년에 다시 흥분 + +141 +00:09:25,389 --> 00:09:30,610 + 그들은 지금 스키의 주요 좋은 신용 할당 종류를했다 + +142 +00:09:30,610 --> 00:09:35,000 + 역 전파 그들이 네트워크를 훈련 할 수있는 문제가 있었다 불행히도 + +143 +00:09:35,000 --> 00:09:37,690 + 그들은 이러한 네트워크를 확장하려고 할 때 그들을 깊이 이상 만드는 것을 + +144 +00:09:37,690 --> 00:09:41,089 + 그들이 할 수있는 다른 것들 중 일부에 비해 매우 잘 작동하지 않았다 + +145 +00:09:41,089 --> 00:09:44,620 + 당신의 기계 학습 도구 키트 그래서 그들은 그냥 아주 좋은을 포기하지 않았다 + +146 +00:09:44,620 --> 00:09:49,339 + 이 시간과 훈련의 결과는 박히과 경쟁했다 + +147 +00:09:49,339 --> 00:09:52,170 + 기본적으로 아주 잘 작동하지 특히​​ 그는 크게하고 싶어 + +148 +00:09:52,169 --> 00:09:56,199 + 네트워크는이 사실 거의 이십년 곳의 경우와 + +149 +00:09:56,200 --> 00:09:58,940 + 어떻게 든이 없었기 때문에 다시 자신의 작품에 적은 연구가 있었다 + +150 +00:09:58,940 --> 00:10:04,370 + 연구는이 때문에 2006 년 아주 잘 작동하고 당신은 훈련 할 수있다 + +151 +00:10:04,370 --> 00:10:08,440 + 최근 다시 한번 소생 힌튼과과에 의해 과학 논문 여부 + +152 +00:10:08,440 --> 00:10:14,190 + 러셀은 충분히 충분했다 아직 그의 이름을 말할하지만 그들은 여기 기본적으로 무엇을 + +153 +00:10:14,190 --> 00:10:17,430 + 이것은 우리가 실제로 할 수 있습니다 처음으로 페널티 킥을 추천 또는 약이었다 + +154 +00:10:17,429 --> 00:10:22,549 + 제대로 훈련하고 그들이 한 신경망 훈련 대신했다 + +155 +00:10:22,549 --> 00:10:26,319 + 역 전파가 온 단일 패스 10 층 같은 모든 층 + +156 +00:10:26,320 --> 00:10:29,230 + 제한이라는 무엇을 사용하여이 자율 사전 교육 방식에 최대 + +157 +00:10:29,230 --> 00:10:32,139 + 볼츠만 기계와이 첫 번째 층을 양성되는 금액 그래서 뭐 + +158 +00:10:32,139 --> 00:10:35,860 + 자율 목표를 사용하여 당신은 그 위에 두 번째 층을 훈련 + +159 +00:10:35,860 --> 00:10:39,850 + 다음 세 번째와 네 번째 다음 번에이 모든 당신이 넣어 다음 훈련 + +160 +00:10:39,850 --> 00:10:42,959 + 모두 함께 다음은 당신이 시작 전파를 다시 시작 + +161 +00:10:42,958 --> 00:10:46,479 + 먼저 음성 읽어 미세 조정 단계는 두 단계 프로세스이었다 + +162 +00:10:46,480 --> 00:10:49,860 + 레이어를 통해 단계적으로, 그리고, 우리는에 넣어 다음 다시 전파 + +163 +00:10:49,860 --> 00:10:53,459 + 작동하고 그래서 이것은 역 전파 처음이다 + +164 +00:10:53,458 --> 00:10:56,250 + 놀라지 훈련에서 기본적으로이 초기화를 필요 + +165 +00:10:56,250 --> 00:10:59,490 + 그렇지 않으면 그들은 처음부터 운이 작동하지 않을 것입니다 그리고 우리는 보게 될 것입니다 + +166 +00:10:59,490 --> 00:11:03,680 + 왜이 강의에서이 훈련이 참으로 네트워크를 얻기 위해 가지 까다로운 + +167 +00:11:03,679 --> 00:11:07,769 + 처음부터 그냥 배경을 사용하여 그리고 당신이 정말로 그것에 대해 생각해야하고 그래서 그것을 + +168 +00:11:07,769 --> 00:11:11,100 + 실제로 깜짝 과정을 필요로하지 않는 것이 나중에 밝혀졌다 당신은 할 수 있습니다 + +169 +00:11:11,100 --> 00:11:14,199 + 바로 배경으로 거래를하지만 당신은 매우 신중해야 + +170 +00:11:14,198 --> 00:11:18,109 + 초기화 및 그들은이 점과 시그 모이 있습니다에서 작동 신호를 사용 + +171 +00:11:18,110 --> 00:11:23,389 + 그냥 좋은 옵션은 사용하지 그래서 기본적으로 배경은 작동하지만 당신은에 있습니다 + +172 +00:11:23,389 --> 00:11:29,250 + 당신이 그것을 사용하는 방법에주의 등이 너무 좀 더 연구가 2006 년이었다 + +173 +00:11:29,250 --> 00:11:32,600 + 가지 다시 지역에 와서 깊은 학습으로하지만, 정말 재 상표되었다 + +174 +00:11:32,600 --> 00:11:39,610 + 여전히 신경망 동의어이다하지만 예술에 대한 더 나은 단어이고 + +175 +00:11:39,610 --> 00:11:43,990 + 기본적으로 내가 생각하는이 시점에서 제대로 잘 작동하기 시작하고 사람들이 수 + +176 +00:11:43,990 --> 00:11:48,940 + 실제로 훈련 네트워크는 지금 여전히 너무 많은없는 사람들이 관심을 때 + +177 +00:11:48,940 --> 00:11:53,310 + 사람들이 정말 관심이 내가 그렇게 2010 년과 2012 년의 주위에 생각하는 정도였다 지불 시작 + +178 +00:11:53,309 --> 00:11:56,379 + 특히 2010 년 신경에 대한이 첫번째 정말 큰 결과가 있었다 + +179 +00:11:56,379 --> 00:11:59,669 + 네트워크는 정말 당신이 한 다른 모든 것들에 비해 정말 잘했다 + +180 +00:11:59,669 --> 00:12:01,078 + 당신의 기계 학습 툴킷 + +181 +00:12:01,078 --> 00:12:07,888 + 커널 정도에 간첩과이 구체적으로 음성 인식했다 + +182 +00:12:07,889 --> 00:12:12,839 + 그들은이 GMM의 HMM 프레임 워크를했다 그들은 긴 부분을 스왑 영역 + +183 +00:12:12,839 --> 00:12:17,800 + 스포츠 네트워크 및 인터넷에서 2010 년에 그에게 큰 개선을 줄 것이며, + +184 +00:12:17,799 --> 00:12:21,068 + 이 마이크로 소프트에 근무하고 그래서 사람들이 있기 때문에주의를 기울여야 시작 + +185 +00:12:21,068 --> 00:12:26,189 + 이건 정말 큰 개선에서 나온 작품이 처음이다 + +186 +00:12:26,190 --> 00:12:30,550 + 그는 더 극적으로 펼쳐 곳, 그리고, 우리는 2012 년에 다시 보았다 + +187 +00:12:30,549 --> 00:12:36,039 + 우리가했다 기본적으로 시각적 인식 및 컴퓨터 비전의 도메인 + +188 +00:12:36,039 --> 00:12:44,448 + 모든 긁힌 D 안톤하여이 2012 네트워크와 기본적으로 호감 + +189 +00:12:44,448 --> 00:12:48,719 + 모든 기능에서 경쟁과 정말 큰 개선이 있었다 + +190 +00:12:48,720 --> 00:12:52,810 + 이들 신경망에서 우리가 목격하고는 어떤 사람들이 정말 + +191 +00:12:52,809 --> 00:12:56,629 + 관심과 폭발의 다음 필드입니다 이런 종류의 지불 시작 + +192 +00:12:56,629 --> 00:12:58,370 + 이 필드 영역의 많은 지금 거기 + +193 +00:12:58,370 --> 00:13:03,110 + 그래서 그것을 시작하는 이유는 수에 조금 나중에 생각 세부 사항으로 이동합니다 + +194 +00:13:03,110 --> 00:13:04,589 + 2010 초기 작동합니다 + +195 +00:13:04,589 --> 00:13:08,860 + 그것은 사물의 조합이다 그러나 나는 우리가 알아 낸 수있어 생각 + +196 +00:13:08,860 --> 00:13:12,710 + 활성화 작업에이 일을 얻는 시각화에 더 나은 방법 + +197 +00:13:12,710 --> 00:13:16,690 + 기능과 우리의 GPU를 가지고 있었고, 우리는 그렇게 정말 훨씬 더 많은 데이터를 많이 가지고 + +198 +00:13:16,690 --> 00:13:19,710 + 이 점에서 그냥 있었기 때문에 물건 전에 확실히 작동하지 않았다 + +199 +00:13:19,710 --> 00:13:26,028 + 컴퓨터 데이터와 아이디어의 일부에 불과 조정 등 그 거친 + +200 +00:13:26,028 --> 00:13:30,750 + 역사적 그래서 우리는 기본적 통해 유망 약자 이상에 걸쳐 갔다 + +201 +00:13:30,750 --> 00:13:34,700 + 처리 및 배달하고 지금은 일처럼 보인다는 실제로 작동하려고 + +202 +00:13:34,700 --> 00:13:37,028 + 정말 잘 우리는이 시점에서 어디 그래서이다 + +203 +00:13:37,028 --> 00:13:42,210 + 확인 나는 세부 사항에 뛰어거야 그리고 우리는 정확히 것입니다 실제로 볼 수 있습니다 + +204 +00:13:42,210 --> 00:13:45,550 + 작품을 알고 죽어가는 우리는의 개요 있도록 적절하게 훈련하는 방법 + +205 +00:13:45,549 --> 00:13:49,139 + 우리가 다음 해 강의의 과정을 통해 다루려고하는 것은 인 + +206 +00:13:49,139 --> 00:13:52,809 + 독립적 인 사물의 전체 무리 그래서 난 그냥 모든 당신을 peppering 될 수 있습니다 + +207 +00:13:52,809 --> 00:13:55,989 + 우리가 이해하고 사람들이에서 무엇을 볼 수있는이 작은 지역 + +208 +00:13:55,990 --> 00:13:59,409 + 경우 우리는 그들을 통해 방법을 모든 거래의 장단점을 갈거야 + +209 +00:13:59,409 --> 00:14:05,659 + 실제로 제대로에 신경 네트워크와 실제 데이터 세트를 훈련 + +210 +00:14:05,659 --> 00:14:06,730 + 먼저 우리가 얘기하는거야 + +211 +00:14:06,730 --> 00:14:14,450 + 활성화 기능은 내가 강의 그래서 전 그렇게이 생각 약속 + +212 +00:14:14,450 --> 00:14:19,320 + 자신의 정상 기능에 우리는 다른 많은 수 보았다 + +213 +00:14:19,320 --> 00:14:25,230 + 이 때문에 휴대폰은 무엇 이러한 활성화에 대한 모든 다른 제안입니다 + +214 +00:14:25,230 --> 00:14:28,450 + 그들이 어떤 감옥 통화 및 방법을 통해 갈 것 같은 기능을 볼 수 있습니다 + +215 +00:14:28,450 --> 00:14:31,459 + 의 바람직한 특성에가는 무엇을 어떻게 활성화에 대해 생각 + +216 +00:14:31,458 --> 00:14:35,289 + 활성화 기능을 가장 많이 사용 된 역사적으로 하나가 그래서 + +217 +00:14:35,289 --> 00:14:39,009 + 그것은 기본적 부수 그래서 다음과 같습니다 시그 모이 드 비선형 + +218 +00:14:39,009 --> 00:14:40,528 + 그것이 진정한 가치 번호를 취 기능 + +219 +00:14:40,528 --> 00:14:45,669 + 그래서 시그 모이와 첫 번째 문제는 0과 1 사이 과즙은가되게합니다 + +220 +00:14:45,669 --> 00:14:51,120 + 지적 된 바와 같이 몇 가지 강의 포화 문제가 거기에 갈 것을 + +221 +00:14:51,120 --> 00:14:55,839 + 어느 제로에 매우 가까이 또는 그 중 하나에 매우 가까운 뉴런 + +222 +00:14:55,839 --> 00:15:00,070 + 신경 세포는 다시 전파 동안 구배를 죽이고 그래서 난에 확장하려면 + +223 +00:15:00,070 --> 00:15:03,660 + 이 항목은 정확히 이것이 의미하는 어떤이는 우리가있어 뭔가에 기여 + +224 +00:15:03,659 --> 00:15:08,679 + 벤치 성분 문제를 호출하려고하는 것은 그래서이의 게이트를 살펴 보자 + +225 +00:15:08,679 --> 00:15:11,159 + 다시 회로의 일부를받을 + +226 +00:15:11,159 --> 00:15:16,149 + 그리고이 나오고 다시 아마 괜찮은 우리로 거래를 신호 + +227 +00:15:16,149 --> 00:15:19,940 + 우리가 가질 수 있도록 체인 규칙을 사용하여 제 2 게이트를 통해 드롭 백업하려면 + +228 +00:15:19,940 --> 00:15:24,089 + 당신이 체인 규칙을 통해 그것을 볼 수있는 끝에서 닥스에 의한 거래는 기본적으로 말했다 + +229 +00:15:24,089 --> 00:15:27,569 + 우리는이 두 수량을 곱 등 때 일어나는 일에 대해 생각하는 + +230 +00:15:27,568 --> 00:15:33,399 + 신호 게이트 받아 10 또는 20 또는 가치가 경쟁 (10)에 의해 연기 + +231 +00:15:33,399 --> 00:15:37,309 + 다음은 정상에서 약간의 그라데이션을지고있어 무엇은 그 방사에 미치는 영향 + +232 +00:15:37,309 --> 00:15:41,549 + 이러한 경우 중 하나의 회로를 통해 배경이 가능한 것입니다 + +233 +00:15:41,549 --> 00:15:56,578 + 당신은 그라데이션이 매우이라고 말을하는지 그래서, 그래서이 경우 일부에서 문제 + +234 +00:15:56,578 --> 00:16:01,919 + 텍사스 마이너스 10 또는 10이 기본적으로 우리가 있습니다보고 기다릴 때 낮은 + +235 +00:16:01,919 --> 00:16:05,659 + 이 그라데이션 곱됩니다 여기에이 지역의 그라데이션이 + +236 +00:16:05,659 --> 00:16:09,838 + 당신은 당신이 할 수있는 네거티브 (10)에있을 때 현지 구배 X bydy DOMA를 defund + +237 +00:16:09,839 --> 00:16:14,370 + 그래디언트 기본적 제로임을 알 때문에이 때의 기울기 제로 + +238 +00:16:14,370 --> 00:16:18,339 + 그라데이션 참석도 제로 근처에있을 것입니다 그래서 문제는 당신이 읽고있는 것입니다 + +239 +00:16:18,339 --> 00:16:24,220 + 여기에서 드롭하지만 당신은에있어 경우는 그래서 기본적으로 0이 있었다 포화됩니다 + +240 +00:16:24,220 --> 00:16:26,930 + 그 다음 원 그라데이션 살해한다 + +241 +00:16:26,929 --> 00:16:31,258 + 난 그냥 아주 작은 수를 곱한됩니다 큰 정보를 통해 중지 + +242 +00:16:31,259 --> 00:16:36,480 + 당신의 대규모 네트워크가있는 경우의 서명을 통해 그들 그래서 당신은 상상할 수 + +243 +00:16:36,480 --> 00:16:39,800 + 시그 모이 신경 세포와 그들 중 많은 사람들이 하나있어 포화 정권에 + +244 +00:16:39,799 --> 00:16:43,269 + 그들이 될 것이기 때문에 0 또는 1 성분은 다시 네트워크를 통해 전파 할 수 없습니다 + +245 +00:16:43,269 --> 00:16:48,230 + 당신이 당신의 사무실 또는 포화 또는 청바지 성분에 앉아하는 경우 중지 + +246 +00:16:48,230 --> 00:16:51,740 + 단지 당신이 있다면 안전한 영역에 종류의 흐름과 우리는 활성 영역은 무엇 전화 + +247 +00:16:51,740 --> 00:16:57,049 + S 상과 그래서 문제의 종류 우리는이에 대한 자세한 내용을 곧 볼 수 있어요 + +248 +00:16:57,049 --> 00:17:03,289 + 시그 모이 또 다른 문제는 우리가 거​​ 야가 중심 제로되지 않은 것입니다 + +249 +00:17:03,289 --> 00:17:07,078 + 당신이 처리 할 때 곧 전처리에 대해 이야기하지만 당신은 항상하려는 + +250 +00:17:07,078 --> 00:17:10,578 + 이 경우는에 하루는 제로 중심으로 오른쪽의이 있는지 확인하려면 및 + +251 +00:17:10,578 --> 00:17:14,658 + 지그문트 자신의 개구부의 여러 레이어의 큰 네트워크를 가지고 가정 + +252 +00:17:14,659 --> 00:17:19,659 + 0과 1 사이이 (90)을 중심으로 값은 우리가 더 기본적으로 퍼팅하는 + +253 +00:17:19,659 --> 00:17:22,260 + 서로의 상부에 적층 리더 분류 + +254 +00:17:22,259 --> 00:17:26,078 + 그리고 약 비 - 제로의 문제는 최대 중심하지만 난 그냥 포기하려고합니다 + +255 +00:17:26,078 --> 00:17:31,169 + 당신이 잘못되면 무엇에 직관의 비트 + +256 +00:17:31,170 --> 00:17:36,480 + 그냥보고이란 바로 공 (60)이 함수를 계산 관심이란 + +257 +00:17:36,480 --> 00:17:40,589 + W 경쟁은해야하고 우리는 당신에 대해 말할 수있는 생각에 대해 무슨 말을 할 수 있습니다 + +258 +00:17:40,589 --> 00:17:45,559 + 역 전파시 W의 기울기는 예전 친구는 모든 경우 + +259 +00:17:45,559 --> 00:17:49,259 + 011 사이의 경우 긍정적 어쩌면 당신은 어딘가 깊은이란에있어 + +260 +00:17:49,259 --> 00:17:54,539 + 모든 초과가 긍정적 인 경우 네트워크는 당신이 무게에 대해 무엇을 말할 수있다 + +261 +00:17:54,539 --> 00:18:00,960 + 번호 + +262 +00:18:00,960 --> 00:18:13,970 + 녹색 WR에 앞서 방법으로 제한하거나 양 또는 음과 + +263 +00:18:13,970 --> 00:18:17,730 + 그라데이션 상단에서 유입 때문이고, 당신이 생각하는 경우 + +264 +00:18:17,730 --> 00:18:22,700 + 모든 W의 광채에 대한 표현들이있어 기본적으로 X 시간 그라데이션 + +265 +00:18:22,700 --> 00:18:28,440 + 그래서 신경 세포의 상단에 그라데이션 오프는 모든 W 긍정적이다 + +266 +00:18:28,440 --> 00:18:32,308 + 격자는 긍정적 것이며, 그 반대의 경우도 마찬가지 그래서 기본적으로 당신이 경우에 결국 + +267 +00:18:32,308 --> 00:18:35,710 + 첫 번째 대기 a를 그래서 두 가중치를 했어야 어디 + +268 +00:18:35,710 --> 00:18:40,788 + 둘째 무슨 일이 끝나는 것은으로 그것에 대해 그 다른 성분 인 대기 + +269 +00:18:40,788 --> 00:18:45,099 + 이 거기에 무게에서 컴퓨터를 통해 준비가 진행 중 긍정적 또는 + +270 +00:18:45,099 --> 00:18:49,509 + 음 그래서 문제는이 제한되어 있고 업데이트의 종류의 + +271 +00:18:49,509 --> 00:18:53,609 + 하고 당신이 원하는 경우 경로를 엄격한이 바람직하지 않은 끝낼 수 있습니다 + +272 +00:18:53,609 --> 00:18:57,808 + 유사한 이런 종류의이 지역 이외의 몇 가지 부분에 도착합니다 + +273 +00:18:57,808 --> 00:19:02,058 + 약간 헨리 VIII 이유 여기지만 단지는 직관주고 당신은 볼 수 있습니다 + +274 +00:19:02,058 --> 00:19:04,769 + 이 경험적으로 당신은 영을 중앙에 있지 것들로 훈련 할 때 + +275 +00:19:04,769 --> 00:19:09,319 + 느린 수렴을 관찰하고이 이유에 대한 이유와 손의 비트입니다 + +276 +00:19:09,319 --> 00:19:13,220 + 그런 일이 있습니다하지만 난 당신이 실제로으로 훨씬 더 깊게 할 경우 생각 + +277 +00:19:13,220 --> 00:19:15,919 + 당신이 말하는 사람들이있다 할 수 있지만 다음에 있는지 + +278 +00:19:15,919 --> 00:19:19,350 + 수학 공식의 주요 계절 자연 그라디언트에 대한 이유와는 조금 얻는다 + +279 +00:19:19,349 --> 00:19:22,959 + 더 이것보다 복잡하지만 난 그냥 당신이 원하는 당신의 직관을주고 싶어 + +280 +00:19:22,960 --> 00:19:25,950 + 당신이 그들의 산타 일을하려는 입력에 제로 센터 일들이 + +281 +00:19:25,950 --> 00:19:30,450 + 흰색은 멋지게 일을 생각하고 그래서 그의 단점 인을 통해 + +282 +00:19:30,450 --> 00:19:35,569 + 이 식 내부 XP 해당 함수는 자신의 마지막 하나되는 시그널링 + +283 +00:19:35,569 --> 00:19:39,099 + 기타의 대안 중 일부에 비해 계산하는 종류의 비용이 + +284 +00:19:39,099 --> 00:19:45,199 + 자선 단체 그래서 그냥 작은 세부 사항의 난 때를 실제로 가정 + +285 +00:19:45,200 --> 00:19:48,028 + 이러한 큰 상업 네트워크를 훈련 컴퓨터가 대부분의 시간은 아니다 + +286 +00:19:48,028 --> 00:19:53,148 + 대회 이러한 내적는이 만료에없는 등 그 종류의 + +287 +00:19:53,148 --> 00:19:55,509 + 작은 기여를 추방하지만 아직도 조금 무언가이다 + +288 +00:19:55,509 --> 00:20:00,710 + 난 당신이 몇 가지를 생각하면 물어 갈거야, 그래서 다른 부분에 비해 단점 + +289 +00:20:00,710 --> 00:20:04,230 + 어린 나이 때문에 질문은 특히 이러한 문제를 해결하기 위해 시도이다 + +290 +00:20:04,230 --> 00:20:11,440 + 그것은 1991 년에 이렇게 웅변을 중심으로 90 년대 있다는 사실은 바로 아주 좋은 썼다 + +291 +00:20:11,440 --> 00:20:13,450 + 네트워크를 최적화하는 방법에 대한 논문 + +292 +00:20:13,450 --> 00:20:18,700 + 그리고 나는 강의에서에 연결하고 그 사람들이 어떤 여분을 사용하는 것이 좋습니다 + +293 +00:20:18,700 --> 00:20:22,350 + 기본적으로 영향을위한 단계의 두 세그먼트하지만 함께 같은 종류의 + +294 +00:20:22,349 --> 00:20:28,219 + 당신은 당신이 (40)과 함께있어 음 하나 하나 너무 사이에있는와 끝까지 + +295 +00:20:28,220 --> 00:20:32,139 + 중심하지만 그렇지 않으면 같은 다른 문제에서 무언가까지 여전히이 + +296 +00:20:32,138 --> 00:20:36,240 + 예를 들어, 당신은 당신이 포화받을 경우이 지역 그라디언트 흐름 없음을 + +297 +00:20:36,240 --> 00:20:41,829 + 그래서 우리가 정말이 시점에서 그 문제를 해결하지 않은하지만 너무 많은 그냥 생각 + +298 +00:20:41,829 --> 00:20:51,259 + 엄밀하게는 (10)을 제외한 모든 같은 문제가 있기 때문에 S 자형 선호 + +299 +00:20:51,259 --> 00:20:57,970 + 계속 어쩌면 우리가 신문에 그렇게 2012 주위에 더 많은 질문을 할 수 있습니다 + +300 +00:20:57,970 --> 00:21:01,038 + 오스카 제시카 이것은 우리가 제안하는 최초의 상용 네트워크 종이입니다 + +301 +00:21:01,038 --> 00:21:05,240 + 실제로 우리는 발견 당신이 맥 시스이란 X를 사용하여이 비선형 + +302 +00:21:05,240 --> 00:21:07,339 + 대신에 S 자형 또는 10 각의 + +303 +00:21:07,339 --> 00:21:10,849 + 단지 훨씬 빠르고 자신의 실험을 거의에서 확인 네트워크 변환을 + +304 +00:21:10,849 --> 00:21:17,699 + 제 6의 높이와 우리가 돌​​아가서 이유에 대해 생각하는 시도 할 수 있습니다이 무엇 인 + +305 +00:21:17,700 --> 00:21:20,450 + 종류의 당신이 실제로 잘 작동하는지 볼 수있는 것처럼 그것으로 읽고 있지만, + +306 +00:21:20,450 --> 00:21:25,580 + 항상 쉬운 몇 가지 이유를 듣고 있지 않는 것을 설명하는 것은 잠시 동안 기대 + +307 +00:21:25,579 --> 00:21:30,908 + 사람들은 그래서 한 가지이이 있다는 것입니다이 훨씬 더 잘 작동하는지 생각 + +308 +00:21:30,909 --> 00:21:35,570 + 자신의 역할과 적어도 그렇게 성소하지 적어도 긍정적 인 영역 + +309 +00:21:35,569 --> 00:21:38,859 + 여기서,이 지역에서 당신은 스페인 성분 문제가없는 당신의 + +310 +00:21:38,859 --> 00:21:42,019 + 광택은 종류의 사망하고이 문제가 어디 뉴런이 + +311 +00:21:42,019 --> 00:21:47,028 + 양측하지만 이들로부터 경계하는 작은 영역 만 활성화됩니다 + +312 +00:21:47,028 --> 00:21:50,519 + 뒷면의 의미에서 실제로 활성 뉴런이 제대로 여부를 전파 + +313 +00:21:50,519 --> 00:21:55,419 + 올바르게 그러나 적어도 그들은 그들의 지역의 80온스 절반 이상을 좋아하지 않는다 + +314 +00:21:55,419 --> 00:22:00,730 + 그들이있어 훨씬 더 그냥 잡고있어 효율적으로 계산하고 + +315 +00:22:00,730 --> 00:22:04,919 + 실험 당신은이 숫자를 너무 많이 더 빨리 그래서이를 볼 수 있습니다 + +316 +00:22:04,919 --> 00:22:08,929 + 근처 누린다 당신의 장치에있는 파일에 대한 호출은이 논문에서 지적했다 + +317 +00:22:08,929 --> 00:22:12,000 + 이 훨씬 더 잘 작동이이 같은 종류의 것을 처음으로 + +318 +00:22:12,000 --> 00:22:15,429 + 자세한 권장 사항은 무엇가 동시에이 시점에서 사용한다 + +319 +00:22:15,429 --> 00:22:18,990 + 이 판결이란 그래서 한 가지 몇 가지 문제가 다시 그것의 것을 알 수 있습니다 + +320 +00:22:18,990 --> 00:22:23,778 + 하지 제로는 그렇게 완전히 아마도 이상과 약간 아니다 업을 중심으로 + +321 +00:22:23,778 --> 00:22:26,130 + 집권이란의 성가심 + +322 +00:22:26,130 --> 00:22:31,120 + 이 때 발생하는 무엇에 대해 우리는 그것에 대해 이야기하고 생각할 수 있음 + +323 +00:22:31,119 --> 00:22:37,009 + 정말 전파 진행 상황에 대해 더 (10)는이란이 될하지 않는 경우 + +324 +00:22:37,009 --> 00:22:43,269 + 예측에 적극적으로 그들이 그것을 무엇을 활성 천둥 배경에 남아 + +325 +00:22:43,269 --> 00:22:47,289 + 물론이 있다는 것입니다의 그라데이션 등 방법을 죽이는 권리를 확인하기 죽이기 + +326 +00:22:47,289 --> 00:22:51,609 + 때 너무 부정적인 읽을 경우 동일한 그림이 현지보다 10 말과 + +327 +00:22:51,609 --> 00:22:55,119 + 이 때문에 더 그냥 제로 그라데이션이 그냥 0이됩니다 여기에 그라데이션 + +328 +00:22:55,119 --> 00:22:58,589 + 동일 그것은 단지 당신이 그것을 죽일 실제로 당신을 저하 뭉개 버려 아니에요 + +329 +00:22:58,589 --> 00:23:01,689 + 완전히 그렇게 작동하지 않는 사람은 그 전파되지 않습니다 + +330 +00:23:01,690 --> 00:23:06,039 + 아래의 무게가 업데이트 아무것도에서 그 아래에 발생하지 않습니다 + +331 +00:23:06,039 --> 00:23:13,970 + 기여와 전술에 대한 최소한 10입니다 지역 그라데이션이었다 + +332 +00:23:13,970 --> 00:23:19,940 + 하나는, 그래서 그냥 그라데이션을 통해 단지 게이트를 통과 할 경우 경우 경우 경우의 + +333 +00:23:19,940 --> 00:23:24,820 + 그 밖의 자산은 긍정적하고 그냥 통과 그렇지 통해 읽기 + +334 +00:23:24,819 --> 00:23:30,250 + 그것은 지금까지 좋은 게임 같은 종류의를 죽이고 그런데 무슨 일이 때 발생 + +335 +00:23:30,250 --> 00:23:38,569 + 실제로 그건 정의되지 않은 것 그 시점에서 당신의 그라데이션을 손쉽게 텍사스 0 + +336 +00:23:38,569 --> 00:23:42,169 + 오른쪽 녹색은 우리가 단지 내가 할 때마다 이야기 그 시점에서 존재하지 않는 + +337 +00:23:42,170 --> 00:23:45,789 + 그라데이션에 대한 이야기​​는 내가 항상 일부 기울기를 의미하는 것으로 가정 + +338 +00:23:45,789 --> 00:23:49,119 + 때때로에 미분없는 그라데이션이 기능의 일반화 + +339 +00:23:49,119 --> 00:23:52,250 + 존재하지 않는 한계를들을 수 있지만, 일부 그라디언트의 전체 무리가있다 + +340 +00:23:52,250 --> 00:23:58,609 + 즉, 0 또는 1이 될 수 있고 그래서 우리가 연습이 보통 사용하는 무엇을 + +341 +00:23:58,609 --> 00:24:02,119 + 차이는 너무 많은 정말 중요하지 않습니다하지만 난에서 남쪽에 대해 이야기를하고 싶어 + +342 +00:24:02,119 --> 00:24:06,539 + 무슨 미라 케이트 X & Y 누군가에 의해의 경우는 질문을 + +343 +00:24:06,539 --> 00:24:12,629 + X & Y가 동일한 경우 그 경우 당신은 또한 기능에 꼬임을 가질 수 있으며, + +344 +00:24:12,630 --> 00:24:15,550 + 그들 취약하게하지만 실제로 이런 일들은 정말 상관 없어 + +345 +00:24:15,549 --> 00:24:20,329 + 하나를 선택 그래서 당신이 2011 년에 큰를 가질 수 있고, 일이 잘 작동하고 + +346 +00:24:20,329 --> 00:24:23,490 + 이 당신이 권리를 결국 매우 않을 경우이기 때문에 그 약이다 + +347 +00:24:23,490 --> 00:24:24,710 + 그곳에 + +348 +00:24:24,710 --> 00:24:28,519 + 확인 그래서 RELO의 문제는 대략 여기에 실제로 발생하는 문제입니다 그 + +349 +00:24:28,519 --> 00:24:32,799 + 당신은 당신이 인의 알고 있어야 이스라엘 단위와 한 가지로 시도 + +350 +00:24:32,799 --> 00:24:37,629 + 그들은 아무것도 넣지 않는 경우, 어떤 좋은 치과를 얻을하지 않습니다 이러한 뉴런 + +351 +00:24:37,630 --> 00:24:38,290 + 죽여 + +352 +00:24:38,289 --> 00:24:48,049 + 업데이트 등 문제가있을 것으로 예상되는 무슨 뭔가 일이 + +353 +00:24:48,049 --> 00:24:51,059 + 당신이있어 초기화 할 때 정말 당신이 비에를 초기화 할 수 있습니다 뉴런 + +354 +00:24:51,059 --> 00:24:57,000 + 아니 아주 운이 방법은 어떤 생각되는 일이 끝이 당신의 가이드 + +355 +00:24:57,000 --> 00:25:02,009 + 당신의 엘레 노어의 입력의 구름은 우리가 부르는 끝낼 수있는 네 소유 + +356 +00:25:02,009 --> 00:25:06,650 + 이 신경 세포는 지역의 활성화 경우 등 죽은 상대 죽은 벨소리 + +357 +00:25:06,650 --> 00:25:12,550 + 이 침대 트레일러에서 데이터 클라우드의 외부가 활성화 될하지 않습니다 및 + +358 +00:25:12,549 --> 00:25:15,889 + 다음은 업데이트하지 않습니다 때문에이 중 두 가지 방법 중 하나를 발생할 수 있습니다 + +359 +00:25:15,890 --> 00:25:19,090 + 초기화하는 동안 당신은 정말 운이 있었고, 당신은 샘플 일 + +360 +00:25:19,089 --> 00:25:22,959 + 그 신경이 켜지지 않을 것입니다 그런 방법으로 자신의 그녀의 역할을 기다립니다 + +361 +00:25:22,960 --> 00:25:27,549 + 이란이 경우 비가되지 않지만 더 자주 발생하는 중입니다 + +362 +00:25:27,549 --> 00:25:31,769 + 당신이 속도를 학습하는 경우 훈련은 이러한 뉴런 요청에 대해 생각 높다 + +363 +00:25:31,769 --> 00:25:35,339 + 우리는 기회가 종종 발생할 수 있습니다 주위에 그리고 그들은 단지 떨어져 기절있어 + +364 +00:25:35,339 --> 00:25:39,669 + 데이터 매니 폴드와 그 다음 일어날 때 그들은 다시 활성화하지 않고 얻을 않을 것 + +365 +00:25:39,670 --> 00:25:43,310 + 그들은 데이터 매니 폴드에 다시 오지 않을 당신은 거기에 볼 수 있습니다 + +366 +00:25:43,309 --> 00:25:48,039 + 실제로 때때로 당신이 대표단에 큰 신경 그물을 훈련 할 수처럼 연습 + +367 +00:25:48,039 --> 00:25:51,740 + 당신은 그것을 시도하고 그것을 잘 작동하는 것 같다 다음 당신이를 중지 할 것 + +368 +00:25:51,740 --> 00:25:54,279 + 교육 당신은 당신의 네트워크를 통해 전체 훈련 데이터 세트를 전달 + +369 +00:25:54,279 --> 00:25:59,460 + 당신은 모든 단일 신경 세포의 통계를 보면 어떤 일어날 수있는 것은 + +370 +00:25:59,460 --> 00:26:02,620 + 네트워크의만큼 10 등, 20 %가 죽었다는 + +371 +00:26:02,619 --> 00:26:06,319 + 그에 디자이너는 훈련 데이터의 어떤이에 대해 설정되지 않습니다 + +372 +00:26:06,319 --> 00:26:09,929 + 실제로 당신이 비율이 높았다 배우고 있기 때문이다 일반적으로 일어날 수 + +373 +00:26:09,930 --> 00:26:14,250 + 그래서 사람들은 네트워크의 죽은 부분처럼 당신은 파타키를 호출 할 수 있습니다 + +374 +00:26:14,250 --> 00:26:16,299 + 실제 등등 이러한 것들과 국유화에 대한 계획 + +375 +00:26:16,299 --> 00:26:21,569 + 사람들은 일반적으로 많이하지 않습니다하지만주의해야 할 뭔가 그리고 그것은이다 + +376 +00:26:21,569 --> 00:26:26,929 + 이 비선형으로 그래서 특히 때문에 초기화 문제 + +377 +00:26:26,930 --> 00:26:30,840 + 사람이 죽은 진짜 문제는 일반적으로 버스를 초기화하면된다 좋아 (10) + +378 +00:26:30,839 --> 00:26:35,289 + 이 생활에있는 사람들이 그 때문에 렉시 0101 약간 양수이었다 대신 + +379 +00:26:35,289 --> 00:26:40,389 + 수 그것은 가능성이 있음을 초기화 이러한 로마 숫자와 + +380 +00:26:40,390 --> 00:26:44,170 + 그것은 덜 것을 수 있도록 이전 업데이트를 얻을 것이다 신경 단지 않습니다 + +381 +00:26:44,170 --> 00:26:48,190 + 교육을 통해 지금까지 활성화 될하지만 난 생각 실제로하지 않습니다 + +382 +00:26:48,190 --> 00:26:51,350 + 이 가능성이 논쟁 점을 가지고있다 어떤 사람들은 주장 + +383 +00:26:51,349 --> 00:26:54,849 + 어떤 사람들은 실제로 모든 그래서 그냥 도움이되지 않는 말 섹시 도움 + +384 +00:26:54,849 --> 00:27:02,089 + 뭔가 우리가 들어갈 예정이 시점에서 질문에 대해 생각하는 + +385 +00:27:02,089 --> 00:27:08,839 + 다른 지금의 사람들이 느슨한를 해결하기 위해 노력 같은 것들을 살펴 보자 확인 원 + +386 +00:27:08,839 --> 00:27:13,058 + 이러한 죽은 신경 세포가 아니기 때문에 그래서 친척 하나의 문제는, 그래서 여기에 하나를 이상적 + +387 +00:27:13,058 --> 00:27:18,349 + 누설 비와 정말 누출의 아이디어라고 제안 + +388 +00:27:18,349 --> 00:27:22,399 + 기본적으로 우리는이 꼬임을 원하고 우리는이 평화 마침내 RT 원하는 우리가 원하는 + +389 +00:27:22,400 --> 00:27:29,070 + 하지만 문제의 자족이 지역이 당신의 꿈이 너무 죽을 것입니다 + +390 +00:27:29,069 --> 00:27:32,379 + 대신의 약간 부정적으로 여기거나 약간 경 사진이를 만들어 보자 + +391 +00:27:32,380 --> 00:27:36,409 + 긍정적으로 나는이 지역에서 가정 경사 등이와 끝까지 + +392 +00:27:36,409 --> 00:27:41,260 + 기능과 그가 새는 그래서 어떤 사람들은 사람들이 있음을 보여주고있다라는 것 + +393 +00:27:41,259 --> 00:27:45,519 + 이것은 당신이 죽어가는 신경이 문제가 있지만하지 않는 약간 더 나은 작품 + +394 +00:27:45,519 --> 00:27:51,730 + 완전히이 항상 더 나은 다음 작동 설정되지 않은 생각 + +395 +00:27:51,730 --> 00:27:54,870 + 이 가지고 노는 어떤 사람들은 더욱 더 지금이 당신의 아파트 101 만 + +396 +00:27:54,869 --> 00:27:57,439 + 그 사실은 임의의 매개 변수가 될 수 있으며, 당신은 뭔가를 얻을 + +397 +00:27:57,440 --> 00:28:01,058 + 그는 파라 메트릭 정류기 또는 사람과 여기에 기본적으로 아이디어라고 + +398 +00:28:01,058 --> 00:28:07,519 + 이 네트워크의 매개 변수를 101 인이 할 수있는 소개한다 + +399 +00:28:07,519 --> 00:28:10,808 + 배울 당신은 그것으로 얻을 백업 할 수 있습니다 그래서 이러한 뉴런은 기본적으로 수 + +400 +00:28:10,808 --> 00:28:15,609 + 확인 자신의 고유 영역이 무엇을 사면 선택하고 그래서 그들은 될 수 있습니다 + +401 +00:28:15,609 --> 00:28:21,250 + 그들이 원하는 또는 그들이 누출 될 수 있습니다 또는 그들이 가질 수 있습니다 관련성이없는 경우 + +402 +00:28:21,250 --> 00:28:25,798 + 선택은 거의 모든 신경이 사람들이 노는 물건의 종류 + +403 +00:28:25,798 --> 00:28:40,950 + 그들은 단지 매우 일반적인 방법으로 너무 좋은 일을 설계하려 할 때 + +404 +00:28:40,950 --> 00:28:44,200 + 그것은 그것의 자신의 그 것처럼 경쟁은 모든 신경 세포가있을 것이다 나가서 + +405 +00:28:44,200 --> 00:28:46,659 + 바이어스 + +406 +00:28:46,659 --> 00:28:48,490 + 진행 + +407 +00:28:48,490 --> 00:29:00,370 + 나는거야 그건 아마 그래서 한 다음 ID를받을거야 발견 + +408 +00:29:00,369 --> 00:29:03,779 + 전파는 의미에서 원하는 무언가 그 정체성 아니었다면 그 + +409 +00:29:03,779 --> 00:29:06,819 + 당신이 아기를 예상 할 수 있도록 그 아주 경쟁에게 유용한이어야한다 + +410 +00:29:06,819 --> 00:29:09,939 + 다시 전파가 실제로 공간이 그 지역에 당신을 얻을 안되며, + +411 +00:29:09,940 --> 00:29:13,720 + 내가 제대로이 기억한다면 어쩌면 아마도 내가 실제로 생각하지 않습니다 + +412 +00:29:13,720 --> 00:29:17,069 + 특별한 일없는 곳에 너무 많이하지만 내가 할 수있는 것을 정말 걱정 명 + +413 +00:29:17,069 --> 00:29:20,529 + 잘못된 앞으로있을 이제 얼마 전 신문을 읽고 나는이 너무 많이 사용하지 않는 + +414 +00:29:20,529 --> 00:29:27,160 + 작동하고 그래서 하나의 문제는 여전히 우리가 그것을보고 이러한 서로 다른 방식이있다 + +415 +00:29:27,160 --> 00:29:30,759 + 이란의 난간 침대를 고정하는 경우에만 온 다른 사람들이있다 + +416 +00:29:30,759 --> 00:29:34,730 + 약 두 달 전, 예를 들면 밖으로 그래서 이것은 당신에게 방법의 감각을 준다 + +417 +00:29:34,730 --> 00:29:38,210 + 논문하려고 두 달 전에 새로운이 필드는이 나오고있다 + +418 +00:29:38,210 --> 00:29:42,850 + 하여 단위 중 하나를 지수로하는 새로운 기능을 활성화가 제안 + +419 +00:29:42,849 --> 00:29:46,799 + 이 모든이 시도에 당신에게 사람들이 즐기는 게임에 대한 아이디어를 제공 + +420 +00:29:46,799 --> 00:29:50,869 + 힘 입어 relew의 장점 인 비 - 제로의 이러한 단점을 없애 + +421 +00:29:50,869 --> 00:29:54,909 + 중심 그래서 그들은처럼 보이는 여기 푸른 함수와 끝까지 + +422 +00:29:54,910 --> 00:29:58,390 + 진짜 문제 만은 음의 영역에 그냥 제로에 갈하거나하지 않습니다 않습니다 + +423 +00:29:58,390 --> 00:30:02,700 + 누수로 내려가하지만이 재미있는 모양을 가지고 있으며, 수학의 두 페이지에 있습니다 + +424 +00:30:02,700 --> 00:30:03,480 + 종이 + +425 +00:30:03,480 --> 00:30:08,509 + 부분적으로 정당화 당신은 원하는 이유는 대략 당신과 함께이 말을 할 때 + +426 +00:30:08,509 --> 00:30:12,829 + 제로 평균 아울렛 그리고 그들은 그 균주 더 나은 주장 내가 있다고 생각 + +427 +00:30:12,829 --> 00:30:17,889 + 어떤이에 대한 논쟁과 우리는 기본적으로이 모든 것을 이해하려고 노력하고 있습니다 + +428 +00:30:17,890 --> 00:30:18,309 + 아웃 + +429 +00:30:18,308 --> 00:30:21,849 + 활성 연구의 영역과 우리가 오히려 아직 어떻게해야할지 확실하지 않다하지만 권리 + +430 +00:30:21,849 --> 00:30:26,719 + 당신이 조심 있다면 당신은 그 그래서 경우 ​​지금 안전 권고처럼 + +431 +00:30:26,720 --> 00:30:31,259 + 느슨한 하나 상대적이기 때문에 내가 언급을 주목하고 싶습니다 더 + +432 +00:30:31,259 --> 00:30:35,319 + 일반적으로 당신이 작동에 대한 것은 자신의 출력이 최대입니다 읽으면 당신은 그것을 볼에 + +433 +00:30:35,319 --> 00:30:42,308 + 호텔에서 기본적으로는이란에서 매우 다른있어 그것을 그냥 아니에요 + +434 +00:30:42,308 --> 00:30:44,000 + 다른 보이는 활성화 함수 + +435 +00:30:44,000 --> 00:30:47,789 + 실제로 계산해 그냥이 양식이없는 방법이란 컴퓨터 내에서 변경 + +436 +00:30:47,789 --> 00:30:54,629 + WX의 실제로 두 개의 무게가 다음 W 난다는 X 박스를 바꾸어 계산 + +437 +00:30:54,630 --> 00:30:58,970 + 이러한 장소를 하이킹을 좋아하여 WSYX의 또 다른 세트가 끝을 위로해야 할 그 + +438 +00:30:58,970 --> 00:31:01,440 + 당신은을 통해 최대를 가지고 그건 컴퓨터 근처에 무엇을 + +439 +00:31:01,440 --> 00:31:04,298 + 이러한 활성 기능을 가지고 노는 여러 가지 방법이 있다는 것을 볼 수있다 + +440 +00:31:04,298 --> 00:31:09,339 + 그래서 이것은 이것의 단점 중 일부는 죽고 싶어해야하고하지 않습니다 + +441 +00:31:09,339 --> 00:31:13,128 + 여전히 구분 선형은 여전히​​ 효율적인하지만 하나 하나 신경 세포의 + +442 +00:31:13,128 --> 00:31:16,839 + 이 가중치를 가지고 있으며, 그래서 당신은 가지 매개 변수 초연의 수를 두 배로 + +443 +00:31:16,839 --> 00:31:21,689 + 그래서 아마 그 이상적인 아니에요에 그래서 어떤 사람들은 이것을 사용하지만 난 그게 생각 + +444 +00:31:21,690 --> 00:31:45,130 + 그것은 내가 도로가 여전히 가장 일반적인 것을 말할 것입니다 슈퍼 공통 아니다 + +445 +00:31:45,130 --> 00:31:57,870 + 그 바람에 상이 할 수 있고, 그래서 당신은 다른 가중치를 종료합니다 + +446 +00:31:57,869 --> 00:32:11,009 + 확실히 복잡 복잡 + +447 +00:32:11,009 --> 00:32:15,799 + 최적화 프로세스의 많은 단지 손실 함수에 대한되지 않지만 + +448 +00:32:15,799 --> 00:32:19,000 + 다만 채소의 역류의 역학에 대해 좋아하고 우리는 표시됩니다 + +449 +00:32:19,000 --> 00:32:22,250 + 다음 주에 그것에 대해 조금 당신이 정말로 그것에 대해 생각할 필요가있다 + +450 +00:32:22,250 --> 00:32:27,420 + 단지 잃어버린 풍경보다 동적으로 더 그렇게 너무 야 어떻게 + +451 +00:32:27,420 --> 00:32:32,410 + 복잡하고 또한 특별히 확률 그라데이션 하강하고있다 + +452 +00:32:32,410 --> 00:32:36,340 + 일부 자유가 멋지게 연주 좋네요 특정 형태와 뭔가 splaine + +453 +00:32:36,339 --> 00:32:41,039 + 최적화 업데이트를 연결되어 있다는 사실뿐만 아니라 모든이에 연결되어 + +454 +00:32:41,039 --> 00:32:45,519 + 모두가 함께 상호 작용의 유형 및 이러한 활성 기능의 선택과 같은 + +455 +00:32:45,519 --> 00:32:49,619 + 및 업데이트의 선택은 종류의 결합이 때 매우 불분명하고 있습니다 + +456 +00:32:49,619 --> 00:32:59,649 + 그들이 여기있는 동안이다, 그래서 당신은 실제로 복잡한 생각의 종류를 최적화 + +457 +00:32:59,650 --> 00:33:03,620 + 이 녀석을 시도 할 수 있습니다 당신은 사람이 그렇게하지 ​​너무 많은 기대한다 시도 할 수 + +458 +00:33:03,619 --> 00:33:06,669 + 사람들이 너무 많은 지금 사용하고 무시하지 않는 생각 기본적으로 인해 + +459 +00:33:06,670 --> 00:33:11,130 + 열 난 그냥 엄격하게 더 나은 당신은 사람들이 이제 더 이상 음성을 사용하여 볼 수 없습니다 + +460 +00:33:11,130 --> 00:33:14,350 + 물론 우리는 긴 단기 기억 단위 팔레스타인 등을 사용 + +461 +00:33:14,349 --> 00:33:17,129 + 누군가가 재발 신경망하지만 자신의 비트에 해당로 이동합니다 + +462 +00:33:17,130 --> 00:33:22,500 + 우리가 그들을 사용하고 수업 시간에 나중에 볼 이​​유 구체적인 이유와 + +463 +00:33:22,500 --> 00:33:26,700 + 그들은 우리가 같이 멀리에 너무 커버 한 내용과 다르게 사용하는 것입니다 + +464 +00:33:26,700 --> 00:33:32,670 + 그냥 완전히 연결 샌드위치 메이커 파티를 곱 누군가 단지를 갖는 + +465 +00:33:32,670 --> 00:33:35,720 + 기본 신경망 확인 그게 내가 말하고 싶어 모든 그래서하지만 + +466 +00:33:35,720 --> 00:33:39,410 + 활성화 기능은 기본적으로 이것을 우리가 걱정 주요 기능을했다 + +467 +00:33:39,410 --> 00:33:42,990 + 그것에 대해 본 연구에 대해 우리는 완전하게 파악하고있다하지 않은 + +468 +00:33:42,990 --> 00:33:46,640 + 그들 중 많은 몇 가지 장점과 단점 및 방법 그라데이션에 대해 생각에 내려와 + +469 +00:33:46,640 --> 00:33:50,690 + 네트워크를 통해 흐르는 아직 죽은 친척과 같은 이러한 문제를 논의 + +470 +00:33:50,690 --> 00:33:54,808 + 당신이 당신의 네트워크를 디버깅하려고하면 정말 그라데이션 흐름에 대해 알아야 할 사항 + +471 +00:33:54,808 --> 00:33:59,428 + 그리고는 가격에하자 모습에 무슨 일이 일어나고 있는지 이해하기 + +472 +00:33:59,429 --> 00:34:03,710 + 처리 매우 간단하므로 + +473 +00:34:03,710 --> 00:34:07,440 + 처리는 매우 간단 일반적으로 그냥 구름이 있다고 가정 + +474 +00:34:07,440 --> 00:34:11,829 + 원래의 데이터와 여기에 두 가지 차원 (20) 센터 데이터 있도록 매우 일반적인 + +475 +00:34:11,829 --> 00:34:15,230 + 그냥 하나 하나 그림이었다 함께 평균 사람들을 추적 할 수 있음을 의미합니다 + +476 +00:34:15,230 --> 00:34:18,889 + 당신은 기계 학습 문학을 통해 갈 때 때때로 시도 + +477 +00:34:18,889 --> 00:34:22,720 + 표준이 말을 당신이 정상화 모든 단일 차원 있도록 데이터를 정상화 + +478 +00:34:22,719 --> 00:34:23,759 + 일탈 + +479 +00:34:23,760 --> 00:34:28,990 + 표준화는 최소 및 최대 내에 등등 있는지 확인 할 수 있습니다 + +480 +00:34:28,989 --> 00:34:33,098 + 이미지에 이렇게 여러 가지 방식이 거기 있기 때문에 일반적인 아니에요 있습니다 + +481 +00:34:33,099 --> 00:34:35,760 + 다른 단위가 될 수있는 다양한 기능을 분리 할 필요가 없습니다 + +482 +00:34:35,760 --> 00:34:39,619 + 모든 것이 픽셀과 자신의 경계가로 아니라 0과 255 + +483 +00:34:39,619 --> 00:34:43,970 + 공통 데이터를 정상화하지만 당신이 할 수있는 매우 일반적인 20 센터 데이터의합니다 + +484 +00:34:43,969 --> 00:34:44,719 + 더 나아가 + +485 +00:34:44,719 --> 00:34:48,730 + 일반적으로 시스템에서 당신이 앞서 갈 수있는 학습 데이터가 일부 공분산이 + +486 +00:34:48,730 --> 00:34:52,079 + 기본적으로 구조가 가서 그 공산주의 러시아 될 수 있습니다 + +487 +00:34:52,079 --> 00:34:55,740 + 대각선은 PCA를 적용하여, 예를 들어 말을하거나 더도 갈 수 있고, 당신이 할 수있는 + +488 +00:34:55,739 --> 00:35:00,309 + 데이터를 닦아 무엇을 뜻하면 가지도 끝났다 PCR 후 뭉개 버려이다 + +489 +00:35:00,309 --> 00:35:05,159 + 당신의 다양한 측정 그냥 대각선이되도록 당신은 또한 당신의 데이터를 뭉개 버려 + +490 +00:35:05,159 --> 00:35:08,699 + 그래서 그 사람들이 얘기 내가 볼 전처리의 또 다른 형태이다 + +491 +00:35:08,699 --> 00:35:14,480 + 이들은 모두 내가하지 않으려는 BC의 클래스 노트에서 더 자세한 내용을 갈 수 있습니다 + +492 +00:35:14,480 --> 00:35:17,500 + 이 이미지에서 우리가하지 않는 것으로 나타났다 때문에 너무 많은 세부 사항으로 이동 + +493 +00:35:17,500 --> 00:35:20,960 + 실제로 기계 학습오고 이러한에도 순서로 끝낸다 + +494 +00:35:20,960 --> 00:35:25,659 + 이미지는 구체적으로 어떤 일반적인 것은 다음 단지 수단을 중심으로하고있다 + +495 +00:35:25,659 --> 00:35:28,519 + 내가 그 중심의 특정 변종 연습 약간 더 편리합니다 + +496 +00:35:28,519 --> 00:35:34,780 + 그래서 우리가 330 말을 중심으로 의미하는 것은 당신이 원하는까지하면 세 가지 이미지의 해저를 구입 + +497 +00:35:34,780 --> 00:35:38,869 + 모든 단일 픽셀에 대해 당신이 오버 트레이닝 (W) 경쟁하는 것이 데이터를 중앙 등 + +498 +00:35:38,869 --> 00:35:43,318 + 당신이 결국 그래서 그 무엇을 추적하는 것은 기본적으로 가지고이 평균 이미지 + +499 +00:35:43,318 --> 00:35:47,219 + 세에 의해 32로 32의 임무는 그래서 그 예에 대한 이미지를 의미 생각 + +500 +00:35:47,219 --> 00:35:51,409 + 이미지 데이터 정의, 오렌지, 방울마다 하나의 이미지로부터 추적하는 것은 중앙에 + +501 +00:35:51,409 --> 00:35:56,000 + 데이터가 더 나은 훈련 역학과 그들이있어 하나의 다른 형태를 가질 수 있습니다 + +502 +00:35:56,000 --> 00:36:00,818 + 약간 더 편리 단지 채널 당 그​​래서 당신이 들어갈 뜻 받고있다 + +503 +00:36:00,818 --> 00:36:05,639 + 모든 공간은 단지 끝에서 적색, 녹색, 청색 채널의 평균을 계산 + +504 +00:36:05,639 --> 00:36:07,289 + 기본적으로 세 개의 숫자 최대 + +505 +00:36:07,289 --> 00:36:11,029 + 적색, 녹색, 청색 채널의 이동 단지 연습 아웃 등 일부 네트워크 + +506 +00:36:11,030 --> 00:36:15,250 + 사람들이 하나가 더 편리처럼 두 가지 일반적인 스킨입니다 대신 있도록 그 사용 + +507 +00:36:15,250 --> 00:36:17,519 + 당신 만 걱정 세 개의 숫자를 가지고 있기 때문에 당신은 필요가 없습니다 + +508 +00:36:17,519 --> 00:36:20,670 + 당신이 모든 작가를 주위에 제공 할 필요가 평균의 거대한 배열에 대해 걱정 + +509 +00:36:20,670 --> 00:36:26,430 + 당신은 실제로 너무 너무 많은 I에 대해 말하고 싶은이를 넣어 때 + +510 +00:36:26,429 --> 00:36:30,649 + 이것은 단지 기본적으로 평균 및 컴퓨터 비전 응용 프로그램 일을 빼기 + +511 +00:36:30,650 --> 00:36:35,039 + 특히 DPC에서보다 훨씬 더 복잡 얻을 등이에로 사용하지 않습니다 + +512 +00:36:35,039 --> 00:36:38,860 + 이미지이기 때문에 약간 일반적인 문제는 모든 이미지에 적용 할 수 없습니다 + +513 +00:36:38,860 --> 00:36:43,559 + 픽셀의 제비 등이 유니폼이 될 것입니다 매우 높은 차원 객체 + +514 +00:36:43,559 --> 00:36:47,789 + 거대하고 사람들은 당신 때문에 로컬 미백하고 같은 일을하려고 + +515 +00:36:47,789 --> 00:36:53,179 + 특히 이미지를 통해 슬라이드 번개 필터를 확인하고 그 사용 것 + +516 +00:36:53,179 --> 00:36:56,389 + 몇 년 전 다운하지만 중요하지 않는 것 지금과 같은 일반적인 아니다 + +517 +00:36:56,389 --> 00:37:01,809 + 초기화를 기다릴 너무 많이 확인 + +518 +00:37:01,809 --> 00:37:06,539 + 아주 아주 중요한 주제 일찍 신경을 생각하는 이유 중 하나 + +519 +00:37:06,539 --> 00:37:09,409 + 네트워크는 상당히으로뿐만 아니라 사람들이 충분히주의하지 않기 때문에 같은 작동하지 않았다 + +520 +00:37:09,409 --> 00:37:14,119 + 내가 볼 것이다 첫 번째 일이 그렇게 한 모든 첫 번째로 어떻게하지에 + +521 +00:37:14,119 --> 00:37:18,170 + 당신은 단지에 유혹 될 수 있습니다 특히 있도록이 법안에 할 + +522 +00:37:18,170 --> 00:37:23,619 + 당신을 것을 제로에 동일한 가중치에서 시작하자하고 확인을 말하여 + +523 +00:37:23,619 --> 00:37:27,029 + 네트워크는 10 층 신경망처럼이었고, 당신은 항상 20 말했다 말한다 + +524 +00:37:27,030 --> 00:37:37,320 + 왜 그 일이 왜 좋은 생각은 잘 진행되지 않는 것을 + +525 +00:37:37,320 --> 00:37:41,410 + 기본적으로 배경에서 같은 일에 그냥 모든 신경 세포는 동작합니다 + +526 +00:37:41,409 --> 00:37:45,000 + 당신이 그것을 부르는 우리가 전화와 동일한 방법과 그렇게 아무것도 없다 + +527 +00:37:45,000 --> 00:37:50,360 + 대칭 그래서 다른 모든 컴퓨팅 말하는 물건을 깨는 그래서 그들은 것 + +528 +00:37:50,360 --> 00:37:53,570 + 모두 동일한 구배를 경쟁 할 것 같은 등 그렇게하지 ​​가장 깨끗하게 + +529 +00:37:53,570 --> 00:37:57,860 + 것은 사람들이 당신이 할 수있는 한 가지 방법 있도록 작은 숫자를 작은 임의의 숫자를 사용하는 것입니다 + +530 +00:37:57,860 --> 00:38:01,820 + 할 수있는 비교적 흔한 일이 예를 들어 그렇게하는 것은에서 당신 샘플입니다 + +531 +00:38:01,820 --> 00:38:07,410 + 당신은 2010 하나의 표준 편차가 너무 작은 임의의 숫자 때문에 협상 + +532 +00:38:07,409 --> 00:38:11,299 + W 행렬은 할리우드 지금 초기화 곳이다 + +533 +00:38:11,300 --> 00:38:15,340 + 이 초기화에 문제가 확인 작동하지만 당신이 찾는 그것은 해당 + +534 +00:38:15,340 --> 00:38:20,068 + 당신이 점점 더 깊이 갈 시작으로 당신이 작은 네트워크를 가지고 있지만 경우 확인 작업 + +535 +00:38:20,068 --> 00:38:24,659 + 국유화에 대해 훨씬 더 조심해야 할 것입니다 내가 가고 싶습니다 + +536 +00:38:24,659 --> 00:38:29,199 + 정확하게는 휴식과 나누기 어떻게 당신이하려고 할 때 휴식을 물린로 + +537 +00:38:29,199 --> 00:38:32,499 + 이 순진 초기화 전략을 수행하고 깊은 네트워크를 가지고 그렇게하자하려고 + +538 +00:38:32,498 --> 00:38:38,798 + 잘못되면 어떻게 보면 그래서 내가 여기에 쓴 것은 작은 책이다 그래서 뭐 + +539 +00:38:38,798 --> 00:38:43,608 + 우리는 내가 샘플링하고있어이 그냥 간단히 단계별로 가고 여기 일을하는지 + +540 +00:38:43,608 --> 00:38:48,369 + 차원 (500)이며, 그 다음 내가 만드는거야 1,000 점의 집합 + +541 +00:38:48,369 --> 00:38:52,170 + 숨겨진 레이어 및 비선형의 전체 무리 그래서 우리가 지금 당장 말 + +542 +00:38:52,170 --> 00:38:58,749 + (10) 500 단위의 층 우리는 10 시간을 사용하고 난 그냥 해요으로 나는 여기서 뭐 해요 + +543 +00:38:58,748 --> 00:39:03,798 + 기본적으로 단위 세상에 및 데이터를 복용하고 난 네트워크를 통해 전달 해요 + +544 +00:39:03,798 --> 00:39:07,509 + 어디 바로 지금이 특정 초기화 전략 + +545 +00:39:07,509 --> 00:39:10,920 + 초기화 전략은 내가 이전 슬라이드 참조 샘플에 설명 된 것입니다 + +546 +00:39:10,920 --> 00:39:14,869 + 분출에서 그는 내가이 부분 때문에 여기에서하고 있어요 그래서 세르비아 (101)에 의해 살해있어 + +547 +00:39:14,869 --> 00:39:18,608 + 지금은 그냥 시리즈로 구​​성되어이 네트워크를 전파 지루 해요 + +548 +00:39:18,608 --> 00:39:25,208 + 같은 크기의 레이어 $ 500 그렇다면 열 층과 나는 함께 전파를위한 해요 + +549 +00:39:25,208 --> 00:39:29,328 + 단위 분출 데이터와 I가 원하는 것에 대해이 초기화 전략 + +550 +00:39:29,329 --> 00:39:34,109 + 숨겨진 뉴런의 통계에 일어나는 것이다 봐 + +551 +00:39:34,108 --> 00:39:37,719 + 이 초기화와 네트워크를 통해 활성화 그래서 우리는 거 야 + +552 +00:39:37,719 --> 00:39:40,429 + 평균과 표준 편차에 구체적으로 보면 우리가가는거야 + +553 +00:39:40,429 --> 00:39:44,498 + 평균 표준 편차를 플롯 우리는 그렇게 히스토그램을 차단하는거야 + +554 +00:39:44,498 --> 00:39:48,159 + 우리는을 통해 모든 데이터를 가지고 우리가 가고있는 다섯 번째 플레이어에서 말 + +555 +00:39:48,159 --> 00:39:52,368 + 뭐라고 값이 다섯이나 여섯 또는 일곱 번째 내부에 가지고 무슨 짓을했는지 봐 + +556 +00:39:52,369 --> 00:39:56,338 + 우리는 당신이 만약이 초기화와 함께, 그래서 사람들의 히스토그램을 만들어가는 곳 + +557 +00:39:56,338 --> 00:39:59,588 + 그것은 다음과 같이보고 끝나는 당신은 결국이 실험을 실행 + +558 +00:39:59,588 --> 00:40:03,889 + 그래서 여기에 내가 그것을 밖으로 인쇄하고 우리는 0의 평균으로 시작 그들의 + +559 +00:40:03,889 --> 00:40:07,368 + 하나의 부문은 우리의 데이터의 그것과 지금은 전파를위한 해요 + +560 +00:40:07,369 --> 00:40:13,019 + 나는 우리가 대칭의 10 세 정도로 부드러운 나이를 사용하고 평균 10 플레이어에 가서 + +561 +00:40:13,018 --> 00:40:16,868 + 당신이 제로 주변의 평균 상태에 있지만, 표준 편차를 예상대로 그렇게 + +562 +00:40:16,869 --> 00:40:21,440 + 그것을 어떻게되는지를 보면 110 부문 진형 2.2은 당기했다 + +563 +00:40:21,440 --> 00:40:27,420 + 2004 년과 다음 뉴런의 표준 편차에 대한 구심 제로 다운 + +564 +00:40:27,420 --> 00:40:31,639 + 그냥 하나 하나 공기에 여기 히스토그램을보고 아래로 20 간다 + +565 +00:40:31,639 --> 00:40:33,338 + 히스토그램 이유 제 층 + +566 +00:40:33,338 --> 00:40:37,778 + 그래서 우리는 11 사이의 숫자의 확산을하고 무엇은 일어나고 끝 + +567 +00:40:37,778 --> 00:40:42,889 + 다만 정확히 제로 그래서 무슨 일이 끝에서 꽉 분포 ​​축소 + +568 +00:40:42,889 --> 00:40:46,328 + 우리의 네트워크 생산이 초기에 발생하는 모든 10 + +569 +00:40:46,329 --> 00:40:50,930 + H 뉴런은 20 그래서 마지막 계층에서 이러한 작은 수있는 팀에서 결국 + +570 +00:40:50,929 --> 00:40:58,719 + 같은 제로에 가까운 그래서 모든 직업은 기본적으로 0이된다 및 번호 + +571 +00:40:58,719 --> 00:41:01,219 + 왜이 문제입니다 + +572 +00:41:01,219 --> 00:41:05,568 + 그라디언트에 후방 패스의 역학에 무슨 생각 + +573 +00:41:05,568 --> 00:41:10,969 + 당신이 정품 인증에 작은 번호가있을 때 당신의 텍스트는 작은 숫자입니다 + +574 +00:41:10,969 --> 00:41:12,548 + 지난 몇 층에 + +575 +00:41:12,548 --> 00:41:17,159 + 어떤 이들 성분처럼 무엇을 중요시하는 방법에 이들 계층 무엇에 있어요 + +576 +00:41:17,159 --> 00:41:27,478 + 후방에 발생하는 모든의 첫 번째 내 너무 층이 가정 통과 + +577 +00:41:27,478 --> 00:41:32,399 + 여기에 나중에 전에 몇 가지를 살펴보고 거의 모든 입력은 너무 작은 것을 + +578 +00:41:32,400 --> 00:41:37,789 + 당신이 기대하는 일 기울기가 무엇인지 작은 수의 X 축입니다 번호 + +579 +00:41:37,789 --> 00:41:45,509 + 그라디언트에 W 해당 레이어의 경우에하는 주셔서 매우 + +580 +00:41:45,509 --> 00:41:55,528 + 작은 왜 그들은 것입니다 매우 작은 W는 X 시간 기울기와 동일합니다 + +581 +00:41:55,528 --> 00:41:56,278 + 상부로부터 + +582 +00:41:56,278 --> 00:42:00,789 + 확인 등의 효과뿐만 아니라 WR 작은 숫자에 대한 이유보다 작은 숫자입니다 + +583 +00:42:00,789 --> 00:42:06,640 + 그래서이 사람은 기본적으로 지금 유관 거의 이유가 없습니다 우리 + +584 +00:42:06,639 --> 00:42:13,228 + 또한 다시이 행렬에 무슨 볼 수 있습니다 우리는 우리이었다 데이터를했다 + +585 +00:42:13,228 --> 00:42:16,659 + 단위주의와 처음으로 배포 한 후 우리는 결국 + +586 +00:42:16,659 --> 00:42:20,278 + W 및 활성화 기능에 의해 그것을 곱 우리는 기본적으로 그 보았다 + +587 +00:42:20,278 --> 00:42:24,699 + 모든 이것은 단지 시간이 지남에 붕괴하고 생각 제로로 간다 + +588 +00:42:24,699 --> 00:42:27,939 + 뒤로 패스 우리가 이러한 레이어를 통해 그라데이션을 변경 같이 + +589 +00:42:27,940 --> 00:42:31,380 + 우리가 효과적으로 무슨 일을하는지 다시 전파는 그라데이션 종류의 일부입니다 + +590 +00:42:31,380 --> 00:42:35,989 + 우리 그라데이션 W에 사람들 오프의 우리는 숫자를 보았다하지만 다시 던졌다 + +591 +00:42:35,989 --> 00:42:39,108 + 전파는 우리가 계약 효과를 통해거야 그리고 우리는 결국 + +592 +00:42:39,108 --> 00:42:41,969 + 여기를 통해 우리 배경은 당신이 무엇을 얻을 때 일 + +593 +00:42:41,969 --> 00:42:47,419 + 당신이 장치를 가지고가는 경우 모든 단일 계층에서 또 다시 W에 의해 곱한 + +594 +00:42:47,420 --> 00:42:51,460 + 이 규모에서 화장실에 의해 다중 데이터를 분출하면 당신은 모든 것을 볼 수 있습니다 + +595 +00:42:51,460 --> 00:42:55,010 + 제로하고 같은 일이 후 뒤로 패스했다 일어날 간다 + +596 +00:42:55,010 --> 00:42:59,180 + 연속적으로 매일 공기에 우리 다시 전파로 W에 의해 두 행위를 곱하여 + +597 +00:42:59,179 --> 00:43:03,529 + 우리는 당신을있는 것과 합리적인 숫자 진형이 그라데이션 + +598 +00:43:03,530 --> 00:43:07,300 + 당신의 손실 함수는이 일을 계속 같이 그냥 0으로가는 종료됩니다 + +599 +00:43:07,300 --> 00:43:11,519 + 프로세스 및 당신은 기본적으로 작은 단지 작은 여기에 그라디언트 결국 + +600 +00:43:11,519 --> 00:43:17,530 + 숫자는 그래서 당신은 기본적으로이 전반에 걸쳐 매우 매우 낮은 기울기와 끝까지 + +601 +00:43:17,530 --> 00:43:21,500 + 이 때문에 이유의 네트워크 및 이것은 우리가 추방으로 참조 뭔가 + +602 +00:43:21,500 --> 00:43:24,070 + 이 그라데이션 등의 성분이 특정과를 통해 이동 + +603 +00:43:24,070 --> 00:43:27,160 + 초기화는 녹색의 그​​룹이 크기를 볼 수 있습니다 우리는거야 + +604 +00:43:27,159 --> 00:43:34,239 + 단지 우리가 두 가지 중 하나를 사용 사용될 때 가서 우리는 극단적 인 다른 시도 할 수 있습니다 + +605 +00:43:34,239 --> 00:43:38,569 + 당신이 시도 할 수 있습니다에 여기 스케일링 우리는 토끼와 음의 확장으로 대신 + +606 +00:43:38,570 --> 00:43:45,530 + 초기화에서 W 매트릭스의 다른 규모 그래서 나는 110001 시도 가정 + +607 +00:43:45,530 --> 00:43:51,099 + 이제 우리는 다른 방법을 오버 슈트 때문에 또 다른 재미 일이 일어 볼 수 있습니다 + +608 +00:43:51,099 --> 00:43:56,260 + 당신이 잘 볼 수있는 감각은 아마 여기에서 결정을보고하는 것이 가장 좋습니다 + +609 +00:43:56,260 --> 00:44:00,250 + 당신은 모든 것이 완전히 중이 10 시간을 포화 볼 수 있습니다 + +610 +00:44:00,250 --> 00:44:05,079 + 모든 부정적인 하나 내가 분포를 의미하는 모든 사람은 정말 모든 것 + +611 +00:44:05,079 --> 00:44:08,389 + 네트워크 카드를 통해 신경 세포의 전체 네트워크를 슈퍼 포화 + +612 +00:44:08,389 --> 00:44:12,509 + 하나 음 (101) 무게가 너무 큰 그들은 것을 계속 추가하기 때문에 + +613 +00:44:12,510 --> 00:44:15,859 + 비선형 성을 겪고 결국이 과정이기 때문에 다른 사람 + +614 +00:44:15,858 --> 00:44:19,949 + 단지 매우 큰 가중치가 큰 그래서 모든 슈퍼 때문에 + +615 +00:44:19,949 --> 00:44:25,669 + 네트워크가 그냥 통해 재료가 흐르는 무엇 때문에 포화 + +616 +00:44:25,670 --> 00:44:28,869 + 끔찍한 그냥 모든 단지에 대한 제로로의 완벽한 재난 권리입니다 + +617 +00:44:28,869 --> 00:44:34,180 + 기하 급수적으로 0 당신은 그래서 당신은 매우 긴 시간과 훈련을 할 수 죽을 + +618 +00:44:34,179 --> 00:44:37,889 + 이 모든 때문에 당신의 손실 그냥 아무것도되는 일이 없을 때 당신은 볼 수 있습니다 + +619 +00:44:37,889 --> 00:44:41,299 + 모든 신경 세포가 포화 아무것도하지 않기 때문에 아무것도 다시 전파되지 않습니다 + +620 +00:44:41,300 --> 00:44:46,490 + 당신이 실제로 예상대로이 초기화가 슈퍼처럼 있도록 업데이트되는 + +621 +00:44:46,489 --> 00:44:50,469 + 까다로운 설정하고 그것이 있어야 특히이 경우 가지 있어야 + +622 +00:44:50,469 --> 00:44:54,629 + 어딘가 10 10 10 K 등 사이 + +623 +00:44:54,630 --> 00:44:58,259 + 그래서 당신은 약간 더 대신 몇 가지 다른 값을 시도의 원칙에 따른 될 수있다 + +624 +00:44:58,259 --> 00:45:03,059 + 2010 년 예를 들어이 있었다 있도록이 작성 몇 가지 서류가 + +625 +00:45:03,059 --> 00:45:07,589 + 우리가 지금의 초기화를 호출하는 것에 대해 제안이 전혀 나가서 + +626 +00:45:07,588 --> 00:45:11,199 + 의 종류가 겪은 그들은의 분산에 대한 표현 보았다 + +627 +00:45:11,199 --> 00:45:15,318 + 당신의 신경 및이를 읽을 수 있습니다 당신은 기본적으로 특정을 제안 할 수있다 + +628 +00:45:15,318 --> 00:45:19,608 + 그래서 난 필요 없어 당신이 당신의 구배를 주문 방법에 대한 초기화 전략 + +629 +00:45:19,608 --> 00:45:24,088 + 나는 그들이 이런 종류의 추천 어떤 다른 하나를 시도 할 필요가 없습니다 2001 시도 + +630 +00:45:24,088 --> 00:45:27,500 + 초기화 우리는 입력의 수의 제곱근으로 나눈 + +631 +00:45:27,500 --> 00:45:28,750 + 하나 하나 신경 + +632 +00:45:28,750 --> 00:45:33,630 + 입력의 많은 당신은 낮은 무게와 끝까지 직관적으로 그 수 + +633 +00:45:33,630 --> 00:45:36,539 + 당신이 더 많은 일을하고 있기 때문에 의미 당신은 당신으로가는 더 많은 물건을 가지고와 + +634 +00:45:36,539 --> 00:45:39,619 + 무게 일부는 그래서 당신은 그들 모두와 경우 상호 작용이 덜합니다 + +635 +00:45:39,619 --> 00:45:43,660 + 더 큰하려는 은신처로 공급되는 단위의 적은 수의 + +636 +00:45:43,659 --> 00:45:46,980 + 무게 다음 거기에 그들 중 몇 그리고 당신은 변화를 원하기 때문에 + +637 +00:45:46,980 --> 00:45:51,019 + 18의 조금 백업 + +638 +00:45:51,018 --> 00:45:54,659 + 여기에 아이디어는 그들은 하나의 신경 세포 더 활성화에서 찾고있다 + +639 +00:45:54,659 --> 00:45:58,118 + 함수는 선형 신경 세포입니다 포함하고있는 경우가 말을하는지 모든입니다 + +640 +00:45:58,119 --> 00:46:02,099 + 당신이 입력으로 데이터를 받고하는 경우 원하는 당신에게이 학습자를 좋아한다 + +641 +00:46:02,099 --> 00:46:06,079 + 다음은이 금액하여 가중치를 초기화해야 하나의 분산을 + +642 +00:46:06,079 --> 00:46:10,670 + 그리고 노트에 난이 파생하는 방법을 정확하게 단지 우리 두 개의 표준되어가는 + +643 +00:46:10,670 --> 00:46:15,650 + 편차 내가 사용할 수 있도록 기본적으로이 합리적인 초기화입니다 + +644 +00:46:15,650 --> 00:46:18,700 + 대신 당신은 볼 수 여기를 사용하는 경우 + +645 +00:46:18,699 --> 00:46:22,399 + 분포 다시보고를 통해보다 합리적인 끝나게 + +646 +00:46:22,400 --> 00:46:25,660 + 이러한 열 에이전트 중 하나에 부정적 일 간의 역사와 당신은 더 많은 것을 얻을 수 + +647 +00:46:25,659 --> 00:46:31,000 + 현명한 여기 수와 실제로의 활성 영역 내에서이 + +648 +00:46:31,000 --> 00:46:33,929 + 모든 청소년은 그래서 당신이 훨씬 더 좋을 것으로 예상 할 수있다 + +649 +00:46:33,929 --> 00:46:38,518 + 초기화 가지 활성 영역에있는 것들 훈련 때문에 + +650 +00:46:38,518 --> 00:46:42,318 + 시작 무에서 시작하는 이유 슈퍼 포화 + +651 +00:46:42,318 --> 00:46:45,179 + 이 단지 아주 좋은, 우리가 아직도 가지고있는 이유 끝나게하지 않습니다 + +652 +00:46:45,179 --> 00:46:48,139 + 이 문서는 계정을 고려하지 않기 때문에 아래로 여기 융​​합이다 + +653 +00:46:48,139 --> 00:46:52,308 + 이 경우 비선형 세입자 등 테니스 비선형 최대 + +654 +00:46:52,309 --> 00:46:57,650 + 전체에 분산의 형성 통계의 종류 같은 그래서 당신 경우 + +655 +00:46:57,650 --> 00:47:02,309 + 그것을 떨어져이 시작하고 최대 여전히이 경우 유통에 일을 + +656 +00:47:02,309 --> 00:47:05,410 + 이 표준 편차가 다운 될 것 같다하지만 경우처럼 극적인 아니다 + +657 +00:47:05,409 --> 00:47:08,179 + 이 안녕 안녕을 설정 단지 시험했다 + +658 +00:47:08,179 --> 00:47:11,299 + 그래서 이것은 합리적인 초기화처럼 거기입니다 + +659 +00:47:11,300 --> 00:47:15,280 + 비교하여 내부 네트워크를 사용하는 단지 2001 그래서 사람들을 설정합니다 + +660 +00:47:15,280 --> 00:47:20,760 + 때로는 같은 관행을 사용하게하지만, 그래서 이것은 10 세의 경우 작동 + +661 +00:47:20,760 --> 00:47:24,349 + 당신이 정류에 넣어하려고하면 합리적인 무언가를 그것은 밝혀 + +662 +00:47:24,349 --> 00:47:30,019 + 선형 단위 네트워크는 그것뿐만 아니라 작동하지 않고 감소 부문이 될 것입니다 + +663 +00:47:30,019 --> 00:47:34,679 + 훨씬 더 빠른 그래서 테헤란에서 집회를보고 첫 번째 레이어는 몇 가지가 있습니다 + +664 +00:47:34,679 --> 00:47:37,769 + 당신이 볼 수있는 분배하고 분배는 더욱 더 얻는다 + +665 +00:47:37,769 --> 00:47:43,130 + 제로 그래서 점점 더 많은 뉴런에서 까다로운이 초기화로 활성화된다 + +666 +00:47:43,130 --> 00:47:48,440 + 그래서 바로 잡기 층 층 그물에 초기화를 사용하는 것은 좋은 일을하지 않습니다 + +667 +00:47:48,440 --> 00:47:52,659 + 그래서 다시 그들이 실제로에 대해 얘기하지 않는이 논문에 대해 생각 + +668 +00:47:52,659 --> 00:47:57,578 + 비선형 및 관련이란의 컴퓨터입니다이 가중 합 + +669 +00:47:57,579 --> 00:48:02,068 + 방법 후 여기에 있지만 그들의 수요에서 뭔가 당신은 그래서 당신이 수행하는 것이 + +670 +00:48:02,068 --> 00:48:05,858 + 당신이 직관적으로 그가 무엇을 0으로 설정하고 분배의 절반을 죽일 + +671 +00:48:05,858 --> 00:48:10,380 + 당신의 최대의 배포하지만, 기본적으로 절반 변형 등이 + +672 +00:48:10,380 --> 00:48:14,849 + 이 사실은 누군가에 작년에 본 논문에서 제안되었다 밝혀 말했다 + +673 +00:48:14,849 --> 00:48:19,000 + 기본적으로 당신은 그가 때문에에 대한 회사 아니에요 2 배 거기에 보면 + +674 +00:48:19,000 --> 00:48:22,809 + 정말 당신은 론의 효과적으로 행복을 모르거나 때마다 변형하지 않는다 + +675 +00:48:22,809 --> 00:48:26,510 + 당신이 입력을 확보하지 못했 있도록 모든 것을 가지고 있기 때문에 당신이 그들을 통해 소요 + +676 +00:48:26,510 --> 00:48:29,960 + 당신의 비선형 당신은 당신이 나는 것 물건을 받고 있지만 당신이 정말로 그렇게하지 + +677 +00:48:29,960 --> 00:48:35,530 + 그래서 당신이 두 가지 변종을 가진 끝 그것도으로 고려하는 것 그리고 + +678 +00:48:35,530 --> 00:48:38,859 + 당신은 당신이 데럴을 위해 특별히 적절한 분포를 얻을 수행 할 때 + +679 +00:48:38,858 --> 00:48:43,719 + 이란 등이 초기화에 사용 된 그물하면 약을 걱정할 필요가 + +680 +00:48:43,719 --> 00:48:48,618 + 여분의 세수 모든 것이 잘 올 것이다 당신은받지 않습니다 + +681 +00:48:48,619 --> 00:48:52,358 + 이 구축 계속 두 가지의 요인과는 나사까지 당신의 활성화를 + +682 +00:48:52,358 --> 00:48:56,769 + 기하 급수적 그래서 기본적으로이 까다로운 까다로운 물건과는 정말 + +683 +00:48:56,769 --> 00:49:01,159 + 예를 들어 자신의 논문에서 연습에 연습 문제는을 가진 비교 + +684 +00:49:01,159 --> 00:49:04,519 + 당신이 너무 요인을 가지고 있지 않으며이 중요한 경우 요인 우리가 정말 깊이가 + +685 +00:49:04,519 --> 00:49:08,500 + 당신이 고려하는 경우이 경우 네트워크는 나는 그들이 수십 플레이어를 가졌다 고 생각 + +686 +00:49:08,500 --> 00:49:12,940 + 당신은 아무것도 그냥하지 않습니다에 당신이 감소를 계산하지 않는 경우가 수렴한다는 사실 + +687 +00:49:12,940 --> 00:49:14,950 + 제로 많이 확인 + +688 +00:49:14,949 --> 00:49:19,469 + 그래서 당신이 정말 필요한 매우 중요한 물건을 조심해야 당신을 통해 그것을 생각하는 + +689 +00:49:19,469 --> 00:49:24,789 + 그것은 잘못 같은 나쁜 일이 너무 구체적으로 어떻게하고 있는지 인플레이션 + +690 +00:49:24,789 --> 00:49:28,108 + 이 레일 장치와 함께 작동 당신이 알고있는 경우 케이스는 정확 + +691 +00:49:28,108 --> 00:49:36,460 + 사용 대답하고 그래서 이것이 오는이 초기화이다 + +692 +00:49:36,460 --> 00:49:40,220 + 부분적으로이 오랫동안 당신의 말은 우리가 내가 생각하는 일부 이유 + +693 +00:49:40,219 --> 00:49:46,088 + 사람들은 완전히 어쩌면이 잘 얻을 수 있었다 얼마나 어려운 감사하지 않았다 + +694 +00:49:46,088 --> 00:49:51,219 + 터키 그래서 난 그냥 적절한 초기화 기본적으로 지적하고 싶은 + +695 +00:49:51,219 --> 00:49:54,419 + 연구의 활성 영역은 당신이 논문은 아직이에 게시되고있다 볼 수 있습니다 + +696 +00:49:54,420 --> 00:49:58,849 + 논문 다수 단지 초기화하는 다른 방법을 마주하여 + +697 +00:49:58,849 --> 00:50:03,019 + 그들은 당신에게를 제공하지 않기 때문에 네트워크는 지난 몇도 흥미 롭다 + +698 +00:50:03,019 --> 00:50:06,659 + 초기화 공식 그들은 이러한 데이터를 초기화 구동 폐기물이 + +699 +00:50:06,659 --> 00:50:10,399 + 네트워크와 지금 당신은 당신의 네트워크에 전달할 데이터의 배치를 취할 + +700 +00:50:10,400 --> 00:50:13,530 + 임의의 네트워크와는 차이를 보면 그 모든 단일 지점에서 + +701 +00:50:13,530 --> 00:50:16,690 + 네트워크 및 직관적으로 당신은 당신의 차이는 0으로 가고 싶지 않아 + +702 +00:50:16,690 --> 00:50:20,200 + 원하지 않는 그들은 모든 약이 같은 일 말하고 싶은 폭발 + +703 +00:50:20,199 --> 00:50:24,328 + 네트워크 전반에 걸쳐 단위주의는 그래서 그들은 항변 스케일이 입력 + +704 +00:50:24,329 --> 00:50:28,349 + 당신이 사방에 활성화에 크게 가질 수 있도록 네트워크에 무게 + +705 +00:50:28,349 --> 00:50:33,568 + 그 순서는 기본적 등 일부 데이터 중심의 기술과 라인이 있습니다 + +706 +00:50:33,568 --> 00:50:39,139 + 제대로 난 난 일부에 갈거야, 그래서 확인을 초기화하는 방법에 대한 작업 + +707 +00:50:39,139 --> 00:50:41,848 + 이러한 많은 문제를 완화하는 기술로 갈하지만 + +708 +00:50:41,849 --> 00:50:55,369 + 지금 나는 몇 가지 질문을 수 그리고 그들은으로 나누어에만있어 + +709 +00:50:55,369 --> 00:50:59,800 + 분산 가능하지만 다시없는거야 전파 당신이 경우 때문에 + +710 +00:50:59,800 --> 00:51:02,710 + 그라데이션 만난 후 당신의 목표는 더 이상 무엇인지 분명하지 않다과 + +711 +00:51:02,710 --> 00:51:06,710 + 그래서 당신은 그라데이션 반드시 못하고있어 그래서 이것은 유일한 문제가 될 수있다 + +712 +00:51:06,710 --> 00:51:11,170 + 난 당신이 그라데이션 I 정상화를 시도 할 경우 무슨 일이 일어날 지 확실하지 않다 + +713 +00:51:11,170 --> 00:51:13,730 + 이 방법은 내가 조금에 제안 할 것 같네요 + +714 +00:51:13,730 --> 00:51:19,960 + 실제로 그 효과에하지만 무엇 깨끗한 방법으로 일을하고있다 + +715 +00:51:19,960 --> 00:51:23,550 + 실제로 그건 실제로 이러한 많은 문제를 해결 뭔가로 이동 + +716 +00:51:23,550 --> 00:51:26,630 + 나의 비전이라고하며 그것은 단지 작년에 제안하고, 그래서 심지어 캔트 + +717 +00:51:26,630 --> 00:51:30,809 + 이 클래스에서이 작년에 덮여 있지만, 지금은 실제로 많은 도움이 있습니다 + +718 +00:51:30,809 --> 00:51:37,119 + 확인하고 기본 개념을 극대화 종이는 대략 기기가 받고 싶은 괜찮습니다 + +719 +00:51:37,119 --> 00:51:42,039 + 네트워크의 모든 단일 부분에서 활성화하고 그래서 그냥 그냥 그냥 할 + +720 +00:51:42,039 --> 00:51:46,369 + 그냥 만들 당신은 당신이 할 수있는 확인주의를 알고있는 무언가를 만들기 때문에 + +721 +00:51:46,369 --> 00:51:50,720 + 단위주의는 완전히 다른 기능이며, 그래서 확인 당신이 할 수 있어요 + +722 +00:51:50,719 --> 00:51:54,980 + 그것을 통해 전파하고 데이터에서 날 다시 복용 무엇 그들이 참조 + +723 +00:51:54,980 --> 00:51:57,480 + 당신은 우리가 만날거야 네트워크를 통해 따기있어 + +724 +00:51:57,480 --> 00:52:00,900 + 네트워크와 최상의으로 이러한 전문화 층을 삽입 + +725 +00:52:00,900 --> 00:52:06,400 + 정규화 층은 귀하의 입력 X를 가지고 그들은 모든 있는지 확인 + +726 +00:52:06,400 --> 00:52:10,420 + 배치에서 하나의 기능 치수는 단위 분출 활성화를 + +727 +00:52:10,420 --> 00:52:15,909 + 그래서 그는 어쩌면이있는 네트워크를 통과 백 예제의 배치를했습니다 + +728 +00:52:15,909 --> 00:52:19,779 + 여기에 좋은 예는 당신의 돈에 더 나은 활성화 너무 많은 일을하다 + +729 +00:52:19,780 --> 00:52:25,530 + 뒤로 어떤 점에있다 D 기능 또는 신경 세포의 불 활성화가 일부 + +730 +00:52:25,530 --> 00:52:28,869 + 부분이는 다시 나중에 입력 + +731 +00:52:28,869 --> 00:52:32,550 + 그래서 이것은 활성화 및 국유화의 주요 주제이다 + +732 +00:52:32,550 --> 00:52:39,390 + 효과적으로 모든 단일 기능을 함께 경험 평균과 분산을 평가 + +733 +00:52:39,389 --> 00:52:44,989 + 그것은 단지 어떤 있도록 전 그냥 확인했다 그것으로 나눈 모든 것을 + +734 +00:52:44,989 --> 00:52:49,088 + 단일 열 여기서 단위는 Univision의입니다 가지고 있으며, 그래서 완벽하게있어 + +735 +00:52:49,088 --> 00:52:54,219 + 미분 기능은 모든 단일 기능 또는 활성화에 그것을 적용 + +736 +00:52:54,219 --> 00:53:02,818 + 독립적으로 배치를 통해 당신은 아주 좋은 것으로 판명 할 수 있도록 + +737 +00:53:02,818 --> 00:53:08,548 + 아이디어는 지금이 팀과 함께 하나의 문제는 그래서 이것은이뿐만 아니라 작동 방법입니다 + +738 +00:53:08,548 --> 00:53:11,670 + 일반적으로 우리는 비선형 다음에 한 한 + +739 +00:53:11,670 --> 00:53:15,900 + 이 파티 네트워크 이제 우리는 이러한 국유화를 삽입 할거야 + +740 +00:53:15,900 --> 00:53:19,670 + 바로 정치 상속인 후 또는 동등 길쌈 층 후에 층 + +741 +00:53:19,670 --> 00:53:24,490 + 상용 네트워크와 잘 CCNA 등 기본적으로 우리가 그들을 시작할 수 있습니다 + +742 +00:53:24,489 --> 00:53:28,159 + 그들은 모든이의 매 단계에서 분출되어 있는지 확인 + +743 +00:53:28,159 --> 00:53:30,190 + 우리가 그냥 그렇게 만들 있기 때문에 네트워크 + +744 +00:53:30,190 --> 00:53:36,500 + 내가이이 함께 생각하는 한 가지 문제는 그것이 불필요한 것 같아 것입니다 + +745 +00:53:36,500 --> 00:53:41,088 + 제약 그래서 당신은 후에 여기 다시 넣을 때 출력은 확실히 것 + +746 +00:53:41,088 --> 00:53:45,389 + 당신이 그들을 정상화 때문에 분출 될 수 있지만 명확하지 않다가 10 H 실제로 + +747 +00:53:45,389 --> 00:53:50,288 + 당신이 10 H의 형태에 대해 생각, 그래서 만약 한 번 단위주의 입력을 후퇴에 + +748 +00:53:50,289 --> 00:53:54,450 + 그들이에 한 번 작동 모든 것을 걸 분명하지 않다 그것에 특정 기술을 가지고 + +749 +00:53:54,449 --> 00:53:59,730 + 출력은 당신이 협상 정확히 것을 확인이 어려운 제약 조건이 + +750 +00:53:59,730 --> 00:54:06,009 + 10 TH 전에 당신은 당신의 10 각을 원하는 경우 선택하는 네트워크를 좋아하기 때문에 + +751 +00:54:06,009 --> 00:54:10,429 + 지금 그것을 더 많거나 적은 포화 확산 및 더 많은 이하로 무엇이 다른 + +752 +00:54:10,429 --> 00:54:14,268 + 그것은이 상단에 작은 패치의 두 번째 부분, 그래서 죽음을 수있을 것입니다 + +753 +00:54:14,268 --> 00:54:19,429 + 아이 티어 행위를 정상화하지 않을하지만 정상화 한 후에는 네트워크를 살 + +754 +00:54:19,429 --> 00:54:25,068 + 감마으로 이동하고 모든 단일 기능에 대한 있어야했다 그래서이 허용하는 + +755 +00:54:25,068 --> 00:54:28,358 + 네트워크가 수행하는 이들은 감마 그래서 우리의 매개 변수이며 수있어 + +756 +00:54:28,358 --> 00:54:33,869 + 우리가 다시거야 매개 변수로 백업하고 그들은 단지 허용 + +757 +00:54:33,869 --> 00:54:38,690 + 일반 ICU 후 배송 네트워크 (22)는이 폭탄이 이동 할 수 있도록 협상 + +758 +00:54:38,690 --> 00:54:44,108 + 규모가 네트워크가 원하는 경우 우리가 초기화 아마도 웹의 110 + +759 +00:54:44,108 --> 00:54:48,250 + 뭐 그런 다음 우리는 네트워크를 조정하도록 선택할 수 수에 의해 + +760 +00:54:48,250 --> 00:54:51,239 + 당신은 우리가 10 H로 공급하면 상상이 수 조정 + +761 +00:54:51,239 --> 00:54:54,719 + 네트워크가 더 이상 그것을 만들기 위해 배경 신호를 선택하거나 + +762 +00:54:54,719 --> 00:54:58,618 + 덜 까다 롭고 또는 한 번 어떤 방식으로 포화하지만 당신은 들어갈 않을거야 + +763 +00:54:58,619 --> 00:55:01,910 + 일이 그냥 완전히 사망하거나 폭발이 문제 + +764 +00:55:01,909 --> 00:55:06,359 + 최적화의 시작 등 상황이 그때 바로 훈련한다 + +765 +00:55:06,360 --> 00:55:10,579 + 전파는 이상이 걸릴 수 있습니다 하나 더 초과 근무로 당신을 찾을 수 없습니다 + +766 +00:55:10,579 --> 00:55:16,170 + 중요한 기능을 사용하면이 무장 괴한을 설정하면 경우에 당신이 그들을 훈련하는 경우 것이 내 + +767 +00:55:16,170 --> 00:55:20,230 + 다시 전파는 말까지의 경험 분산 및 복용하는 것이 일 + +768 +00:55:20,230 --> 00:55:24,829 + 당신이 기본적으로 네트워크가를 취소 할 수있는 능력을 가지고 볼 수있을 때 의미 + +769 +00:55:24,829 --> 00:55:30,519 + 이 부분은 그 부분을 취소 배울 수 국유화 있도록 그래서 왜 돌아 왔어 + +770 +00:55:30,519 --> 00:55:34,059 + 및 실현에 식별 기능의 역할을 할 수 있습니다 또는 일을 배울 수 + +771 +00:55:34,059 --> 00:55:37,599 + 정체성이 없었던 전 그래서 당신이있을 때이 가장 잘 알려진 반면, + +772 +00:55:37,599 --> 00:55:42,460 + 자신의 네트워크에있는 플레이어와 전파를 다시 던졌다 그것을 꺼내 배우거나 + +773 +00:55:42,460 --> 00:55:45,110 + 그것은 유용한 찾으면 그것을 활용 배울 수 + +774 +00:55:45,110 --> 00:55:51,010 + 배경으로 통해 운동이 의지의 종류는 그에게 단지 좋은 점, 그래서 + +775 +00:55:51,010 --> 00:55:58,470 + 이 오른쪽 숫자가 수 있도록이 때문에 기본적으로 몇 가지 속성이있다 + +776 +00:55:58,469 --> 00:56:03,639 + 이들 제 특성은 구배 흐름을 개선한다는 것이다 I은 설명 된 것과 + +777 +00:56:03,639 --> 00:56:09,049 + 네트워크는 고등 교육 비율 때문에 네트워크 배울 수 있습니다 통해 + +778 +00:56:09,050 --> 00:56:13,080 + 빨리이 중요한 하나는 강한 의존성에 대한 소개입니다 감소 + +779 +00:56:13,079 --> 00:56:16,269 + 당신이 당신의 초기화의 다른 선택을 통해 청소로 초기화 + +780 +00:56:16,269 --> 00:56:19,659 + 당신이 당신을 학대하지 않고 그 표시됩니다 확장하는 것은 큰 차이를 볼 수 있습니다 + +781 +00:56:19,659 --> 00:56:23,469 + 최대 당신은 훨씬 더 많은 것들이 훨씬 더 큰 위해 작동 볼 수 있습니다 + +782 +00:56:23,469 --> 00:56:27,539 + 초기 규모의 설정 및 그래서 당신이없는만큼 걱정에 + +783 +00:56:27,539 --> 00:56:34,139 + 정말이 넣어 포인트와 아웃하는 데 도움이 하나 더 미묘한 점은 여기에서 지적 + +784 +00:56:34,139 --> 00:56:39,299 + 그것을 실현에서 돈의 액세스의 종류와 그 필요성에 대한 감소 + +785 +00:56:39,300 --> 00:56:43,900 + 한 방울의 클래스에 나중에 비트에 들어갈되지만 방법은이 역할을 + +786 +00:56:43,900 --> 00:56:51,559 + 당신이 입력 X의 어떤 종류를 가지고 통과 할 때 재미 정규화입니다 + +787 +00:56:51,559 --> 00:56:55,849 + 네트워크 다음 일부 나중에 네트워크에서의 표현은 기본적으로 + +788 +00:56:55,849 --> 00:56:59,858 + 그것뿐만 아니라 기능뿐만 아니라 어떤 다른 예제의 함수의 + +789 +00:56:59,858 --> 00:57:02,049 + 일괄 그렇게되기 위해서는 일 + +790 +00:57:02,050 --> 00:57:05,570 + 무엇 때문에 다른 예는 완전히 그 배치 과정에서 당신과 함께 있습니다 + +791 +00:57:05,570 --> 00:57:09,840 + 독립적으로 의류 패션은 실제로 함께하고 있으므로 그들을 묶어 + +792 +00:57:09,840 --> 00:57:12,880 + 네트워크의 두꺼운 층 같은 말 표현은 실제로 기능입니다 + +793 +00:57:12,880 --> 00:57:16,539 + 무엇이든의 그것은 다시는 샘플링 할 일이하고 관대 한을 무엇을 당신의 + +794 +00:57:16,539 --> 00:57:19,809 + 나중에에 표현 공간에 배치하고이 실제로 좋은가 + +795 +00:57:19,809 --> 00:57:26,139 + 효과를 정례화 등 빈정 생성 않는 사람 당신이이 사실을 + +796 +00:57:26,139 --> 00:57:31,609 + 에 될 일이 것은이 효과를 가지고 있으며, 그래서 난 실제로 것 같다 모르고 + +797 +00:57:31,610 --> 00:57:33,920 + 실제로 그것의 도움 + +798 +00:57:33,920 --> 00:57:38,950 + 확인 시험 나는 방법 기능 나중에 조금 다르게 당신을 열정적 해요 + +799 +00:57:38,949 --> 00:57:42,699 + 이 결정은 함수로 원하는 테스트 시간 불과하므로없는 + +800 +00:57:42,699 --> 00:57:46,500 + 다르게 학사 기능을 사용할 때 시간을 s의 빠른 점 + +801 +00:57:46,500 --> 00:57:52,019 + 특히 당신은 당신이 그렇게 시험에 의해 정규화를 유지하는이 새로운 시그마가 + +802 +00:57:52,019 --> 00:57:55,519 + 난 그냥 당신도 할 수있는 데이터 세트에서보기와 시그마을 기억 해요 + +803 +00:57:55,519 --> 00:57:59,250 + 평균 무엇처럼 계산하고 모든 단일 지점에서 S 상 + +804 +00:57:59,250 --> 00:58:02,309 + 네트워크는 전체 교육 센터를 통해 한 번 그 계산하거나 수 + +805 +00:58:02,309 --> 00:58:05,759 + 그 다음에 훈련하고있는 동안 단지 몇 가지 재미있는 육개월을 실행 한 + +806 +00:58:05,760 --> 00:58:08,800 + 최고의 선수에 있기 때문에 그냥 시간이 있는지 기억 할 수 있도록 당신은하지 않습니다 + +807 +00:58:08,800 --> 00:58:12,460 + 실제로 허리에 걸쳐 경험 평균과 분산을 추정 할 당신 + +808 +00:58:12,460 --> 00:58:17,000 + 그냥 당신이 오지 않을거야 좋은 때문에 그래서 직접들을 사용하려면 + +809 +00:58:17,000 --> 00:58:26,179 + 이 단지 작은 세부 사항이고 그래서 질문은 그래서이 시간에 전달 + +810 +00:58:26,179 --> 00:58:29,049 + 전국 고속도로에 대한 그래서 이것은 좋은 일이 + +811 +00:58:29,050 --> 00:58:35,559 + 그것과 직원 실제로 할당을 사용 + +812 +00:58:35,559 --> 00:58:41,039 + 문제는 둔화 전혀은 그래서를가 않습니다 그래서 감사합니다 + +813 +00:58:41,039 --> 00:58:44,219 + 그것은 불행하게도 내가 정확히 모르는에 대한 런타임 처벌하지만 당신은 지불해야 + +814 +00:58:44,219 --> 00:58:49,088 + 나는 사람이 30 %처럼 말을 듣고 얼마나 비싼 경우에도 그래서 나도 몰라 + +815 +00:58:49,088 --> 00:58:54,318 + 실제로 나는 완전히이 확인되지 않은 있지만 기본적으로 패널티 때문에이있다 + +816 +00:58:54,318 --> 00:58:58,548 + 당신은 당신이 모든 이후로 매우 일반적입니다 일반적으로이 작업을 수행해야 + +817 +00:58:58,548 --> 00:59:02,458 + 나중에 경쟁과 우리는 래리 같은 250 진정 당신이이 모든 것을 가지고 결국있다가 + +818 +00:59:02,458 --> 00:59:16,719 + 질문의 물건을 축적은 우리가 지불하는 가격을 인상 정말 그래 그래서 때를 가정 + +819 +00:59:16,719 --> 00:59:20,249 + 당신은 아마 내가이에 다시 그에게 올 거라고 생각 국가해야 알 수 있습니다 + +820 +00:59:20,248 --> 00:59:24,228 + 몇 슬라이드 네트워크가 건강하지 있음을 감지 할 수있는 방법을 같이 볼 수 있습니다 + +821 +00:59:24,228 --> 00:59:30,318 + 다음 어쩌면 당신은 내가 20이 학습 과정 있도록 다국적 확인을 할 + +822 +00:59:30,318 --> 00:59:36,489 + 분 나는 그래서 우리는 우리가 신뢰 괜찮아요 생각이 700처럼 할 수 있다고 생각 + +823 +00:59:36,489 --> 00:59:41,420 + 우리가 결정했습니다 우리의 데이터는의는의는 이러한 목적을 위해 어떤 결정합시다 + +824 +00:59:41,420 --> 00:59:44,719 + 이 실험은 내가 10 C 작동거야 그리고 나는를 사용하는거야 + +825 +00:59:44,719 --> 00:59:48,688 + 안전 두 층 신경 네트워크는 미묘한 차이를 가지고 있었고, 나는 아이디어를 제공하고 싶습니다 + +826 +00:59:48,688 --> 00:59:51,538 + 이 충격과 같은 방법 등에 대해 같은 때 훈련 신경 네트워크입니다 + +827 +00:59:51,539 --> 00:59:52,699 + 당신은 플레이 어떻게 + +828 +00:59:52,699 --> 00:59:56,849 + 어디 사람이 어떻게 실제로이 무엇 Primaris로 변환 할 + +829 +00:59:56,849 --> 00:59:59,380 + 일을 얻기에 날짜와 재생의 과정에서 같이 일을하는 + +830 +00:59:59,380 --> 01:00:03,019 + 연습하고 그래서 작은 신경 네트워크를 사용하기로 결정 + +831 +01:00:03,018 --> 01:00:08,248 + 내 데이터와 나는 경우에 보일 것이다 사물의 상기 제 1 종류의 전처리 + +832 +01:00:08,248 --> 01:00:11,728 + 내 예측이 그들에게 일을 작업하는 생각 보정되어 있는지 확인하려면 + +833 +01:00:11,728 --> 01:00:16,028 + 내가 갈거야 모두의 첫 번째는 여기에 2 년 신경을 초기화한다 + +834 +01:00:16,028 --> 01:00:19,679 + 네트워크 초기화 무게와 편견 그냥 순진했다 그래서 + +835 +01:00:19,679 --> 01:00:23,969 + 여기에 초기화이 그래서가 감당할 수있는 그냥 아주 작은 네트워크 때문에 + +836 +01:00:23,969 --> 01:00:28,259 + 어쩌면 고갈에서 단지 순진 샘플을 수행 한 다음이 함수는 + +837 +01:00:28,259 --> 01:00:31,329 + 기본적으로 신경 네트워크를 양성하는 것 그리고 난 당신에게 보여주는 아니에요 + +838 +01:00:31,329 --> 01:00:35,949 + 구현 분명하지만 한 것은 당신의 손실을 반환 누락 + +839 +01:00:35,949 --> 01:00:39,170 + 모델 매개 변수에 대한 그래서 먼저 자신의 반환 보험료 + +840 +01:00:39,170 --> 01:00:42,869 + 내가 예를 들어 시도 시간이 나는에 전달되는 정규화를 사용하지 않도록한다 + +841 +01:00:42,869 --> 01:00:45,818 + 종료하고 난 내 손실이 나오는 것을 확인 + +842 +01:00:45,818 --> 01:00:49,358 + 바로 그렇게 행동 나는이 언급 이전 라인 그래서 나는 10 수업을 말하기 + +843 +01:00:49,358 --> 01:00:53,318 + 내가의 손실을 예상하고있어 것을 알 수 있도록 부드러운 분류를 사용하여 지원 n 개의 메신저 + +844 +01:00:53,318 --> 01:00:59,099 + 그 때문에 10 일의 음의 로그가 그 손실에 대한 식이다 + +845 +01:00:59,099 --> 01:01:03,180 + 그리고는 2.3로 밝혀 그래서 나는 이것을두고 나는 그래서 2.3를 많이 얻을 + +846 +01:01:03,179 --> 01:01:05,708 + 기본적으로 신경망은 나에게 확산을주고 있음을 알 수 + +847 +01:01:05,708 --> 01:01:09,728 + 그것은 아무것도 모르기 때문에 클래스를 통해 배포 우리는 너무 봤는데 + +848 +01:01:09,728 --> 01:01:12,778 + 그 밖으로 짜증 나는 확인 수있는 다음 일은 예를 들어 I가 마약이다 + +849 +01:01:12,778 --> 01:01:17,318 + 정규화 물론 내 손실이 올라갈 것으로 예상 지금 때문에 우리 + +850 +01:01:17,318 --> 01:01:20,380 + 목적이 추가 기간이 있고 그래서 그렇게 체크 아웃 + +851 +01:01:20,380 --> 01:01:20,940 + 그 좋네요 + +852 +01:01:20,940 --> 01:01:25,409 + 난 보통은 아주 좋은 상태 검사의 수행하려고 할 것입니다 다른 다음 일 + +853 +01:01:25,409 --> 01:01:28,478 + 당신이 그들의 네트워크에서 작업 할 때 데이터의 작은 조각을하려고한다 + +854 +01:01:28,478 --> 01:01:32,139 + 당신은 당신은 단지 그 작은을 위해 노력하고 그 위에 수 있는지 확인하려고 + +855 +01:01:32,139 --> 01:01:36,608 + 스물 같이합니다 일부는 추천 훈련 예제의 샘플을 말 조각 + +856 +01:01:36,608 --> 01:01:41,858 + (28) 레이블과 난 그냥 그 작은 조각에 훈련 있는지 확인하고 그냥 + +857 +01:01:41,858 --> 01:01:45,179 + 내가 완전히 적합으로 초과 할 수있는 제로 근처에 기본적으로 손실을 얻을 수 있는지 확인 + +858 +01:01:45,179 --> 01:01:48,379 + 내가 이상 캔트이라면 내 생각의 작은 조각을 맞게 때문에 일이 그 + +859 +01:01:48,380 --> 01:01:54,608 + 나는 훈련을 시작하고 난 시작 해요 확실히 그래서 여기에 깨진 + +860 +01:01:54,608 --> 01:01:58,969 + 여기에 일부 매개 변수 임의의 숫자와 나는 전체에 들어갈 않을거야 + +861 +01:01:58,969 --> 01:02:04,150 + 이 세부하지만 기본적으로 내 비용이 제로에 가서 할 수 있는지 확인하고 + +862 +01:02:04,150 --> 01:02:08,519 + 나는 데이터의이 작은 조각에 정확도 100 %를 받고 있어요 그 날을 준다 + +863 +01:02:08,518 --> 01:02:12,659 + 아마 배경은 아마 업데이트가 작동하고있다 신뢰 + +864 +01:02:12,659 --> 01:02:16,798 + 학습 속도는 어떻게 든 합리적으로 설정되고 그래서 작은을 넣을 수 있습니다 + +865 +01:02:16,798 --> 01:02:21,190 + 어쩌면 나는까지 확장에 대해 생각하고이 시점에서 행복하지 않은 데이터 세트 + +866 +01:02:21,190 --> 01:02:28,079 + 뭔가보다 큰 + +867 +01:02:28,079 --> 01:02:33,960 + 그래서 당신은 하나 같이 말처럼 시도 할 수 때때로을 압도 할 수 있어야한다 + +868 +01:02:33,960 --> 01:02:37,409 + 두 개 또는 세 개의 예제 당신이 정말로 아래로 연습 할 수 있습니다 당신은 할 수 있어야 + +869 +01:02:37,409 --> 01:02:40,460 + 더 작은 네트워크를 감당할 그래서 매우 좋은 상태 검사 때문에입니다 + +870 +01:02:40,460 --> 01:02:45,289 + 당신은 작은 네트워크가 여유와 당신이 그것을 도울 수 있는지 확인 할 수 있습니다 + +871 +01:02:45,289 --> 01:02:49,039 + 아마 잘못된 일이 매우 펑키의 구현은 그렇게 잘못 + +872 +01:02:49,039 --> 01:02:52,039 + 당신이 상원을 통과하기 전에 당신의 하루를 확장해서는 안 나는 말했다 + +873 +01:02:52,039 --> 01:03:02,380 + 의 작은 조각을 복용 나는이 접근을 시도 그래서 기본적으로 방법을 확인 + +874 +01:03:02,380 --> 01:03:05,990 + 데이터 이제 우리는 이상 그것을 확장하고 있지만 무기가 좋아하는 올라오고 있어요 + +875 +01:03:05,989 --> 01:03:10,049 + 더 큰 데이터 세트는 I 작동 학습을 검색하기 위해 노력하고있어 당신은에 있습니다 + +876 +01:03:10,050 --> 01:03:13,289 + 정말 그냥 안구 찾아야하는 큰 제공 할 수 있습니다이 권리 플레이 + +877 +01:03:13,289 --> 01:03:17,219 + 규모는 대략 몇 가지 먼저 많은 부정적인 같은 소규모 학습 속도를 시도 + +878 +01:03:17,219 --> 01:03:22,559 + 여섯와 나는 미끼로 손실이 겨우 겨우 그래서이 손실 추락 볼 + +879 +01:03:22,559 --> 01:03:27,509 + 하나의 부정적인 육이 학습 속도는 아마 너무 작은 권리 아무것도 없다이다 + +880 +01:03:27,510 --> 01:03:30,250 + 그들은 손실 때문에 당연히 변경하면 다른 많은 문제가있을 수 있습니다 + +881 +01:03:30,250 --> 01:03:34,409 + 하지만 용처럼 만 이유에서 우리는 너무 작은 정신 검사를 통과하기 때문에 + +882 +01:03:34,409 --> 01:03:38,339 + 나는 이것이 아마 손실이 너무 낮다는 것을 생각하고 그리고 난 당신에 의해 칠 필요 + +883 +01:03:38,340 --> 01:03:43,130 + 이 방법은가는 펑키 무언가의 좋은 예 듣는가 재미있다 + +884 +01:03:43,130 --> 01:03:48,280 + 겨우 내려 갔다 내 손실에 대해 생각하지만 실제로 내 훈련 정확도 + +885 +01:03:48,280 --> 01:03:54,000 + 그게 내가 이길 수 있는지 이해가 않는 방법 기본 10 %에서 20 %까지 촬영 + +886 +01:03:54,000 --> 01:03:58,050 + 손실에 의해 겨우 변경하지만 내 비용 내 정밀도가 너무 좋아 + +887 +01:03:58,050 --> 01:04:08,130 + 물론 훨씬 더 그게 가능의 10 % 이상 + +888 +01:04:08,130 --> 01:04:38,860 + 아직도 + +889 +01:04:38,860 --> 01:04:46,120 + 확인 아마 꽤 정확도를 계산하는 방법에 대해 생각하는 방법이 사용자 지정 + +890 +01:04:46,119 --> 01:05:04,799 + 컴퓨터가 지금 무슨 일이 일어나고 있는지이 점수가 작은 그래서 당신의 훈련입니다 + +891 +01:05:04,800 --> 01:05:08,769 + 대략 확산 여전히 손실을 이동 지금 같은 손실에서 생을 마감하지만, + +892 +01:05:08,769 --> 01:05:12,619 + 당신은 정답이없는 작은 조금 더 아마 그래서 우리는 실제로있어 + +893 +01:05:12,619 --> 01:05:16,210 + 실제로 정확성 D 기술의 맥시 클래스입니다 경쟁 올바른 일을 끝낼 + +894 +01:05:16,210 --> 01:05:19,530 + 이들의 당신은 당신이 실제로 어떤 훈련시에 실행 재미 것들 중 일부입니다 + +895 +01:05:19,530 --> 01:05:24,900 + 물건의 확인을 그래서 지금 내가 시작 표현식에 대해 생각해야합니까 + +896 +01:05:24,900 --> 01:05:27,619 + 시도 매우 낮은 학습 속도 일들이 거의 내가에 갈거야 곧 일어나고 + +897 +01:05:27,619 --> 01:05:30,719 + 내가 무엇을 할 수있을 학습 32,000,000을 시도거야 극단적 인 기타 + +898 +01:05:30,719 --> 01:05:36,199 + 아마도 당신은 몇 가지 이상한 오류가 발생할 수 있으므로 어떤 경우에 발생하는 잘못과 + +899 +01:05:36,199 --> 01:05:40,429 + 일이 낸시 정말 재미있는 물건은 1,000,000 중 하나 너무 좋아 어떻게 얻을 폭발 + +900 +01:05:40,429 --> 01:05:44,639 + 내가 노력 제가 그럼이 시점에서 생각하고 등이 아마 너무 높은 + +901 +01:05:44,639 --> 01:05:48,179 + 거친 지역에있는 좁힐 실제로 나에게 내 비용의 감소를 제공하는 + +902 +01:05:48,179 --> 01:05:51,409 + 그게 내가 몇 가지 여기 나의 이진 검색으로 할 노력하고있어 무엇 스레드 + +903 +01:05:51,409 --> 01:05:54,739 + 포인트 난 당신이 내가 십자가 있어야 할 곳에 대략 알고에 대한 몇 가지 아이디어를 얻을 + +904 +01:05:54,739 --> 01:05:55,929 + 검증 + +905 +01:05:55,929 --> 01:06:00,019 + 이 시점에서 적절한 최적화처럼 내가 약속 최선을 찾을려고 + +906 +01:06:00,019 --> 01:06:04,030 + 내 네트워크 권리를 우리가 찾는 과정에서 이동되는 연습 할 좋아 + +907 +01:06:04,030 --> 01:06:07,820 + 전략은 그래서 처음 난 그냥 우리가 배우고 함께 연주하여 거친 생각을 가지고 + +908 +01:06:07,820 --> 01:06:11,550 + 리처드 난 코스 검색을 수행 한 후있는 것은 더 큰 유사한 속도를 놀라운된다 + +909 +01:06:11,550 --> 01:06:16,180 + 세그먼트 후 내가 어떤 작품을보고 나서이 좁은이 과정을 반복 + +910 +01:06:16,179 --> 01:06:20,500 + 이 지역의 그 일을 잘 확인 여기에 이​​렇게 빨리 그리고 당신의 코드에 대한 + +911 +01:06:20,500 --> 01:06:23,719 + 예를 폭발을 감지하고 초기는 측면에서 좋은 단계처럼 탈옥 + +912 +01:06:23,719 --> 01:06:28,339 + 내가 루프 위치를 가지고 구현 그래서 효과적으로 여기서 뭘하는지 + +913 +01:06:28,340 --> 01:06:31,579 + 나는이 사건을 정규화 말을 배우는 내 총리 샘플 + +914 +01:06:31,579 --> 01:06:36,849 + 속도는 나는 이러한 정확성 그래서 내가 여기에 몇 가지 결과를 얻을 훈련을 샘플링 + +915 +01:06:36,849 --> 01:06:40,179 + 검증 데이터에 이러한 그들을 생산 너무 높은 예비 선거는 + +916 +01:06:40,179 --> 01:06:44,440 + 정확성의 일부는 당신은 그들이 몇 가지를 아주 잘 그래서 50 %였다 40 % 있음을 볼 수있다 + +917 +01:06:44,440 --> 01:06:47,409 + 그들 모두에서 잘 작동하지 않는 것은 그래서 이것은 나에게 어떤 범위에 대한 아이디어를 제공합니다 + +918 +01:06:47,409 --> 01:06:50,659 + 학습의 요금 및 규정은 상대적으로 잘 작동됩니다 + +919 +01:06:50,659 --> 01:06:55,079 + 이 최적화를 수행 할 때 당신은 단지 작은 먼저 밖으로 시작할 수 있습니다 + +920 +01:06:55,079 --> 01:06:58,090 + 당신은 아주 긴 시간 동안 실행하려고 시대의 수는 몇 가지에 대해 실행 + +921 +01:06:58,090 --> 01:07:02,680 + 다른보다 더 일하고 어떤 분은 이미 감각을 얻을 수 있습니다 + +922 +01:07:02,679 --> 01:07:08,259 + 사물과 하나 노트 당신은 정규화 학습을 통해 최적화하고 + +923 +01:07:08,260 --> 01:07:12,320 + 단순히 그냥 유니폼에서 샘플링하고 싶지 않은 공간을 산책하는 것이 가장 좋습니다 평가 + +924 +01:07:12,320 --> 01:07:16,510 + 유통이 학습 요금과 정규화 그들이 행동 때문에 + +925 +01:07:16,510 --> 01:07:20,180 + 곱셈 허리 전파의 역학에 리와는 그래서는 이유 + +926 +01:07:20,179 --> 01:07:25,319 + 당신은 내가 흑인 (326)에서 샘플링하고있어 볼 수 있도록 잠금 공간에서이 작업을 수행 할 수 + +927 +01:07:25,320 --> 01:07:28,350 + 하여 학습 속도와 지수와 나는 10의 힘으로 상승하고있어 + +928 +01:07:28,349 --> 01:07:33,319 + 그것의 전원을 10 놀라운 그래서 당신은 단지에서 샘플링되고 싶지 않아 + +929 +01:07:33,320 --> 01:07:38,610 + 귀하의 샘플의 대부분 때문에 백 같은 균일 한 0012은에 가지 있습니다 + +930 +01:07:38,610 --> 01:07:41,820 + 불량 영역 바로 학습 속도가 곱셈 상호 작용 때문에 + +931 +01:07:41,820 --> 01:07:50,050 + 뭔가 비교적 잘 내가 두 번째 패스를하고있어 어떤 작품을 알고 있어야합니다 + +932 +01:07:50,050 --> 01:07:52,950 + 어디 가지에 갈거야 그리고 난 다시 약간의 이러한 변화 그리고 난 해요 + +933 +01:07:52,949 --> 01:07:58,139 + 그래서 어떤 작품을보고 난 지금이​​ 작업의 일부를 (253)를 얻을 수있는 것을 발견 + +934 +01:07:58,139 --> 01:08:02,460 + 정말 잘 한 일이 가끔이 같은 결과를 얻을 알고 있어야합니다 + +935 +01:08:02,460 --> 01:08:06,920 + (53)는 아주 잘 작동하고 난 난 이것을 볼 경우이 사실은 더 나쁘다 + +936 +01:08:06,920 --> 01:08:11,440 + 나는이 교차 검증을 통해 너무이​​기 때문에 실제로이 시점에서 걱정 + +937 +01:08:11,440 --> 01:08:14,490 + 여기에 나는 이것에 대해 실제로 뭔가 문제가 여기에 결과를 가지고 + +938 +01:08:14,489 --> 01:08:21,880 + 일부 문제에서 힌트 결과 + +939 +01:08:21,880 --> 01:08:31,279 + 문제 + +940 +01:08:31,279 --> 01:08:54,109 + 사실은 꽤 일관성이 너​​무 여기에 일어나는 놀라운 학습 봐 + +941 +01:08:54,109 --> 01:08:58,759 + 93 (94) 사이의 속도는 경향 및 I는 아주 좋은 결과와 끝까지 + +942 +01:08:58,760 --> 01:09:00,690 + 난 무엇을 단지 경계 + +943 +01:09:00,689 --> 01:09:06,960 + 이상 최적화하기 때문에 이것이 거의 13 그것은 거의 0001 인 끝나는이다 + +944 +01:09:06,960 --> 01:09:10,510 + 정말 정말 좋은 결과를 얻는 몇 가지를 통해 찾고 있어요 무엇의 경계 + +945 +01:09:10,510 --> 01:09:14,780 + 의 가장자리에 내가 바라는 건 그게 잘되지 무슨 아마 올해 때문에 + +946 +01:09:14,779 --> 01:09:18,719 + 나는 그것을 정의한 방법은 실제로 최적이며, 그래서 있는지 확인하려면 + +947 +01:09:18,720 --> 01:09:21,560 + 더 나은이있을 수 있기 때문에 나는이 일을 파악하고 난 그냥 내 범위 + +948 +01:09:21,560 --> 01:09:22,520 + 결과 + +949 +01:09:22,520 --> 01:09:26,390 + 내가 부정적인 변경하려면 어쩌면 약간이 길을가는 32 음의 두 개 + +950 +01:09:26,390 --> 01:09:32,570 + 2.5하지만 정규화에 대한 그게 어쩌면 내가 해요 아주 잘 작동 참조 + +951 +01:09:32,569 --> 01:09:38,529 + 약간 더 나은 장소 등의 난을 좋아하는이 한 일에 대해 너무 걱정 + +952 +01:09:38,529 --> 01:09:42,739 + 당신이 나를 샘플 꿀벌 무작위도의 균일 한 경향 볼로 지적 + +953 +01:09:42,739 --> 01:09:46,639 + 이 일이 어떤 샘플링 임의의 정규화 학습 복귀 무엇을 + +954 +01:09:46,640 --> 01:09:49,829 + 사람들은 그래서 정말 그리드 검색이라고 무엇을 함께 할 때로 볼 수 있습니다 + +955 +01:09:49,829 --> 01:09:53,920 + 사람들이에 가고 싶어 무작위로 여기의 차이는 대신 샘플링입니다 + +956 +01:09:53,920 --> 01:09:58,789 + 학습 속도 조절 등 모두 일정량 씩 + +957 +01:09:58,789 --> 01:10:02,519 + 당신은 심지어 학습의 일부 설정을 통해 여기 더블 루프 끝 + +958 +01:10:02,520 --> 01:10:03,740 + 정규화의 설정 + +959 +01:10:03,739 --> 01:10:07,590 + 철저한 되려고 노력이이 실제로 나쁜 생각은 실제로하지 않습니다 + +960 +01:10:07,590 --> 01:10:12,720 + 실제로 언제나 당신을 몇 무작위로 간단하고 직관적으로 잘 작동하지만, + +961 +01:10:12,720 --> 01:10:16,280 + 다음 단계로 가고 싶지 않아 무작위로 샘플링 할 여기에 이​​유가 + +962 +01:10:16,279 --> 01:10:23,319 + 에 대한 그것의 종류 그것에 대해 생각이야 그러나 이것은 내가 샘플링 좋은 검색 방법입니다 + +963 +01:10:23,319 --> 01:10:31,579 + 간격을 설정하고 난 당신이 과세 표준을 쓸어 알고있는 회사와을 가질 수 없습니다 + +964 +01:10:31,579 --> 01:10:35,090 + 난 그냥 무작위로 문제로에서 샘플링 무작위 표본 추출은이다 + +965 +01:10:35,090 --> 01:10:38,930 + 최적화 및 훈련 그들이 작동하는 모든 무엇 자주 발생하는 것으로되어있어 + +966 +01:10:38,930 --> 01:10:41,800 + 그녀는 매개 변수 중 하나가 훨씬 더 중요한 다른 것보다이 될 수있어 + +967 +01:10:41,800 --> 01:10:43,039 + 매개 변수 + +968 +01:10:43,039 --> 01:10:45,989 + 그래서 이것은 중요한 파라미터이다라고 그 성능 + +969 +01:10:45,989 --> 01:10:49,349 + 당신의 손실 함수의 성능은 정말 흰색 차원의 함수가 아니다 + +970 +01:10:49,350 --> 01:10:52,510 + 하지만 정말 당신이 더 나은 결과를 얻을 전시의 함수이다 + +971 +01:10:52,510 --> 01:10:58,699 + 종종 인 X 축을 따라이 다음에 해당되는 경우, 특정 영역 + +972 +01:10:58,699 --> 01:11:02,170 + 경우는이 경우에 당신은 실제로 뭔가를 많이 끝날거야 + +973 +01:11:02,170 --> 01:11:06,300 + 다른 세금과 당신은 당신이했습니다 여기보다 더 좋은 자리와 끝까지 + +974 +01:11:06,300 --> 01:11:09,850 + 정확한 지점에서 샘플링하면에 걸쳐 모든 종류의 정보를 얻기하지 않는 + +975 +01:11:09,850 --> 01:11:14,910 + 그 말이 경우 전 그렇게 때문에 항상 있습니다 이러한 경우 임의 사용 + +976 +01:11:14,909 --> 01:11:24,220 + 실제로 내가 약속 당신에게 벅에 대한 더 많은 강타를 줄 것이다 일반적인 임의 + +977 +01:11:24,220 --> 01:11:28,520 + 아마 학습 속도를 수있는 가장 일반적인 사람과 놀고 싶어 + +978 +01:11:28,520 --> 01:11:32,920 + 업데이트는 어쩌면 우리가 조금이에 갈 거 야에 거 야를 입력합니다 + +979 +01:11:32,920 --> 01:11:36,899 + 정규화와 드롭 아웃 금액은 우리는 그래서이가로 갈거야 + +980 +01:11:36,899 --> 01:11:42,979 + 정말 너무 재미는 방법은 실제로 그래서 그러나 이것은 우리가를 가지고있는 것 같습니다 + +981 +01:11:42,979 --> 01:11:46,679 + 컴퓨터 비전 클러스터의 예를 들어 우리는 그래서 난 그냥 수 많은 기계가 + +982 +01:11:46,680 --> 01:11:49,829 + 많은 기계에서 내 훈련을 배포하고 나는 자신을 작성했습니다 + +983 +01:11:49,829 --> 01:11:53,100 + 예를 들어 의견은 이러한 모든 모든 손실 기능이 어디 얼굴을 설정 + +984 +01:11:53,100 --> 01:11:56,880 + 다른 기계와 컴퓨터와이 여기에 모두 몇 가지 클러스터 + +985 +01:11:56,880 --> 01:12:01,270 + 기본적으로 어떻게 작동하고 무엇을 통해 검색하고 내가 볼 수있는 것은 아니고, 내가 할 수있는 + +986 +01:12:01,270 --> 01:12:04,370 + 내가 확인이 모든 단계에서 작동하지 않는 말을 할 수 있도록 내 노동자들에게 명령을 보낼 + +987 +01:12:04,369 --> 01:12:07,399 + 재 샘플 당신은 전혀 잘하지 않는 이들 중 일부는 아주 잘하고있다 + +988 +01:12:07,399 --> 01:12:10,960 + 나는 정확히 잘 작동하고 무엇을보고 나는 그에게 동적 조정 해요 + +989 +01:12:10,960 --> 01:12:14,020 + 내가 통과해야 할 과정은 실제로 잘 작동하는 물건을 얻기 위해 + +990 +01:12:14,020 --> 01:12:17,490 + 그는 단지 너무 많은 물건을 가지고 있기 때문에 이상 최적화하고 당신은 단지에 여유가 있습니다 + +991 +01:12:17,489 --> 01:12:21,569 + 스프레이 당신이 함께 일해야기도 + +992 +01:12:21,569 --> 01:12:25,759 + 확인 그래서 당신은 최적화 당신은 손실 함수에서 찾고 + +993 +01:12:25,760 --> 01:12:29,289 + 손실 함수는 여러 가지 다른 형태를 취할 수 있으며, 당신은 할 수 있어야합니다 + +994 +01:12:29,289 --> 01:12:34,510 + 당신이보고에서 당신은 꽤 좋은거야있을거야, 그래서 그게 무슨 뜻인지에 읽기 + +995 +01:12:34,510 --> 01:12:38,289 + 그것이 예를 들어이 일을 무슨 재미로 손실 함수 + +996 +01:12:38,289 --> 01:12:42,409 + 아마에 사용 된 이전 강의가 같은 지수 아니라고 지적 내 + +997 +01:12:42,409 --> 01:12:47,359 + 당신이 그래서 우물쭈물하고 조금 보이는 알고에 손실 함수를 내가 원하는 + +998 +01:12:47,359 --> 01:12:50,949 + 어쩌면이되지 않도록으로 학습 속도가 약간 너무 낮은 수 있음을 알려줍니다 + +999 +01:12:50,949 --> 01:12:53,069 + 학습 속도가 그냥 고려할 수 있음을 의미 너무 낮 의미 + +1000 +01:12:53,069 --> 01:12:54,359 + 견딜 수 없는 + +1001 +01:12:54,359 --> 01:12:58,549 + 당신이 고원을 가질 수 있도록 아침 때로는 재미 모든 종류의 것들을 얻을 + +1002 +01:12:58,550 --> 01:13:04,199 + 어떤 시점에서 그 결정 것 인 지금 당신이 그렇게 일반적으로 최적화 실행 + +1003 +01:13:04,198 --> 01:13:15,948 + 의 경우 이러한 종류의 유력한 용의자가 무엇 단지 저를 생각하고 나는 생각 + +1004 +01:13:15,948 --> 01:13:19,388 + 총리가 올바르게 그라디언트를 초기화하고 거의 의심 + +1005 +01:13:19,389 --> 01:13:23,579 + 흐르는하지만 어떤 점에서 그들은까지 추가하고 그냥 몇 가지 연구 훈련을 보았다 + +1006 +01:13:23,579 --> 01:13:27,420 + 사실 많은 재미가 내가 잠시 동안 전체 텀블러를 너무 재미있어 시작이 + +1007 +01:13:27,420 --> 01:13:34,260 + 이 사람들이 이러한 기여를 통해 그들이 갈 수 있도록 전 및 기능을 상실 + +1008 +01:13:34,260 --> 01:13:38,300 + 어떤 좋은 서비스 나는 그렇게 생각하고 훈련 특히 네트워크 전송 + +1009 +01:13:38,300 --> 01:13:43,550 + 우리가이 들어갈거야 것은 이국적인 모양의 모든 종류 I 정확히 아니에요입니다 + +1010 +01:13:43,550 --> 01:13:48,730 + 이 중 하나는 그것이 무슨 의미인지 정말 모르는 어떤 점에서 알 수 + +1011 +01:13:48,729 --> 01:13:52,569 + 잘 + +1012 +01:13:52,569 --> 01:14:04,469 + 그래, 그래서 여기 몇 가지 동시에 훈련하는 작업과 그냥이 + +1013 +01:14:04,469 --> 01:14:08,139 + 그런데 나는이 사실을 훈련한다 무엇을 여기에 무슨 일이 있었 알고 + +1014 +01:14:08,139 --> 01:14:11,170 + 그렇지으로 강화 학습 강화에 에이전트에게 문제를 학습 + +1015 +01:14:11,170 --> 01:14:14,679 + 만약 고정 자산 투자 학습 갖고 있지 않은 고정식 분포를 갖도록 + +1016 +01:14:14,679 --> 01:14:17,800 + 정책 변화와 당신이 끝날 경우 에이전트 환경과 상호 작용 + +1017 +01:14:17,800 --> 01:14:21,199 + 벽에 응시하거나 당신의 공간의 다른 부분을보고 결국 같은 + +1018 +01:14:21,198 --> 01:14:24,629 + 당신은 다른 데이터 분포와 끝까지 그래서 갑자기 난 + +1019 +01:14:24,630 --> 01:14:27,109 + 내가 사용하는 것보다 매우 다른 것을보고는보고있다 및 난 + +1020 +01:14:27,109 --> 01:14:30,098 + 내 에이전트를 훈련 손실은 에이전트가 익숙하기 때문에까지 간다 + +1021 +01:14:30,099 --> 01:14:33,569 + 그 템플릿의 종류 그래서 당신은 재미있는 물건이 일어나고있는 모든 종류가있다 + +1022 +01:14:33,569 --> 01:14:40,578 + 다음이 하나 내가 아무 생각이 무엇을 기본적으로 일어나지 않았을 내 즐겨 찾기 중 하나입니다 + +1023 +01:14:40,578 --> 01:14:45,988 + 여기 손실은 진동하지만 대략 수행하고 그냥 폭발 온다 + +1024 +01:14:45,988 --> 01:14:53,238 + 분명히 뭔가가이 경우 오른쪽 않았고, 또한 여기에 그냥 사람이 있어요 + +1025 +01:14:53,238 --> 01:14:57,789 + 수렴하기로하고 재미의 모든 종류를 얻을 수 있도록 아무 생각이 잘못 없었다 + +1026 +01:14:57,789 --> 01:15:01,368 + 일이 당신의 임무에 재미 플롯으로 끝날 경우에 보내 마십시오 + +1027 +01:15:01,368 --> 01:15:02,948 + 트리오 로스 판초 스하지만, + +1028 +01:15:02,948 --> 01:15:06,219 + 훈련 도중 강력한 + +1029 +01:15:06,219 --> 01:15:09,899 + 만 손실 함수와 보는 다른 것은 보지 않는 당신의 정확성이다 + +1030 +01:15:09,899 --> 01:15:14,929 + 가끔 정확성을보고 선호하므로 특히 예를 들어 정확도 + +1031 +01:15:14,929 --> 01:15:18,248 + 무슨 기능을 통해 정확도는 해석하기 때문에 나는 어떤이 알고 + +1032 +01:15:18,248 --> 01:15:22,519 + 손실 함수는에 대한 분류의 정확도는 절대적인 의미 + +1033 +01:15:22,519 --> 01:15:27,369 + 아마로 해석 등 특히 난에 대한 손실이없는 내 + +1034 +01:15:27,368 --> 01:15:31,589 + 구원 데이터와 나의 훈련 때문에이 경우 예를 들어 나는 그 말을 해요 + +1035 +01:15:31,590 --> 01:15:35,288 + 내 트레이닝 데이터의 정확도가 훨씬 더 검증 정확성을 받고 + +1036 +01:15:35,288 --> 01:15:38,929 + 당신에게 힌트를 줄 수있는이 사람에 따라 이렇게 개선 중지 것을 + +1037 +01:15:38,929 --> 01:15:42,380 + 특히이 경우에 후드에 갈 수있는 큰 격차가 여기에있다 + +1038 +01:15:42,380 --> 01:15:44,440 + 그래서 어쩌면 내가 overfitting 생각 해요 + +1039 +01:15:44,439 --> 01:15:48,069 + 100 % 확신하지만 난 강하게 나는 정기적으로 시도 할 수도 있습니다 과불 수 있습니다 + +1040 +01:15:48,069 --> 01:15:57,038 + 물건도보고 될 수 있습니다 때의 차이를 추적 + +1041 +01:15:57,038 --> 01:16:01,988 + 당신의 매개 변수의 규모와 그 매개 변수로 업데이트의 규모 때문에 + +1042 +01:16:01,988 --> 01:16:06,748 + 당신이 당신의 무게 단위 분출의 순서에 있다고 가정하고 그래서있어 말 + +1043 +01:16:06,748 --> 01:16:10,599 + 다음 직관적으로 당신에 의해 당신의 무게를 증가 업데이트 및 + +1044 +01:16:10,599 --> 01:16:14,349 + 역 전파 당신은보다 훨씬 큰 것으로 해당 업데이트를하지 않으 + +1045 +01:16:14,349 --> 01:16:16,679 + 분명히 가중치 또는 당신은 그들이 작은되고 싶어 + +1046 +01:16:16,679 --> 01:16:20,529 + 당신의 무게의 순서에있을 때 당신의 업데이트는 1987 년 정도가 될 수 있습니다 + +1047 +01:16:20,529 --> 01:16:25,359 + 음 너무 그래서 당신이 증가하는 약이야 업데이트를 보면 하나 + +1048 +01:16:25,359 --> 01:16:29,439 + 당신의 무게에 그냥 예를 들어,이 표준 보는 색상 사각형과 + +1049 +01:16:29,439 --> 01:16:34,129 + 일반적으로 귀하의 매개 변수의 규모와의 좋은 규칙 업데이트에 비해 + +1050 +01:16:34,130 --> 01:16:38,550 + 엄지 손가락이 대략 13 그래서 기본적으로 모든 업데이트 할 수 있어야 당신의 + +1051 +01:16:38,550 --> 01:16:41,360 + 의 하나 하나에 대한 세 번째 유효 숫자와 같은 순서에 수정 + +1052 +01:16:41,359 --> 01:16:44,118 + 매개 변수를 오른쪽 당신은 당신이 매우 작은 결정하지 않는 거대한 업데이트를 제작하지 않는 + +1053 +01:16:44,118 --> 01:16:49,708 + 그래서 업데이트는이 경우 일반적으로 확인 작동 대략 13를보고 한 가지 + +1054 +01:16:49,708 --> 01:16:53,038 + 너무 높은 어쩌면 말 내 학습 등의 방법이 너무 낮게 감소 할 + +1055 +01:16:53,038 --> 01:17:00,069 + 107 아마 내 학습 속도를 증가 할 그래서 요약 오늘날 우리 것 + +1056 +01:17:00,069 --> 01:17:05,308 + 교육 신경 네트워크 청록색 함께 할 수있는 일의 전체 무리 보았다 + +1057 +01:17:05,309 --> 01:17:09,729 + 그들 모두의 팔은 당신이 사용하는 트랙을 의미 잃게 기본적으로 있습니다 + +1058 +01:17:09,729 --> 01:17:11,869 + 초기화 + +1059 +01:17:11,869 --> 01:17:15,750 + 당신은 당신이 작은 네트워크를 생각하는 경우 또는 당신은 어쩌면 그냥 멀리 얻을 수 있습니다 + +1060 +01:17:15,750 --> 01:17:20,399 + 당신의 규모 2001 선택하거나 어쩌면 당신은 그와 조금 놀고 싶어하고있다 + +1061 +01:17:20,399 --> 01:17:26,719 + 여기에 강한 권고 난 그냥 생각하지 사용할 때 당신은 내가 아니에요하고있는 + +1062 +01:17:26,720 --> 01:17:34,110 + 내 결정 프로그램을 샘플링해야하고 많은에게 기부를 할 때 적절하고 + +1063 +01:17:34,109 --> 01:17:39,449 + 그주의해야 할 뭔가 이것은 우리가 아직도 충당하기 위해 무엇을하고 그 + +1064 +01:17:39,449 --> 01:17:44,269 + 이 경우 내가 질문을 할 수 있도록 우리가 두 분 이상을해야합니까 옆에있을 것입니다 + +1065 +01:17:44,270 --> 01:18:01,520 + 어떤 + +1066 +01:18:01,520 --> 01:18:11,120 + 사이의 상관 관계 + +1067 +01:18:11,119 --> 01:18:15,729 + 나는 어떤 분명히 당신이 얻을 필요가 추천 할 수 있다고 생각하지 않습니다 + +1068 +01:18:15,729 --> 01:18:18,769 + 그 검사는 그게 분명 나를 밖으로 점프 거기에 아무것도 생각하지 않습니다 + +1069 +01:18:18,770 --> 01:18:35,210 + 확인 위대한 질문에서 다른 커플 + +1070 +01:18:35,210 --> 01:18:35,949 + 에 대한 질문 + diff --git a/captions/Ko/Lecture6_ko.srt b/captions/Ko/Lecture6_ko.srt new file mode 100644 index 00000000..5153a78f --- /dev/null +++ b/captions/Ko/Lecture6_ko.srt @@ -0,0 +1,3652 @@ +1 +00:00:00,000 --> 00:00:07,009 + 확인 그래서 우리는 다시 신경망을 훈련에 대해 얘기하자 오늘은 이제 첫 무엇과 + +2 +00:00:07,009 --> 00:00:10,449 + 나는 우리가 다이빙을하기 전에 작동 쇼에 오는 당신에게 인터뷰의 비트를 줄 것이다 + +3 +00:00:10,449 --> 00:00:15,489 + 그 소재 단지 일부 관리 것을 먼저 첫 번째 I로 + +4 +00:00:15,490 --> 00:00:18,618 + 기회를하지 않았다 실제로 인터뷰 저스틴 마지막 강의 저스틴입니다하기 + +5 +00:00:18,618 --> 00:00:21,579 + 이 클래스 또한 강사 그는 처음 2 주 동안 실종됐다 + +6 +00:00:21,579 --> 00:00:28,409 + 그들은 그가 어쩌면 매우 지식의 나에게 아무것도에 대해 아무것도 요청할 수 있습니다 + +7 +00:00:28,410 --> 00:00:29,428 + 즉, 삼가의 + +8 +00:00:29,428 --> 00:00:37,960 + 확인하고 72 그렇게 꽤 오랫동안의 알림 내가 시작하는 것이 좋습니다으로 밖으로 + +9 +00:00:37,960 --> 00:00:43,850 + 여기에 구축하고는 기본적으로 다음 주 금요일은 그래서 가능한 한 빨리 그에 시작 할 수있어합니다 + +10 +00:00:43,850 --> 00:00:47,679 + 가능하면 앞으로의 적절한 API와 함께 작동 노하우를 구현 + +11 +00:00:47,679 --> 00:00:50,429 + 뒤로 클래스와 당신은 경쟁의 추상화 사로 잡고 볼 수 있습니다 + +12 +00:00:50,429 --> 00:00:54,820 + 다시 내 세션으로 이동 중퇴하고 실제로 구현합니다 + +13 +00:00:54,820 --> 00:00:57,770 + 상업 네트워크 실제로이이 과제의 말 때문에 + +14 +00:00:57,770 --> 00:01:00,770 + 강한에 오는 방법의 모든 낮은 수준의 세부 사항을 매우 잘 이해하고 + +15 +00:01:00,770 --> 00:01:06,530 + 네트워크 분류 난 그냥 확인 해요 그래서 우리는 단지 신호로이 클래스에 위치 + +16 +00:01:06,530 --> 00:01:10,140 + 다시 우리는 네트워크에서 훈련을한다 밖으로 신경망을 훈련하고 회전하는 + +17 +00:01:10,140 --> 00:01:15,590 + 정말 4 단계 프로세스는 전체 데이터 세트의 이미지와 라벨 우리가 + +18 +00:01:15,590 --> 00:01:18,920 + 우리가 네트워크를 통해 전파 생각했다 데이터 세트에서 작은 백을 샘플링 + +19 +00:01:18,920 --> 00:01:23,060 + 우리는 현재 분류​​하고 얼마나 잘 우리에게 말하고있는 손실에 도착합니다 + +20 +00:01:23,060 --> 00:01:26,390 + 데이터의 파견 그리고 우리는 모두의 기울기를 완료하기 위해 전파 + +21 +00:01:26,390 --> 00:01:29,969 + 무게와이 그라데이션이 우리에게 말하고 우리가 어떻게 매일 대기 확실하지해야 + +22 +00:01:29,969 --> 00:01:33,789 + 네트워크에 우리는 더 나은 다음 번에이 이미지를 분류하고 있도록 + +23 +00:01:33,790 --> 00:01:36,700 + 우리가 실제로 그렇게 할 경우 우리는 그라데이션 우리가 차 업데이트를 사용할 수있다 + +24 +00:01:36,700 --> 00:01:38,930 + 작은 홈 + +25 +00:01:38,930 --> 00:01:42,659 + 지난 시간 우리는 활성화 기능으로 보면서 나는 활성화 피곤 해요 + +26 +00:01:42,659 --> 00:01:45,368 + 기능과 어떤 장점과 이러한 내부자 신경 중 하나를 사용의 단점 + +27 +00:01:45,368 --> 00:01:49,060 + 물었을 때 좋은 질문이 너무 광장에서 들어오는 네트워크 이유도 당신 것 + +28 +00:01:49,060 --> 00:01:53,939 + 정품 인증 기능을 사용하는 이유는 그냥 건너 뛰고 질문을 제기했다하지 + +29 +00:01:53,938 --> 00:01:57,618 + 난 정말 기본적으로 마지막 강의에 매우 능숙하게이 문제를 해결하는 데있어 + +30 +00:01:57,618 --> 00:02:00,790 + 전체 신경망이 끝나는 경우보다 활성화 함수를 사용하지 않는다면 + +31 +00:02:00,790 --> 00:02:05,500 + 당신의 샌드위치 하나 하나가되는 등 용량 단지의 그것과 동일하다 + +32 +00:02:05,500 --> 00:02:10,080 + 그 활성화 기능이 정말 중요하다, 그래서 선형 분류 + +33 +00:02:10,080 --> 00:02:13,880 + 사이에 그들은 그들은 당신에게 당신이 사용할 수있는 모든 방법을 제공 것들입니다 + +34 +00:02:13,879 --> 00:02:17,490 + 실제로 데이터를 넣어 우리는 전처리에 대해 간단히 이야기 + +35 +00:02:17,490 --> 00:02:21,860 + 기술하지만, 아주 간단히 우리는 또한 활성화 기능을 보았고, + +36 +00:02:21,860 --> 00:02:24,830 + 신경망 여기 그래서 문제 전반에 걸쳐 자신의 분포 I + +37 +00:02:24,830 --> 00:02:31,370 + 우리는 이러한 초기 가중치를 선택해야하고 특히 전화가 참조 + +38 +00:02:31,370 --> 00:02:34,930 + 기다리는 사람들을 방법을 큰 규모는 처음에하고 우리는 보았다 + +39 +00:02:34,930 --> 00:02:38,260 + 이 경우 그 그 무게는 신경에​​ 활성화 너무 작은 경우 + +40 +00:02:38,259 --> 00:02:41,909 + 네트워크는 깊은 네트워크가 0으로 이동이 있고 당신이 설정 한 경우 그 기술은 그대로 + +41 +00:02:41,909 --> 00:02:45,129 + 그들 모두보다 높은에 가능성이 대신 폭발하고 그래서 당신은 끝낼 + +42 +00:02:45,129 --> 00:02:48,939 + 다른 네트워크 슈퍼 포화 또는 해당 단지에 대한 모든 네트워크와 끝까지 + +43 +00:02:48,939 --> 00:02:54,189 + 0과 1 그래서 그 규모는 우리가 들여다 설정하는 매우 매우 까다로운 일이다 + +44 +00:02:54,189 --> 00:02:59,579 + 당신이에 사용하는 것은 합리적인 종류를 제공 초기화 + +45 +00:02:59,580 --> 00:03:03,290 + 형성하고는 기본적으로 대략 좋은 활동 활성화 또는를 제공합니다 + +46 +00:03:03,289 --> 00:03:06,459 + 훈련의 시작 부분에서 네트워크를 통해 활성화의 분포 + +47 +00:03:06,459 --> 00:03:10,959 + 그리고, 우리는 많이 경감이 일에 가장 정상화에 들어갔다 + +48 +00:03:10,959 --> 00:03:14,120 + 실제로 제대로 그 기술을 설정하고 세바스찬 이러한 두통의 + +49 +00:03:14,120 --> 00:03:16,689 + 법안이에게 그들이 필요 없어 훨씬 더 강력한 선택한다 + +50 +00:03:16,689 --> 00:03:20,550 + 정확하게 맞 초기 규모를 얻고 우리는 현재의 모든 호출에 갔다 + +51 +00:03:20,550 --> 00:03:23,620 + 우리는 잠시 동안 그것에 대해 이야기하고 우리는 학습에 대해 이야기 + +52 +00:03:23,620 --> 00:03:26,920 + 당신이 실제로 할 방법에 대한 팁과 트릭의 종류를 표시하려고에 의해 처리 + +53 +00:03:26,919 --> 00:03:29,809 + 말했다 당신이 그들을 어떻게 또한 제대로 훈련받을 방법이 신경망 + +54 +00:03:29,810 --> 00:03:34,860 + 위반에 걸쳐 실행 방법 천천히 시간이 지남에 너무 렌더링 일어나 + +55 +00:03:34,860 --> 00:03:37,769 + 우리는 몇 가지로 갈거야 그래서이 시간에 대한 모든 것을 지난 시간에 이야기 + +56 +00:03:37,769 --> 00:03:41,060 + 위 특정 매개 변수에 훈련 신경 네트워크의 나머지 항목 + +57 +00:03:41,060 --> 00:03:44,989 + 계획은 나는 대부분의 부분을 생각하고 우리는 내 난 앙상블 드롭 아웃에 대해 조금 얘기하자 + +58 +00:03:44,989 --> 00:03:49,480 + 나는 그 어떤 행정 일에 난 내 길을 뛰어 등등 전에 있도록 + +59 +00:03:49,479 --> 00:03:53,509 + 잊고 반드시 그렇게 + +60 +00:03:53,509 --> 00:03:58,030 + 차 업데이트 신경망을 훈련에 프로세스가 거기에 있기 때문에 + +61 +00:03:58,030 --> 00:04:01,199 + 이것은 정말 당신이 위반에 대해는 그 모습에 의사입니다 + +62 +00:04:01,199 --> 00:04:04,419 + 법에 심각한 그라데이션 내가 얘기 공연 차 업데이트 + +63 +00:04:04,419 --> 00:04:08,030 + 매개 변수 업데이트는 특히 여기 어디에서이 마지막 줄보고 있었다 + +64 +00:04:08,030 --> 00:04:12,129 + 우리가 만들려고하는보다 복잡한 그 어디 그래서 지금 우리가 무슨 일을하는지 + +65 +00:04:12,129 --> 00:04:17,129 + 학교는 단지 일을 읽고. 우리는 내 컴퓨터 및 우리에 그 휴식을 취할 곳 + +66 +00:04:17,129 --> 00:04:21,639 + 그냥 우리의 주요 요인의 학습 속도에 의해 확장 및 곱셈 우리는 할 수 있습니다 + +67 +00:04:21,639 --> 00:04:23,159 + 훨씬 더 정교한 방법으로 우리 + +68 +00:04:23,160 --> 00:04:27,960 + 해당 날짜 등등에 나는 지난 몇 강의 곳에서 간단히 이미지를 플래시 + +69 +00:04:27,959 --> 00:04:30,759 + 서로 다른 매개 변수를 업데이트 방식을 볼 수 있습니다 얼마나 빨리 그들은 실제로 + +70 +00:04:30,759 --> 00:04:35,129 + 여기에 간단한 손실 함수를 최적화 그래서 특히 것을 볼 수 있습니다 STD + +71 +00:04:35,129 --> 00:04:38,550 + 우리가 여기에 네 번째 줄에 현재 사용하고 그 발 빠르게 그리고 무엇 인 + +72 +00:04:38,550 --> 00:04:41,710 + 그 사실 때문에 그들 모두의 가장 느린 하나입니다 것을 볼 수 있습니다 당신에게 책을 읽어 + +73 +00:04:41,709 --> 00:04:45,139 + 당신은 거의 이제까지 단지 기본 양육권을 사용하지 연습하고 더 나은 방식이다 것을 우리 + +74 +00:04:45,139 --> 00:04:48,979 + 우리가 이제 무엇을 살펴 보자 구조에서 이들에 갈거야 사용할 수 있습니다 + +75 +00:04:48,980 --> 00:04:54,810 + 문제는 너무 느려 약간이 특정을 고려하는 이유 하사관 함께 + +76 +00:04:54,810 --> 00:04:58,589 + 우리는 손실 함수 액면 세트가 여기에 인위적 예 우리 + +77 +00:04:58,589 --> 00:05:02,099 + 손실은 다른 것보다 훨씬 더 높은 긴 한 방향에 반대 + +78 +00:05:02,100 --> 00:05:05,500 + 여기 방향 때문에 기본적으로이 손실 함수는 매우 얕은입니다 + +79 +00:05:05,500 --> 00:05:10,199 + 수평으로하지만, 매우 수직으로 가파른 물론이을 최소화하기 위해 우리가 할 + +80 +00:05:10,199 --> 00:05:13,469 + 우리가 렉스 볼티모어에있어 지금이 가리키는 최소려고 + +81 +00:05:13,470 --> 00:05:19,240 + 우리가 행복 만의 궤도 무엇에 대해 생각 어디 웃는 얼굴 + +82 +00:05:19,240 --> 00:05:22,980 + 이 모두 X 및 Y 방향이다 + +83 +00:05:22,980 --> 00:05:30,650 + 주디 우리가 같은 그 표정이 풍경을 최적화하려고하면 그래서 뭐 + +84 +00:05:30,649 --> 00:05:35,729 + 그것은 수평과 같이 수직으로 난 그렇게 누군가의 엉덩이를 볼 것입니다 무슨 + +85 +00:05:35,730 --> 00:05:43,540 + 당신은 거기 계획하는 이유는 그래서 최대 반송 가서 아래처럼 있어요된다 + +86 +00:05:43,540 --> 00:05:52,030 + 그 이유는 많은 진전를 잘 기본적으로이가되게하지 않습니다 + +87 +00:05:52,029 --> 00:05:56,969 + 우리가 그라데이션 볼 포럼 수평 우리는 복사가 있음을 볼 수 + +88 +00:05:56,970 --> 00:06:00,680 + 이 수평 얕은 기능을하지만 우리가이 있기 때문에 매우 작은 + +89 +00:06:00,680 --> 00:06:03,439 + 큰 평가는 무슨 일이 일어날에 관해서는 매우 가파른 기능이기 때문에 + +90 +00:06:03,439 --> 00:06:06,389 + 당신은 이들 종류의 경우에서 거리를 출시하고이 끝낼 때 + +91 +00:06:06,389 --> 00:06:10,250 + 당신이 수평 방향으로 너무 느린거야 패턴의 종류 만 + +92 +00:06:10,250 --> 00:06:13,300 + 이에 결국 때문에 당신은 너무 빠르고 수직 방향을거야 + +93 +00:06:13,300 --> 00:06:17,918 + 올해 하나 이런 상황 또는 치료의 방법을 우리는 기억으로 모멘텀 그래서 + +94 +00:06:17,918 --> 00:06:22,189 + 기세 업데이트에 대한 업데이트는 다음과 같은 방법으로 우리의 업데이 트를 변경됩니다 + +95 +00:06:22,189 --> 00:06:25,319 + 그래서 지금 우리는 단지 그라데이션을 구현하고 + +96 +00:06:25,319 --> 00:06:28,409 + 그라데이션을 복용하고 우리는에 의해 우리의 현재 위치를 통합하고 + +97 +00:06:28,410 --> 00:06:34,220 + 날짜에 등급 대신 우리는 우리가 계산 된 그라데이션을거야 및 + +98 +00:06:34,220 --> 00:06:36,449 + 대신 직접 위치를 통합 + +99 +00:06:36,449 --> 00:06:40,840 + 우리는 내가 너무 속도로 떠날 수있는이 변수 V를 증가거야 + +100 +00:06:40,839 --> 00:06:44,049 + 우리는 우리가 증가 그래서 약간의 이유를 보게 될 것입니다 + +101 +00:06:44,050 --> 00:06:48,020 + 속도의 변수가 될 대신 대신에 우리는 기본적으로 가입이 구축하고 + +102 +00:06:48,019 --> 00:06:53,278 + 과거에 일부 신빙성을 지수 및 그 위치를 통합하는거야 + +103 +00:06:53,278 --> 00:06:58,610 + 여기에이 새로운는 0과 1 사이의 숫자의 종류로 행복 프라이머 및 음소거입니다 + +104 +00:06:58,610 --> 00:07:03,629 + 그리고 이전 BE되었다 하 등 화면 구배에 첨가 하였다 + +105 +00:07:03,629 --> 00:07:07,180 + 당신은 매우 물리적으로 해석 할 수있는 업데이트 모멘텀에 대한 좋은 데요 + +106 +00:07:07,180 --> 00:07:14,310 + 조건 및 다음과 같은 방법으로 기본적으로 모멘텀 업데이트에 해당하는 사용 + +107 +00:07:14,310 --> 00:07:18,899 + 할인 목록을 해석하는 정말 대담한 구름이 라운드가 허용하는 + +108 +00:07:18,899 --> 00:07:22,459 + 프리이 경우 그래디언트가 숲이라는 입자 + +109 +00:07:22,459 --> 00:07:26,408 + 느낌 그래서이 문서는 대신 그라데이션 약간의 힘을 느끼고있다 + +110 +00:07:26,408 --> 00:07:31,158 + 힘이 상당하므로 직접 위치를 물리학이 힘을 통합 + +111 +00:07:31,158 --> 00:07:36,019 + 이 때문에 가속에 가속이 우리가 경쟁하고있는 것입니다 + +112 +00:07:36,019 --> 00:07:39,938 + 그래서 속도는 여기에 다음 새 배의 가속도에 의해 통합됩니다 + +113 +00:07:39,939 --> 00:07:43,039 + 그 경우에, 마찰의 해석을 가지고 그 때문에 매 + +114 +00:07:43,038 --> 00:07:47,759 + 반복은 약간 둔화이 새로운 시간이 될 직관적 경우 아니었다했다 + +115 +00:07:47,759 --> 00:07:51,550 + 그냥 법 주위에 있었기 때문에 휴식을 오지로 다음 굵게가 않습니다 + +116 +00:07:51,550 --> 00:07:54,509 + 영원히 표면과가에 정착 할 에너지의 손실이 없을 것 + +117 +00:07:54,509 --> 00:07:58,158 + 손실 기능 등 최종 운동량 업데이트는이 중임 + +118 +00:07:58,158 --> 00:08:01,810 + 최적화의 물리적 해석 그러나 우리는 볼이 약 롤링이 + +119 +00:08:01,810 --> 00:08:08,249 + 그리고 시간이 지남에 따라 둔화 것 등이 작동하는 방식은 아주 좋은 무엇이다 + +120 +00:08:08,249 --> 00:08:11,669 + 이 업데이트에 대한 당신은 특히이 속도와를 구축 끝으로 + +121 +00:08:11,668 --> 00:08:14,959 + 얕은 방향을보고 매우 쉽게 당신이 얕은이있는 경우 만 + +122 +00:08:14,959 --> 00:08:18,449 + 일관된 방향은 다음 모멘텀 업데이트는 천천히 속도를 구축 할 것입니다 + +123 +00:08:18,449 --> 00:08:21,360 + 당신이 얕은에서 위로 가속화 결국 방향 벡터 + +124 +00:08:21,360 --> 00:08:24,999 + 방향하지만 무슨 일이 일어날 매우 가파른 방향으로 당신의 시작입니다 + +125 +00:08:24,999 --> 00:08:28,919 + 과정은 일반적으로 약하지만 당신은 항상 다른 사람을 뽑아되고있어 + +126 +00:08:28,918 --> 00:08:32,429 + 중심을 향해 및 감쇠 및 진동의 종류와 방향 + +127 +00:08:32,429 --> 00:08:36,338 + 그래서 그것은 종류의 가파른 방향이 진동을 찍힌 것 중간 및 + +128 +00:08:36,339 --> 00:08:41,139 + 종류의이 과정을 장려하고 일관성이있어 고무적 + +129 +00:08:41,139 --> 00:08:44,889 + 얕은 방향과는 컨버전스의 개선 끝나는 이유입니다 + +130 +00:08:44,889 --> 00:08:49,600 + 대부분의 경우는 그래서 여기 시각화, 예를 들어 우리는 SED 업데이트에서 참조 + +131 +00:08:49,600 --> 00:08:53,459 + 모멘텀 업데이트는 녹색 아니고, 그래서 당신은 녹색 하나를 어떻게 볼 수 있습니다 + +132 +00:08:53,458 --> 00:08:57,008 + 신발을 통해 공격하면이 모든 홍보를 구축하기 때문에 + +133 +00:08:57,009 --> 00:09:00,909 + 최소 오버 슈트하지만 그것은 결국 갤런 변환 끝과 + +134 +00:09:00,909 --> 00:09:04,169 + 물론 그것은 촬영 끝났어하지만이 나온다 일단 당신은 그것의 것을 볼 수있다 + +135 +00:09:04,169 --> 00:09:07,879 + 가 결국 업데이트처럼 그냥 기본보다 훨씬 더 빨리 수렴 + +136 +00:09:07,879 --> 00:09:11,230 + 문을 너무 많이 구축하면 결국 경우보다가 빨리 얻을 것보다 + +137 +00:09:11,230 --> 00:09:17,110 + 당신은 속도가 모멘텀 업데이트는 가고있다있어하지 않았다 + +138 +00:09:17,110 --> 00:09:20,430 + 운동량의 특정 변이 난 그냥 물어보고 싶은게 조금 등장 + +139 +00:09:20,429 --> 00:09:34,289 + 나는 프라이머와 같은 단일있어 언제 모멘텀에 대한 질문은 업데이트 + +140 +00:09:34,289 --> 00:09:40,078 + 보통 때때로 어떤 약 8.5 4.9의 값과 보통 사람들 소요 + +141 +00:09:40,078 --> 00:09:43,219 + 그것은 슈퍼 혜성은 아니지만 사람들이 때때로 리드 (25) 2.99에서 + +142 +00:09:43,220 --> 00:09:54,200 + 천천히 시간이 지남에 있지만, 그것은 단지 하나의 숫자입니다 + +143 +00:09:54,200 --> 00:09:57,180 + 네 그래서 당신은 작은 학습 속도하지만 문제가있는 사람을 방지 할 수 있습니다 + +144 +00:09:57,179 --> 00:10:03,000 + 당신이 있다면 느린 학습 속도는 모든 방향에 전 세계적으로 적용된다 + +145 +00:10:03,000 --> 00:10:06,070 + 그라데이션 등은 당신이에 진전을하지 않는다 기본적 것이다 + +146 +00:10:06,070 --> 00:10:09,390 + 수평 방향으로 오른쪽 당신은 많은 것을 얻을 수 없겠죠하지만 그것은 당신을 데려 갈 것이다 + +147 +00:10:09,389 --> 00:10:12,710 + 영원히 갈 수평으로 몇 가지 작은 학습은 떨어져 무역의이 종류는 말한다 + +148 +00:10:12,710 --> 00:10:25,350 + 자신의 질문에 수정을 설명하는 선택 방법을 초기화하는 방법입니다 + +149 +00:10:25,350 --> 00:10:29,050 + 일반적으로 10을 상실하고 결국 있기 때문에 문제가 너무 많이하지 않습니다 + +150 +00:10:29,049 --> 00:10:32,490 + 처음 몇 단계를 구축하고 당신은 당신이 경우 다음과 같이 끝 + +151 +00:10:32,490 --> 00:10:35,480 + 이 기하 급수적으로의 당신은 기본적으로 그 볼이 재발을 지출 + +152 +00:10:35,480 --> 00:10:39,330 + 이전 인사의 일부를 부패 그래서 당신은 당신이 당신에게 그것을 가지고 한 번 + +153 +00:10:39,330 --> 00:10:46,020 + 모멘텀의 특정 열 때문에 특히 변화라는 것을 가지고있다 + +154 +00:10:46,019 --> 00:10:53,449 + 모멘텀과 그라데이션 하강 여기에 생각에 아저씨는 우리가이입니다 + +155 +00:10:53,450 --> 00:10:57,550 + 보통 운동량 여기 방정식 그것에 대해 생각하는 방법이다 당신의 + +156 +00:10:57,549 --> 00:10:59,789 + 초과 정말 두 부분으로 추천 + +157 +00:10:59,789 --> 00:11:03,279 + 특정 방향으로 약간의 힘을 너무 구축하는 것이의 한 부분이있다 + +158 +00:11:03,279 --> 00:11:06,799 + 즉, 새로운 시대를 그린의 모멘텀 단계이고 그 곳이다 + +159 +00:11:06,799 --> 00:11:09,959 + 모멘텀은 현재를 수행하기 위해 노력하고 두 번째가 + +160 +00:11:09,960 --> 00:11:12,610 + 그라디언트에서 기여 기울기는이 방법으로 당기는 + +161 +00:11:12,610 --> 00:11:17,450 + 손실 함수의 감소와 실제 단계는 벡터 합인 끝낸다 + +162 +00:11:17,450 --> 00:11:21,350 + 그래서 블루만큼 당신이 결국 두 사람은 그냥 녹색 더하기 빨간색의 + +163 +00:11:21,350 --> 00:11:24,840 + 생각하지만 필요한 모멘텀이 실제로 더 나은 작업 끝과 + +164 +00:11:24,840 --> 00:11:29,629 + 다음과 같이 우리는 관계없이 현재의 입력이 무엇의이 시점에서 알 + +165 +00:11:29,629 --> 00:11:33,439 + 우리에게 그래서 우리는 최대 아직 대해 경쟁하지 않은 그러나 우리는 우리가 어떤을 구축 한 것을 알고있다 + +166 +00:11:33,440 --> 00:11:37,240 + 모멘텀과 우리는 우리가 확실히 확인 그래서이 녹색 방향을거야 알고 + +167 +00:11:37,240 --> 00:11:41,220 + 우리는 확실히 여기이 그린 밸리 성분을거야 우리 + +168 +00:11:41,220 --> 00:11:45,310 + 현재의 자리 네 스테 로프 모멘텀을 수행 앞서 대신보고 싶어 + +169 +00:11:45,309 --> 00:11:49,379 + 화살표의 상단이 시점에서이 시점 기울기를 평가하므로 + +170 +00:11:49,379 --> 00:11:53,679 + 당신이와 끝까지 우리가 우리가가는거야 알고 여기에 다음과 같은 차이 + +171 +00:11:53,679 --> 00:11:57,089 + 왜 그냥 같은 것은 그 부분에 도착하기 앞서 살펴 어쨌든이 길을 갈 + +172 +00:11:57,090 --> 00:12:00,420 + 객관적이고 그 시점에서 녹색을 평가하고 그것은 물론 당신이있어하지 않습니다 + +173 +00:12:00,419 --> 00:12:02,309 + 다른에이기 때문에 독서는 다소 차이가있을 것입니다 + +174 +00:12:02,309 --> 00:12:05,669 + 로스 함수의 위치와이 한 단계 앞서 당신에게 약간 더 나은를 제공 + +175 +00:12:05,669 --> 00:12:06,259 + 방향 + +176 +00:12:06,259 --> 00:12:11,109 + 저기 수 있습니다 당신은 당신이 할 수있는 지금 그것을 약간 다른 업데이 트를 얻을 + +177 +00:12:11,109 --> 00:12:14,379 + 이론적으로이 사실에 더 나은 이론 보장을 즐기는 것을 보여 + +178 +00:12:14,379 --> 00:12:18,069 + 수렴 속도뿐만 아니라이 이론뿐만 아니라 실제의 사실과 + +179 +00:12:18,068 --> 00:12:23,068 + 거의 항상 차이가 너무 좋아 그냥 순간보다 더 잘 작동 약 + +180 +00:12:23,068 --> 00:12:28,358 + 다음 해에 그 코드를하지만 여전히 우리의 표기법처럼 같은 작성한된다 + +181 +00:12:28,359 --> 00:12:29,589 + 시간이 + +182 +00:12:29,589 --> 00:12:33,089 + 당신이 현재하고있는 이전의 속도 벡터 및 구배를 돌연변이 + +183 +00:12:33,089 --> 00:12:37,629 + 평가하고 우리는 여기에 업데이트를하고 있으므로 필요한 업데이트 만을 + +184 +00:12:37,629 --> 00:12:41,720 + 차이는이 새로운 더한 새로운 BTW 시간을 뺀 11의 뜻 여기 보류했다 + +185 +00:12:41,720 --> 00:12:44,949 + 우리는이에 약간 다른 위치에서 평가 한 그라데이션을 평가 + +186 +00:12:44,948 --> 00:12:48,278 + 위치를 미리보고 그래서 강한 모멘텀에 정말 그것은 거의 + +187 +00:12:48,278 --> 00:12:51,698 + 항상 지금 약간의 기술은 내가 안되는 여기 거기되는 작품 + +188 +00:12:51,698 --> 00:12:57,068 + 너무 많이 들어갈 것 같네요하지만 사실 그 불편할 약간 있어요 + +189 +00:12:57,068 --> 00:13:00,418 + 일반적으로 우리는 향후에 대해 생각하고 뒤로 우리는 결국 무엇 때문에 통과 + +190 +00:13:00,418 --> 00:13:04,288 + 으로는 최대 프라이 머리 승리 데이터와 그 때의 기울기를 갖지만 + +191 +00:13:04,288 --> 00:13:09,088 + 당신은 떨어져에서 사육 매개 변수 및 그라데이션을 가지고 우리를 원하는 경우는 없습니다 + +192 +00:13:09,089 --> 00:13:12,600 + 다른 점은 그래서 꽤 단지 사이의 간단한 API처럼에 맞지 않는 + +193 +00:13:12,600 --> 00:13:16,019 + 코드를 갖는 그래서 방법이 밝혀 내가 정말하고 싶지 않아 + +194 +00:13:16,019 --> 00:13:19,899 + 아마이에 너무 많은 시간을 소비하지만, 기본적으로 변수를 할 수있는 방법이 + +195 +00:13:19,899 --> 00:13:23,379 + 변압기는 통지 일부 재배치를 수행 살이 찐를 얻을 당신은 얻을 + +196 +00:13:23,379 --> 00:13:26,079 + 더욱 새로 업데이트의처럼 보이는 뭔가 그냥 수 + +197 +00:13:26,078 --> 00:13:29,538 + 당신이 결국 때문에 감동 에드 교환 아만다 마틴에서 스 와이프 + +198 +00:13:29,538 --> 00:13:34,119 + 만 그라디언트 위축을 필요로하고 당신을 무언가를 업데이트하고이 기​​능은 + +199 +00:13:34,119 --> 00:13:35,209 + 정말 앞서 보여요 + +200 +00:13:35,208 --> 00:13:38,159 + 매개 변수의 버전들은 그냥 원시 매개 변수 벡터에 있기 때문에 + +201 +00:13:38,159 --> 00:13:40,608 + 당신이 노트에 갈 수있는 단지 전문적이 체크 아웃하기 + +202 +00:13:40,609 --> 00:13:46,709 + 확인 그래서 여기에 네 스테 로프 가속 독서는 마젠타에 당신은 볼 수 있습니다 + +203 +00:13:46,708 --> 00:13:50,208 + 원래 가게를 통해 여기 모멘텀하지만 많은하지만 가속 때문에 아저씨 + +204 +00:13:50,208 --> 00:13:53,958 + 모멘텀은 당신이 주위에 더 많은 곱슬 있다고 볼 수 있습니다 앞서이 한 단계가 + +205 +00:13:53,958 --> 00:13:57,738 + 신속하고 그 때문에 모든이 작은 기여 약간 더 낫다 + +206 +00:13:57,739 --> 00:14:01,619 + 당신이하려고합니다 어디에서 그라데이션 합산 결국하고 거의 항상합니다 + +207 +00:14:01,619 --> 00:14:08,600 + UD 모멘텀이이었다 최근까지 수 있도록 빠른 그래서 필요의 수렴 + +208 +00:14:08,600 --> 00:14:11,329 + 훈련 상용 네트워크와 많은 사람들의 표준 기본 방법 + +209 +00:14:11,328 --> 00:14:14,658 + 아직이에서 볼 수있는 일반적인 일 업데이트하기 위해 잠시를 사용하여 훈련 + +210 +00:14:14,658 --> 00:14:17,610 + 연습과 필요한 경우 더 나은 + +211 +00:14:17,610 --> 00:14:20,990 + 그래서 잡지는 여기에 일주일을 의미합니다 + +212 +00:14:20,990 --> 00:14:44,350 + 당신이 그것에 대해 생각하는지 질문은 그래서 나는 그것이 약간 잘못된 생각 + +213 +00:14:44,350 --> 00:14:46,990 + 만 일반적으로 생각 신경 네트워크에 대한 옵션을 많이 생각 + +214 +00:14:46,990 --> 00:14:50,350 + 이 미친 계곡과 지역 최소값을 많이 사방 실제로는 아니다 + +215 +00:14:50,350 --> 00:14:53,670 + 그것은 그 보는 올바른 방법은 개념이 할 수있는 올바른 접근이다 + +216 +00:14:53,669 --> 00:14:56,278 + 당신의 마음에 당신은 아주 작은 신경 네트워크와 사람들이 생각하는 데 사용 때 + +217 +00:14:56,278 --> 00:14:59,769 + 지역 최소값 것을 문제 및 최적화 네트워크 그러나 실제로집니다 + +218 +00:14:59,769 --> 00:15:04,269 + 당신이 당신의 모델을 확장으로 최근의 이론적 작업의 많은 아웃 + +219 +00:15:04,269 --> 00:15:10,740 + 이 지역의 최소 갈수록 문제의 사진에 있도록되어 있습니다 + +220 +00:15:10,740 --> 00:15:14,389 + 생각하고있는 것은 지역의 최소값 많이있다하지만 그들은 같은에 대한 모든 것 + +221 +00:15:14,389 --> 00:15:18,958 + 실제로이 때문에 이러한 기능의 신경을보고 더 나은 방법 손실 + +222 +00:15:18,958 --> 00:15:22,078 + 실제로 연습 네트워크와 나는 그릇 등 같은 훨씬 더 찾고 있어요 + +223 +00:15:22,078 --> 00:15:25,599 + 대신 미친 계곡 풍경과 당신은 여전히​​ 당신으로 그것을 표시 할 수 있습니다 + +224 +00:15:25,600 --> 00:15:28,360 + 신경망 최선보다는 최악의 등의 차이 + +225 +00:15:28,360 --> 00:15:29,259 + 지역 최소값 + +226 +00:15:29,259 --> 00:15:32,448 + 실제로 좀 좋아도 일부 연구자와 시간이 지남에 따라 아래로 축소 + +227 +00:15:32,448 --> 00:15:36,120 + 기본적으로이 매우 소규모 네트워크에서 일어나는 나쁜 지역 최소값가 없습니다 + +228 +00:15:36,120 --> 00:15:41,409 + 당신이 다른과 초기화하면 그렇게 연습에서 실제로 당신이 찾는 것은 + +229 +00:15:41,409 --> 00:15:44,610 + 임의의 초기화는 거의 항상 같은처럼 같은 대답을 받고 결국 + +230 +00:15:44,610 --> 00:15:48,009 + 결국 손실은 그래서 당신은 같은 나쁜 지방의 최소값은 없습니다 결국하지 마십시오 + +231 +00:15:48,009 --> 00:15:57,429 + 때로는 특히 당신이 질문을 네트워크 질문을 시작했다 때 + +232 +00:15:57,429 --> 00:16:10,849 + 네 스테 로프 진동 기능을하는 부분으로 + +233 +00:16:10,850 --> 00:16:14,819 + 확인 당신이 여러 슬라이드로 이동하려고했다가에 의해 아마했다 점프 있다고 생각 + +234 +00:16:14,818 --> 00:16:19,849 + 약간의 두 번째 또는 두 가지 방법이 괜찮 날 정말 또 다른 업데이트에 뛰어 보자 + +235 +00:16:19,850 --> 00:16:23,069 + 이 접지라고하고 원래 개발 된 사례에서 볼 것이 일반적 + +236 +00:16:23,068 --> 00:16:25,969 + 다음 볼록 최적화 문학과는 가지에 포팅되었다 + +237 +00:16:25,970 --> 00:16:30,019 + 다른 큰 업데이트로 보이는 있도록 신경망 사람들은 가끔 사용 + +238 +00:16:30,019 --> 00:16:30,560 + 다음 + +239 +00:16:30,559 --> 00:16:35,619 + 우리가 일반적으로 몇 가지 기본적인 확률 그라데이션 하강을 참조로 우리는이 업데이트가 + +240 +00:16:35,620 --> 00:16:37,500 + 여기에 여기에 큰 시간을 학습 + +241 +00:16:37,500 --> 00:16:42,259 + 그라데이션하지만 지금 우리는이 그라데이션 있지만이 추가 변수를 확장하고 + +242 +00:16:42,259 --> 00:16:47,589 + 우리는 있었다이 현금 구축하고 있음을 여기에 메모를 축적 유지하는 것이 + +243 +00:16:47,589 --> 00:16:52,199 + 그라데이션 사각형의 합이 캐시는 양수 만 포함 + +244 +00:16:52,198 --> 00:16:55,599 + 여기 캐시 변수가 같은 크기의 합작 투자 참고하여 + +245 +00:16:55,600 --> 00:17:00,730 + 개인 차원에서 구축 요인 등이 현금과 최대이었다 + +246 +00:17:00,730 --> 00:17:03,839 + 그라디언트 또는 제곱의 합을 추적하는 데 우리는 때때로을에 좋아 + +247 +00:17:03,839 --> 00:17:07,679 + 이들의 두 번째 순간이라는 Oncenter은 잠시 시간을내어 그래서 우리는 계속 + +248 +00:17:07,679 --> 00:17:12,409 + 이 현금을 구축하고 우리가 요소를 분할하는 이유에 의해이 단계 기능입니다 + +249 +00:17:12,409 --> 00:17:21,709 + 그 이유는 그래서 광장 현금의 루트 그래서 무슨 일이 일어나고 끝이 + +250 +00:17:21,709 --> 00:17:26,189 + 사람들은 그것을 푸르르의 푸르르 매개 변수 적응 학습 율법 때문에 호출 + +251 +00:17:26,189 --> 00:17:31,090 + 모든 단일 제품 이제 매개 변수 공간의 모든 단일 차원 + +252 +00:17:31,089 --> 00:17:34,569 + 동적으로 내용에 따라 조정됩니다 같은 학습 속도의 자신의 종류가 + +253 +00:17:34,569 --> 00:17:39,079 + 재료의 종류이 너무 그 규모면에서 볼 수있다 + +254 +00:17:39,079 --> 00:17:42,859 + 우리의 경우 특히이 경우 사인으로 발생하는 해석 + +255 +00:17:42,859 --> 00:17:47,019 + 이 어떤 수평 및 수직 방향으로 발생하지만,이 종류의 작업을 수행 + +256 +00:17:47,019 --> 00:17:51,359 + 역학 + +257 +00:17:51,359 --> 00:18:03,789 + 우리가 수직으로 큰 것을 큰 경사를 가지고 당신은 무엇을 볼 수 있습니다 + +258 +00:18:03,789 --> 00:18:07,259 + 그라데이션은 현금까지 추가되고 우리는 더 크고로 나누어 결국 + +259 +00:18:07,259 --> 00:18:11,359 + 큰 숫자는 너무 너무 수직 단계에서 더 작은 업데이트를 얻을 것이다 + +260 +00:18:11,359 --> 00:18:14,798 + 우리는 매우 깨끗 큰 영역을 많이보고있는 때문에이 학습을 부패한다 + +261 +00:18:14,798 --> 00:18:18,859 + 속도가 수직 방향뿐만에서 더 작은 단계들을 만들 + +262 +00:18:18,859 --> 00:18:22,009 + 우리가 끝낼 수 있도록 수평 방향으로는 매우 얕은 방향의 + +263 +00:18:22,009 --> 00:18:25,750 + 분모 작은 숫자는 당신이 볼 수 있다는 Y에 대한 상대 + +264 +00:18:25,750 --> 00:18:29,058 + 치수는 우리가이 성능 조정이 있도록 빠른 진행을 끝낼거야 + +265 +00:18:29,058 --> 00:18:35,058 + 이 회계의 효과는 기울기와 알라 신의 뜻 방향을 당신에게 + +266 +00:18:35,058 --> 00:18:40,319 + 실제로 수직 대신 바로 그때 훨씬 더 큰 학습을 할 수 있습니다 + +267 +00:18:40,319 --> 00:18:48,048 + 방향 및하지만 그래서는 대학원이없는 한 문제이다는 생각이 무엇인지 + +268 +00:18:48,048 --> 00:18:53,009 + 우리가 원한다면 우리는이 위치를 업데이트하고, 상기 공정 크기로 발생 + +269 +00:18:53,009 --> 00:18:55,900 + 오랜 시간 동안 전체 깊은 신경망에게 지분을 훈련하고 우리는있어 + +270 +00:18:55,900 --> 00:19:01,970 + 그래서 물론 정도에 무슨 일이 일어날 이번 여름에 오랜 시간 훈련 + +271 +00:19:01,970 --> 00:19:05,169 + 현금은 이러한 모든 긍정적 인 번호를 추가 모든 시간을 구축 결국 + +272 +00:19:05,169 --> 00:19:09,100 + 분모에 들어가는 당신은 말 그대로 단지의 경우 20이고 당신은 중지 끝 + +273 +00:19:09,099 --> 00:19:14,579 + 완전히 같은 학습 및 그래서 그래서 아니에요 확인 소득세 문제입니다 + +274 +00:19:14,579 --> 00:19:17,970 + 아마도 우리는 그냥 가지 볼링을 최적의 아래로 붕괴 당신이있어 + +275 +00:19:17,970 --> 00:19:21,919 + 수행하지만 신경 네트워크에서 물건 그건 좀 다음 주위에 왕복 같다 + +276 +00:19:21,919 --> 00:19:24,549 + 그에 따라 그림을 시도하는 것은 그래서이 그것을 생각하고 더 좋은 방법처럼 + +277 +00:19:24,548 --> 00:19:28,329 + 것은 당신의 데이터를 얻을 에너지의 지속적인 종류를 필요로하고 그래서 당신은 싶지 않아 + +278 +00:19:28,329 --> 00:19:33,009 + 이었다 사인에 매우 간단한 변화가 그래서 그냥 중단 붕괴 + +279 +00:19:33,009 --> 00:19:37,829 + 최근 제프 힌튼에 의해 제안 여기 아이디어는 대신 유지하는 것입니다 + +280 +00:19:37,829 --> 00:19:42,289 + 완전히 그냥 제곱의 합과 나는 우리가 있는지 확인 주말을 언급 할 수 있었다 + +281 +00:19:42,289 --> 00:19:46,250 + 새는 카운터 카운터 그래서 대신에 우리는 하이킹이 붕괴 속도와 끝까지 + +282 +00:19:46,250 --> 00:19:52,500 + 주 우리는 0.99 % 사각형과 같은 설정 만 제곱의 합이다 + +283 +00:19:52,500 --> 00:19:57,750 + 천천히 누출하지만 괜찮 것은 그래서 우리는 여전히 좋은 동점을 유지하는 우리 + +284 +00:19:57,750 --> 00:20:01,569 + 가파른 또는 포격 방향으로 스텝 크기를 등화 효과 + +285 +00:20:01,569 --> 00:20:05,869 + 우리는 단지 무기를 판매 완전히 20 업데이트를 변환하지 않을거야 + +286 +00:20:05,869 --> 00:20:10,299 + 19 법안 무기 적절한 방법에 대한 역사적 접촉하는 방식이었다입니다 + +287 +00:20:10,299 --> 00:20:11,430 + 우리에게 소개 + +288 +00:20:11,430 --> 00:20:14,340 + 당신은이 방법을 제안 종이 될 것이라고 생각하지만 사실 그것은이었다 + +289 +00:20:14,339 --> 00:20:18,789 + 슬라이드 저스틴 스콧 사라 클래스 불과 몇 년 전 그래서 저스틴 단지 + +290 +00:20:18,789 --> 00:20:22,240 + 삶의 슬라이드이되어 번쩍이 해적 클래스를 제공 하였다 + +291 +00:20:22,240 --> 00:20:25,630 + 게시되지 않은 그러나 이것은 일반적으로 실제로 잘 작동하고 이렇게하고 있어요 + +292 +00:20:25,630 --> 00:20:29,920 + 기본적으로 우리의 수학 문제는 그래서 나는 그 다음 내가 더 잘 같은 본 구현 + +293 +00:20:29,920 --> 00:20:34,060 + 바로 내 최적화 결과와 나는 그 정말 재미라고 생각하고 + +294 +00:20:34,059 --> 00:20:37,769 + 논문뿐만 아니라 내 논문하지만 많은 사람들 다른 논문에서 사실 마이크에 너무 + +295 +00:20:37,769 --> 00:20:44,559 + 코 세라에서 슬라이드를 인용 한 바로 강의 6 슬라이드 그냥 밀어 + +296 +00:20:44,559 --> 00:20:48,389 + 이후 문제는 다음이 지금 실제로 실제 용지이며 더 많은 결과가있다 + +297 +00:20:48,390 --> 00:20:52,300 + 정확히 그가하고있어 및 등등하지만 잠시 동안이 정말 우스웠다에 + +298 +00:20:52,299 --> 00:20:57,609 + 그래서이까지 내 관점에서 우리는 여기 땅이 파란색과 아라미스입니다 볼 수 있습니다 + +299 +00:20:57,609 --> 00:20:58,579 + 소품이입니다 + +300 +00:20:58,579 --> 00:21:02,490 + 블랙 우리는 둘 다 아래로 여기 아주 빨리 덮여 있음을 알 수 + +301 +00:21:02,490 --> 00:21:07,519 + 보다 약간 빠른 변환이 대학원에서이 특정한 경우에 방법과 + +302 +00:21:07,519 --> 00:21:11,589 + 무기 문제 그러나 그것은 항상 당신이 볼 일반적으로 어떤 경우 뭔가 아니다 + +303 +00:21:11,589 --> 00:21:15,839 + 대학원 너무 일찍 중지하고 그대로 실천하면 펜 Jillette에 훈련 작품 + +304 +00:21:15,839 --> 00:21:21,329 + 비참 말까지 일반적으로 이러한 이러한 방법 및 질문에서 승리 + +305 +00:21:21,329 --> 00:21:24,509 + 우리의 가장 확률값은 진행에 대해 + +306 +00:21:24,509 --> 00:21:55,150 + 이 방법은에 문제가 매우 가파른 길 당신은 아마하지 않으려한다 + +307 +00:21:55,150 --> 00:21:58,800 + 자신 다운 그래서 어쩌면에서 그 방향으로 매우 빠르게 업데이트 할 말 + +308 +00:21:58,799 --> 00:22:02,220 + 당신이 좋아하는 것 특히이 경우 빠른 이동하지만 당신은 가지에 읽고 + +309 +00:22:02,220 --> 00:22:05,019 + 이 특정 예 그것은 일반적으로 이들의 진정한 종류 아니다 + +310 +00:22:05,019 --> 00:22:09,940 + 어떤 네트워크가 좋은 전략의 구성되지 않은 최적화 풍경 적용 + +311 +00:22:09,940 --> 00:22:22,930 + 처음에 이러한 경우에 + +312 +00:22:22,930 --> 00:22:25,730 + 오 그런데 나는 17이 탐사를 통해 건너하지만 너희들은 할 수 + +313 +00:22:25,730 --> 00:22:30,380 + 희망 (127)가 움직이는 0으로 나누기를 방지하기 위해 단지가 있음을 볼 수 + +314 +00:22:30,380 --> 00:22:34,550 + 다시 높은 소유주에 일반적으로 우리는이 1 ~ 5 또는 6 ~ 7 개에 앉아 + +315 +00:22:34,549 --> 00:22:39,139 + 시작하여 현금처럼 뭔가 그래서 다음에 올 수 0 + +316 +00:22:39,140 --> 00:22:46,540 + 당신이 무엇을 얻을 당신의 생활 학습 속도 (22)이 적응 행동하지만 스케일입니다 + +317 +00:22:46,539 --> 00:22:50,420 + 이 증류 그것의 절대 규모가 컨트롤에 아직도의 또는 + +318 +00:22:50,420 --> 00:22:57,370 + 컨트롤은 여전히​​이 이야기는 단지 물건의 종류를 방해 속도를 배우고 + +319 +00:22:57,369 --> 00:23:00,989 + 다른 프라이머 방법에 대해 상대적인 것 같은 더 볼 수 있습니다 + +320 +00:23:00,990 --> 00:23:12,190 + 당신은 단계 동점 골을하지만 절대 글로벌 단계는 최대 아직있다 + +321 +00:23:12,190 --> 00:23:18,710 + 아주 당신이 바로 설명하는 일을 매우 효율적으로부터의 + +322 +00:23:18,710 --> 00:23:23,038 + 이 전 아주 긴 시간에서 재료의 종류를 얻기 위해 끝 때문에 + +323 +00:23:23,038 --> 00:23:27,750 + 정말 시간 t에서의 발현은 지난 몇의 기능 만있어 + +324 +00:23:27,750 --> 00:23:36,480 + 재료는하지만, 지수 함수 적으로 감쇠 가중 합에 우리가 갈거야 + +325 +00:23:36,480 --> 00:23:43,819 + 다행 마지막 업데이트로 이동 + +326 +00:23:43,819 --> 00:24:03,039 + 기하 급수적으로 가중 방식과 유사하고 그래서 당신은이 할 것 + +327 +00:24:03,039 --> 00:24:09,789 + 나는 사람들을 생각하지 않습니다 또는이에 유한 창 정말 당신에게 나중에 할 수 있습니다 시도 + +328 +00:24:09,789 --> 00:24:19,889 + 당신이 10 최적화 네트워크있을 때를 위해 그 X를 볼 것이다 너무 많은 메모리를 필요 + +329 +00:24:19,890 --> 00:24:23,560 + 예되도록 240,000,000 매개 변수의 메모리가 꽤 많이 복용하고 그래서 + +330 +00:24:23,559 --> 00:24:29,659 + 당신은 우리가있어 다음도 좋아 (10) 이전의 불만을 추적하고 싶지 않아 + +331 +00:24:29,660 --> 00:24:37,540 + 거하면 성능이 저하 된 모멘텀을 결합하면 20 있는지에 가서 주셔서 감사합니다 + +332 +00:24:37,539 --> 00:24:45,269 + 질문이 너무 너무 대충 무슨 일이 일어나고 있는지 슬라이드의 아담이을이다 + +333 +00:24:45,269 --> 00:24:49,119 + 마지막 업데이트는 감옥 실제로 최근에 제안되었다 그리고있다 + +334 +00:24:49,119 --> 00:24:52,959 + 당신이 기세를 알 수 있습니다로 모두의 요소는 가지의 트랙을 유지하고있다 + +335 +00:24:52,960 --> 00:24:57,190 + 잘못된 그라디언트를 요약하여 독서의의 첫 번째 순서의 순간 + +336 +00:24:57,190 --> 00:25:02,350 + 이 지수 일부와 손자를 유지하는 두 번째의 트랙을 유지하고 있습니다 + +337 +00:25:02,349 --> 00:25:07,869 + 순간 기울기와 당신이 종료 아담 아담 업데이트 당신이와 끝까지이다 + +338 +00:25:07,869 --> 00:25:13,389 + 기본적으로의 단계와 그것의 같은 종류의 것이 네 같은 종류의 수행 + +339 +00:25:13,390 --> 00:25:16,980 + 조금 그래서 당신처럼 보이는이 일을 끝낼 가장 아마 모멘텀 + +340 +00:25:16,980 --> 00:25:21,650 + 그것은 기본적으로 부패 방법이 속도를 추적 그리고 그건 + +341 +00:25:21,650 --> 00:25:25,420 + 당신의 단계하지만 당신은이 기하 급수적까지 추가하여 아래로 확장 + +342 +00:25:25,420 --> 00:25:29,490 + 새는 당신의 광장 그라디언트의 카운터 등 동일한에서 모두 끝 + +343 +00:25:29,490 --> 00:25:36,009 + 공식과 사람들은 그래서 당신이 모두 힘을 다하고 않는 조합 그게 업데이트 및 + +344 +00:25:36,009 --> 00:25:41,759 + 당신은 또한이 적응 스케일링을하고있는 그래서 여기에있는 군대의 확률값하자 + +345 +00:25:41,759 --> 00:25:44,789 + 이를 비교했을 때 실제로 정말 심지어이 이전 버전을 번쩍해야 + +346 +00:25:44,789 --> 00:25:46,339 + 기본적으로 가장 확률값 + +347 +00:25:46,339 --> 00:25:52,079 + 빨간색은 여기에 우리가 대체 한 것을 제외하고는 동일한 것입니다 단지가 있었다 TX + +348 +00:25:52,079 --> 00:25:56,220 + 이전 단지 그라데이션 현재 지금 우리는이 그라데이션 TX를 교체하고 + +349 +00:25:56,220 --> 00:25:56,630 + 그것으로 + +350 +00:25:56,630 --> 00:26:01,170 + 예를 한 가지 방법에 대한 상상 그래서 만약 RDX이 실행 카운터 인 + +351 +00:26:01,170 --> 00:26:04,090 + 또한 샘플링 많은 배치를 설정하여 불쾌한 kasich입니다 그것을 보면 + +352 +00:26:04,089 --> 00:26:07,359 + 야이 나쁜 패스 난수의 많은 수 그리고 당신은이 모든 잡음을 얻을 수있어 + +353 +00:26:07,359 --> 00:26:10,990 + 그라디언트 그래서 대신에 우리가있어 매번 단계를 어떤 큰 영향을 사용하여 + +354 +00:26:10,990 --> 00:26:14,309 + 실제로 이전 인사의 일부가되었고, 그것을 할 수있는 사용하는 것 + +355 +00:26:14,309 --> 00:26:19,139 + 그것의 그라디언트 방향을 안정시키고 그 기세의 기능입니다 + +356 +00:26:19,140 --> 00:26:23,720 + 여기와 여기에 스케일링이 있는지 확인하는 것입니다 스텝 크기의 운동에 대하여 + +357 +00:26:23,720 --> 00:26:29,940 + 서로 스티븐 L 방향이 감사에 당신은 당신이 것을 싶지 않아 + +358 +00:26:29,940 --> 00:26:31,269 + 하이퍼 매개 변수 + +359 +00:26:31,269 --> 00:26:36,119 + (801)는 일반적으로 보통 9802 포인트 995 가리 + +360 +00:26:36,119 --> 00:26:42,869 + 내 자신의 일에 선두에 걸쳐 높은 프리미엄을의 어딘가에있을 정도로 나는 발견 + +361 +00:26:42,869 --> 00:26:45,719 + 내가 실제로 일반적으로하지 않습니다에 걸쳐이 상대적으로 강력한 설정입니다 + +362 +00:26:45,720 --> 00:26:50,690 + 이러한 난 그냥 보통 스마일을 넣어으로 설정 떠나 결국하지만 당신은 재생할 수 있습니다 + +363 +00:26:50,690 --> 00:27:04,259 + 당신이 추진력을 얻을 수 있습니다 그것의 사람들과 때때로 우리는 보았다 + +364 +00:27:04,259 --> 00:27:08,789 + 그래 당신은 실제로 단지 용지를 읽을 수 않는 것이 레스토랑 작동 더 나은 청소 + +365 +00:27:08,789 --> 00:27:12,849 + 실제로 어제는 종이 아니었다 대해이 229에서 프로젝트 보​​고서이었다 + +366 +00:27:12,849 --> 00:27:17,149 + 나는 그것에 대해 용지가 있는지 모르겠어요하지만 당신이 할 수있는 것을 실제로 사람 + +367 +00:27:17,150 --> 00:27:20,250 + 즉 단순히 여기에 수행되지 않습니다 놀이 + +368 +00:27:20,250 --> 00:27:25,759 + 확인 나는 내가 여기에 아담이 약간 더 복잡하게 할 한 가지 더 + +369 +00:27:25,759 --> 00:27:30,849 + 그것은 불완전 당신이 볼 정도로 나를 그냥 아담의 완전한 몰입에 넣어 보자 + +370 +00:27:30,849 --> 00:27:33,949 + 당신이이 거기에 참조 할 때 혼동 될 수 있습니다 한가지 더있다 + +371 +00:27:33,950 --> 00:27:38,220 + 바이어스 보정이라는 것은 자신의 삽입 및 수정을하는 방식을 경멸하는 + +372 +00:27:38,220 --> 00:27:40,920 + I는 루프의 확대 야하는 이유는 바이어스 보정가에 달려 있다는 + +373 +00:27:40,920 --> 00:27:46,940 + 절대 시간 단계 00 T T 여기에서 사용되며, 그 이유는 이것이 무엇 + +374 +00:27:46,940 --> 00:27:49,730 + 의 작은 점 같은 종류의 일을하고 나는 이것에 대해 혼동하지 않으 + +375 +00:27:49,730 --> 00:27:54,049 + 너무하지만 기본적으로 그 MMV 사실을 보상하기위한 보상있어 + +376 +00:27:54,049 --> 00:27:58,659 + 오니 쉬 (500) 통계는 처음에 잘못 그래서 그가 무엇을하고 있는지입니다 + +377 +00:27:58,660 --> 00:28:01,269 + 정말 메가를 확장에서 + +378 +00:28:01,269 --> 00:28:04,250 + 당신이 편견의 매우 친절와 끝까지하지 않도록 처음 몇 반복 + +379 +00:28:04,250 --> 00:28:07,359 + 제 1 및 제 2 순간의 추정은 그래서 그것에 대해 걱정하지 마십시오 + +380 +00:28:07,359 --> 00:28:11,279 + 너무 많은 이것은 단지이 매우 먼저 귀하의 업데이트를 변화한다 + +381 +00:28:11,279 --> 00:28:15,190 + 항목 등으로의 몇 번 예열되고, 그래서는 적절한에서 이루어집니다 + +382 +00:28:15,190 --> 00:28:18,210 + 통계 메가 측면에서 방법 + +383 +00:28:18,210 --> 00:28:23,380 + 나는 우리가 여러 가지 업데이트에 대한 이야기​​ 그 확인으로 너무 많이 가지 않는다 + +384 +00:28:23,380 --> 00:28:26,710 + 우리는 이러한 모든 업데이트가 여전히이 배우는 좋은 프라이머를 보았다 + +385 +00:28:26,710 --> 00:28:31,279 + 그래서 난 그냥 여전히 필요하지만 것을 간략하게 사실에 대해 얘기하고 싶지 + +386 +00:28:31,279 --> 00:28:34,369 + 학습과 우리 모두를위한 전면 인종 차별주의 학습 속도로 일어나는 보았다 + +387 +00:28:34,369 --> 00:28:37,639 + 이러한 방법과 내가 제기하고자하는 질문을 다음의 어느 하나 + +388 +00:28:37,640 --> 00:28:47,290 + 속도를 학습 사용하는 것이 가장 좋습니다 + +389 +00:28:47,289 --> 00:28:55,509 + 당신이 신경 네트워크를 실행하는 경우 그래서 이것은 레이트 학습에 대한 슬라이드입니다 + +390 +00:28:55,509 --> 00:28:59,819 + 트릭 답을 구분하는 것은 그 중에 무엇을 사용하는 좋은 학습 레이스가 없다는 것입니다 + +391 +00:28:59,819 --> 00:29:04,259 + 이 최적화 때문에 당신은 당신이 먼저 높은 학습 속도를 사용해야한다해야 + +392 +00:29:04,259 --> 00:29:07,869 + 좋은 학습 속도보다 더 빨리 당신이 매우 빠른 진전을 볼 수 있지만, + +393 +00:29:07,869 --> 00:29:10,779 + 어떤 점에서 두 확률 될거야 당신은에 수렴 할 수 없습니다 + +394 +00:29:10,779 --> 00:29:13,829 + 주 내 아주 잘 당신이 시스템에 너무 많은 에너지를 가지고 있기 때문에 + +395 +00:29:13,829 --> 00:29:17,869 + 당신은 당신의 손실 함수의 검은 좋은 부품 등 무엇으로 정착 할 수 없습니다 + +396 +00:29:17,869 --> 00:29:21,399 + 당신은 당신이 속도 배우고 UDK는 다음 종류의이 탈 수 할 + +397 +00:29:21,400 --> 00:29:26,269 + 감소 학습 속도의 드래곤과 그들 모두에 최선을 다할가 많다 + +398 +00:29:26,269 --> 00:29:28,670 + 사람들이 시작하는 다른 방법은 시간이 지남에 따라 요금을 배울 당신은해야 + +399 +00:29:28,670 --> 00:29:32,400 + 또한 같은 종류의 그들의 물건 붕괴의 과제가되었다 + +400 +00:29:32,400 --> 00:29:36,810 + 당신이했습니다에 간단한 하나는 아마도 훈련 데이터의 한 시대는 참조 후 + +401 +00:29:36,809 --> 00:29:41,619 + 파키스탄 새끼가 부패 무슨 말을 한 후 한 번에 너무 매 훈련 샘플을 볼 수 + +402 +00:29:41,619 --> 00:29:45,219 + 내 포인트 9 또는 당신은 또한 사용할 수있는 뭔가에 요금을 학습 + +403 +00:29:45,220 --> 00:29:49,600 + 지수 붕괴하거나 여러 거기 TDK 중 하나 여러가는거야 + +404 +00:29:49,599 --> 00:29:54,379 + 그것은 가능성이 향상 이론적 특성의 일부에 확대하고있어 알고 + +405 +00:29:54,380 --> 00:29:58,260 + 내가 생각하기 때문에 서로 다른 경우에 대한 그들의 불행하게도 많은하지 적용 + +406 +00:29:58,259 --> 00:30:01,150 + 그들은 볼록 최적화 문학에서 대부분이고 우리는 매우 상대하고 + +407 +00:30:01,150 --> 00:30:05,160 + 목표 다르지만 일반적으로 실제로 나는 뭔가에 사용되는 + +408 +00:30:05,160 --> 00:30:12,330 + 질문이었다 + +409 +00:30:12,329 --> 00:30:25,259 + 훈련 동안 이들 사이의 어느 하나의 커밋되지 + +410 +00:30:25,259 --> 00:30:28,470 + 그래, 난 그 모든 표준 생각하지 않습니다 + +411 +00:30:28,470 --> 00:30:32,990 + 흥미로운 점 나는 당신이 그래 사용할 줄 때 확실하지 않다 확실하지 않다 + +412 +00:30:32,990 --> 00:30:37,839 + 그것은 나에게 분명하지 않다 당신이 시도하고 I가 좋아 연습 뭔가를 시도 할 수 있습니다 + +413 +00:30:37,839 --> 00:30:42,079 + 적어도 영향은 바로 지금이다 당신은 거의 항상 내가 발견 지점을 + +414 +00:30:42,079 --> 00:30:46,189 + 일반적으로 좋은 기본값은 지금 모든 것을 위해 시간을 사용하므로 함께 갈 장미 + +415 +00:30:46,190 --> 00:30:49,840 + 아주 잘 우리의 대부분의 문제는 모멘텀보다 더 나은 또는 작동하는 것 같다 + +416 +00:30:49,839 --> 00:30:56,638 + 그들 때문에 우리가 그들에게 전화로 그런 아무것도 그래서 키가 큰 주문 방법이다 + +417 +00:30:56,638 --> 00:31:00,579 + 우리가 평가 한 있도록 만 손실 함수에 그라디언트 정보를 사용하여 + +418 +00:31:00,579 --> 00:31:03,720 + 그라데이션은 우리가 기본적으로 기울기와 모든 단일 방향을 알고 + +419 +00:31:03,720 --> 00:31:05,710 + 즉, 우리가 사용하는 유일한 것이다 + +420 +00:31:05,710 --> 00:31:09,600 + 이 최적화를위한 2 차 방법의 전체 세트입니다하지만 당신은해야 + +421 +00:31:09,599 --> 00:31:13,168 + 내가 너무 많은 세부 사항에 가고 싶지 않는 2 차 반대의 인식 + +422 +00:31:13,169 --> 00:31:17,919 + 그러나 결국 최대 그래서 당신의 손실 함수에 더 큰 근사치를 형성 + +423 +00:31:17,919 --> 00:31:20,820 + 그들 만이 기본적으로 초평면에 근사하지 않는 방법 I 등 + +424 +00:31:20,819 --> 00:31:26,069 + 희망하지만 당신도 토론에 의해 근사 한을 알리는 방법입니다 + +425 +00:31:26,069 --> 00:31:29,710 + 그래서 당신은 그가 또한 독일인 필요한 그라데이션이 필요하지 않습니다 억제 서비스 + +426 +00:31:29,710 --> 00:31:36,808 + 뿐만 아니라 그 계산해야하고 당신에게 내가 말할 것 오늘 밤에 볼지도 모른다 + +427 +00:31:36,808 --> 00:31:38,500 + 229 예 + +428 +00:31:38,500 --> 00:31:44,190 + 뉴턴의 방법은 기본적으로 당신이 그릇을 형성 업데이 트를주고 + +429 +00:31:44,190 --> 00:31:47,259 + 당신의 목적에 같은 패션 근사이 업데이트 사용할 수 있습니다 + +430 +00:31:47,259 --> 00:31:54,259 + 수는 그래서 그 근사 방식의 최소로 직접 이동합니다 + +431 +00:31:54,259 --> 00:31:58,490 + 어떤이 그들을 사용된다 사람을 왜 2 차 방법에 대한 좋은 데요 + +432 +00:31:58,490 --> 00:32:02,099 + 특히 뉴턴 방법은 이것에 대해 좋은 무엇을 여기에 제시 + +433 +00:32:02,099 --> 00:32:05,399 + 컨버전스에 대한 업데이트 + +434 +00:32:05,400 --> 00:32:13,410 + 당신은 학습 속도가 확인이 업데이트의 방법 차 알지 알 수 있습니다 그리고 그건 + +435 +00:32:13,410 --> 00:32:17,220 + 이 손실 기능이 손실 함수에 그라데이션을 보는 경우에 있기 때문에 + +436 +00:32:17,220 --> 00:32:20,480 + 당신은 또한 곡률과 그 장소를 알고 당신은 근사 그렇다면 + +437 +00:32:20,480 --> 00:32:23,920 + 정확히 알고있는이 황소는 어디에 때문에 최소 주문 근사치로 이동합니다 + +438 +00:32:23,920 --> 00:32:26,900 + 그의 최소로 직접 이동할 수 있습니다 당신 학습을위한 필요가 없습니다 + +439 +00:32:26,900 --> 00:32:30,610 + 그게 내가 그 생각 아주 좋은 기능 그래서 그릇에 근접하면 두 가지가 I + +440 +00:32:30,609 --> 00:32:32,969 + 당신은 두 번째 순서를 사용하고 있기 때문에 생각했던 당신은 빠른 수렴을 + +441 +00:32:32,970 --> 00:32:38,839 + 뿐만 아니라 정보가 왜이 단계 업데이트를 사용하도록 종류의 불가능하다 + +442 +00:32:38,839 --> 00:32:47,069 + 과정의 문제에 대한 작품을 모든되는 교육 열정은 백을 말한다 + +443 +00:32:47,069 --> 00:32:48,500 + 만 기본 네트워크 + +444 +00:32:48,500 --> 00:32:52,299 + 백 만 백 만 행렬 그리고 당신은 그것을 변환 할 + +445 +00:32:52,299 --> 00:32:59,259 + 그이 너무 행운 그래서 몇 가지가 발생하지 않을 + +446 +00:32:59,259 --> 00:33:02,480 + 알고리즘과 난 그냥 당신이 당신이 그들을 사용하지 않을 알고 싶습니다 + +447 +00:33:02,480 --> 00:33:05,650 + 기본적으로 뭔가 불리는 곳 DHS하는 아래 클래스 + +448 +00:33:05,650 --> 00:33:08,360 + 수 있습니다 당신은 패션을 변환하지 멀리 얻을 구축 + +449 +00:33:08,359 --> 00:33:11,819 + 모든 순위 연속 업데이트를 통해 헤센의 근사 + +450 +00:33:11,819 --> 00:33:15,000 + 하나는 그것의 종류의 세션을 구축하지만 당신은 여전히​​ 헤 시안을 저장해야 + +451 +00:33:15,000 --> 00:33:18,279 + 대규모 네트워크에 대한 다음 거기에 뭔가 더 좋은 때문에 여전히 메모리에 + +452 +00:33:18,279 --> 00:33:22,710 + 제한 제레미 BFGS의 약자라는 파운드 실제로 가을에 저장되지 않았습니다 + +453 +00:33:22,710 --> 00:33:26,980 + 패션 아니면 근사 회원 그리고 그 사람들이 실제로 사용하는 무엇을 + +454 +00:33:26,980 --> 00:33:33,549 + 때로는 지금 당신은 때때로 최적화 문헌에 언급 참조합니다 LBS + +455 +00:33:33,549 --> 00:33:37,769 + 그것은 우리를 위해 정말 정말 잘 작동 특히 당신은 작은 하나가있는 경우 + +456 +00:33:37,769 --> 00:33:42,450 + 이처럼 상자 같은 결정 기능에는 확률 적 노이즈가 없습니다 + +457 +00:33:42,450 --> 00:33:47,920 + 과 모든 것을 더 도시는 일반적으로 손실을 분쇄 할 수 있습니다 메모리 주소에 맞는 없다 + +458 +00:33:47,920 --> 00:33:53,200 + 기능을 아주 쉽게 그러나 아주 아주 기본적 파운드 GS2을 연장으로 까다로운 + +459 +00:33:53,200 --> 00:33:56,539 + 대규모 데이터 세트 및 이유는이 많은 의사를 서브 샘플링 하였다된다 + +460 +00:33:56,539 --> 00:33:59,730 + 우리는 많은 그래서 WASSUP에 간단한 메모리에 모든 훈련 데이터를 맞지 않을 수 있기 때문에 + +461 +00:33:59,730 --> 00:34:02,930 + 배치는 다음 나는이 많은 경기와에 작품의 위험이있을거야 그 + +462 +00:34:02,930 --> 00:34:06,810 + 근사는 서로 다른 여러 배치를 교환하고 같이있는 잘못에 + +463 +00:34:06,809 --> 00:34:10,449 + 또한 당신이 조심해야 할 능력을 가지고 당신은 확인해야합니다 + +464 +00:34:10,449 --> 00:34:12,539 + 당신이 드롭 아웃을 수정해야 + +465 +00:34:12,539 --> 00:34:17,690 + 당신이 있는지 확인해야하므로 내부적으로 불량배이기는하지만 함수 당신의 + +466 +00:34:17,690 --> 00:34:20,679 + 기능 많은 많은 다른 시간이 모든 근사하고 거짓말을하고있다 + +467 +00:34:20,679 --> 00:34:24,480 + 검색 물건이 매우 무거운 함수의 같은 것을 그래서 당신은 확인해야합니다 + +468 +00:34:24,480 --> 00:34:26,668 + 당신이 사용할 때 사용하지 않거나 출처 확인 + +469 +00:34:26,668 --> 00:34:29,889 + 랜덤 정말 연습 우리에 그래서 기본적으로 그것을 좋아하지 않을 때문에 + +470 +00:34:29,889 --> 00:34:33,779 + 큰 잘 못했습니다 정말 일을하지하지 않는 것 때문에 모든 BHS를 사용하지 않는 + +471 +00:34:33,780 --> 00:34:36,970 + 지금은 다른 방법에 비해 너무 많은 재료가 갖는 기본적 + +472 +00:34:36,969 --> 00:34:41,529 + 일이 당신이 더 나은 것은 바로이 우리의 물건을 잡음을하지만 이상을 수행합니다 + +473 +00:34:41,530 --> 00:34:47,880 + 당신이 할 수있는 경우 그 거래는 좋은 선택으로 사용 요약 그렇게 꺼져과 + +474 +00:34:47,880 --> 00:34:51,570 + 그렇지 않은으로 당신이 아마 하루에 은행 감당할 수있는 여유 + +475 +00:34:51,570 --> 00:34:55,419 + 2009 메모리와 앞으로 매우 큰 소득과 그들에 패스를 얻을 + +476 +00:34:55,418 --> 00:35:00,460 + 메모리 당신은 파운드로 볼 수 있지만에서 사용 관행에 표시되지 않습니다 + +477 +00:35:00,460 --> 00:35:05,220 + 현재 연구 방향 비록 지금 바로 대규모 설정 + +478 +00:35:05,219 --> 00:35:10,009 + 당신이이기 때문에 그래서 다른 개인 업데이트의 제 논의를 결론 + +479 +00:35:10,010 --> 00:35:14,830 + 학습 속도는 우리가 거​​기에이 클래스의 모든 베아트리체 조사하지 않을거야 + +480 +00:35:14,829 --> 00:35:24,739 + 바로 다시 질문 + +481 +00:35:24,739 --> 00:35:34,609 + 당신에 대한 요구하는지 너무 좋은 자동으로 당신이있어 구분 예를 들어 + +482 +00:35:34,610 --> 00:35:38,510 + 그래서 시간이 지남에 속도를 학습 당신은 또한 당신이 있다면 사건을 깰 학습 사용합니다 + +483 +00:35:38,510 --> 00:35:41,930 + 그래서 일반적으로 그랜드 이상을 사용하면 때를 매우 일반적인 영기를 배우는 참조 + +484 +00:35:41,929 --> 00:35:55,379 + 실제로 나는 당신이 대학원 또는 그러나 그것을 사용하는 경우 확실하지 않다 또는 아담 그래 그것은 아닙니다입니다 + +485 +00:35:55,380 --> 00:36:04,900 + 하지 아니 아주 좋은 대답 당신은 그것을 할 확실히 할 수 있지만 어쩌면 항목이 아니라고 + +486 +00:36:04,900 --> 00:36:08,910 + 아담처럼 그냥 방자 안드로이드 때문에에서 학습 (30)를하지 않습니다 + +487 +00:36:08,909 --> 00:36:12,339 + 이 새는 그라데이션입니다하지만 그는 학습 속도가 된 큰 우려했다 + +488 +00:36:12,340 --> 00:36:15,170 + 그것은 인도 자동으로 20을 부패 있기 때문에 아마 이해가되지 않습니다 + +489 +00:36:15,170 --> 00:36:22,710 + 괜찮아 괜찮아 우리는 매우 간단하게 같은 모델 앙상블 I에 갈거야 + +490 +00:36:22,710 --> 00:36:24,829 + 그것은 아주 간단하기 때문에 그것에 대해 얘기 + +491 +00:36:24,829 --> 00:36:28,750 + 당신이 당신의 훈련 데이터에 여러 개의 독립적 인 모델을 훈련하면 밝혀 + +492 +00:36:28,750 --> 00:36:32,949 + 대신 다음 단 하나의 하나의 당신은 당신이했습니다이 시간에 결과를 평균 + +493 +00:36:32,949 --> 00:36:39,929 + 항상 22 % 추가 성능 확인 지금이 정말 이론적하지있어 + +494 +00:36:39,929 --> 00:36:43,289 + 그 결과 같은 종류의하지만 그냥 연습처럼 여기 결과 + +495 +00:36:43,289 --> 00:36:46,570 + 기본적으로이 거의 항상 더 잘 작동 할 좋은 것 같다 + +496 +00:36:46,570 --> 00:36:48,850 + 물론 단점은 모든 다른 독립이 필요하지 않습니다 + +497 +00:36:48,849 --> 00:36:52,259 + 모델과 앞으로해야 할 필요와 그들과 여러분의 뒤로 클래스 + +498 +00:36:52,260 --> 00:36:56,850 + 그 적합하지 그래서 그들 모두를 훈련 아마 당신은 아래로 느려했다 + +499 +00:36:56,849 --> 00:37:00,989 + 당신의 앙상블 모델의 수와 단지 시간 등 몇 가지 팁이있다 + +500 +00:37:00,989 --> 00:37:05,689 + 및 유용한 정보 비트를위한 그래서 하나의 접근 방식을 따기 어떤 종류의에 사용 + +501 +00:37:05,690 --> 00:37:08,619 + 예를 들어 당신이 가진 당신이 당신의 신경망을 훈련으로 모든 서로 다른를 + +502 +00:37:08,619 --> 00:37:11,680 + 체크 포인트는 일반적으로 체크 포인트를 저장 그들에게 하나 하나 하키를 저장하는 + +503 +00:37:11,679 --> 00:37:14,750 + 당신은 당신이 당신의 검증 성능 그래서 한 가지 당신이 무엇인지 알아낼 + +504 +00:37:14,750 --> 00:37:18,119 + 실제로 판명 예를 위해 할 수있는 것은 때로는 같은 얻을 당신입니다 + +505 +00:37:18,119 --> 00:37:23,420 + 당신의 모델에 대한 몇 가지 체크 포인트를 가지고 당신은 그했다 그 + +506 +00:37:23,420 --> 00:37:26,349 + 실제로 때때로에서 사물과 그렇지 그래서 방법을 개선하기 위해 밝혀 + +507 +00:37:26,349 --> 00:37:29,730 + 한 미국 훈련 칠 독립적 인 모델을 훈련해야하지만 당신은 어떤 앙상블 + +508 +00:37:29,730 --> 00:37:34,809 + 그와 관련된 다른 체크 포인트의 트릭있다 + +509 +00:37:34,809 --> 00:37:39,739 + 이것은 우리가 전에 본 적이 당신의 네 단계를 여기에 무슨 일이 일어나고 있는지에 항의 + +510 +00:37:39,739 --> 00:37:44,709 + 나는 실행으로 여기에 예비 선거 X 테스트의 또 다른 세트와이 텍스트를 유지하고있어 + +511 +00:37:44,710 --> 00:37:49,590 + 일부 기하 급수적으로 내 실제 매개 변수 벡터 X를 썩 때 내가 사용하는 + +512 +00:37:49,590 --> 00:37:52,750 + 텍스트 테스트 및 검증이나 테스트 데이터는 거의 항상이 밝혀 + +513 +00:37:52,750 --> 00:37:57,199 + 이 때문에 종류의 같이하고있는 단독 확인 X를 사용하는 것보다 약간 더 나은 수행 + +514 +00:37:57,199 --> 00:38:00,919 + 마지막으로 이전 몇 주 요인 작은 같은 가중 앙상블 그것은 종류의 + +515 +00:38:00,920 --> 00:38:05,309 + 어려운 종류의 한 가지 방법으로 실제로는하지만, 기본적으로 해석하는 + +516 +00:38:05,309 --> 00:38:08,329 + 그것을 나는이 실제로 할 수있는 좋은 일이 이유에 대해 처리 할 수​​있는 하나의 방법을 해석 + +517 +00:38:08,329 --> 00:38:12,900 + 당신의 공 기능을 최적화에 대해 생각하고, 당신은 너무 많은 스테핑있어 + +518 +00:38:12,900 --> 00:38:16,849 + 실제로 모든 단계의 평균을 복용 최소 당신을 얻을 수 주위에 + +519 +00:38:16,849 --> 00:38:20,980 + I가 할 수있는 최소한의 확인에 가까운이 실제로 약간 중요한 이유 + +520 +00:38:20,980 --> 00:38:25,639 + 더 나은 우리가 가고 있기 때문에 내가 가진 작은 앙상블은 내 인생을 논의하기 위해 수 있도록 + +521 +00:38:25,639 --> 00:38:29,759 + 드롭 아웃으로 보면 이것은 당신이 될 것입니다 매우 중요한 기술이다 + +522 +00:38:29,760 --> 00:38:34,590 + 드롭 아웃에 대한 생각은 매우 흥미로운 그래서 등등 구현 및 사용 + +523 +00:38:34,590 --> 00:38:38,620 + 당신의 전체 목적을하고있는 것처럼 당신이 강하와 함께 할 당신입니다 + +524 +00:38:38,619 --> 00:38:45,429 + 신경 네트워크는 무작위 그래서 그냥 통과 공원에서 일부 뉴런 (20)을 설정합니다 + +525 +00:38:45,429 --> 00:38:49,839 + 당신이 당신의 데이터 X의 전진 패스를하고있는 당신이 어떤 작업을 수행하는지 명확히하는 것은 당신의 + +526 +00:38:49,840 --> 00:38:52,670 + 이 기능에 발언권을 계산 + +527 +00:38:52,670 --> 00:38:57,010 + 첫 번째 숨겨진 층 W의 비선형 하나 배 XP SP1 그래서 + +528 +00:38:57,010 --> 00:39:02,830 + 그건 좀 이상이고 다음 여기 이진수의 마스크를 계산합니다 + +529 +00:39:02,829 --> 00:39:05,230 + 여부에 기초하여 0 또는 1 중 + +530 +00:39:05,230 --> 00:39:09,469 + 0과 1 사이 숫자는 우리가 심각한 펌프를 듣고있는 P보다 작은 + +531 +00:39:09,469 --> 00:39:13,469 + 당신이 원하는이 우리는 0과 1의 절반과 절반의 바이너리 마스크입니다 + +532 +00:39:13,469 --> 00:39:17,469 + 다중 정품 인증 적극적으로 우리가 그들의 절반을 포기 숨겨진되는 + +533 +00:39:17,469 --> 00:39:21,349 + 모든 정품 인증 각 하나의 숨겨진 레이어를 계산 한 다음 우리는 두 가지가 드롭 + +534 +00:39:21,349 --> 00:39:25,730 + 무작위로 유닛, 그리고, 우리는 두 번째, 그리고, 우리는 무작위로 그 중 절반을 드롭 할 + +535 +00:39:25,730 --> 00:39:30,699 + 확인 물론 이것은 단지 전방이 후방 패스이어야 합격입니다 + +536 +00:39:30,699 --> 00:39:35,719 + 적절하게이 방울도 다시 전파 할 수 있도록뿐만 아니라 조정 + +537 +00:39:35,719 --> 00:39:39,309 + 그것에서뿐만 아니라, 그래서 그렇게 통해 구현할 때 중퇴 그렇게 기억 + +538 +00:39:39,309 --> 00:39:41,980 + 전진은 드롭을 통과하지만, 역 전파는 경우 후방 패스 + +539 +00:39:41,980 --> 00:39:45,829 + U2에 의해 곱하면 하나는 그래서 당신이 장소에서 기본적으로 생기를 죽인를 구입 + +540 +00:39:45,829 --> 00:39:46,559 + 당신은 떨어 곳 + +541 +00:39:46,559 --> 00:39:52,179 + 나는이 방법을 처음으로 당신이를 보였다 때 확인 그래서 당신은 생각 될 수 있습니다 + +542 +00:39:52,179 --> 00:39:56,799 + 이 전혀 이해가 않습니다이 좋은 생각은 왜에 어떻게 원하는 것이되었다 + +543 +00:39:56,800 --> 00:40:00,390 + 당신의 신경증을 계산하고 (20)이 어떠한 의미를에 다음 그들에게 경향을 설정 + +544 +00:40:00,389 --> 00:40:12,369 + 그래서 나도 몰라 그럼 이제 너희들은 앞서 생각에 과열을 방지 할 수 있도록하자 + +545 +00:40:12,369 --> 00:40:23,880 + 어떤 의미 + +546 +00:40:23,880 --> 00:40:27,170 + 당신은 정말 당신이 그것을 할 말을하는지 있도록 올바른 정보를 얻고 + +547 +00:40:27,170 --> 00:40:31,240 + 난 단지 그때 내 네트워크의 절반을 거​​의 사용하고있는 경우 때문에 overfitting을 방지 + +548 +00:40:31,239 --> 00:40:34,500 + 난 단지 내 네트워크를 한 번의 절반을 사용하고 작은 용량의 같은이 + +549 +00:40:34,500 --> 00:40:37,739 + 하나의 작은 네트워크 I 기본적으로 만 너무 거기에있어 단지처럼 거기 + +550 +00:40:37,739 --> 00:40:40,209 + 이 종류의 그래서 나는 다음 전체 네트워크 거기에 직장에서 무슨 일이 있었는지 수행 할 수 있습니다 + +551 +00:40:40,210 --> 00:40:44,798 + 당신이 대표 할 수 있는지의 관점에서 당신의 분산 제어 등 + +552 +00:40:44,798 --> 00:40:55,619 + 그래 나는 종종 내가하지 않은 다양한 무역에 의해 등의 조건을 충족하고 싶습니다 + +553 +00:40:55,619 --> 00:40:59,480 + 정말 우리는 너무 많이하지 않을거야하지만 힘들다는 더 작은 모델을 + +554 +00:40:59,480 --> 00:41:08,579 + 그 이상하지만 서로 다른 신경 네트워크의 여러 앙상블을 갖는 것은 가고 있었다 + +555 +00:41:08,579 --> 00:41:34,289 + 즉 사용 된 하나의 경우 때문에 조금에 그 시점으로 이동 + +556 +00:41:34,289 --> 00:41:38,119 + 위층 확인 내 다음 인생에서 가리 말씨의 더 나은 방법이 + +557 +00:41:38,119 --> 00:41:43,028 + 의 그 괜찮아 우리가하려고하는 것을 가정하는 특정 예를 살펴 보자 + +558 +00:41:43,028 --> 00:41:47,130 + 신경 네트워크의 고양이 점수를 계산하고 여기에 아이디어는 것입니다 + +559 +00:41:47,130 --> 00:41:51,380 + 이러한 모든 다른 단위를 가지고 강하하고있는 스포츠는 많은 노래 + +560 +00:41:51,380 --> 00:41:54,920 + 방법은 드롭 아웃을보고 있지만 그 중 하나는 당신의 코드 당신의 강요 것입니다 + +561 +00:41:54,920 --> 00:41:59,608 + 어떤 이미지의 표현은 당신이 필요로하기 때문에 중복하고 있었다 + +562 +00:41:59,608 --> 00:42:03,318 + 그 중복 당신은 당신이 절반을받을 제어 할 수있는 방법에 대해이기 때문에 + +563 +00:42:03,318 --> 00:42:06,710 + 네트워크의 내려 그래서 당신은 더 많은 당신의 고양이 점수를 확인해야합니다 + +564 +00:42:06,710 --> 00:42:09,900 + 기능은 제대로 요리 고양이 점수 때문에를 계산하기 위하여려고하는 경우 + +565 +00:42:09,900 --> 00:42:14,000 + 어떤 어떤이 삭제 될 수도 있기 때문에 당신이 그것에 의존 할 수 그들 중 하나 등등 + +566 +00:42:14,000 --> 00:42:17,068 + 이 경우 우리는 여전히 캐츠 킬을 분류 할 수 있도록 그게 보는 하나의 방법입니다 + +567 +00:42:17,068 --> 00:42:22,639 + 우리는 매우 중요 그래서 여부에 대한 액세스 권한이없는 경우에도 적절 + +568 +00:42:22,639 --> 00:42:24,768 + 즉, 드롭 아웃의 하나의 해석이다 + +569 +00:42:24,768 --> 00:42:29,088 + 드롭 아웃의 또 다른 해석은 다음과 같이 근육의 관점에서 언급되어, + +570 +00:42:29,088 --> 00:42:33,358 + 드롭 아웃 효과적으로 모델의 큰 앙상블 훈련으로 바라 보았다 될 수있다 + +571 +00:42:33,358 --> 00:42:36,420 + 기본적으로 서브되는 + +572 +00:42:36,420 --> 00:42:43,099 + 하나의 큰 네트워크는하지만, 그들은 당신이 그렇게 좋은 방식으로 예비 선거를 공유 할 수 없습니다 + +573 +00:42:43,099 --> 00:42:46,650 + 이것을 이해하면 우리는 우리와 우리를 위해 그것을 할 경우 다음 사항을주의해야 + +574 +00:42:46,650 --> 00:42:49,970 + 무작위로 무엇을 생각 뒤로 패스에 비해 단위의 일부를 내려 + +575 +00:42:49,969 --> 00:42:53,669 + 나는 우리가 임의의이 내려 가지고 가정 오른쪽도록 그라데이션 발생 + +576 +00:42:53,670 --> 00:42:57,409 + 후방 패스에서 이러한 단위는 우리가 다시 최대를 통해 전파하고 그 + +577 +00:42:57,409 --> 00:43:01,879 + 했다 특히 만 뉴런의 수 있도록 드롭 아웃에 의해 유도 된 + +578 +00:43:01,880 --> 00:43:05,349 + 전진 패스에 사용 실제로 업데이트 또는 불만이 흐르는이됩니다 + +579 +00:43:05,349 --> 00:43:09,599 + 차단 된 모든 신경 세포가 20 아니 그라디언트 흐름 없기 때문에 그들을 통해 + +580 +00:43:09,599 --> 00:43:13,650 + 그것과 이전 계층의 무게를 너무 업데이트되지 않습니다 + +581 +00:43:13,650 --> 00:43:18,550 + 적극적으로 더 이상 그에 이전 계층으로의 연결을 중퇴했다 + +582 +00:43:18,550 --> 00:43:22,750 + 업데이트 그냥했다 그렇게 정말 무엇을이없는 것처럼 그건되지 않습니다 + +583 +00:43:22,750 --> 00:43:27,230 + 당신의 신경 네트워크의 일부를 샘플링 마스크 하위 오프 삭제하고 만있어 + +584 +00:43:27,230 --> 00:43:30,789 + 교육 당신이 일이 발생할 것이 그 하나의 예에 신경 네트워크 + +585 +00:43:30,789 --> 00:43:44,980 + 시간의 점은 하나의 모델은 하나의 데이터 포인트에 비가 가져옵니다 있도록 + +586 +00:43:44,980 --> 00:43:51,250 + 확인을 나는 것을 반복 시도 할 수 있습니다 + +587 +00:43:51,250 --> 00:44:04,239 + 여기 어딘가에에서 온 너희들이 아닌지를 이해하려면 + +588 +00:44:04,239 --> 00:44:10,789 + 당신이 당신의 자신의 드롭 드롭을 삭제할 때 확인 그래서 내가이의 예 있었으면 좋겠다 + +589 +00:44:10,789 --> 00:44:14,429 + 내가 곱 값에 드롭하면 신경 세포의 오른쪽하지만 최대 09 그 효과를 구입 + +590 +00:44:14,429 --> 00:44:17,918 + 손실의 기능에 영향은 경사 (10)가 있기 때문에 바로 그래서이 없다 + +591 +00:44:17,918 --> 00:44:21,668 + 그는 손실을 계산에 사용하고 그래서 안되었다에 대한 가중치는을받지 않습니다 + +592 +00:44:21,668 --> 00:44:25,679 + 업데이트 우리는 네트워크의 일부를 표본했는데 것처럼 그래서 우리의 단지 열차 + +593 +00:44:25,679 --> 00:44:28,959 + 현재 만에 훈련과 네트워크 나니 하나의 데이터 포인트 + +594 +00:44:28,958 --> 00:44:32,348 + 모든 시간 우리의 가능성 표본이 다른 부분을 위해 그것을 할 당신의 + +595 +00:44:32,349 --> 00:44:35,899 + 신경망하지만 이상한 같은 종류의 그래서 그들은 모두 공유 매개 변수 + +596 +00:44:35,898 --> 00:44:39,778 + 다른 모델의 많은 모든 교육의 앙상블 월요일이 점하지만 그들은 모두 + +597 +00:44:39,778 --> 00:44:48,458 + 공유 매개 변수 즉 이해가되지 않습니다 여기 종류의 약 아이디어 그래서 + +598 +00:44:48,458 --> 00:45:07,108 + 일반적으로 50 %이이 그렇게 동일한 크기를 발생하는 매우 거친 방법 저장 + +599 +00:45:07,108 --> 00:45:09,798 + 세계의 힘은 우리가 실제로 컴퓨터 H 알 + +600 +00:45:09,798 --> 00:45:14,009 + 우리는 우리가했던 것처럼 컴퓨터의 모든 전에를 계산하는 것이 더의 절반 이상 + +601 +00:45:14,009 --> 00:45:17,119 + 값은 20 떨어졌다 얻을 것이다 + +602 +00:45:17,119 --> 00:45:29,250 + 아무것도 그들이 좋은거야 변경되지 않습니다 + +603 +00:45:29,250 --> 00:45:38,349 + 대신 문제에 대한 경쟁 역은 도로에서 경쟁 할 + +604 +00:45:38,349 --> 00:45:42,150 + 당신은 당신이 할 수 있도록 스포츠 업데이트를 수행 할 경우에 삭제되지 않습니다 + +605 +00:45:42,150 --> 00:45:44,950 + 하지만 이론적으로 나는 실제로 우리가 걱정하지 않는 이상한 생각하지 않습니다 + +606 +00:45:44,949 --> 00:46:12,369 + 너무 많이하고 그래서 항상 돼 작업 훈련 그래서 매일 반복 우리를 + +607 +00:46:12,369 --> 00:46:15,469 + 우리가 거​​ 드롭하는지에 대해 우리가 샘플 분 경기 또는 노이즈 패턴을 얻을 + +608 +00:46:15,469 --> 00:46:19,359 + 앞으로 가고, 뒤로 패스와 그라데이션 우리는이 이상을 선회 계속 + +609 +00:46:19,360 --> 00:46:31,360 + 또 다시 그래서 당신의 질문에 어떻게 든 영리 사실 바이너리 마스크처럼 + +610 +00:46:31,360 --> 00:46:35,829 + 최고의 정말 안되지 않은 모델 또는 뭔가를 최적화하는 방법 등 + +611 +00:46:35,829 --> 00:46:44,769 + 이루어집니다 또는 누군가가 내가 그래 내가 갈거야 너무 미안 들여다했다고 생각 + +612 +00:46:44,769 --> 00:46:47,389 + 하나의 슬라이드 다음 슬라이드에 해당 들어가 + +613 +00:46:47,389 --> 00:46:57,618 + 우리는이 시점에서 볼거야 나는 마지막 질문을 할게요 + +614 +00:46:57,619 --> 00:47:04,519 + 질문 하나 다른 레이어에 다른 양에게 드롭을 수행 할 수 있습니다 + +615 +00:47:04,518 --> 00:47:05,459 + 당신을 중지 아무것도 없다 + +616 +00:47:05,460 --> 00:47:09,338 + 그 직관적으로 당신은 당신이 더 필요하면 밖으로 강한 드롭을 적용 할 + +617 +00:47:09,338 --> 00:47:12,690 + 정규화 그렇게 볼 수 Primaris의 엄청난 금액을 갖는 층 거기 + +618 +00:47:12,690 --> 00:47:16,349 + 하나의 예에있어 소득 당신은 거기에 강한 하락에 의해 명중 할 + +619 +00:47:16,349 --> 00:47:20,269 + 반대로 우리가 어떤 네트워크의 초기에 볼 수 있습니다 몇 가지 레이어가있을 수 있습니다 + +620 +00:47:20,268 --> 00:47:24,248 + 코미디 쇼 층은 그가 정말 많은 드롭을 연주하지 않는 매우 작은 + +621 +00:47:24,248 --> 00:47:27,368 + 거기에 조금이가는 컬러 네트워킹은 예를 들어 아주 흔한 일 + +622 +00:47:27,369 --> 00:47:30,740 + 당신은 그 대답은 그래서 낮은 드롭 아웃 시간이 지남에 끝나는로 시작 + +623 +00:47:30,739 --> 00:47:38,848 + 예 내가 두 번째 질문은 당신이 대신 단위 그냥 드롭 아웃 할 수 잊었다 + +624 +00:47:38,849 --> 00:47:41,880 + 당신이 할 수 있고 그 뭔가라고 각각의 가중치는 우리가 원하는 연결 삭제 + +625 +00:47:41,880 --> 00:47:46,349 + 이 클래스에서 너무 많이 들어가 있지만,뿐만 아니라 내가 가진 것을 할 수있는 방법이있다합니다 + +626 +00:47:46,349 --> 00:47:52,829 + 지금은 내가 당신을 우리는이 모든 것을 도입했습니다되어 수행 할 작업을 이상적으로 신뢰하는 시간이야 + +627 +00:47:52,829 --> 00:47:56,940 + 바로 공원으로 노이즈가 통과하고 그래서 당신은 단지 시간과 지금 좋아하면 + +628 +00:47:56,940 --> 00:48:00,349 + 우리는 모든 소음을 통합하고 근사를 좋아 하죠하려는 싶습니다 + +629 +00:48:00,349 --> 00:48:03,318 + 그 뭔가를 것에 당신은 당신이 분류하려면 테스트 이미지를 가지고있는 것처럼 + +630 +00:48:03,318 --> 00:48:06,909 + 당신이 할 수있는 많은 전진은 바이너리 마스크의 다양한 설정으로 전달 + +631 +00:48:06,909 --> 00:48:10,558 + 당신은 단지 서브 네트워크를 사용하고 모든 걸쳐 평균 수 + +632 +00:48:10,559 --> 00:48:14,329 + 그래서 그 아마 배포판은 중대하지만 불행히도 아닌 것 + +633 +00:48:14,329 --> 00:48:17,818 + 매우 효율적 그래서 당신이 실제로이 과정을 근사 할 수 있습니다 밝혀 + +634 +00:48:17,818 --> 00:48:22,338 + 첫 번째 드롭 아웃과 방법을 도입 할 때 어느 정도는 지적 주신 + +635 +00:48:22,338 --> 00:48:26,170 + 당신이 당신의 신경 세포 모두 당신을 활용하고자 직관적으로이 작업을 수행합니다 + +636 +00:48:26,170 --> 00:48:29,509 + 내 무작위 우리가 길을 복사하려고거야 떨어지고 싶지 않아 우리 + +637 +00:48:29,509 --> 00:48:33,548 + 몰라 A의 전진 패스에서 드롭 그래서 온 모든 신경을 남길 수 있습니다 + +638 +00:48:33,548 --> 00:48:39,920 + 테스트 이미지 그러나 우리는 실제로 우리가 이것을 어떻게 우리가 그렇게 할 수 조심해야 + +639 +00:48:39,920 --> 00:48:43,480 + 가난한 우리가 어떤 단위를 드롭하지 않을거야 테스트 이미지를 전달하지만 우리는이 + +640 +00:48:43,480 --> 00:48:48,028 + 얻을 것을 기본적으로 하나의 방법에주의하는 그 무엇 + +641 +00:48:48,028 --> 00:48:54,880 + 문제는 이것이이란과의있어 두 개의 입력이었다고 생각 나는 생각한다 + +642 +00:48:54,880 --> 00:48:59,079 + 이 시간에 존재하는 모든 입력을 그래서 우리는 그렇게 단위를 포기하지 않을 것을 + +643 +00:48:59,079 --> 00:49:02,630 + 이들 두 사람은 가까운 일부 활성화 및 다른 의사가 시간이야 + +644 +00:49:02,630 --> 00:49:06,400 + 컴퓨터는이 비교 아직 어떤 값 세금이 될 수 있습니다 + +645 +00:49:06,400 --> 00:49:12,608 + 훈련 시간 동안 것 뉴런 밖으로 무엇을하지만 X이 값 + +646 +00:49:12,608 --> 00:49:18,440 + 이 때문에 드롭 아웃 마스크 매우 무작위 등 교육 시간에 확인 기대 + +647 +00:49:18,440 --> 00:49:21,170 + 어떤 다른 일어날 수있는 여러 가지 경우가있다 + +648 +00:49:21,170 --> 00:49:27,068 + 이들 경우 다른 규모가 될이하자에 대해 걱정해야 할 것 + +649 +00:49:27,068 --> 00:49:32,259 + 날이 내가이 생각을 의미 정확히 무엇을 보여 + +650 +00:49:32,260 --> 00:49:35,539 + 더 비선형 만 남아있는이란에가보고되지 않았다이야 말할 계산해서 + +651 +00:49:35,539 --> 00:49:39,990 + 스트레스 테스트 중에이 활성화되는 (가) 여기에 10 대기 0 W된다 + +652 +00:49:39,989 --> 00:49:44,848 + 자루 + 한 번 W 이유를 확인 그것이 내가에 테스트를 계산하기 위해 원하고 무엇 때문에 + +653 +00:49:44,849 --> 00:49:48,420 + 내가 조심해야 이유는 교육 시간 예상 출력 중입니다 + +654 +00:49:48,420 --> 00:49:51,528 + 이 특정한 경우에 아주 달라졌을 것의 우리는 네가 + +655 +00:49:51,528 --> 00:49:55,619 + 우리는 그 4에 하나 또는 다른 또는 둘 모두 또는 없음을, 그래서 드롭 수있는 가능성 + +656 +00:49:55,619 --> 00:49:56,720 + 가능성 + +657 +00:49:56,719 --> 00:50:00,750 + 컴퓨터에 다른 계곡은 실제로 당신이 때를 볼 수 있습니다이 수학을 위기했다 + +658 +00:50:00,750 --> 00:50:01,659 + 당신은 그것을 감소 + +659 +00:50:01,659 --> 00:50:07,548 + 당신은 왜 그렇게 훈련에서 기대에 WRX + W 하나 끄기 시간을 절반으로 끝낼 + +660 +00:50:07,548 --> 00:50:15,630 + 시간이 신경 세포의 갱신은 실제로 단지 시간이었고, 그래서 당신은 할 때 + +661 +00:50:15,630 --> 00:50:19,640 + 이것과 이것 저것을 보상하기 위해 당신이 가진 모든 시간을 사용하는 + +662 +00:50:19,639 --> 00:50:22,730 + 우리는 아마와 단위를 삭제 한 사실에서 오는 멀리 일 + +663 +00:50:22,730 --> 00:50:29,219 + 이를 최대 절반 그래서 아마 포인트 인 이유 절반은 그래서이다 + +664 +00:50:29,219 --> 00:50:35,358 + 다섯 올림픽 싱가포르는 우리가 결국 다음 그래서 기본적으로 우리는이 작업을 수행하지 않은 경우 통과 + +665 +00:50:35,358 --> 00:50:39,019 + 우리는 동안 기대 한 것에 비해 충분히 크지 만에 갖는 + +666 +00:50:39,019 --> 00:50:42,960 + 분포가 기본적으로 변경됩니다에서 교육 시간과 당신이있어 + +667 +00:50:42,960 --> 00:50:45,639 + 그들은 이러한보고에 사용하지이기 때문에 휴식 것이 세계의 것 + +668 +00:50:45,639 --> 00:50:49,368 + 큰 고온 열 중성자 그리고 그녀는 그 보상해야하고 그럴 필요 + +669 +00:50:49,369 --> 00:50:53,798 + 그냥 일을 일의 대신 모든 물건을 사용하지 않도록 아래로 뭉개 버려 + +670 +00:50:53,798 --> 00:50:57,480 + 하지만 당신은 복구를 다시 얻기 위해 매일 활성화에 스크래치가 당신의 + +671 +00:50:57,480 --> 00:51:03,099 + 예상 출력 확인이 실제로 어려운 점이지만 내가 한 번 들었다 생각 + +672 +00:51:03,099 --> 00:51:06,559 + 제프 힌튼은 처음에 밖으로 드롭 함께 왔을 때 이야기하는 것이 그 + +673 +00:51:06,559 --> 00:51:10,710 + 어떤하지 않았다 밖으로 우리가 드롭을 시도 그래서 실제로 완벽하게이 부분을 마련하지 않았다 + +674 +00:51:10,710 --> 00:51:16,088 + 일을하고 실제로 이유는 그가이 까다로운 놓쳤다 그가대로 작동하지 않았다 + +675 +00:51:16,088 --> 00:51:19,340 + 실제로 인정 하듯이 지적 그래서 우리는 당신의 활성화를 확장해야 + +676 +00:51:19,340 --> 00:51:24,070 + 아래 때문에이 효과의 시스템 다음 모든 것이 훨씬 더 그렇게 작동 I + +677 +00:51:24,070 --> 00:51:28,500 + 그냥 우리가 기본적으로 이러한 계산처럼이는 모습을 보여 그냥 해요 + +678 +00:51:28,500 --> 00:51:33,449 + 정상적으로 신경망은 그래서 우리는 첫 번째 또는 두 번째하지만 지금은 그냥 시간이 우리가 될 수 있습니다 + +679 +00:51:33,449 --> 00:51:38,869 + 평화의 예를 들어 하하 확률 규모를 삭제하도록 P를 곱해야 + +680 +00:51:38,869 --> 00:51:43,139 + 되도록 활성화 아래로 기대 밖으로 예상하지만 지금은이 + +681 +00:51:43,139 --> 00:51:46,969 + 이 때 실제로 당신의 교육 시간 등의 예상 출력과 동일 + +682 +00:51:46,969 --> 00:51:52,449 + 드롭 아웃에 대한 복구 및 예상 출력이 일치하고이 실제로 작동 + +683 +00:51:52,449 --> 00:52:18,069 + 정말 잘 나는이에서 떨어지는거야, 그래서 기차와 사이에 단지 차이입니다 + +684 +00:52:18,070 --> 00:52:20,780 + 모든 신경 세포를 사용하여 모든 같은 테스트를 불일치가있어 떨어집니다 + +685 +00:52:20,780 --> 00:52:24,580 + 그래서 어느 당신은이 시점에서이를 수정하거나 우리가 부르는 당신은 사용할 수 있습니다 + +686 +00:52:24,579 --> 00:52:29,469 + 내가 조금 당신을 보여주지 벌리 드롭 아웃은 그래서 우리는 비트에 그에게거야 + +687 +00:52:29,469 --> 00:52:34,319 + 드롭 아웃 요약 당신은 아마 해제와 함께 드롭 당신의 단위를 삭제하려면 + +688 +00:52:34,320 --> 00:52:38,210 + 오줌의 확률을 유지하고 그것은 단지 당신이 경우에 그렇게를 확장하는 것을 잊지 + +689 +00:52:38,210 --> 00:52:40,820 + 이 네트워크는 잘 작동 할 것 + +690 +00:52:40,820 --> 00:52:44,190 + 확인도 다시 아니에요 마스크를 전파하는 것을 잊지 마세요 + +691 +00:52:44,190 --> 00:52:49,710 + 할 수있는 방법으로 반전 드롭 아웃을 보여주는 것은이 알아서하는 것입니다 + +692 +00:52:49,710 --> 00:52:53,349 + 기차 및 시험 용액 약간 다른 방식 간의 불일치 + +693 +00:52:53,349 --> 00:52:57,710 + 당신 일이었다 전에, 그래서 특히 우리가 할 거 야하는 것은 우리는 올해를 변경하고 + +694 +00:52:57,710 --> 00:53:01,250 + 우리가하지 않을거야 바이오 매스 컵 냉동 것들은 우리가 할 거 야한다 + +695 +00:53:01,250 --> 00:53:04,980 + 우리가 정품 인증 a를 아래로 확장 할거야, 그래서 교육 시간에 여기에 확장 + +696 +00:53:04,980 --> 00:53:07,960 + 그는 다섯은 우리가있어 소비 때문에 경우 다른 스킬을 시간을 노력 + +697 +00:53:07,960 --> 00:53:12,079 + 비난에게 뜨거운하여 기차 시간을 강화하고 우리가 우리의 코드를 떠날 수있는 시간이야 + +698 +00:53:12,079 --> 00:53:16,029 + 바로 그래서 우리는 기차 시간 활성화의 증폭을하고있는 만진 + +699 +00:53:16,030 --> 00:53:20,880 + 우리는이 행위에 의해 인위적으로 더 큰 모든 것을 만들고있어 후 시간이야 + +700 +00:53:20,880 --> 00:53:24,450 + 우리는이 거 야하지만 지금 우리는 단지 청소를 복구하는거야 + +701 +00:53:24,449 --> 00:53:27,819 + 우리가 지금 스케일링하려고 시간을 수행 한 표현 때문에 당신은 수 있습니다 + +702 +00:53:27,820 --> 00:53:31,010 + 당신은 제대로 기차와 시험 사이의 기대를 보정 할 수 있습니다 + +703 +00:53:31,010 --> 00:53:39,290 + 가장 많이 찾는이의 모든에 년과 오른쪽 그래서 드롭 아웃을 사용하고 작업 + +704 +00:53:39,289 --> 00:53:42,779 + 그래서 정말 감염 실제로 사용하는 것은 다음 몇 줄과 아래로 온다 + +705 +00:53:42,780 --> 00:53:47,300 + 뒤로 패스 조금 변경하지만 네트워크는 거의 항상 함께 잘 작동 + +706 +00:53:47,300 --> 00:54:15,070 + 이 당신은 실제 정확한에 피팅에서 심각 아니라면 그이다 + +707 +00:54:15,070 --> 00:54:17,230 + 내가 여기에 언급 한 이유는 + +708 +00:54:17,230 --> 00:54:22,039 + 근사는 조립 근사하고 이유 중 하나 인 + +709 +00:54:22,039 --> 00:54:25,029 + 실제로 다음 사진에서 일어난 일단 때문에 근사값입니다 + +710 +00:54:25,030 --> 00:54:27,769 + 이러한 예상 출력은 모든 종류의 때문에 비선형의 망쳐된다 + +711 +00:54:27,769 --> 00:54:37,500 + 이러한 질문의 상단에 효과 내가 가서 것을 가리키는 주셔서 감사합니다 + +712 +00:54:37,500 --> 00:54:44,769 + 내가없는 당신은 그들이 드롭 인 (drop-in)과 드롭 아웃 반전 말을하는지 참조 + +713 +00:54:44,769 --> 00:54:49,039 + 동등한 그렇게 여부 때문에 상기의 문제가 아니다 그녀의 일을하고 + +714 +00:54:49,039 --> 00:54:59,309 + 내가 가진 것 구십 어쩌면 당신이 바로 당신이 될 수있는 아마 그것에 대해 생각합니다 + +715 +00:54:59,309 --> 00:55:37,949 + 여기 나는이 모든 단지 기대에 기대에 대한 생각 + +716 +00:55:37,949 --> 00:55:41,349 + 당신은 절반을 삭제하고 있고 그래서 거기에 불구하고도 사용할 올바른 일이 + +717 +00:55:41,349 --> 00:55:44,049 + 실제로 결국 정확히 양에 약간의 임의성이 삭제되고 + +718 +00:55:44,050 --> 00:55:47,370 + 큰 괜찮아 + +719 +00:55:47,369 --> 00:55:51,869 + 이 오, 그래의 그래서이 있었다 밖으로 떨어질 것이다 재미있는 이야기로 당신에게 좋아 + +720 +00:55:51,869 --> 00:55:55,509 + 2012 년 제프 힌튼에 깊은 학습 여름 학교는 처음으로 나 있었다 + +721 +00:55:55,510 --> 00:55:56,590 + 적어도 처음 봤어 + +722 +00:55:56,590 --> 00:56:00,930 + 드롭 아웃을 제시하고 그래서 그는 기본적으로 그냥 괜찮 말하는 것을 당신의 뉴런 (20)에서 + +723 +00:56:00,929 --> 00:56:04,589 + 랜덤 그냥 난 그냥 바쁜 활성화를 해요이 항상 잘 작동 + +724 +00:56:04,590 --> 00:56:07,750 + 더 나은 우리는 내 친구가 앉아으로 그 흥미로운 와우 같은거야 + +725 +00:56:07,750 --> 00:56:10,469 + 내 옆에 그는 단지 바로 그 역이 있음 자신의 노트북을 뽑아 + +726 +00:56:10,469 --> 00:56:13,959 + 대학 기계와 이야기하는 동안과가 바로 그것을 구현 + +727 +00:56:13,960 --> 00:56:17,340 + 시간 제프 힌튼 마무리는 그가 더 나은 결과를 얻고지고 있다고 얘기 + +728 +00:56:17,340 --> 00:56:18,950 + 실제로 미술 기자의 상태 등 + +729 +00:56:18,949 --> 00:56:25,189 + 그는 빠른 작업 한 자신의 데이터에 나는 누군가가 내려면 같이 가서 봤어요 + +730 +00:56:25,190 --> 00:56:30,490 + 일본은 너무 많이 나는 이야기를하려고하면서 추가로 5 %가 바로 다음이었다 + +731 +00:56:30,489 --> 00:56:33,589 + 즉, 무언가가 실제로 매우 몇 번 거기에 정말 재미라고 생각했다 + +732 +00:56:33,590 --> 00:56:36,590 + 이런처럼 그것이 그 중 하나이기 때문에 드롭 아웃은 훌륭한 일이다 + +733 +00:56:36,590 --> 00:56:42,390 + 소수의 투자자는 매우 간단하고 항상 그냥 잘 작동하고있다 + +734 +00:56:42,389 --> 00:56:45,579 + 내가 생각 우리가 주운 팁과 트릭의 이러한 종류의 거의 + +735 +00:56:45,579 --> 00:56:49,659 + 문제는 얼마나 더 많은 간단한 일 드롭 아웃 등을들 수있다 거기입니다 + +736 +00:56:49,659 --> 00:56:50,879 + 당신에게 2 %의 활력을 불어 + +737 +00:56:50,880 --> 00:56:54,140 + 항상 우리는 모른다 + +738 +00:56:54,139 --> 00:57:01,199 + 확인은 그래서 그라데이션 검사로이 시점에서 갈 거라고하지만 난을 생각한다 + +739 +00:57:01,199 --> 00:57:04,588 + 실제로 나는 모든 신경의 피곤 때문에이 작업을 건너 내가거야 결정 + +740 +00:57:04,588 --> 00:57:07,130 + 우리와 같은 네트워크는 모든 훈련의 세부 정보를 많이 얘기했습니다 그 + +741 +00:57:07,130 --> 00:57:10,180 + 작품과 내가 너희들뿐만 아니라 피곤하고 그래서 그라데이션을 건너 뛸 것 같네요 + +742 +00:57:10,179 --> 00:57:13,469 + 그것은 아주 잘 노트 여기에 설명되어 있기 때문에 체크 나는 보시기 바랍니다 + +743 +00:57:13,469 --> 00:57:19,028 + 그것은 까다로운 과정의 종류를 통해 이동하는 시간의 비트에 소요 + +744 +00:57:19,028 --> 00:57:23,190 + 프로세스의 모든 어려움을 감사하고 그래서 그냥 내가 읽어 + +745 +00:57:23,190 --> 00:57:27,250 + 에 더 흥미 수 있도록 내가 주위에 드라이브 수있는 일이 생각하지 않습니다 + +746 +00:57:27,250 --> 00:57:29,469 + 당신은 그래서 난 그냥 그것을 확인하는 것이 좋습니다 것 + +747 +00:57:29,469 --> 00:57:33,118 + 한편 우리는 오른손을 뛰어거야하고 ​​그 작동 올 것와 + +748 +00:57:33,119 --> 00:57:42,358 + 사진을보고 너무 1980년에서이 다섯에서 아일린입니다 같이 + +749 +00:57:42,358 --> 00:57:46,538 + 대략 우리는 어떻게 상용 네트워크 마크의 세부 사항에 갈거야 + +750 +00:57:46,539 --> 00:57:49,609 + 이 클래스에서 우리는 실제로 낮은 수준의 세부 사항을 수행하지 않을거야 + +751 +00:57:49,608 --> 00:57:52,768 + 내가 얼마나이 분야에 대한 수에 대한 당신의 직관을 제공하기 위해 노력하겠습니다 + +752 +00:57:52,768 --> 00:57:56,868 + 어떤 일 전체 상황과 그냥 그에서 오는 일반적으로 작동 그렇다면 + +753 +00:57:56,869 --> 00:57:59,559 + 당신은 당신이 돌아 가야 상업 네트워크의 역사에 대해 이야기하고 싶습니다 + +754 +00:57:59,559 --> 00:58:04,910 + 특히 이렇게 대략 아홉 육십 실험 승인 및 족제비에 + +755 +00:58:04,909 --> 00:58:10,449 + 그들은 차 시각 피질 고양이를 공부하고, 그들은은을 전송했다 + +756 +00:58:10,449 --> 00:58:14,710 + 초기 시각 영역과 고양이와 고양이의 뇌는에 패턴을 찾고 있었다 + +757 +00:58:14,710 --> 00:58:19,500 + 그들은 끝내었고, 화면은 실제로이 언젠가 노벨상을 수상 + +758 +00:58:19,500 --> 00:58:23,449 + 이후이 실험을 위해 우리는이 실험이 보는 무엇을 게재 할로 + +759 +00:58:23,449 --> 00:58:27,518 + 그냥 그렇게처럼 그들은 내가 여기 여든 비디오를 뽑아 그렇게 보면 정말 재미있어 + +760 +00:58:27,518 --> 00:58:32,258 + 여기에 무슨 고양이가 위치에 고정되고, 우리가 기록을하는지 참조 + +761 +00:58:32,259 --> 00:58:35,900 + 뒷면에 처리 영역의 어딘가의 피질에서 + +762 +00:58:35,900 --> 00:58:39,809 + 뇌가 하나가 될 수 있고, 지금 우리가 고양이에 다른 빛의 패턴을 표시하고 있고 + +763 +00:58:39,809 --> 00:58:43,519 + 우리는 이제 살펴 보자 기록과 다른 자극에 대한 신경 세포의 불을 공유하고 + +764 +00:58:43,518 --> 00:58:48,039 + 어떻게 경험과 같이 표시됩니다 + +765 +00:58:48,039 --> 00:59:14,050 + 이리 + +766 +00:59:14,050 --> 00:59:27,410 + 이러한 세포와​​ 같은 실험 그들은 모두 네 모서리를 설정하는 것 + +767 +00:59:27,409 --> 00:59:30,279 + 특정 방향 그리고 그들은 가장자리와 일에 대한 흥분 + +768 +00:59:30,280 --> 00:59:36,360 + 방향과 북쪽 방향은 다음과 같이 그래서 그들을 자극하지 않습니다 + +769 +00:59:36,360 --> 00:59:42,150 + 10 분 비디오와 같은 긴 과정을 통해 우리는이 작업을 수행하지 않을거야 + +770 +00:59:42,150 --> 00:59:45,450 + 오랜 시간 그들은 분방하고 어떻게 시각 피질의 모델로 등장 + +771 +00:59:45,449 --> 00:59:52,349 + 공정 뇌의 정보 및 그래서 그들은 몇 가지 결국 수 있습니다 + +772 +00:59:52,349 --> 00:59:56,059 + 예를 들어 노벨상을 선도 그들은 피질가 있음을 알아 냈다 + +773 +00:59:56,059 --> 00:59:56,759 + 배치 + +774 +00:59:56,760 --> 01:00:02,570 + 국소 적 시각 피질 어떤 것을 의미하는 것은 그녀가 내 프린터라고이다 + +775 +01:00:02,570 --> 01:00:06,920 + 피질에서 기본적으로 근처의 세포 그래서이있다 대뇌 피질의 조직 인근 전개 + +776 +01:00:06,920 --> 01:00:11,389 + 소금 공기 피질은 실제로 그래서 당신의 시야 인근 지역을 처리하는 + +777 +01:00:11,389 --> 01:00:15,049 + 당신은 인근 처리 인식되지 않는 및이를 가지고 어떤 것 + +778 +01:00:15,050 --> 01:00:20,510 + 지역은 당신의 처리에 보존되고 그들도 있다는 것을 알아 냈 + +779 +01:00:20,510 --> 01:00:23,790 + 간단한 세포와​​ 그들이라고 무슨 이러한 역할의 전체 년 + +780 +01:00:23,789 --> 01:00:27,659 + 가장자리의 특정 방향으로 반응하고 이러한 모든 있었다 + +781 +01:00:27,659 --> 01:00:31,809 + 일부 셀 예를 들어 더 복잡한 응답 있도록했다 다른 세포 + +782 +01:00:31,809 --> 01:00:34,949 + 제공하는 특정 방향을 선회하지만 약간 있었다 것 + +783 +01:00:34,949 --> 01:00:38,159 + 그들은 가장자리의 특정 위치에 대한 번역 불변 걱정하지 않도록 + +784 +01:00:38,159 --> 01:00:41,839 + 하지만 그들은 단지 방향에 대해 걱정하고 그래서 그들은 가설 + +785 +01:00:41,840 --> 01:00:44,120 + 시각 피질은 이런 종류의이 실험의 모든 통해 + +786 +01:00:44,119 --> 01:00:48,269 + 당신이 다른 단순한 판매 그들의 독서를 종료 계층 적 조직 + +787 +01:00:48,269 --> 01:00:52,679 + 복잡한 세포 등 이러한 세포라는 세포는 각각의 상단에 내장되어 있습니다 + +788 +01:00:52,679 --> 01:00:56,369 + 특히 다른과 간단한 노래는 수용이 상대적으로 지방이 + +789 +01:00:56,369 --> 01:01:00,019 + 필드 이들은 표현의 더 복잡한 유형을 구축했다 + +790 +01:01:00,019 --> 01:01:04,320 + 그래서이는 연속적인 표현의 층과를 통해 뇌의 + +791 +01:01:04,320 --> 01:01:09,240 + 어떤 사람들은이 재현하려고하는 과정을 많이 경험 + +792 +01:01:09,239 --> 01:01:14,649 + 컴퓨터는 상기 제 때문에 하나의 코드와 시각 피질 모델링하려는 및 + +793 +01:01:14,650 --> 01:01:19,389 + 이 예는 후쿠시마에서 거 드롭이었다 그는 기본적으로 결국 + +794 +01:01:19,389 --> 01:01:20,429 + 설정 + +795 +01:01:20,429 --> 01:01:26,710 + 기본적으로 작은 보면이 지역의 수용 세포 구조 + +796 +01:01:26,710 --> 01:01:31,760 + 충격의 영역은 그 층과이 층 그래서 그는을 강화 + +797 +01:01:31,760 --> 01:01:34,750 + 또한 상기 복잡한 간단한을 해결해 단지에 이러한 간단한 공격했다 + +798 +01:01:34,750 --> 01:01:39,000 + 단순 및 복합도 그때 지금 이라크에 구축의 샌드위치 + +799 +01:01:39,000 --> 01:01:41,849 + 하지만에서 아홉 년대 다시 전파 정말 주위에 아직하지 않습니다 + +800 +01:01:41,849 --> 01:01:45,380 + 그래서 이러한 교육에 대한 내 머리와 자율 학습 절차를 추진 + +801 +01:01:45,380 --> 01:01:49,599 + 클러스터링 방식 등으로 네트워크 그러나 이것은 다시되지 않습니다에 전파 + +802 +01:01:49,599 --> 01:01:54,150 + 시간 그러나 그것은 상단에 건물 연속 층 작은 세포의 이런 생각을했다 + +803 +01:01:54,150 --> 01:02:00,039 + 서로 다음이 실험 또한 그는 가지 위에 구축 + +804 +01:02:00,039 --> 01:02:04,739 + 일을하고 그는 건축 레이아웃을 유지하지만 그가했던 것은 사실이었다 + +805 +01:02:04,739 --> 01:02:09,009 + 연수생은 다시 전파 네트워크 그래서 예를 들어 그는 다른 훈련 + +806 +01:02:09,010 --> 01:02:12,770 + 분류 등 등 네 자리 숫자 또는 문자와 그것의 모든 훈련 + +807 +01:02:12,769 --> 01:02:16,769 + 배경 및 그들은 실제로 읽어 복잡한 시스템이를 사용하여 종료 + +808 +01:02:16,769 --> 01:02:23,469 + 등 등 우편 서비스의 숫자와 같은 레이더를 확인합니다 + +809 +01:02:23,469 --> 01:02:27,239 + 즉 실제로 아홉 구십에 전 꽤 오랜 시간으로 돌아가 그리고 + +810 +01:02:27,239 --> 01:02:33,199 + 사람이 2012 년에 이렇게 다시 그들을 사용하지만 괜찮 아주 작은했고, 누가 + +811 +01:02:33,199 --> 01:02:37,559 + 는이 용지에서했을 정도로 꽤 큰 얻을 시작하는 올 때 + +812 +01:02:37,559 --> 01:02:43,549 + 나는 그들이 그 모든했다으로 탈출 참조 유지하고는 같은 아니에요 + +813 +01:02:43,550 --> 01:02:48,200 + 이 천으로 만 이미지, 그래서 우리의 실험실에서 실제로 제공 데이터 세트 + +814 +01:02:48,199 --> 01:02:51,339 + 클래스는 대략 6 천만 인이 모델을 데이터의 엄청난 금액을 + +815 +01:02:51,340 --> 01:02:56,380 + 알렉스 Kozinski 이들의 이름을 기준으로 알렉스 그물에서 매개 변수 차가운 + +816 +01:02:56,380 --> 01:02:59,260 + 네트워크는이 알렉스 냅이있다 그래서 그들은 이름을 가지고 있는지 가고 있었다 + +817 +01:02:59,260 --> 01:03:05,560 + 이 일이 그렇게처럼 자신의 몇 분에서 구글을 가지고 지역 + +818 +01:03:05,559 --> 01:03:09,630 + 한계는 그리고 우리는 그들에게이 알렉스의 순이었다 그래서 이름을 부여하고 그것은 하나 + +819 +01:03:09,630 --> 01:03:13,090 + 이 실제로 무슨 일이있어 다른 알고리즘을 꽤하여보다 실적 + +820 +01:03:13,090 --> 01:03:17,530 + 역사적으로주의하는 것이 흥미 알렉스 아무것도 2012 년 사이의 차이 + +821 +01:03:17,530 --> 01:03:21,850 + 열 아홉 90 년대 한도는 기본적으로 아주 아주 약간의 차이가있다 + +822 +01:03:21,849 --> 01:03:25,940 + 이러한 두 개의 서로 다른 네트워크에서 볼 때이 사람은 내가 신호를 생각 사​​용 + +823 +01:03:25,940 --> 01:03:31,789 + 10 H의 아마 동전이 하나는 진짜하고 더 깊은이었고, + +824 +01:03:31,789 --> 01:03:33,460 + GPU를 훈련하고 더 많은 데이터를 가지고 있었다 + +825 +01:03:33,460 --> 01:03:38,889 + 그리고 기본적입니다 그것은 단지 그 같은 대략의 차이이다있어 그와 + +826 +01:03:38,889 --> 01:03:41,098 + 그래서 정말 우리가했던 것은 우리는 물론 더 나은 방법을 알아 냈어요된다 + +827 +01:03:41,099 --> 01:03:45,000 + 를 초기화하고 국가 군대와 더 잘 작동과 반군이 많은 작업 + +828 +01:03:45,000 --> 01:03:49,480 + 그것보다 더 있지만, 다른는 모두 데이터와 계산을 살해했다 + +829 +01:03:49,480 --> 01:03:53,740 + 하지만 대부분의 배우가 매우 유사했다 그리고 우리는 몇 가지 더 많은 일을했습니다 + +830 +01:03:53,739 --> 01:03:56,719 + 그들은 큰 필터를 사용하는 예를 들어 같은 트릭은 우리가 많이 사용하는 것을 볼 수 + +831 +01:03:56,719 --> 01:04:01,379 + 작은 필터 우리 또한 지금이 우리가 지금이 선수의 수십입니다 + +832 +01:04:01,380 --> 01:04:05,059 + 백 오십이 나중에 와서는 그래서 우리는 정말 스킬에 꽤 최대입니다 + +833 +01:04:05,059 --> 01:04:08,150 + 어떤면하지만, 그렇지 않으면 당신은 정보를 처리하는 방법의 기본 개념 + +834 +01:04:08,150 --> 01:04:09,789 + 유사하다 + +835 +01:04:09,789 --> 01:04:15,150 + 확인을 그래서 그들은 모든 종류의 작업을 수행 할 수 있도록 기본적으로 모든 곳에서 지금이다 + +836 +01:04:15,150 --> 01:04:19,280 + 물론 분류 것 같은 것들을 그들은 검색에 아주 좋은 경우 그래서 당신 + +837 +01:04:19,280 --> 01:04:24,119 + 그들은 그것과 같은 다른 이미지를 검색 할 수 있습니다 그들에게 이미지를 보여 그들도 할 수있다 + +838 +01:04:24,119 --> 01:04:29,809 + 검출 그래서 여기 저기 개와 말 등등 사람들과 검출 + +839 +01:04:29,809 --> 01:04:33,230 + 이 일부 독일 차, 예를 들어 사용될 수있는 모든 다음이가 + +840 +01:04:33,230 --> 01:04:36,588 + 선 그들도 그렇게 하나 하나의 화소가 몇 가지 실험을 수행 할 수 있습니다 + +841 +01:04:36,588 --> 01:04:41,409 + 예를 들어 표시된 사람이나 분할을 재건 도로 나 나무 나 하늘 + +842 +01:04:41,409 --> 01:04:47,529 + 예를 들어 자동차에서의 사용을 위해 여기에 포함 된 작은 엔비디아 테그 라이야 + +843 +01:04:47,530 --> 01:04:51,480 + 우리가 올 실행할 수 있습니다 GPU는 예를 들어 하나의 이유는 이것이 유용 할 수 있어요 + +844 +01:04:51,480 --> 01:04:55,480 + 당신이 식별 할 수있는 자동차는 모든 당신은 라운딩의 왜곡 된 인식 될 수 있습니다 + +845 +01:04:55,480 --> 01:04:57,219 + 당신의 주위에 물건 + +846 +01:04:57,219 --> 01:05:02,039 + 친구의 당신 일부는 식은 경우 의견은 아마 얼굴을 식별하는 + +847 +01:05:02,039 --> 01:05:04,909 + 페이스 북에 자동으로 내가이 시점에서 추측 것이 거의 확실입니다 + +848 +01:05:04,909 --> 01:05:10,069 + YouTube에서 해당 동영상 분류 YouTube 동영상 안에 무엇을 식별 + +849 +01:05:10,070 --> 01:05:14,900 + 그들은이에 사용하는 매우 성공적 Google의 프로젝트입니다 + +850 +01:05:14,900 --> 01:05:17,900 + 기본적으로 구글이 스트리트 뷰 이미지를 복용에 정말 관심과 + +851 +01:05:17,900 --> 01:05:20,809 + 자동에서 바깥 번호를 읽고 + +852 +01:05:20,809 --> 01:05:25,019 + 확인을 밝혀이 그래서 그들은 인간을 많이했다 완벽한 아스트라한입니다 + +853 +01:05:25,019 --> 01:05:30,289 + 노동의 데이터 팔 엄청난 양에 있고, 다음에 거대한 코멘트를 넣어 + +854 +01:05:30,289 --> 01:05:33,429 + 그것은 인간으로 거의 잘 작동 결국 그는 일이 우리가 거 + +855 +01:05:33,429 --> 01:05:37,710 + 이 물건 정말 정말 잘 추정을 작동 전반에 걸쳐 참조 + +856 +01:05:37,710 --> 01:05:41,730 + 그들은 컴퓨터 게임을 재생할 수 있습니다 포즈 + +857 +01:05:41,730 --> 01:05:46,559 + 그들은 그렇게 암 또는 무언가의 모든 종류를 감지하고 안녕히 가세요 + +858 +01:05:46,559 --> 01:05:53,519 + 이미지는이 내가 생각입니다 거리 표지판을 인식 한자를 읽을 수 있습니다 + +859 +01:05:53,519 --> 01:05:57,690 + 신경 조직의 분할은 또한 이렇게 시각적없는 일을 할 수있다 + +860 +01:05:57,690 --> 01:06:02,510 + 예를 들어, 그들이 사용했던 음성 처리를위한 음성을 인식 할 수있다 + +861 +01:06:02,510 --> 01:06:07,780 + 또한 텍스트 문서에 대해 당신은뿐만 아니라 그들이했습니다 코멘트로 텍스트를 볼 수 있도록 + +862 +01:06:07,780 --> 01:06:11,400 + 이들이 사용되어 한 은하의 종류를 인식하기 위해 사용 된 + +863 +01:06:11,400 --> 01:06:15,570 + 다른 웨일즈 인식하는 최근의 가축 대회에서 이것은이다 + +864 +01:06:15,570 --> 01:06:18,420 + 특히 잘 거기 백마일 같았다 또는 같은 그이다 + +865 +01:06:18,420 --> 01:06:24,409 + 그냥 내 특정 개인 그래서이 자사의 하얀 반점의 패턴을 살 것이다 + +866 +01:06:24,409 --> 01:06:28,179 + 머리는 그 놀라운 그래서 나는이 인식 될 수 있습니다 특정 방법입니다 + +867 +01:06:28,179 --> 01:06:32,618 + 모든 그들이 사용하는 위성 사진에서 작동 꽤 지금이 있기 때문에 + +868 +01:06:32,619 --> 01:06:35,280 + 이 모든 분석 있도록 위성 많은 데이터를 가지고 여러 회사 + +869 +01:06:35,280 --> 01:06:39,530 + 이 경우 큰 의견으로는 도로를 권선 있어요 그러나 당신은 또한 볼 수 있습니다 + +870 +01:06:39,530 --> 01:06:43,850 + 농업 응용 프로그램 또는 그들은 또한 이미지를 할 수있는 사람은 당신에게 수도 캡처 + +871 +01:06:43,849 --> 01:06:48,829 + 내 작품이 포함 된 결과뿐만 아니라 우리가 이미지를 가지고의 일부를 보았다 + +872 +01:06:48,829 --> 01:06:53,369 + 자막 대신 단일 카테고리의 더 문장이 그들은 수도 있습니다 + +873 +01:06:53,369 --> 01:06:56,150 + 다양한 예술적 노력 사용될 + +874 +01:06:56,150 --> 01:06:59,800 + 그래서 이것은 무엇인가라는 깊은 꿈이며 우리는 어떻게 들어갈거야 + +875 +01:06:59,800 --> 01:07:00,350 + 공장 + +876 +01:07:00,349 --> 01:07:04,440 + 실제로 세 번째 과제를 구현하는 것은 어쩌면 당신은 것입니다 확인 할 수있다 + +877 +01:07:04,440 --> 01:07:08,099 + 세 번째 과제를 구현하는 당신은 그것을 이미지를 제공하고 사용 할 수있는 그 + +878 +01:07:08,099 --> 01:07:11,349 + 이 이상한 물건을 할 수 있도록 + +879 +01:07:11,349 --> 01:07:17,380 + 특히 개 환각을 많이하고 우리는 왜 개에 갈거야 + +880 +01:07:17,380 --> 01:07:20,349 + 그것이 사실 여기서 이러한 네트워크하다 화상 순을 수행하는 표시 + +881 +01:07:20,349 --> 01:07:25,579 + 최대 끝으로 훈련을받을 그들은 개 많은 등이 이러한 네트워크를 + +882 +01:07:25,579 --> 01:07:28,259 + 사과 주스와 먹는 개는 그들이 일부를 사용하는의 같은 종류의 + +883 +01:07:28,260 --> 01:07:32,440 + 패턴 및 당신은 다른 이미지를 당신이 그 (것)들을 넣을 수 있어야한다 + +884 +01:07:32,440 --> 01:07:36,710 + 이미지와 루프에서 그들과 우리가 어떻게를 볼 수 있도록 환각 일을 나눠 + +885 +01:07:36,710 --> 01:07:42,769 + 나는 슬라이드를 설명하지 않을거야 비트에서 작동하지만 당신이 할 수 있도록 멋진 모습 + +886 +01:07:42,769 --> 01:07:47,559 + 내가 또한 지적 아마 어딘가에 포함 할 것 상상 + +887 +01:07:47,559 --> 01:07:51,579 + 흥미로운 것은 네트워크 라이벌 표현 불리는이 종이있다 + +888 +01:07:51,579 --> 01:07:55,420 + 개인의 나는 그들이 무슨 짓을했는지 인식의 분기 피질 호출을 생각한다 + +889 +01:07:55,420 --> 01:08:00,250 + 나는 이것이 원숭이 원숭이라고 생각하고 여기 기본적으로 찾고 + +890 +01:08:00,250 --> 01:08:05,280 + 여기 피질에서 ITV에서 기록과 신경이 기록 + +891 +01:08:05,280 --> 01:08:09,030 + 정품 인증 원숭이 이미지를보고 그들은 동일한 이미지를 공급 + +892 +01:08:09,030 --> 01:08:12,660 + 네트워크에서 수행하는 것과 그들이하려는 것은 인기에서입니다 + +893 +01:08:12,659 --> 01:08:16,960 + 상업 네트워크 코드 또는 전용 희소 뉴런 집단으로부터 무도회 + +894 +01:08:16,960 --> 01:08:21,560 + 문맥의 인구는 몇 가지 개념 분류를 수행하기 위해 노력하고 + +895 +01:08:21,560 --> 01:08:25,820 + 무엇 당신이 보는 것은 그 아이디어 피질과 분류의 코팅 + +896 +01:08:25,819 --> 01:08:30,519 + 이미지는 측면에서 2013이 신경 네트워크를 사용하는 것과 거의 같은 좋은 + +897 +01:08:30,520 --> 01:08:35,400 + 정보는 이미지에 대한 걸 당신은 성능이 거의 동일 할 수있다 + +898 +01:08:35,399 --> 01:08:40,279 + 여기에 분류 아마 더 눈에 띄는 결과를 우리가 비교하는 + +899 +01:08:40,279 --> 01:08:43,759 + 일의 경쟁을 통해 이미지를 많이 공급 그들은이있어 + +900 +01:08:43,760 --> 01:08:46,720 + 달 그는 이미지를 많이했다 다음은 이러한 이미지는 방법에 대해 알아 + +901 +01:08:46,720 --> 01:08:48,789 + 뇌 또는 의견 표현 + +902 +01:08:48,789 --> 01:08:53,019 + 그래서이 두 공간은 공간에 배치되는 방식의 이미지 표현이다 + +903 +01:08:53,020 --> 01:08:57,520 + 주석에 의해 당신은 유사성 행렬 및 통계를 비교할 수 있습니다 + +904 +01:08:57,520 --> 01:09:00,450 + 당신이 볼 수 있다는 IT 피질과 코멘트 + +905 +01:09:00,449 --> 01:09:04,099 + 의 매우 매우 유사 표현 기본적 것을 매핑 사이가있다 + +906 +01:09:04,100 --> 01:09:08,440 + 비슷한 물건들이 배치 방법을 계산하고있다처럼 거의 보인다 + +907 +01:09:08,439 --> 01:09:12,399 + 서로 다른 개념과 내용을 시각적으로 공간이 폐쇄 무엇을 훨씬 것은 매우가요 + +908 +01:09:12,399 --> 01:09:16,809 + 당신이 뇌에서에서 볼 무엇 때문에 어떤 사람들은 매우 매우 유사 + +909 +01:09:16,810 --> 01:09:20,780 + 이 회사는 뭔가 뇌를하고 있다는 것을 그냥 몇 가지 증거가 있다고 생각 + +910 +01:09:20,779 --> 01:09:23,769 + 같은 그것은 매우 흥미로운 점에서 다음 남아있는 유일한 문제 때문에 + +911 +01:09:23,770 --> 01:09:24,330 + 케이스 + +912 +01:09:24,329 --> 01:09:27,210 + 이 작품은 + +913 +01:09:27,210 --> 01:09:28,609 + 우리는 다음 수업을 찾을 수 있습니다 + diff --git a/captions/Ko/Lecture8_ko.srt b/captions/Ko/Lecture8_ko.srt new file mode 100644 index 00000000..abdab665 --- /dev/null +++ b/captions/Ko/Lecture8_ko.srt @@ -0,0 +1,3528 @@ +1 +00:00:00,000 --> 00:00:07,519 + 시계의이의 나는 그래서 오늘은 휴식 약간의 강의를 알 수 있도록 시작하자하자 + +2 +00:00:07,519 --> 00:00:11,269 + 오늘 우리는 우리가 얘기 마지막 시간은 약 종류의 우리의 모든 부분을보고있어 + +3 +00:00:11,269 --> 00:00:14,439 + 의견은 우리가 일부를 볼거야 함께 오늘 모든 것을 넣어 + +4 +00:00:14,439 --> 00:00:16,250 + 연락처 응용 프로그램 + +5 +00:00:16,250 --> 00:00:20,550 + 측면은 실제로 이미지 안에 다이빙과 공간 지역화에 대해 이야기하고 + +6 +00:00:20,550 --> 00:00:25,550 + 검출 우리는 우리가 실제로 우리가 나중에 있었다 조금이 강의를 이동했다 + +7 +00:00:25,550 --> 00:00:29,080 + 일정에 우리는 많은 남자들이 프로젝트의이 유형에 관심 보았다 + +8 +00:00:29,079 --> 00:00:31,839 + 무엇 무엇을 누가 종류의 당신의 아이디어를 제공하기 위해 이전으로 이동하고 싶었다 + +9 +00:00:31,839 --> 00:00:38,378 + 가능 그래서 처음 몇 행정 가지 프로젝트 제안했다입니다 + +10 +00:00:38,378 --> 00:00:41,988 + 토요일에 일을 내받은 편지함의 종류 그래서 당신의 대부분을 생각 주말에 폭발 + +11 +00:00:41,988 --> 00:00:45,909 + 제출하지만 당신은 당신이 아마에 가야하지 않은 경우 우리는에 걸 + +12 +00:00:45,909 --> 00:00:49,328 + 그 통해 보는 과정이 반드시 프로젝트 제안서 있는지 확인로 이동합니다 + +13 +00:00:49,329 --> 00:00:52,530 + 한 번에 하나를 합리적으로 인정되지 않습니다 그래서 우리는 희망에 드리도록하겠습니다 + +14 +00:00:52,530 --> 00:01:02,149 + 프로젝트 이번주는 집 또는 두 개의 그래서 사람들이 끝났다 누가 금요일에 의한 사람 + +15 +00:01:02,149 --> 00:01:04,519 + 패치 규범에 붙어 + +16 +00:01:04,519 --> 00:01:09,820 + 우리가 만들고 있어요, 그래서 다음 적은 수의 손을의 좋은 좋은 괜찮 우리는 지난 주에보고 + +17 +00:01:09,819 --> 00:01:13,688 + 진행은 또한 우리가 당신을 요구하고 있음을 알아 두셔야에 실제로 꽤 훈련 + +18 +00:01:13,688 --> 00:01:17,798 + 지금까지이 숙제 C에 큰 대륙 당신이 훈련을 시작하고, 그래서 만약 + +19 +00:01:17,799 --> 00:01:22,570 + 상단 될 수 목요일 밤은 그래서 어쩌면 또한 마지막 부분에 일찍 시작 + +20 +00:01:22,569 --> 00:01:25,618 + 숙제 하나는 만드는 과정에 있었다 잘하면 우리는 다시 사람들을해야합니다 + +21 +00:01:25,618 --> 00:01:30,540 + 이번 주에 당신은 또한 비록 염두에 두어야 할 수있는 숙제 전에 피드백을 얻을 수 있습니다 + +22 +00:01:30,540 --> 00:01:35,450 + 그 일주일에서, 그래서 우리는 실제로 수요일에 다음 주 수업 중간에이 + +23 +00:01:35,450 --> 00:01:41,159 + 수요일 그래서 재미를 많이해야합니다 클래스의 준비 + +24 +00:01:41,159 --> 00:01:46,359 + 확실히 우리가 경쟁에 대해 얘기했다 그래서 마지막 강의는 우리가 할 일 + +25 +00:01:46,358 --> 00:01:50,438 + 조각을 면제 우리는 어떻게 회선을 이해하는 데 시간이 오래 소요 + +26 +00:01:50,438 --> 00:01:53,699 + 운영자는 우리는 종류의 하나에서 기능지도를 변환하는 방법을 작동 + +27 +00:01:53,700 --> 00:01:58,329 + 지도를 통해이 창을 밀어 제품에 이상 실행하여 다른 + +28 +00:01:58,328 --> 00:02:01,809 + 제품을 계산하고 실제로 통해 표현을 변환 + +29 +00:02:01,810 --> 00:02:05,759 + 많은 처리 레이어와 이러한 낮은 기억한다면 당신이 기억하는 경우 + +30 +00:02:05,759 --> 00:02:09,299 + 회선 텐트 층, 상기 가장자리와 색상과 높은 같은 것들 + +31 +00:02:09,299 --> 00:02:14,790 + 레이어는 우리가 어떤 당기는 이야기 더 복잡한 객체 부분을 학습하는 경향이 + +32 +00:02:14,789 --> 00:02:18,509 + 일부 샘플에 사용되는 네트워크 안에 우리의 기능 표현을 소형화 + +33 +00:02:18,509 --> 00:02:24,209 + 그것은 우리가 우리는 또한 특정에 대한 사례 연구를했다 보았다 일반적인 성분이다 + +34 +00:02:24,209 --> 00:02:27,479 + 콘텐츠 아키텍처는 이러한 일이에 푹 얻을하는 경향이 방법을 볼 수 있었다 + +35 +00:02:27,479 --> 00:02:31,568 + 조금 섬유의되는 98 일이 우리가 약을 이야기 그래서 연습 + +36 +00:02:31,568 --> 00:02:35,189 + 우리는 알렉스에 대해 이야기 4 자리 인식하지를 사용 내용 + +37 +00:02:35,189 --> 00:02:38,949 + 종류의 이미지를 승리로 2012 년 큰 깊은 깊은 학습 붐 시작했다 + +38 +00:02:38,949 --> 00:02:45,568 + 우리는 2013 년 하나의 이미지 그물 분류했다 ZF에 대해 이야기하는 것이 오지 + +39 +00:02:45,568 --> 00:02:51,108 + 알렉스 꽤 유사한 지금 우리는 깊이가 종종 더 나은 것을보고 + +40 +00:02:51,109 --> 00:02:55,709 + 분류 우리는 2014 년에 정말 잘했던 구글 매트와 PGG 보았다 + +41 +00:02:55,709 --> 00:03:00,609 + 훨씬 알렉스 나탄 즈보다 더 깊고 더 많이 우리 있었다 대회 + +42 +00:03:00,609 --> 00:03:05,430 + 또한 마이크로 소프트는이 새로운 멋진 미친 것은 그 하나 ResNet라고 보았다 + +43 +00:03:05,430 --> 00:03:10,909 + 그냥 백 오십 곳 아키텍처 등 2015 년 12 월에서 당신의 + +44 +00:03:10,909 --> 00:03:14,579 + 발신자는 지난 몇 년 동안 서로 다른 아키텍처왔다 + +45 +00:03:14,579 --> 00:03:19,109 + 깊은 점점 더 많아지고 있지만 그렇게 분류 그냥 + +46 +00:03:19,109 --> 00:03:23,980 + 지금이 강의에서 우리는 현지화 및 검출에 대해 이야기 할 것중인 + +47 +00:03:23,979 --> 00:03:28,500 + 컴퓨터 비전 및이 아이디어에 또 다른 정말 큰 중요한 문제는 실제로 + +48 +00:03:28,500 --> 00:03:32,699 + 더 나은 기회를하고 깊은 네트워크의 모든 종류의 많은 것을 방문 것이다 + +49 +00:03:32,699 --> 00:03:37,798 + 이러한 새로운 공격뿐만 아니라 그래서 지금까지 수업 시간에 우리가 정말 얘기를했습니다 + +50 +00:03:37,799 --> 00:03:42,639 + 종류의 이미지를 부여 분류에 대해 우리는 분류 할 + +51 +00:03:42,639 --> 00:03:47,049 + 몇 개의 개체 범주는 그것은 그에서 좋은 기본적인 문제가되지 것입니다 + +52 +00:03:47,049 --> 00:03:50,340 + 우리가 그것을 사용하여 한 컴퓨터 비전 의견을 이해하기 위해 사용하고, + +53 +00:03:50,340 --> 00:03:53,800 + 이러한하지만 사람들이오고 있었다 다른 작업을 많이 실제로있다 + +54 +00:03:53,800 --> 00:03:59,350 + 그래서 이들 중 일부는 분류 및 현지화 지금 대신이다 + +55 +00:03:59,349 --> 00:04:03,699 + 우리는 또한 드롭 할 가장자리뿐만 아니라 어떤 종류의 라벨을 분류 + +56 +00:04:03,699 --> 00:04:07,349 + 클래스가 발생하는 경우 이미지 다운 상자가 대답 + +57 +00:04:07,349 --> 00:04:11,549 + 또 다른 문제 사람들은 그래서 여기에 다시 몇 가지 거기의 탐지 작업 + +58 +00:04:11,550 --> 00:04:15,689 + 사진 개체의 범주 수 있지만 실제로의 모든 인스턴스를 찾으려면 + +59 +00:04:15,689 --> 00:04:20,238 + 또 다른 최근의 주변 이미지와 드롭 박스 안에 해당 카테고리 + +60 +00:04:20,238 --> 00:04:24,189 + 이 미친 것은 즉시 호출로 작업하지만 사람들은 비트에서 작동하기 시작했다 + +61 +00:04:24,189 --> 00:04:27,490 + 당신은 당신이 두 가지에 대한 일부 사진 번호가 원하는 다시 분할 + +62 +00:04:27,490 --> 00:04:30,829 + 카테고리는 해당 카테고리의 모든 인스턴스에게 이미지를 찾으려면 + +63 +00:04:30,829 --> 00:04:35,319 + 대신 상자를 사용하는 당신이 실제로 주위에 약간의 윤곽을 그릴 싶어 + +64 +00:04:35,319 --> 00:04:37,279 + 모든 화소를 식별 + +65 +00:04:37,279 --> 00:04:41,549 + 우리가 아니에요, 그래서 미친의 각 인스턴스 인스턴스 세분화 종류에 속하는 + +66 +00:04:41,550 --> 00:04:44,710 + 에 대해 이야기 할 것 오늘은 당신이 그것을 알고 있어야합니다 생각 + +67 +00:04:44,709 --> 00:04:47,959 + 우리는 이러한 현지화 및 탐지 작업 오늘이에 초점을 정말거야 + +68 +00:04:47,959 --> 00:04:52,009 + 이들 사이에 큰 차이가 발견 된 개체의 수이다 + +69 +00:04:52,009 --> 00:04:56,250 + 그래서 및 현지화 하나의 개체 또는 일반 효과 번호 종류가있다 + +70 +00:04:56,250 --> 00:05:00,129 + 검출에 우리는 여러 개체 또는 변수가있을 수 있습니다 반면 객체 + +71 +00:05:00,129 --> 00:05:04,000 + 객체의 수와이 작은 차이처럼 보이지만 그것은을 끌 것 + +72 +00:05:04,000 --> 00:05:05,360 + 실제로 큰을 + +73 +00:05:05,360 --> 00:05:10,480 + 그래서 우리는 첫 번째에 대한 이야기​​거야 아키텍처에 매우 중요 + +74 +00:05:10,480 --> 00:05:15,610 + 간단한의 종류 사촌 분류 및 현지화 그래​​서 그냥 요약하자면 + +75 +00:05:15,610 --> 00:05:16,389 + 내가 그냥 슬픈 + +76 +00:05:16,389 --> 00:05:21,849 + 카테고리 라벨 현지화로 분류 하나의 이미지가 상자에 이미지와 + +77 +00:05:21,850 --> 00:05:26,730 + 분류 현지화는 우리가주는거야 동시에 둘을 의미합니다 + +78 +00:05:26,730 --> 00:05:30,669 + 당신은 사람들이 사용 댄스의 종류의 아이디어는 우리가했습니다 이야기 + +79 +00:05:30,668 --> 00:05:33,849 + 이미지가 분류 도전 이미지에 대해 얘기하지도 + +80 +00:05:33,850 --> 00:05:37,810 + 그래서 여기에 분류 + 현지화 도전을 실행하고있다 + +81 +00:05:37,810 --> 00:05:42,269 + 분류 작업과 유사한 천 클래스와 각있다 + +82 +00:05:42,269 --> 00:05:46,319 + 해당 클래스의 인스턴스를 훈련하는 것은 실제로는 하나의 클래스 여러가 + +83 +00:05:46,319 --> 00:05:51,069 + 이제 이미지 내부 클래스와 테스트 tinier 알고리즘에 대한 경계 상자 + +84 +00:05:51,069 --> 00:05:55,709 + 유기물은 어디 대신 추측 단지 인 클래스 레이블 그것의 우회 + +85 +00:05:55,709 --> 00:05:59,370 + 클래스 레이블 함께 경계 상자와 그것이 바로 당신이 얻을 필요가 얻을 + +86 +00:05:59,370 --> 00:06:03,288 + 클래스 레이블 권리와 경계 상자 권한을 우리는 경계 상자가 있어요 + +87 +00:06:03,288 --> 00:06:06,589 + 바로 당신이 교차라는 어떤 일에 가까이있어 의미 + +88 +00:06:06,589 --> 00:06:11,310 + 그렇게 다시 너무 많은 지금은 걱정하지 않아도 당신은 그것을 얻을 + +89 +00:06:11,310 --> 00:06:15,259 + 이미지 적어도 당신이 당신의 5 가스 중 하나가 올바른지 바로 경우를 얻을 수 있고 + +90 +00:06:15,259 --> 00:06:18,129 + 이 + 분류에 일 주요 데이터 집합 사람들의 종류 + +91 +00:06:18,129 --> 00:06:25,159 + 현지화 하나의 정말 근본적인 패러다임 있도록 때 정말 유용 + +92 +00:06:25,160 --> 00:06:28,700 + 회귀의이 아이디어는 내가 알고하지 않습니다 현지화에 관한 생각 + +93 +00:06:28,699 --> 00:06:31,219 + 당신이 가지처럼 보았다 기계 학습 클래스에 다시 생각 + +94 +00:06:31,220 --> 00:06:36,160 + 분류 및 회귀 나 회귀 또는 애호가 뭔가를 할 수있다 + +95 +00:06:36,160 --> 00:06:39,689 + 우리는 지역화에 대해 이야기 할 때 그것은 정말 우리는 할 수 있습니다 의미있어 + +96 +00:06:39,689 --> 00:06:42,980 + 정말 우리의 이미지가 회귀 문제로이 프레임 + +97 +00:06:42,980 --> 00:06:46,700 + 해당 이미지에 오는 어떤 어떤 처리를 통해 갈 및 / 또는 + +98 +00:06:46,699 --> 00:06:49,990 + 결국 상승을 촉진 실수 번호를 생성하는 것 + +99 +00:06:49,990 --> 00:06:53,829 + 이 상자는 다른 매개 변수화 사람들이 일반적으로는 사용할 수있다 + +100 +00:06:53,829 --> 00:06:57,759 + XY 상부 좌측 코너의 폭 및 높이 좌표 + +101 +00:06:57,759 --> 00:07:01,000 + 상자하지만 당신은뿐만 아니라 다른 변형을 볼 수 있지만 항상 네 개의 숫자에 대한 것 + +102 +00:07:01,000 --> 00:07:04,680 + 상자를 경계하고 다시 일부 지상 진실 경계 상자가있다 + +103 +00:07:04,680 --> 00:07:08,810 + 단지 네 개의 숫자와 지금 우리는 우리가 아마 유클리드와 같은 손실을 계산할 수있다 + +104 +00:07:08,810 --> 00:07:12,699 + 손실 우리가 생산 숫자 사이에 꽤 예쁜 표준 선택 + +105 +00:07:12,699 --> 00:07:16,339 + 우리가했던 것처럼 정확한 숫자는 지금 우리는 단지이 일을 설정할 수 있습니다 우리의 + +106 +00:07:16,339 --> 00:07:20,489 + 우리는 약간의 땅과 데이터의 많은 배치를 샘플링 분류 네트워크 + +107 +00:07:20,490 --> 00:07:24,210 + 진실의 상자 우리는 앞으로 전파 컴퓨터는 우리의 예측 사이에 손실 + +108 +00:07:24,209 --> 00:07:29,359 + 올바른 예측은 다시 전파 너무 네트워크를 업데이트 + +109 +00:07:29,360 --> 00:07:33,250 + 인이 패러다임은 실제로의이 현지화 작업을 만드는 정말 쉽습니다 + +110 +00:07:33,250 --> 00:07:37,269 + 그래서 여기에 구현하기가 꽤 쉬운 방법에 대한 정말 간단한 조리법이다 + +111 +00:07:37,269 --> 00:07:41,289 + 먼저 그냥 다운로드 있도록 분류 + 현지화를 구현할 수 + +112 +00:07:41,290 --> 00:07:44,370 + 당신이 야심 찬 있다면 어떤 기존의 초반 이었죠 모델은 당신 자신을 훈련하는 + +113 +00:07:44,370 --> 00:07:48,139 + 알렉스 기사 BGG 구글과 같은 일이 우리가 이야기 모든 것을 충족 + +114 +00:07:48,139 --> 00:07:53,180 + 마지막 강의는 지금 우리가했다 그 완전히 연결 층을거야 + +115 +00:07:53,180 --> 00:07:57,100 + 거 우리의 수준의 점수를했다 생성하는 순간 옆 사람들을 설정하고 우리는있어 + +116 +00:07:57,100 --> 00:08:00,410 + 야는 어느 시점에 몇 가지 새로운 완전히 연결 층을 부착 + +117 +00:08:00,410 --> 00:08:04,840 + 이 회귀가 가진이 호출됩니다 전화 네트워크 그러나 나는 기본적으로 뜻 + +118 +00:08:04,839 --> 00:08:08,119 + 같은 몇 완전히 연결 층과 같은 것은하고는 좀을두고 있습니다 + +119 +00:08:08,120 --> 00:08:13,889 + 우리가 훈련처럼 실제 값 숫자는 지금 우리가이 일을 훈련 우리 + +120 +00:08:13,889 --> 00:08:17,209 + 분류 네트워크는 유일한 차이는 현재 클래스의 대신 + +121 +00:08:17,209 --> 00:08:18,359 + 전쟁 + +122 +00:08:18,360 --> 00:08:24,550 + 대학원 수업은 우리가이 훈련 매트의 중위 손실과 왕관 보석 상자를 사용하여 + +123 +00:08:24,550 --> 00:08:28,918 + 네트워크 정확히 같은 방법은 지금은 우리가 할 모두 머리를 사용하는 시간이야 + +124 +00:08:28,918 --> 00:08:32,218 + 분류 및 현지화 우리는 이미지가 우리가 변경 한 한 + +125 +00:08:32,219 --> 00:08:36,700 + 분류 우리는 우리가 그것을 통과 비편 재화 머리를 양성하고있다 + +126 +00:08:36,700 --> 00:08:40,620 + 우리는 우리가 상자를 수업 과정을 얻을받을 우리가 정말 좋아 완료되면 그게 다야 + +127 +00:08:40,620 --> 00:08:44,259 + 당신은 그래서 이것은 정말 좋은 간단한 조리법의 종류 할 필요가 너희들 + +128 +00:08:44,259 --> 00:08:50,208 + 그래서 다른 프로젝트에 분류 + 현지화에 사용할 수 + +129 +00:08:50,208 --> 00:08:54,750 + 이 방법의 하나 약간의 세부 사항은 두 가지 방법의 종류가 있음 + +130 +00:08:54,750 --> 00:08:59,990 + 사람들이 당신이 클래스에 얽매이지 regresar을 상상할 수있는이 회귀 작업을하거나 + +131 +00:08:59,990 --> 00:09:04,190 + 클래스 별 당신 regresar 상관없이 클래스 난 없다는 것을 상상할 수 + +132 +00:09:04,190 --> 00:09:07,760 + 이들에 동일한 가중치가 완전히 동일한 접속 구조를 사용하는 것 + +133 +00:09:07,759 --> 00:09:11,600 + 레이어를 출력하여 정렬에있을 것입니다 내 경계 상자를 생산하는 + +134 +00:09:11,600 --> 00:09:15,379 + 그냥 상자가 더 내가이야 클래스를 상관 없습니다 항상 네 개의 숫자 + +135 +00:09:15,379 --> 00:09:19,139 + 당신이 볼 대안은 때때로 우리가 지금이야 클래스 고유의 회귀이다 + +136 +00:09:19,139 --> 00:09:23,389 + 넌 넣어 일종의 당 하나의 경계 상자처럼 번호를 시간을 볼 수있어 + +137 +00:09:23,389 --> 00:09:27,569 + 클래스와 다른 사람들이 때때로이 더 잘 작동하는지 발견하고 + +138 +00:09:27,570 --> 00:09:31,269 + 다른 경우가 있지만, 나는 직관적으로 의미가 가지 의미가 있습니다 그 + +139 +00:09:31,269 --> 00:09:35,470 + 방법은 당신이 조금 될 수있는 고양이 현지화에 대한 생각 뭔가 + +140 +00:09:35,470 --> 00:09:38,129 + 당신이 지역화 방법과 다른 비트는 어쩌면 싶어이 그렇게 훈련 + +141 +00:09:38,129 --> 00:09:42,289 + 그것의 것들하지만 대한 책임은 네트워크의 다른 부분 + +142 +00:09:42,289 --> 00:09:45,569 + 그것은 꽤 쉬운 장소는 당신의 다시 로스 (A)에 제공된 방법을 변경하는 것 + +143 +00:09:45,570 --> 00:09:49,329 + 조금 만 그라운드 진실 클래스를 사용하여 손실을 계산할 + +144 +00:09:49,328 --> 00:09:52,809 + 지상 진실 클래스의 상자하지만 심지어 여전히 기본적으로 같은 생각 + +145 +00:09:52,809 --> 00:09:57,750 + 정확히 회귀가 있었다 부착 위치와 여기에 다른 디자인 선택이다 + +146 +00:09:57,750 --> 00:10:01,360 + 당신은 다른 사람들이 어떻게 볼 경우 다시는 다른 사람들이 너무 중요하지 않습니다 + +147 +00:10:01,360 --> 00:10:05,120 + 그것은 다른 방법으로 몇 가지 일반적인 선택은 바로 후를 연결하는 것입니다 + +148 +00:10:05,120 --> 00:10:09,948 + 당신이 정말로 심각한처럼 마지막 길쌈 공기의 가을은 일종의 의미 + +149 +00:10:09,948 --> 00:10:14,909 + 새로운 완전히 연결 층을 초기화하는 다리와 BG를 통해 같은 것들을 볼 수 있습니다 + +150 +00:10:14,909 --> 00:10:18,909 + 현지화 작업 또 다른 일반적인 선택이 바로 연결하는 것입니다 이런 식으로 당신의 + +151 +00:10:18,909 --> 00:10:22,939 + 침략은에서 마지막 완전히 연결 층 후 실제로 있었다 + +152 +00:10:22,940 --> 00:10:27,310 + 당신이 우리의 CNN에 창고처럼 다른 것을 볼 수 있습니다 작업의 분류 + +153 +00:10:27,309 --> 00:10:31,099 + 이 노동 작업의 종류하지만 둘 중 하나는 잘 작동 + +154 +00:10:31,100 --> 00:10:38,129 + 당신이이 따로 그래서으로 막 어디서나 연결하고 뭔가를 할 수 + +155 +00:10:38,129 --> 00:10:42,029 + 우리는 실제로 하나 이상의 현지화에이 프레임 워크를 일반화 할 수 있습니다 + +156 +00:10:42,029 --> 00:10:46,610 + 이 분류 현지화 작업과 그래서 일반적으로 개체가 우리 + +157 +00:10:46,610 --> 00:10:50,440 + 일종의 우리가 정확히 하나의 객체를 생성하는 신경 이미지를 설정 + +158 +00:10:50,440 --> 00:10:54,620 + 입력 이미지 상자를 경계하지만, 경우에 당신은 미리 알고 있습니다 + +159 +00:10:54,620 --> 00:10:59,279 + 당신은 항상 여기가 개체의 일부 고정 된 수의 지역화하려는 + +160 +00:10:59,279 --> 00:11:03,730 + 지금 당신의 침략은 단지 각 상자를 출력했다 일반화 정말 쉽게 + +161 +00:11:03,730 --> 00:11:07,039 + 당신이 걱정하고 다시 동일한에서 네트워크를 훈련하는 객체 + +162 +00:11:07,039 --> 00:11:12,839 + 방법과 같은 시간이 꽤입니다 실제로 현지화 여러 개체의이 아이디어 + +163 +00:11:12,840 --> 00:11:16,790 + 일반적으로 꽤 강력한 예를 들어 이런 종류의 접근법이있다, 그래서 + +164 +00:11:16,789 --> 00:11:21,559 + 인간의 생각은 우리가 입력 범죄 a를 원하는되도록 추정을 제기 사용 + +165 +00:11:21,559 --> 00:11:25,299 + 근접 사람과 사람의보기를 그 포즈의 의미를 알아 내기 위해 + +166 +00:11:25,299 --> 00:11:29,789 + 사람이 너무 잘 사람들은 일종의 일반적으로 같은 관절의 고정 번호가 자신의 + +167 +00:11:29,789 --> 00:11:34,370 + 가슴과 목 및 팔꿈치와 재료의 종류 그래서 우리는 알고있다 + +168 +00:11:34,370 --> 00:11:39,060 + 우리는을 통해 실행 우리는 모든 관절을 찾아야 그래서 우리는 우리의 이미지를 가져올 수 있음 + +169 +00:11:39,059 --> 00:11:43,829 + 컨볼 루션 네트워크와 우리가 각 관절 위치에 대한 XY 좌표를 퇴보하고 + +170 +00:11:43,830 --> 00:11:47,490 + 그것은 우리에게 실제로 전체 인간의 포즈를 예측할 수 있습니다 우리의 작업을 제공합니다 + +171 +00:11:47,490 --> 00:11:52,409 + 이 논문에서 지역화 프레임 워크의 종류를 사용하여 용지은있다 + +172 +00:11:52,409 --> 00:11:55,819 + 구글 일년이 전에 몇 그 방법 이런 종류의 작업을 수행 것과 + +173 +00:11:55,820 --> 00:11:59,740 + 다른 종과 경적하지만 기본적인 아이디어는 단지 CNN에를 사용하여 회귀했다 + +174 +00:11:59,740 --> 00:12:05,100 + 이러한 공동 세션 그래서 전반적으로이 현지화의 아이디어로 치료 + +175 +00:12:05,100 --> 00:12:09,769 + 개체의 회귀 (46) 수 그래서 몇 가지를 알고 정말 정말 간단합니다 + +176 +00:12:09,769 --> 00:12:12,659 + 프로젝트에 너희들은 당신이 실제로 실행하려면에 대해 생각되었다 + +177 +00:12:12,659 --> 00:12:16,850 + 검출 당신은 이미지의 어떤 부분들을 이해하거나 찾을 원인 + +178 +00:12:16,850 --> 00:12:21,290 + 이미지 내부 및 경우 부분은 그 라인을 따라 프로젝트의 생각 + +179 +00:12:21,289 --> 00:12:25,019 + 난 정말 그 대신이 지역화 프레임 워크에 대해 생각해 보시기 바랍니다 + +180 +00:12:25,019 --> 00:12:27,750 + 당신은 당신이 원하는 알고 개체의 고정 된 수의 사실이 있다면 + +181 +00:12:27,750 --> 00:12:31,929 + 지역화 및 모든 이미지 당신은 현지화 문제로 구도를 시도해야한다 + +182 +00:12:31,929 --> 00:12:38,129 + 즉, 많이 좋아, 그래서 실제로 설치에 쉽게 간단한 아이디어의 경향이 있어요 + +183 +00:12:38,129 --> 00:12:42,019 + 회귀 분석을 통해 현지화 실제로 정말 간단 실제로 I를 작동합니다 + +184 +00:12:42,019 --> 00:12:44,120 + 정말 프로젝트를 위해 그것을 시도하는 것이 좋습니다 것 + +185 +00:12:44,120 --> 00:12:47,330 + 당신이하고 싶어하지만 만약 당신이 조금을 추가 할 필요가 이미지 같은 대회 우승 + +186 +00:12:47,330 --> 00:12:52,330 + 다른 멋진 물건 그래서 사람들이 현지화를 위해 할 또 다른 점은 이것이다 + +187 +00:12:52,330 --> 00:12:56,410 + 슬라이딩 윈도우의 생각은 그래서 우리는 더 자세히하지만 생각이 단계별로합니다 + +188 +00:12:56,409 --> 00:13:00,809 + 당신은 여전히​​ 당신의 분류 현지화 가지고 두 향하고있다 + +189 +00:13:00,809 --> 00:13:04,929 + 네트워크는하지만 실제로 이미지에 있지만, 여러에 있지 번을 실행거야 + +190 +00:13:04,929 --> 00:13:08,269 + 그 다른 통해 집계 된 이미지에 위치하면거야 + +191 +00:13:08,269 --> 00:13:13,100 + 그것은 일종의했다, 그래서 위치는 당신이 실제로 효율적인 방법으로이 작업을 수행 할 수 있습니다 + +192 +00:13:13,100 --> 00:13:17,290 + 보다 구체적으로이 슬라이딩 윈도우 현지화가 어떻게 작동하는지 우리는거야 방법 참조 + +193 +00:13:17,289 --> 00:13:21,980 + 묘기가 실제로 우승자이었다 이상의 때문에 공중파 구조를 보면 + +194 +00:13:21,980 --> 00:13:25,399 + 이미지가 2013 년 국산화에 도전 + +195 +00:13:25,399 --> 00:13:29,730 + 그것은이이 건축가이 설정의이 종류는 기본적으로 우리가보고 무엇처럼 보인다 + +196 +00:13:29,730 --> 00:13:33,839 + 몇 일 전에 우리는 그 다음부터 우리가 가지고에서가 아닌 알렉스가 + +197 +00:13:33,839 --> 00:13:37,820 + 분류 회귀 분류 헤드가 밖으로 돌고 있었다 있어야했다 + +198 +00:13:37,820 --> 00:13:38,740 + 우리 클래스 + +199 +00:13:38,740 --> 00:13:44,450 + 회귀는 알렉스 NAT에있어 때문에이 상자와이 일을 가속화했다 + +200 +00:13:44,450 --> 00:13:51,120 + 구조의 유형은 221 (221)의 입력을 기대하지만 실제로 우리는 실행할 수 있습니다 + +201 +00:13:51,120 --> 00:13:55,679 + 큰 사진이 때로는 도움이 될 수 있습니다에이 그래서 우리는 큰 있다고 가정 + +202 +00:13:55,679 --> 00:14:02,799 + 내가 257에 의해 257를 말할 때 지금 우리가 복용 상상할 수있는 무엇의 큰 이미지를 우리 + +203 +00:14:02,799 --> 00:14:06,659 + 분류 + 현지화 네트워크 그냥 상단 모서리에 실행 + +204 +00:14:06,659 --> 00:14:11,799 + 본 이미지는 그 어떤 여름 우리에게 일부 어떤 수준의 점수를주지 + +205 +00:14:11,799 --> 00:14:15,979 + 잔디 경계 상자 우리는 거 반복있어이 우리 같은 구분을 + +206 +00:14:15,980 --> 00:14:21,820 + + 현지화 네트워크와이 이미지의 이후 네 모서리에서 실행 + +207 +00:14:21,820 --> 00:14:26,230 + 이렇게하면 사람들의 각각에서 잔디 경계 박스 하나에 종료됩니다 + +208 +00:14:26,230 --> 00:14:30,509 + 각 위치에 대한 클래스 분류 점수와 함께 4 개소 + +209 +00:14:30,509 --> 00:14:35,700 + 그러나 우리는 실제로는 일부를 사용 그럼 단지 하나의 경계 상자를 원하는 + +210 +00:14:35,700 --> 00:14:39,770 + 메르세데스 점수에 상자를 경계하고 조금 못생긴 I을두고하는 휴리스틱 + +211 +00:14:39,769 --> 00:14:42,809 + 싶지 않아 그들은 종이지만 생각에이 여기에 세부 사항으로 이동 + +212 +00:14:42,809 --> 00:14:46,699 + 그 대중은 여러 위치에서이 상자를 집계 결합한다 + +213 +00:14:46,700 --> 00:14:50,959 + 그 방송 해에 크레딧의 모델 정렬을 할 수 있습니다 도움이 작동하는 경향이 있습니다 + +214 +00:14:50,958 --> 00:14:55,058 + 정말 잘 그 해 그들에게 도전을 수상하는 평균 + +215 +00:14:55,058 --> 00:14:58,149 + 하지만 실제로 그들은 실제로 많은 개 이상의 위치를​​ 사용 + +216 +00:14:58,149 --> 00:15:08,989 + 오, 나중에는 그들과 함께 완벽하게되어야한다 + +217 +00:15:08,989 --> 00:15:12,939 + 잘 난 그냥있어 당신이 회귀하고있어 한 번, 그래서 실제로 좋은 지적 뜻 + +218 +00:15:12,938 --> 00:15:15,498 + 번호를 예측하는 것은 당신이 복제되지 않을 수 균열 없습니다 + +219 +00:15:15,499 --> 00:15:20,149 + 어디서나 내가 그납니다 알고 있지만 이미지 내에서 할 필요는 없습니다 + +220 +00:15:20,149 --> 00:15:23,698 + 그들은 당신이있을 때 때 특히이 일을하고 좋은 점 + +221 +00:15:23,698 --> 00:15:27,088 + 이 슬라이딩 윈도우 방식으로이 네트워크를 훈련하면 실제로 발송하기 + +222 +00:15:27,089 --> 00:15:30,429 + 약간 선박 선박 지상 진실 상자가 사람들을 위해 프레임을 조정 + +223 +00:15:30,428 --> 00:15:35,999 + 나중에 약하지만 단지 걱정 못생긴 세부 종류의 서로 다른 조각 + +224 +00:15:35,999 --> 00:15:39,428 + 그들은 많은 개 이상의 이미지의 위치를​​ 사용하여 실제로 수행 연습 + +225 +00:15:39,428 --> 00:15:43,629 + 뿐만 아니라 당신이 볼 수있는 여러 규모는 실제로 종이로 파악된다 + +226 +00:15:43,629 --> 00:15:47,129 + 나는 당신이 그들이 가지 평가 모두 다른 위치를 참조 남아있어 + +227 +00:15:47,129 --> 00:15:52,058 + 중간에이 네트워크는 그 출력의 각 상자를 하나씩 진행 참조 + +228 +00:15:52,058 --> 00:15:55,678 + 하단 쉬운에 그 위치는 그 위치 각각에 대해지도를 득점 할 수 + +229 +00:15:55,678 --> 00:16:00,139 + 다음 나는 그들이 매우 시끄러운하지만 좀 자신의 종류를 변환있어 의미 + +230 +00:16:00,139 --> 00:16:03,899 + 일반적으로 곰을 통해 그들은이 멋진 집계 방법을 실행하고 줄 수 있도록 + +231 +00:16:03,899 --> 00:16:07,839 + 곰에 대한 최종 상자를 얻을하고 결정하는 한 쌍의 그들과 동일 + +232 +00:16:07,839 --> 00:16:12,869 + 실제로이와 도전을 원하지만 예상 할 수 하나의 문제는이다 + +233 +00:16:12,869 --> 00:16:15,759 + 실제로 그 하나 하나의 네트워크를 실행하는 데 꽤 비싼 될 수 있습니다 + +234 +00:16:15,759 --> 00:16:20,259 + 작물하지만 우리는 우리가 할 수있는 일에 실제로 더 효율적인있다 + +235 +00:16:20,259 --> 00:16:23,489 + 일반적으로 다음 길쌈 오류을 갖는 이러한 네트워크의 생각 + +236 +00:16:23,489 --> 00:16:26,048 + 완전히 수호신 연결하지만 당신은 그것에 대해 생각할 때 + +237 +00:16:26,048 --> 00:16:31,108 + 완전히 연결 래리 단지 4096 번호를 잘 그냥 요인이다하지만 + +238 +00:16:31,109 --> 00:16:34,679 + 대신 벡터로 생각의 우리는 그것으로 또 다른 생각을 할 수 + +239 +00:16:34,678 --> 00:16:39,269 + 길쌈 기능지도 좀 미친 우리는 단지에 추가 트랜스 + +240 +00:16:39,269 --> 00:16:45,019 + 하나씩 크기 그래서 지금 생각은 우리가 지금 완전히 우리의 차를 처리 할 수​​ 있다는 것입니다 + +241 +00:16:45,019 --> 00:16:49,499 + 연결된 레이어와의 상상 몇 가지있다 길쌈로 변환 + +242 +00:16:49,499 --> 00:16:54,339 + 우리의 완전히 연결된 네트워크는 우리는이 길쌈 기능지도를 가지고 있었고, 우리는 하나가 있었다 + +243 +00:16:54,339 --> 00:16:57,749 + 생산지도 기능을합니다 그 경쟁의 각 요소에서 방법 + +244 +00:16:57,749 --> 00:17:02,048 + 각 우리의 4096 차원 벡터의 요소 그러나 우리 대신에 대해 생각 + +245 +00:17:02,048 --> 00:17:06,288 + 재편 및 정렬 단지 다섯 가지는 중 상당의 벌금 층을 갖는 + +246 +00:17:06,288 --> 00:17:06,970 + 오에 의해 + +247 +00:17:06,970 --> 00:17:10,120 + 솔루션은 조금 이상한하지만 당신이 그것에 대해 생각하면 그것은 이해한다 + +248 +00:17:10,119 --> 00:17:16,318 + 결국하지만 확실히 그래서 우리는 나중에로 변신이 완전히 연결 취할 + +249 +00:17:16,318 --> 00:17:21,899 + 우리가 이전에 완전히 다른 것보다이 최대 5 회선에 의한 오 + +250 +00:17:21,900 --> 00:17:26,409 + 4096 4096이에서가 연결 시장은 실제로 하나씩입니다 + +251 +00:17:26,409 --> 00:17:30,570 + 당신은 당신이 열심히 생각하면 경우에 그 좀 이상한입니다 만 회선 오른쪽 + +252 +00:17:30,569 --> 00:17:35,369 + 종이에 수학을 해결하고 조용한 방을 보내 가서 당신이 그것을 알아낼 그래서 것 + +253 +00:17:35,369 --> 00:17:38,769 + 우리는 기본적으로이 완전히 연결 레이어와 우리의 네트워크의 각을 얻을 수 없습니다 + +254 +00:17:38,769 --> 00:17:43,509 + 길쌈 공기에 지금 지금이 때문에 지금은 정말 멋진 우리의 + +255 +00:17:43,509 --> 00:17:47,589 + 네트워크는 기여와 풀링 및 요소를 완전히 구성되어있다 + +256 +00:17:47,589 --> 00:17:51,819 + 작업은 이제 우리는 실제로 서로 다른 크기의 이미지에 네트워크를 실행할 수 있도록 + +257 +00:17:51,819 --> 00:17:56,889 + 그리고 이런 종류의 매우 저렴한 비용으로 우리에게 동등한를 장착합니다 + +258 +00:17:56,890 --> 00:18:01,840 + 그래서 볼의 종류에 다른 위치에서 독립적으로 작동하지 않을 수 있지만 운영 + +259 +00:18:01,839 --> 00:18:02,609 + 어떻게 작동 + +260 +00:18:02,609 --> 00:18:07,219 + 당신은 당신이 실행 14 템플릿에 의해 당신이 14 이상 작업 할 수있는 교육 시간을 상상 + +261 +00:18:07,220 --> 00:18:11,960 + 여기에는 일부 회선과 우리가 지금이야 완전히 연결 층입니다 + +262 +00:18:11,960 --> 00:18:17,140 + 재 상상으로 길쌈 즈가 말했다, 우리는 다섯 콘 블록이로이 + +263 +00:18:17,140 --> 00:18:22,600 + 우리가 정렬했습니다 그래서 이러한 하나씩 특별히 크기의 요소로 설정됩니다 + +264 +00:18:22,599 --> 00:18:26,449 + 이 같은하지 제거 여기에 깊이 치수를 표시하지만, 이들의 + +265 +00:18:26,450 --> 00:18:30,900 + 하나 4096 권한에 의해 하나 하나가 될하거나 이러한 레이어를 변환하는 것 + +266 +00:18:30,900 --> 00:18:35,259 + 길쌈에 우리가 알고있는 지금 거기에 자신의 회선 우리가 할 수 그 + +267 +00:18:35,259 --> 00:18:39,700 + 실제로 더 큰 크기의 부분에 실행하고 당신은 지금 우리가했습니다 가지고 있음을 알 수 + +268 +00:18:39,700 --> 00:18:43,558 + 몇 가지 추가 픽셀을 추가하고 지금 우리는 실제로이 모든 것을을 실행 + +269 +00:18:43,558 --> 00:18:47,869 + 회선 및 두 개의 별 두 개의 출력을 얻을 수 있지만, 어떻게 여기 정말 멋진 것입니다 + +270 +00:18:47,869 --> 00:18:52,058 + 우리는이 정말 효율적 그래서 지금 우리의 출력을 만들기 위해 계산을 공유 할 수있어 + +271 +00:18:52,058 --> 00:18:56,428 + 큰 4 배하지만 우리는 훨씬 적은보다 4 배의 경우 사촌 컴퓨팅을했습니다 + +272 +00:18:56,429 --> 00:19:00,360 + 여기에 우리가하고있는 계산의 차이에 대해 생각 + +273 +00:19:00,359 --> 00:19:04,449 + 만 여분의 계산은 지금이 노란색 부분에서 일어난 우리는 실제로 매우있어 + +274 +00:19:04,450 --> 00:19:08,610 + 효율적없이 많은 다양한 위치에서 네트워크를 평가 + +275 +00:19:08,609 --> 00:19:11,918 + 이 그들이 평가할 수있어 얼마나 그래서 실제로 많은 계산을 지출 + +276 +00:19:11,919 --> 00:19:15,240 + 당신은 몇 보았다 아주 아주 조밀 한 다중 스케일 방법으로 해당 네트워크 + +277 +00:19:15,240 --> 00:19:19,388 + 확인 전에 밤이에 대한 질문을 감지 + +278 +00:19:19,388 --> 00:19:25,558 + 확인 쓰기는 실제로 우리의 분류 + 현지화 결과를 볼 수 있습니다 + +279 +00:19:25,558 --> 00:19:30,858 + 그래서 2012 년 알렉스 알렉스 Kozinski 잭에서 지난 몇 년 동안 임무 + +280 +00:19:30,858 --> 00:19:36,358 + 힌튼은뿐만 아니라 분류뿐만 아니라 현지화를 원하지만 난 수 없습니다 + +281 +00:19:36,358 --> 00:19:40,978 + 그들은 2013 년을 것을 어떻게했는지 정확히의 게시 된 내용을 찾을 수 + +282 +00:19:40,979 --> 00:19:45,249 + 이상 - 더 - 상단 우리가 실제로 알렉스의 결과 조금에 개선 보았다 + +283 +00:19:45,249 --> 00:19:50,429 + 올해 우리는 VGG에 대해 얘기하고 일종의 정말 깊은 19 야 후 자신의 + +284 +00:19:50,429 --> 00:19:54,009 + 네트워크 그들은 분류에 두번째 장소를 가지고 있지만 난 실제로 1 + +285 +00:19:54,009 --> 00:19:59,139 + 현지화와 BGG 실제로 기본적으로 정확히 같은 전략을 사용하는 + +286 +00:19:59,138 --> 00:20:03,918 + 위업의 죽음을 통해 그들은 단지 더 깊은 네트워크를 사용하고 실제로 BGG 흥미로운 + +287 +00:20:03,919 --> 00:20:08,288 + 사용 된 적은 규모는 적은 장소에서 팻 네트워크를 눈에 띄는 적게 사용 + +288 +00:20:08,288 --> 00:20:12,878 + 기술은 있지만 실제로 그렇게 기본적으로 시대에게 유일한 상당히 감소 + +289 +00:20:12,878 --> 00:20:17,868 + 차이는 피트 이상되는 및 BG는 여기에 BGU 깊은 네트워크 그래서 여기에 + +290 +00:20:17,868 --> 00:20:20,858 + 우리는이 정말 강력한 이미지 기능을 실제로 향상을 볼 수 + +291 +00:20:20,858 --> 00:20:24,098 + 현지화를 변경하기에 충분와 현지화 성능 꽤 + +292 +00:20:24,098 --> 00:20:28,418 + 아키텍처는 전혀 우리는 단지 그녀의 CNN 대해 교환과 결과 a를 개선 + +293 +00:20:28,419 --> 00:20:34,169 + 그 테마가 될 것으로 2015 년 많은 후 올해 마이크로 소프트는 모든 것을 휩쓸 + +294 +00:20:34,169 --> 00:20:39,239 + Microsoft에서 제공하는이 강의는 물론이이 치고 은신처 ResNet에서 + +295 +00:20:39,239 --> 00:20:43,629 + 25 모든 방법에서 여기 현지화 및 음주 적절한 성능을 짓 눌린 + +296 +00:20:43,628 --> 00:20:48,738 + 하지만 구까지 나는이이 조금 의미하고 이것은 정말 이야기입니다 + +297 +00:20:48,739 --> 00:20:52,798 + 깊은 기능을 분리 그래서 예 그들은 깊은 기능을하지만 마이크로 소프트가 않았다 + +298 +00:20:52,798 --> 00:20:56,398 + 실제로 다른 지역화 방법이라고 RPM을 지역 제안이다 + +299 +00:20:56,398 --> 00:21:00,699 + 네트워크 정말 분명하지 않다, 그래서 그것이인지이 어느 부분인지 + +300 +00:21:00,700 --> 00:21:04,929 + 더 나은 현지화 전략 또는 더 나은 기능 여부 그러나 어떤 속도로 그들 + +301 +00:21:04,929 --> 00:21:10,139 + 정말 잘이 꽤 많이 나는 분류에 대해 말하고 싶은 전부 않았다 + +302 +00:21:10,138 --> 00:21:13,848 + 질문이 있다면 현지화는 프로젝트를 위해 그 일을 고려하고 + +303 +00:21:13,848 --> 00:21:19,509 + 이 작업에 대해 우리는 지금 나중에에 이동하기 전에 그것에 대해 이야기해야 + +304 +00:21:19,509 --> 00:21:32,890 + 바로 그럼 내가이 손실을 갖는 것이다 때 특히 손실 성능 + +305 +00:21:32,890 --> 00:21:37,050 + 아웃 라이어 그래서 때때로 사람들은 손실에 L을 사용하지 않는 사실은 정말 나쁜 + +306 +00:21:37,049 --> 00:21:40,609 + 대신 시도하고 아웃 라이어 약간의 도움이 될 수 있습니다 하나의 손실을 판매 할 수 있습니다 + +307 +00:21:40,609 --> 00:21:45,279 + 그는 하나를 거처럼 보이는 곳 사람들은 때로는 부드러운 하나의 손실을 다할 것입니다 + +308 +00:21:45,279 --> 00:21:49,339 + 이야기의 종류하지만 제로 근처는 차있을거야 실제로 스와핑 때문에 + +309 +00:21:49,339 --> 00:21:53,319 + 그 회귀 손실 함수는 때때로뿐만 아니라 경우에 아웃 라이어와 약간의 도움이 될 수 있습니다 + +310 +00:21:53,319 --> 00:21:56,399 + 당신은 때때로 희망 당신은 그냥 생각하지 않는 소음이 약간 있습니다 + +311 +00:21:56,400 --> 00:22:14,380 + 사람들이 모두 할 수 있도록 손가락이 너무 어려운 질문 질문을 생각하지 않는다 교차 + +312 +00:22:14,380 --> 00:22:18,560 + 실제로 나는 실제로 그렇게 피트이야 내가 정확히 기억하지 못하는 기억하지 않는다 + +313 +00:22:18,559 --> 00:22:23,409 + 이는 죽은 감독하지만 BGG 실제로 때문에 전체 네트워크에 배경막 + +314 +00:22:23,410 --> 00:22:27,230 + 그냥 훈련을받은 경우는 실제로 잘 작동이 빠른에있을거야 수 있습니다 + +315 +00:22:27,230 --> 00:22:30,289 + 회귀에했지만, 당신은 당신이 경우 조금 더 나은 결과를 얻을하는 경향이 있습니다 + +316 +00:22:30,289 --> 00:22:34,049 + 다시 홈 네트워크에 드롭 BG는이 실험을했고, 그들은 아마있어 + +317 +00:22:34,049 --> 00:22:37,659 + 상기 하나 또는 두 개의 점을 추가 매입의 모든 일을 통해 낙하하지만, + +318 +00:22:37,660 --> 00:22:41,320 + 더 많은 경쟁과 훈련 시간의 비용 그래서 그래서 나는 것 나는 것 + +319 +00:22:41,319 --> 00:22:44,769 + 그냥 시도 얘기하지 않는 우선으로 다시 삭제하고 있지처럼 말 + +320 +00:22:44,769 --> 00:22:50,440 + 네트워크 + +321 +00:22:50,440 --> 00:22:57,110 + 일반적으로하지 못하기 때문에 당신이 본 동일한 클래스에 대한 테스팅 + +322 +00:22:57,109 --> 00:23:00,839 + 당신이거야 교육 시간은 분명히 다른 인스턴스를 참조하지만 난 당신이있어 의미 + +323 +00:23:00,839 --> 00:23:04,759 + 여전히 우리는 당신을 기대하지 않는 교육 시간에 OC 곰에서 힘든 시간을 곰 + +324 +00:23:04,759 --> 00:23:07,370 + 내가 꽤 어려울 것이다 클래스에서 일반화 + +325 +00:23:07,369 --> 00:23:20,638 + 참 좋은 질문은 네 그래서 때때로 사람들은 함께 훈련을하고 있다고 할 것입니다 + +326 +00:23:20,638 --> 00:23:24,349 + 모두 동시에 또한 때때로 사람들은 별도로 끝날 것 + +327 +00:23:24,349 --> 00:23:27,089 + 네트워크 종류의 책임 침략에 대한 그것의 하나를 + +328 +00:23:27,089 --> 00:23:38,089 + 이들 모두는 질문도 기쁜 일을 분류 만 책임 + +329 +00:23:38,089 --> 00:23:40,558 + 그는 실제로 우리가가가의의에 대해 거 이야기하고있는 다음 일이있어 + +330 +00:23:40,558 --> 00:23:50,740 + 객체 검출의 다른 과제 그래서 + +331 +00:23:50,740 --> 00:23:56,808 + 물론 그래 내 말 잘 그래서 좀 훈련 전략 경우에 따라 달라집니다 + +332 +00:23:56,808 --> 00:23:59,920 + 당신은 또한 종류의 불가지론 클래스의이 아이디어로 돌아갑니다 경우 같은거야 + +333 +00:23:59,920 --> 00:24:03,610 + 그것은 중요하지 않습니다 일류 태평양 회귀 클래스 무관 회귀 + +334 +00:24:03,609 --> 00:24:06,889 + 당신은 당신이 일종의있어 클래스 클래스 특정 내일 상자에 회귀 + +335 +00:24:06,890 --> 00:24:13,950 + 각 클래스에 대해 별도의 침략자를 훈련하는 것은 권리의 객체에 대해 이야기하자 + +336 +00:24:13,950 --> 00:24:19,220 + 객체 검출하므로 검출은 훨씬 더 많은 냉각 애호가이지만 + +337 +00:24:19,220 --> 00:24:22,890 + 해리는 그렇게 생각은 우리가 입력 이미지를 가지고 다시 우리의 일종을 가지고있다 + +338 +00:24:22,890 --> 00:24:26,660 + 클래스는 우리가 입력에 있다는 점에서 그 클래스의 모든 인스턴스를 찾으려면 + +339 +00:24:26,660 --> 00:24:31,670 + 내 말은, 그래서 이미지는 회귀 현지화 이유에 대해 꽤 잘 작동 알고 + +340 +00:24:31,670 --> 00:24:37,470 + 우리는 탐지가 SMS로 표시하기 위해 시도하지 않습니다 우리는이이 개를 + +341 +00:24:37,470 --> 00:24:41,429 + 고양이와 우리가 네 가지를 가지고 우리는 그게 전부가 보이는 것 같습니다 16 번호가 + +342 +00:24:41,429 --> 00:24:46,250 + 수의 회귀 율 이미지 밖으로하지만 우리는 다른 이미지를 보면 다음과 같은 + +343 +00:24:46,250 --> 00:24:50,609 + 이 하나의 단지 그것을 팔 수를 갖도록 나오는 두 가지가 알고있는 그들은 + +344 +00:24:50,609 --> 00:24:54,589 + 고양이의 모두 거기에이 일을보고 우리가 숫자의 무리를 필요로 난 그렇게 + +345 +00:24:54,589 --> 00:24:57,519 + 그것은 검출에게 스트레이트 업 회귀를 치료하는 종류의 하드의 뜻 + +346 +00:24:57,519 --> 00:25:01,450 + 우리가 거​​해야하고, 그래서 우리는 가변 크기의 출력이 문제를 가지고 있기 때문에 + +347 +00:25:01,450 --> 00:25:04,460 + 실제로이 있지만 애호가 뭔가 방법이 나중에에 대해 이야기합니다 + +348 +00:25:04,460 --> 00:25:09,539 + 그런 종류의 어쨌든이 작업을 수행하고 회귀 등으로 처리 않지만 우리는거야 + +349 +00:25:09,539 --> 00:25:12,950 + 그에게 우리는 그 이상으로 얻을 것이다 그러나 일반적으로 당신은이 취급하지 싶어 + +350 +00:25:12,950 --> 00:25:18,360 + 회귀 우리가 정말 쉬운 문제가있어, 그래서 매우 정확한 출력을 가지고 있기 때문에 + +351 +00:25:18,359 --> 00:25:22,779 + 이 문제를 해결하기 정말 쉬운 방법이 검출 생각하지 회귀로하지만, + +352 +00:25:22,779 --> 00:25:25,960 + 기계의 분류 권리로 회귀 및 분류를 학습 + +353 +00:25:25,960 --> 00:25:29,929 + 당신이 망치는 그냥 모든 문제를 바로 먹는 사람들을 사용할 수 있습니다 + +354 +00:25:29,929 --> 00:25:34,250 + 그래서 대신에 우리가 방법을 알고 우리는 회귀 작품 분류를 할 것입니다 + +355 +00:25:34,250 --> 00:25:38,558 + 우리는거야 우리가 그냥 CNN의 권리를 위해 우리가 할거야 이미지 영역을 분류 + +356 +00:25:38,558 --> 00:25:43,349 + 거기 분류의 영상이 입력 많은 지역을과 같은 말 + +357 +00:25:43,349 --> 00:25:46,129 + 확실히에서 추정 된 공격이 지역 없음 + +358 +00:25:46,130 --> 00:25:50,770 + 개 같이 조금 이상 우리가 큰하지만 끝났어 고양이를 발견 것을 알고있다 + +359 +00:25:50,769 --> 00:25:54,460 + 우리가 실제로 단지를 시도 할 수 있도록 그 아무것도 아니다이야 조금 + +360 +00:25:54,460 --> 00:25:58,558 + 전체 무리 다른 이미지 영역은 각각이 뜻에서 분류를 실행 + +361 +00:25:58,558 --> 00:26:02,490 + 기본적으로 우리 가변 크기 출력 문제를 해결 + +362 +00:26:02,490 --> 00:26:11,160 + 그래서 결정 방법의 문제는 방법을 결정할 수 있도록 더 문제가 없습니다 거기에 + +363 +00:26:11,160 --> 00:26:14,558 + 어떤 창 크기 대답하는 것은 우리가 모든 권리 그냥 그대로을 시도하다 + +364 +00:26:14,558 --> 00:26:18,879 + 즉 그가 우리 때문에 오른쪽 실제로 큰 문제입니다입니다 그래서 그들 모두를 시도 + +365 +00:26:18,880 --> 00:26:21,910 + 다른 여러 위치의 서로 다른 크기의 윈도우를 시도해야 + +366 +00:26:21,910 --> 00:26:25,290 + 날이 제대로 테스트 할 비늘이 정말 비싼 잘 될 것입니다 + +367 +00:26:25,289 --> 00:26:39,089 + 당신이이 일을 할 때 우리는 그래도 볼 필요가 장소의 전체 많아요 + +368 +00:26:39,089 --> 00:26:45,058 + 당신이 여분의 두 가지 하나가 말을 별도의 클래스를 추가 할 수 있습니다 추​​가 배경 + +369 +00:26:45,058 --> 00:26:49,569 + 여기에 당신이 할 수있는 또 다른 것은 사실이다하지 않습니다 같은 오 거기에 아무것도 말 + +370 +00:26:49,569 --> 00:26:54,159 + 다중 레이블 분류는 바로 여러 긍정적 인 물건을 넣을 수 없습니다 + +371 +00:26:54,160 --> 00:26:56,950 + 그것은 할 사실은 아주 쉽게 그냥 대신 당신이 가진 부드러운 최대 손실이다 + +372 +00:26:56,950 --> 00:27:01,390 + 독립적 인 로지스틱 회귀 클래스의 독립적 인 회귀 손실 때문에 + +373 +00:27:01,390 --> 00:27:05,100 + 나는 실제로 당신이 네 말을 나는 여러 클래스의 한 지점에서 할 수 있지만 그건 + +374 +00:27:05,099 --> 00:27:10,189 + 다만 손실 함수에 걸어 그래서 그게 바로 이렇게 아주 쉽다이다 + +375 +00:27:10,190 --> 00:27:13,220 + 우리는이 방법에 문제가 보는 것을 실제로 전체가 거기에 있는지처럼 + +376 +00:27:13,220 --> 00:27:17,690 + 다른 위치의 무리 우리는 몇 가지의 솔루션 정렬을 평가해야 + +377 +00:27:17,690 --> 00:27:21,308 + 몇 년 전이었다 당신은 보통 수준의 지방은 정말 빨리 사용 + +378 +00:27:21,308 --> 00:27:26,299 + 분류 그래서 실제로 검출이 정말 모든 문제에 그들 모두를 시도 + +379 +00:27:26,299 --> 00:27:29,119 + 컴퓨터 비전 당신은 아마 조금 더 기록해야하므로 + +380 +00:27:29,119 --> 00:27:34,109 + 관점 그래서이 정말 성공 거기에 대해 2005 년에 시작 + +381 +00:27:34,109 --> 00:27:38,490 + 그것에 접근하지만이 기능을 사용 정말 성공적으로 검출 해요 + +382 +00:27:38,490 --> 00:27:42,039 + 당신이 전화 그래서 만약 표현을 지향 광채의 히스토그램을 호출 + +383 +00:27:42,039 --> 00:27:46,609 + 다시 숙제 1 당신은 실제로 실제로 할 수있는 마지막 부분에이 기능을 사용 + +384 +00:27:46,609 --> 00:27:50,979 + 분류도 그래서 이것은 실제로 가장 큰 특징 일종의였습니다 + +385 +00:27:50,980 --> 00:27:55,670 + 우리는 아이디어는 우리가 선형 그냥 야 할 것입니다 2005 년에 컴퓨터 비전 나리에 있었다 + +386 +00:27:55,670 --> 00:27:59,550 + 이 기능의 상단에 분류하고 그래서 우리의 우리의 분류 될 것 + +387 +00:27:59,549 --> 00:28:03,460 + 선형 분류는이 작품 그래서 만약 것은 우리가 계산이다 빠르고 정말 + +388 +00:28:03,460 --> 00:28:08,250 + 여러 규모에서 전체 이미지에 그라데이션 기능을 지향하고 우리는 이것을 실행 + +389 +00:28:08,250 --> 00:28:12,660 + 모든 위치가 그냥 정말 빨리 바로 수행 할 모든 규모의 선형 분류 + +390 +00:28:12,660 --> 00:28:13,210 + 모든 곳 + +391 +00:28:13,210 --> 00:28:15,329 + 분류 및 과거 평가할 + +392 +00:28:15,329 --> 00:28:21,029 + 이것은 이런 생각을했다 사람 2005 종류에 정말 잘 근무에 일 + +393 +00:28:21,029 --> 00:28:25,029 + 그것 때문에 일종의 앞으로 몇 년 가장의 일에 조금 더 + +394 +00:28:25,029 --> 00:28:29,879 + 중요한 검출 차원의 계획이 일 깊은이라고 패러다임하지만, + +395 +00:28:29,880 --> 00:28:34,470 + 변형 부품 모델은 그래서 싶어 너무 많은 세부 최고에 가지 마세요 + +396 +00:28:34,470 --> 00:28:39,309 + 하지만 기본적인 아이디어는 우리가 아직도이 역사 기념관에 최선을 다하고 있음을 의미 + +397 +00:28:39,309 --> 00:28:42,619 + 그라데이션 기능을하지만 지금은 우리의 모델보다는 선형 인 + +398 +00:28:42,619 --> 00:28:46,659 + 분류 우리는이 선형이를위한 템플릿이 선형 종류를 클릭해야 + +399 +00:28:46,660 --> 00:28:51,370 + 객체 그리고 우리는 또한 정렬 할 수 있습니다 부품에 대한 이러한 템플릿을 + +400 +00:28:51,369 --> 00:28:57,119 + 매우 공간적 위치를 통해 조금 변형하고 일부 약간의 공상을 가지고 + +401 +00:28:57,119 --> 00:29:01,939 + 공상은 이러한 것들을 정말 멋진 내용을 보려면 오전에 대해 말 생각 + +402 +00:29:01,940 --> 00:29:07,190 + 동적 프로그래밍 알고리즘은 실제로 정말 빨리 시험이 일을 평가 + +403 +00:29:07,190 --> 00:29:11,100 + 이에 우리의 엄지 손가락이이 일을 즐길 경우 시간이 실제로 종류의 재미 + +404 +00:29:11,099 --> 00:29:16,119 + 부분은 재미의 종류에 대해 생각하지만 최종 결과는 그것이 훨씬이다라는 것이다 + +405 +00:29:16,119 --> 00:29:19,209 + 에 변형성 조금 있습니다 더 강력한 분류하여 + +406 +00:29:19,210 --> 00:29:23,079 + 모델 당신은 여전히​​ 체중에 대해 정말 빠른 그래서 우리는 여전히 그냥 갈거야 수 있습니다 + +407 +00:29:23,079 --> 00:29:26,490 + 그냥 해 모든 규모를 어디서나 모든 화면 비율을 모든 위치를 평가 + +408 +00:29:26,490 --> 00:29:33,039 + 사방 과거이 실제로 주위 2010 년에 정말로 잘했다 + +409 +00:29:33,039 --> 00:29:37,619 + 그 때문에 한 번에 많은 문제를위한 기술 및 검출 상태의 일종이었다 + +410 +00:29:37,619 --> 00:29:40,509 + 이것은 내가이 너무 많은 시간을 할애하지 않지만 정말 멋진 종이가 있었다 + +411 +00:29:40,509 --> 00:29:45,049 + 이 dpi의 모델은 실제로 단지 특정 유형한다고 주장 작년 + +412 +00:29:45,049 --> 00:29:47,480 + 콘텐츠 권리 등 권리의 + +413 +00:29:47,480 --> 00:29:51,329 + 이 이러한 역사 나는거야 미친 개미는 우리가 볼 수 약간의 가장자리처럼 + +414 +00:29:51,329 --> 00:29:55,539 + 망상과 역사와 있었다 좀 그런 종류의 물건을 풀링 등에 그렇다면 + +415 +00:29:55,539 --> 00:30:00,349 + 당신은 권리에 대해 생각하는 재미의 종류의이 논문을 관심을 확인하고 + +416 +00:30:00,349 --> 00:30:02,250 + 그러나 우리는 정말 작업 할 + +417 +00:30:02,250 --> 00:30:06,259 + 아마 같은 무게없이 빠른없는 분류에이 일을 작동하게 + +418 +00:30:06,259 --> 00:30:11,809 + CNN과 그래서 여기에 이​​번 주이 문제는 여전히 우리가 다른 많은이 하드 맞아 + +419 +00:30:11,809 --> 00:30:14,940 + 우리가 아마 실제로 시도 할 여유가 없을 때 시도하려는 배치 + +420 +00:30:14,940 --> 00:30:19,220 + 모두 있도록 솔루션은 우리가 다른이 그들 모두를 시도하지 않는다는 것입니다 + +421 +00:30:19,220 --> 00:30:23,380 + 것을 우리가보고 싶어하고 우리는 우리의 적용 추측의 종류 + +422 +00:30:23,380 --> 00:30:28,720 + 그 생각 때문에 위치의 그 작은 수에서 분류의 비용 + +423 +00:30:28,720 --> 00:30:35,419 + 우리 지역의 제안 방법에 우리가이 일 때문에 지역 제안이라고 + +424 +00:30:35,419 --> 00:30:39,900 + 그 이미지에 소요 다음 경우 어쩌면 지역의 전체 무리를 출력 + +425 +00:30:39,900 --> 00:30:45,280 + 하나의 방법이 지역에 대해 생각할 수 있도록 가능한 개체는 찾을 수 있습니다 + +426 +00:30:45,279 --> 00:30:48,428 + 제안은 좀 정말 빠른 같아 것입니다 + +427 +00:30:48,429 --> 00:30:53,038 + 클래스 불가지론 개체 검출기는 바로 그들이있어 클래스에 대해 걱정하지 않는다 + +428 +00:30:53,038 --> 00:30:56,038 + 하지 매우 정확하지만 그들은 실행 꽤 빨리있어, 그들이 우리에게 전체를 제공 + +429 +00:30:56,038 --> 00:31:00,769 + 상자의 무리와이 지역의 제안 뒤에 뒤에 일반 직관 + +430 +00:31:00,769 --> 00:31:04,639 + 방법들이 좀 구조 등의 얼룩을 찾는 것을하는 것은 이미지 속도입니다 + +431 +00:31:04,640 --> 00:31:09,740 + 그래서 개체는 일반적으로 내가 그것을 좀 보이는 때 요청할 수 있다면 의미 강아지처럼 + +432 +00:31:09,740 --> 00:31:13,940 + 고양이가 흰색 얼룩 꽃처럼 보이는 흰색 덩어리처럼 나는 일종의 될 ㅋ + +433 +00:31:13,940 --> 00:31:17,929 + 눈과 코는 좀 ㅋ 수 있으므로 사람이 지역의 제안, 메소드이다 + +434 +00:31:17,929 --> 00:31:21,650 + 당신이 볼 무엇 시대의 많은 이들의 많은 주위에 넣어 상자의 종류 + +435 +00:31:21,650 --> 00:31:27,820 + 이미지의 blobby 영역 그래서 아마 가장 유명한 지역 제안 방법 + +436 +00:31:27,819 --> 00:31:31,538 + 선택적 검색이라고 당신은 정말 많은으로 정확히 알 필요가 없습니다 + +437 +00:31:31,538 --> 00:31:36,980 + 이 아이디어에 불과 어떻게 작동하는지 세부 사항은 당신이 당신의 픽셀과 당신부터 시작이다 + +438 +00:31:36,980 --> 00:31:40,919 + 그들이 유사한 색상과 질감이 함께있는 경우 합병 인접 화소의 종류 + +439 +00:31:40,919 --> 00:31:45,770 + 다음 지역과 같은이 연결되어있는 리드 연결이 끊긴 덩어리를 형성 + +440 +00:31:45,769 --> 00:31:50,740 + 당신이 더 크고 더 큰 신체 부위를 얻을 수있는 지역 등 여피족의 얼룩을 병합 + +441 +00:31:50,740 --> 00:31:53,829 + 다음이 다른 스케일의 각각에 대해 실제로 이들 각각을 변환 할 수 있습니다 + +442 +00:31:53,829 --> 00:31:58,710 + 그냥이 작업을 수행하여 그럼 주위에 상자를 그려서 상자에 바비 영역 + +443 +00:31:58,710 --> 00:32:02,548 + 여러 저울을 통해 당신은 일종의 주위에 상자의 전체 무리와 끝까지 + +444 +00:32:02,548 --> 00:32:06,359 + 이미지의 blobby 물건을 많이하고이 계산 합리적으로 신속하고 + +445 +00:32:06,359 --> 00:32:11,500 + 실제로 검색 공간을 선택적으로 확실히 꽤 많이 있지만 아래로 인하 + +446 +00:32:11,500 --> 00:32:14,720 + 마을의 유일한 게임이 바로 가장 유명한 될 수있다되지 않는 것은 왕창있다 + +447 +00:32:14,720 --> 00:32:18,319 + 거기 사람들이 개발의 다른 지역 제안 방법 + +448 +00:32:18,319 --> 00:32:21,509 + 이 논문은 작년에 실제로 정말 멋진 철저한 과학을 행한 + +449 +00:32:21,509 --> 00:32:25,890 + 종류 모든 다른 지역의 제안 방법 및 평가의 준 + +450 +00:32:25,890 --> 00:32:29,950 + 프로와 각각의 단점과 물건의 모든 종류하지만 내 말은 내 + +451 +00:32:29,950 --> 00:32:33,620 + 이 논문에서 테이크 아웃은했다 상자에게 그렇게 하나를 선택해야한다면 것을 사용 + +452 +00:32:33,619 --> 00:32:37,459 + 그것은 정말 빨리가 두 번째의 바닥 세 번째에 실행할 수있어입니다입니다 + +453 +00:32:37,460 --> 00:32:40,950 + 선택적 검색을위한 약 10 초에 비해 이미지 당 + +454 +00:32:40,950 --> 00:32:49,000 + 하지만 더 많은 별이 더 나은 그것이 지금 바로거야 그래서 별을 많이 얻는다 + +455 +00:32:49,000 --> 00:32:51,970 + 우리는이 아이디어 지역의 제안을 가지고 우리는 CNN의 이런 생각을 가지고 + +456 +00:32:51,970 --> 00:32:56,679 + 분류는 그 등이이 그래서 그냥 전부 다 넣어 보자 + +457 +00:32:56,679 --> 00:33:02,830 + 아이디어는 일종의 먼저이 방법으로 2014 년에 정말 좋은 방법으로 함께 넣어 + +458 +00:33:02,829 --> 00:33:08,740 + 그것의 그래서 아이디어라고 RCN IT는 영역 - 기반 방법이다 CNN된다 + +459 +00:33:08,740 --> 00:33:12,179 + 우리가 우리가 입력 이미지가 모든 조각을 본 적이 무엇을 아주 간단 사람입니다 + +460 +00:33:12,179 --> 00:33:17,028 + 선택적 검색에 대한 아마 얻을처럼 우린 영역 제안 방법을 실행하고 + +461 +00:33:17,028 --> 00:33:21,929 + 이천 다른 스케일의 상자와 위치 2000을 의미 여전히 많이 있지만, + +462 +00:33:21,929 --> 00:33:26,380 + 그것은 그 상자의 각각에 대해 이제 이미지의 모든 가능한 상자보다 훨씬 더 적은이다 + +463 +00:33:26,380 --> 00:33:31,510 + 우리는 그 다음 자른 어떤 고정 된 크기로 그 이미지 영역을 워프하고있는거야 + +464 +00:33:31,509 --> 00:33:35,898 + 분류하기 나는 CNN을 통해이 전 실행 한 다음이 CNN은해야 할 것입니다 + +465 +00:33:35,898 --> 00:33:41,199 + 회귀 헤드와 회귀 여기 있었고 분류 사용되었던 + +466 +00:33:41,200 --> 00:33:46,259 + PM은 여기로 생각이이 회귀 할 수있는 일종의 올바른 것으로 그래서 + +467 +00:33:46,259 --> 00:33:50,369 + 있었다 영역의 제안에 대해 조금 떨어져이이 실제로 작동 기록 + +468 +00:33:50,369 --> 00:33:55,219 + 정말 잘 그래 정말 간단있어 정말 멋진하지만, 불행히도 그렇게 + +469 +00:33:55,220 --> 00:33:59,460 + 불행하게도 교육 파이프 라인은 조금 길 있도록 복잡하게된다 + +470 +00:33:59,460 --> 00:34:03,788 + 당신은 기차 훈련 SRC를 종료하고 모델은 많은 같은 많은처럼 알고있다 + +471 +00:34:03,788 --> 00:34:06,970 + 모델이 먼저 잘 작동 인터넷에서 모델을 다운로드하여 시작 + +472 +00:34:06,970 --> 00:34:13,240 + 원래 분류를 위해 그들이 사용하고 그 다음 다음 다음 아니에요 어떻게 + +473 +00:34:13,239 --> 00:34:16,868 + 실제로 미세 조정이이 때문에 검출 율이 모델을 원하는 + +474 +00:34:16,869 --> 00:34:20,780 + 분류 모델은 아마 이미지에 대한 교육을하였습니다 4000 클래스 만 + +475 +00:34:20,780 --> 00:34:24,019 + 하여 검출 데이터 세트는 이미지 클래스의 다른 번호를 갖는다 + +476 +00:34:24,019 --> 00:34:28,398 + 당신은 여전히​​이 실행되도록 여전히이 네트워크를 훈련 여분 조금 다른 + +477 +00:34:28,398 --> 00:34:29,679 + 분류 + +478 +00:34:29,679 --> 00:34:33,429 + 당신은 당신의 클래스를하고 처리하는 말에 몇 새 레이어를 추가해야 + +479 +00:34:33,429 --> 00:34:38,068 + 그렇게 여기에 이​​미지 데이터의 약간 다른 통계 처리에 도움 + +480 +00:34:38,068 --> 00:34:41,579 + 당신은 여전히​​ 분류를하고있어하지만 당신은 보류 이미지에서 실행하지 않는 + +481 +00:34:41,579 --> 00:34:44,869 + 당신은에서 이미지의 긍정적이고 부정적인 지역에 밖으로 실행하고 + +482 +00:34:44,869 --> 00:34:49,950 + 당신의 감지 데이터 세트 바로 처음에 새 레이어로 그래서 당신과 당신 + +483 +00:34:49,949 --> 00:34:53,599 + 하루는 것입니다 다시이 일을 양성 + +484 +00:34:53,599 --> 00:34:57,889 + 다음 우리가 실제로 사고에 할 모든에 대한 그래서 두 개의 책상을 갖추고 있습니다 + +485 +00:34:57,889 --> 00:35:02,230 + 당신이 선택 검색을 실행하는 데이터에 조작 된 해당 이미지를 실행 + +486 +00:35:02,230 --> 00:35:07,079 + 당신이 아래로 여기 CNN에 있었다 그 영역을 추출하고 그 기능을 현금으로 + +487 +00:35:07,079 --> 00:35:12,319 + 책상이 단계에 대한 중요한 뭔가가 큰 하드 드라이브를하는 것입니다에 + +488 +00:35:12,320 --> 00:35:16,289 + 암호는 어쩌면 수천의 몇 수십 주문 너무 큰하지 않기로 결정 + +489 +00:35:16,289 --> 00:35:20,170 + 이미지하지만 이러한 기능을 추출 실제로 그렇게 수백 기가 바이트 소요 + +490 +00:35:20,170 --> 00:35:26,869 + 그건 그리 중대하고 다음 우리는 우리가 RSP의 팔을 훈련 할이 있습니다 + +491 +00:35:26,869 --> 00:35:30,909 + 이러한 사실에 기초하여 상이한 다른 클래스로 분류 할 수 + +492 +00:35:30,909 --> 00:35:35,649 + 기능은 여기에, 그래서 우리는 우리의 무리를 변경할의 무리를 실행하려면 + +493 +00:35:35,650 --> 00:35:40,760 + PM의 서로 다른 바이너리에로 이미지 영역을 분류 할 것인지 여부를 그들이 + +494 +00:35:40,760 --> 00:35:45,220 + 포함하거나 해당 하나의 오브젝트 질문 (A)에 다시가는 것을 포함하지 않는다 + +495 +00:35:45,219 --> 00:35:49,029 + 전 약간은 때때로 당신은 실제로 한 지역이 얼마나 싶어 있습니다 + +496 +00:35:49,030 --> 00:35:53,460 + 여러 긍정적는 같은 이미지를 여러 클래스에 출력 YES로 수 + +497 +00:35:53,460 --> 00:35:56,889 + 그들이 그 그냥 내 훈련 별도의 이진 SVM입니다 않는 영역과 하나의 방법 + +498 +00:35:56,889 --> 00:36:01,579 + 음성 클래스는 바로 그래서 다음이 그들은 단지를 사용하여 오프라인 프로세스의 일종이다 + +499 +00:36:01,579 --> 00:36:08,230 + 이 어쩌면 그들이다 긍정적 인 당신이이 기능을 가지고 있도록 최선 오후 + +500 +00:36:08,230 --> 00:36:11,820 + 백작에 대한 샘플은 그래 그것은 어떤 의미의 권리를하지 않습니다하지만 당신은 얻을 + +501 +00:36:11,820 --> 00:36:14,700 + 생각의 속도 당신은 당신이 다른 이미지를 서로 다른 이미지를 가지고있다 + +502 +00:36:14,699 --> 00:36:18,599 + 영역은 당신이 다음 해당 지역과의 디스크에 저장이 기능을 가지고 있습니다 + +503 +00:36:18,599 --> 00:36:22,029 + 각 클래스에 대한 각각의 양극과 음극 샘플로 분할 + +504 +00:36:22,030 --> 00:36:27,269 + 방금이 이러한 이진 SVM 훈련이 작업을 수행이에게 동일한 작업을 수행 + +505 +00:36:27,269 --> 00:36:33,239 + 개를위한 것이나, 당신은 단지 근처 지금 결정하는 모든 클래스에 대해이 작업을 수행 + +506 +00:36:33,239 --> 00:36:37,029 + 그럼 이렇게 콕스 회귀의이 아이디어가 있다면 다른 정지 권리가있다 + +507 +00:36:37,030 --> 00:36:40,450 + 때로는 지역의 제안은 우리가 실제로 무엇을 원하는 그렇게 완벽하지 + +508 +00:36:40,449 --> 00:36:45,549 + 받는 사람에 교정에 그의 캐스트 기능에서에서 퇴보 할 수있다 + +509 +00:36:45,550 --> 00:36:50,269 + 지역 제안하고 보정 재미 전제 상승의이 종류가 있습니다 + +510 +00:36:50,269 --> 00:36:54,320 + 종이뿐만 종류의에 대한 표현을 국가의 세부 사항을 정상화 + +511 +00:36:54,320 --> 00:36:58,300 + 직관은 어쩌면이이이 지역의 제안에 대해 그것이이다 + +512 +00:36:58,300 --> 00:37:02,030 + 꽤 좋은 우리가 정말 어떤 어떤 수정하지만 어쩌면이 일을 할 필요가 없습니다 + +513 +00:37:02,030 --> 00:37:06,250 + 제안은 너무 멀리 왼쪽에있는 것을 중간에 그것은 침대처럼한다 + +514 +00:37:06,250 --> 00:37:09,510 + 우리는 이것에 회귀 할 오른쪽으로 조금으로 금이 지상의 진실 + +515 +00:37:09,510 --> 00:37:12,530 + 보정 계수 실제로 우리가 조금 이동해야한다는 것을 우리에게 말해 그 + +516 +00:37:12,530 --> 00:37:15,780 + 오른쪽 또는 어쩌면이 사람은 조금 너무 넓 + +517 +00:37:15,780 --> 00:37:19,100 + 우리가 퇴보 할 수 있도록 그들은 고양이 이외의 물건을 너무 많이 잃지 않았다 + +518 +00:37:19,099 --> 00:37:21,880 + 우리에게이 보정 계수 우리는 축소해야 + +519 +00:37:21,880 --> 00:37:26,539 + 지역의 제안은 조금 그래서 다시는 날 그냥 선형을 수행 할 수있다 + +520 +00:37:26,539 --> 00:37:30,340 + 당신은 그러나 당신이 229에서 알 수 회귀 이러한에게 이러한 기능을 가지고 있습니다 + +521 +00:37:30,340 --> 00:37:35,490 + 그냥 내가 우리 전에 그렇게 SAT 선형 회귀를 실행하면 이러한 목표를 가지고 + +522 +00:37:35,489 --> 00:37:39,219 + 우리는 다른에 대해 조금 이야기를 이야기해야 결과를 보면 + +523 +00:37:39,219 --> 00:37:42,769 + 데이터 세트 검출에 사용되는 사람들이 당신이 볼 세 가지의 종류가있다 + +524 +00:37:42,769 --> 00:37:48,489 + 파스칼과 같은 연습 하나는 OC 데이터 세트는 내가 생각하는 매우 중요 + +525 +00:37:48,489 --> 00:37:53,399 + 수천 이전에 있지만, 지금은 조금 작다이 하나의 약 20 + +526 +00:37:53,400 --> 00:37:57,820 + 클래스에 따라서 약 20,000 이미지와는 약 2 개체의 비율이 + +527 +00:37:57,820 --> 00:38:01,550 + 이 상대적으로 작은 틱 데이터 세트가 너무 때문에 당신을 많이 볼 수 있습니다 + +528 +00:38:01,550 --> 00:38:05,860 + 검출 논문이에 그냥 처리하기 쉽게가는 일뿐만 아니라, 거기에 + +529 +00:38:05,860 --> 00:38:09,970 + 문제의 전체 무리로 실행 이미지가 검출 데이터 세트 이미지 + +530 +00:38:09,969 --> 00:38:13,109 + 당신은 아마 우리가 현지화 노력 분류를 보았다 지금까지 본 적이 + +531 +00:38:13,110 --> 00:38:17,820 + 검출 도전하지만 보호 만이 있다는 이미지도있다 + +532 +00:38:17,820 --> 00:38:21,600 + 이백 클래스하지 분류에서 천 그러나 그것은 매우의의 + +533 +00:38:21,599 --> 00:38:25,619 + 당신은 많은 논문이 작업이 표시되지 않도록 큰 거의 50 만 이미지 만 + +534 +00:38:25,619 --> 00:38:29,819 + 이 처리 할 종류의 짜증나지만 이미지 당 약 100 거기에 사촌과 + +535 +00:38:29,820 --> 00:38:32,760 + 다음 주 이상은 최근 코코라는 Microsoft에서 제공하는이 일이있다 + +536 +00:38:32,760 --> 00:38:36,660 + 이는 적은 수의 클래스 이미지를 가지고 있지만 실제로는 더 많은 개체가 + +537 +00:38:36,659 --> 00:38:42,649 + 사람들이 아니에요 일을 같은 비율은 더욱 흥미있는 권리가있다 + +538 +00:38:42,650 --> 00:38:45,300 + 당신이 거기 검출에 대해 얘기하고이이 또한있다 + +539 +00:38:45,300 --> 00:38:49,000 + 재미 평가 메트릭은 우리가 평균 평균 정밀도와 초기 싶어라고 사용하십시오 + +540 +00:38:49,000 --> 00:38:52,000 + 얻을 당신은 정말 알아야 할 사항 등 세부 사항에 너무 많이는 점입니다 + +541 +00:38:52,000 --> 00:38:56,570 + 0 수백 좋은 수백과 사이의 수 + +542 +00:38:56,570 --> 00:38:59,940 + 그리고 그것은 또한 내가 직관의 종류의 IT 당신이하려는 의미 + +543 +00:38:59,940 --> 00:39:04,079 + 오른쪽 당신 싶어 사실이 긍정적 높은 점수를 얻을 당신은 또한에있다가 + +544 +00:39:04,079 --> 00:39:08,230 + 당신의 상자가 일부 내에 할 필요가 생산 몇 가지 한계를 가지고 + +545 +00:39:08,230 --> 00:39:12,090 + 균열 상자의 임계 값을 수행 할 수 있습니다 보통이 그 임계 값이 점 + +546 +00:39:12,090 --> 00:39:15,420 + 노조 만의 교차에 의해 당신은 당신이 약간 다른 문제를 볼 수 있습니다 + +547 +00:39:15,420 --> 00:39:19,740 + 그게 전부 다른 일이 바로 지금 우리가 데이터를 이해하자 + +548 +00:39:19,739 --> 00:39:24,679 + 우리의 CNN을 통해 우리의 높이에 설정하는 것은 바로 그래서이 지난 2에 않았다 + +549 +00:39:24,679 --> 00:39:27,779 + 등의 파스칼 데이비스의 버전 난 당신을 많이 볼 수 있습니다로 작은 말했습니다 + +550 +00:39:27,780 --> 00:39:32,730 + 이에 대한 결과를 다른 버전을 자주 볼 2007, 2010 년을 거기에 + +551 +00:39:32,730 --> 00:39:35,990 + 사람들은 테스트가 공개되어해서 사람들을 사용 그래서 쉽게 + +552 +00:39:35,989 --> 00:39:37,169 + 평가 + +553 +00:39:37,170 --> 00:39:42,380 + 그래,하지만 우리가에서 2011 년부터 본이이 변형 부품 모델 + +554 +00:39:42,380 --> 00:39:48,579 + 커플 슬라이드는 전 약 30 평균 정밀도에있다 스무을 받고있다 + +555 +00:39:48,579 --> 00:39:52,069 + 지역이라는 다른 방법은 국가의 종류이었다 2013에서하자 + +556 +00:39:52,070 --> 00:39:55,280 + 내가 바로 깊은 학습을하기 전에하지만 그것 뿐이다 찾을 수있는 예술은 일종의이다 + +557 +00:39:55,280 --> 00:39:58,130 + 비슷한 맛 당신은 교사의 상단에 동급 플레이어에서 이러한 기능을 가지고 있습니다 + +558 +00:39:58,130 --> 00:40:02,840 + 우리의 CNN 우리가 방금 본이 꽤 간단한 일이 실제로 점프 + +559 +00:40:02,840 --> 00:40:06,789 + 실제로 성능에게 우리가 가진 가장 먼저 바다, 그래서 꽤 많이 향상 + +560 +00:40:06,789 --> 00:40:10,509 + 우리가 사용하는 CNN의이 매우 간단한 프레임 워크를 전환 큰 개선 + +561 +00:40:10,510 --> 00:40:15,160 + 여기에 실제로이 결과는 경계 상자의 억압없이 + +562 +00:40:15,159 --> 00:40:19,029 + 이 포함되어있는 경우에만 실제로 ESPN의 영역에 제안을 사용 + +563 +00:40:19,030 --> 00:40:23,550 + 추가로 결합 제안은 실제로 또 다른 재미 꽤 내기를하는 데 도움이 중지 + +564 +00:40:23,550 --> 00:40:26,820 + 주의해야 할 것은 당신이 우리의 CNN을 가지고 있다면 당신은 모든 것을에게 같은 행동을 할 것입니다 + +565 +00:40:26,820 --> 00:40:31,080 + 사용 예 (16) 대신 알렉스 그물을 제외한 다른 꽤 큰 부스트에서 얻을 + +566 +00:40:31,079 --> 00:40:34,059 + 성능은 그래서 이것은 종류의 우리가 그 바로 전에 보았던 것과 비슷한입니다 + +567 +00:40:34,059 --> 00:40:39,650 + 바로 다른 작업을 많이하는 데 도움이 더 강력한 기능을 경향 사용 + +568 +00:40:39,650 --> 00:40:42,840 + 이것은 우리가 우리가에 큰 개선을 같이했던 한 정말 좋은 권리 + +569 +00:40:42,840 --> 00:40:47,829 + 검출 놀라운 2013에 비해 그러나 우리의 CNN은 완벽하지 않습니다 + +570 +00:40:47,829 --> 00:40:53,150 + 바로 그래서 꽤 우리가 우리 것을보고 그 시험 시간을 오른쪽으로 느린 몇 가지 문제가있다 + +571 +00:40:53,150 --> 00:40:57,110 + 아마 이천 지역을하면의 각 지역에 대한 우리의 CNN을 평가하는 것을 의미한다 + +572 +00:40:57,110 --> 00:41:02,910 + 좀 느린 우리는이 곳이 약간 미묘한 문제가 r에 SVM + +573 +00:41:02,909 --> 00:41:07,009 + 사람들은 일종의 가능성이 가장 오후를 사용하여 오프라인 교육을 받았습니다 회귀 + +574 +00:41:07,010 --> 00:41:10,930 + 그리고 실제로 무게 선형 회귀 우리의 우리의 CNN은 정말하지 않았다의 + +575 +00:41:10,929 --> 00:41:14,960 + 갱신에 응답 할 기회를 갖고 것의 네트워크의 해당 부분 + +576 +00:41:14,960 --> 00:41:19,039 + 그 목적은하고 싶었던 우리는 또한 복잡한 이런 종류의했다 + +577 +00:41:19,039 --> 00:41:24,309 + 혼란의 비트를했다 훈련 파이프 라인은 그래서 년 이러한 문제를 해결하기 위해 + +578 +00:41:24,309 --> 00:41:29,690 + 나중에 우리가 제시 한 빨리 우리의 CNN이 너무 빨리 우리의 CNN이라이 일이 + +579 +00:41:29,690 --> 00:41:34,950 + 꽤 최근 ICC 년 12 월에 바로 할 수 있지만, 아이디어는 정말 간단합니다 + +580 +00:41:34,949 --> 00:41:39,819 + 우리는 CNN을 영역을 추출하고 실행의 순서 만 거 스왑이야 + +581 +00:41:39,820 --> 00:41:43,550 + 이것은 우리가 본 슬라이딩 윈도우 아이디어에 관련의 종류의 종류 + +582 +00:41:43,550 --> 00:41:48,450 + 이상 - 더 - 상단 그래서 여기에 테스트 시간은 우리가 좀 비슷한 모양의 파이프 라인 + +583 +00:41:48,449 --> 00:41:52,299 + 우리가하지거야이 입력 이미지를 우리는 고해상도 입력을거야 + +584 +00:41:52,300 --> 00:41:55,920 + 이미지와 우리의 네트워크의 길쌈 레이어를 통해 그것을 실행하고 + +585 +00:41:55,920 --> 00:42:00,150 + 지금 우린 지금이 고해상도 길쌈 기능지도를 얻고있어 우리의 + +586 +00:42:00,150 --> 00:42:03,940 + 지역의 제안은 직접 그 지역의 특징 거 추출했다 + +587 +00:42:03,940 --> 00:42:07,610 + 투자 수익 (ROI)라는이 일을 사용하여이 길쌈 기능지도에서 제안 + +588 +00:42:07,610 --> 00:42:10,530 + 풀링하고 지역의 + +589 +00:42:10,530 --> 00:42:14,269 + 그 지역에 대한이 조성 기능에 대한 기능은 공급한다 + +590 +00:42:14,269 --> 00:42:17,829 + 우리의 완전히 연결 층으로 다시 것입니다 분류가 가진 + +591 +00:42:17,829 --> 00:42:22,670 + 그래서 이것은 정말 꽤입니다의 냉각되기 전에 우리가 본 것처럼 회귀했다 + +592 +00:42:22,670 --> 00:42:26,930 + 큰 우리가 너무 우리의 CNN 우리의 CNN과 본 많은 문제를 해결 + +593 +00:42:26,929 --> 00:42:31,039 + 우리가이이를 공유함으로써이 문제를 해결 US 시간에 정말 느리다 + +594 +00:42:31,039 --> 00:42:37,289 + 지역의 제안에 걸쳐 길쌈 기능의 계산은 참조하는 우리의 + +595 +00:42:37,289 --> 00:42:40,519 + CNN은 또한 우리가이 메시지를 가지고 교육 시간에 이러한 문제가 + +596 +00:42:40,519 --> 00:42:44,920 + 교육 파이프 라인 우리는 우리가 다른 훈련하고이이 문제를 가지고 있었다 + +597 +00:42:44,920 --> 00:42:48,760 + 별도로 네트워크 용액 부분 우리 그냥 매우 간단 + +598 +00:42:48,760 --> 00:42:50,480 + 알고 한 번에 모두 함께 훈련 + +599 +00:42:50,480 --> 00:42:53,800 + 우리가 실제로 지금 할 수있는이 복잡한 파이프 라인이없는하지 않습니다 + +600 +00:42:53,800 --> 00:42:58,140 + 우리는이는 바로 출력하는 입력에서이 꽤 좋은 기능을 가지고 + +601 +00:42:58,139 --> 00:43:01,299 + 당신은 그래서 빨리 우리의 CNN이 실제로 꽤 많이 해결되어 있는지 볼 수 있습니다 + +602 +00:43:01,300 --> 00:43:06,340 + 우리가 정말 흥미로 우리의 CNN 정렬 보았다 문제 + +603 +00:43:06,340 --> 00:43:10,530 + 빨리 우리의 CNN 기술적 비트는 관심의 길 영역이 문제였다 + +604 +00:43:10,530 --> 00:43:15,519 + 아이디어 있도록 풀링 우리가 아마 고의이 입력 이미지를 가지고있다 + +605 +00:43:15,519 --> 00:43:19,068 + 해상도와 우리는되고있어이 지역의 제안을 + +606 +00:43:19,068 --> 00:43:23,969 + 선택 과목 수술 상자 또는 뭔가처럼 우리는이 지역이을 넣을 수 있습니다 + +607 +00:43:23,969 --> 00:43:27,199 + 우리의 길쌈 및 풀링 층 단지를 통해 높은 해상도 이미지 + +608 +00:43:27,199 --> 00:43:30,880 + 잘 그들이가 여전히이있어 규모 불변의 일종이다 그 때문에 + +609 +00:43:30,880 --> 00:43:34,318 + 다른 입력의 크기 그러나 지금 문제가 완전히 연결이다 + +610 +00:43:34,318 --> 00:43:39,630 + 우리의 전 열차 네트워크의 층이 매우 낮은 입술 죄수를 기대하고있다 + +611 +00:43:39,630 --> 00:43:46,068 + 전체 이미지에서 이러한 기능 반면 기능은 지금 우리가 해결 고해상도있다 + +612 +00:43:46,068 --> 00:43:50,038 + 아주 간단한 방법으로이 문제는 우리가있어이 지역의 제안을 부여 + +613 +00:43:50,039 --> 00:43:53,930 + 야는에 그 댓글 기능의 특수 부품의 종류 전망 + +614 +00:43:53,929 --> 00:43:59,368 + 볼륨 이제 우리는 작은 격자 오른쪽에 그 칸 미래의 부피를 나눌거야 + +615 +00:43:59,369 --> 00:44:04,910 + 이 HIW 계통에 그 일을 분할 하류 층이 기대하고 있고 + +616 +00:44:04,909 --> 00:44:09,798 + 우리는 지금 우리가 우리가 본 적이 그래서 지금 그 그리드의 각 셀 내에서 당겨 맥을 + +617 +00:44:09,798 --> 00:44:14,349 + 이 아주 간단한 전략 우리는이 지역의 제안을 찍은 우리가 공유 한 + +618 +00:44:14,349 --> 00:44:19,430 + 편집 기능이 그에 대한 그 지역의 출력을 자극 추출 + +619 +00:44:19,429 --> 00:44:23,629 + 해당 지역의 제안이 기본적으로 단지의 순서를 교환한다 기록 + +620 +00:44:23,630 --> 00:44:28,108 + 하나의 방법 회선 및 휨 및 자르기 그것에 대해 생각하고 + +621 +00:44:28,108 --> 00:44:31,538 + 이 것은 기본적으로부터 때문에이 꽤 좋은 동작이다 + +622 +00:44:31,539 --> 00:44:35,249 + 다만 최대는 당기 우리는 당겨 최대를 통해 수행 할 수 있습니다 전파 백업하는 방법을 알고 + +623 +00:44:35,248 --> 00:44:38,368 + 다시이이 통해이 거기에 당기는 관심의 영역이 전파 + +624 +00:44:38,369 --> 00:44:42,269 + 잘하고 정말 관절에이 모든 일을 훈련하는 우리를 할 수 있습니다 무엇 + +625 +00:44:42,268 --> 00:44:46,758 + 방법 권한의 일부 결과를 볼 수 있도록 이러한 꽤 꽤 멋진 + +626 +00:44:46,759 --> 00:44:50,858 + 훈련 시간이 너무 놀라운 큰는 CNN입니다 그것은이 복잡한 파이프 라인을 가지고 있었다 + +627 +00:44:50,858 --> 00:44:54,098 + 책상 곳이 모든 물건을 독립적으로 수행하는 모든 물건을 절약 할 수 및 + +628 +00:44:54,099 --> 00:44:57,789 + 심지어 아주 작은 파스칼 데니스에이 훈련 팔십사시간 걸렸다에서 + +629 +00:44:57,789 --> 00:45:05,229 + 우리의 CNN이 LR의 테스트 시간으로 멀리에서 훨씬 더 빨리 당신이 훈련 할 수있다 및 전달 + +630 +00:45:05,228 --> 00:45:09,318 + 다시 우리는 이러한 독립적 인 전진을 실행하는 것은 전달하기 때문에 CNN은 매우 느립니다 + +631 +00:45:09,318 --> 00:45:14,469 + 반면 빠른 우리의 CNN 각 지역의 제안에 대한 CNN의 경우 우리가 할 수있는 + +632 +00:45:14,469 --> 00:45:17,979 + 종류의 다른 지역 제안의 계산을 공유하고이를 얻을 수 + +633 +00:45:17,978 --> 00:45:23,439 + 테스트 거대한 속도까지 난 백이다 마흔여섯 큰 놀랍고 + +634 +00:45:23,440 --> 00:45:26,690 + 성능면 내 말에 사실은 그것은 조금 더 나은 아니다 않는다 + +635 +00:45:26,690 --> 00:45:30,048 + 이 아마에 기인 할 수 있지만 성능에 급격한 차이 + +636 +00:45:30,048 --> 00:45:32,130 + 이 미세 조정 속성이 + +637 +00:45:32,130 --> 00:45:35,140 + 우리의 CNN 과거 실제로 길쌈의 모든 부분이 당신을 찾을 수 있습니다 + +638 +00:45:35,139 --> 00:45:38,969 + 이 ALPA 작업에 도움이 공동으로 네트워크와 당신이 볼 이유는 아마 + +639 +00:45:38,969 --> 00:45:43,230 + 증가의 비트가 바로 여기 그래서 이것은 무엇을 가능하게 할 수 무슨 큰 권리 + +640 +00:45:43,230 --> 00:45:45,730 + 놀라운 우리의 CNN과 외모 빠른 잘못 될 + +641 +00:45:45,730 --> 00:45:51,699 + 큰 문제는 이러한 테스트 나는 속도는 지역의 제안을 포함하지 않는거야이다 + +642 +00:45:51,699 --> 00:45:55,669 + 바로 지금 실제로 병목이라는 정말 좋은 우리의 CNN을 빨리 + +643 +00:45:55,670 --> 00:46:00,750 + 의 속도는 고려 번, 그래서 꽤 멋진 지역의 제안을 계산 + +644 +00:46:00,750 --> 00:46:04,789 + 컴퓨터가 실제로 당신이 많은 것을 볼 수 있습니다 CPU에이 지역의 제안을 계산 + +645 +00:46:04,789 --> 00:46:09,190 + 우리의 속도의 장점은 빠른 바로 25 X 사라 우리 종류의 손실의 + +646 +00:46:09,190 --> 00:46:15,030 + 그 아름다운 백 속도 업도 지금은 나 이초에게이란이 걸리기 때문에 + +647 +00:46:15,030 --> 00:46:18,560 + 실제로 꽤 마젠타 그리고 당신이 정말로 좀 여전히이 실시간으로 사용할 수 없습니다 + +648 +00:46:18,559 --> 00:46:23,750 + 오프라인 처리 것은 바로 그래서 이것의 해결책은 꽤해야한다 + +649 +00:46:23,750 --> 00:46:27,340 + 우리 모두가 이미 컨볼 루션 네트워크를 사용하는 명백한 속도 + +650 +00:46:27,340 --> 00:46:32,620 + 분류를 사용하여 회귀 이유에 대한 이유 제안을 위해 그것을 사용하지 + +651 +00:46:32,619 --> 00:46:39,569 + 즉 그, 그래서 종류의 미친 수 있습니다 작동합니다 작성하는 종이 사람이 원하는이다 + +652 +00:46:39,570 --> 00:46:46,570 + 이름이 예는 빠르게 우리의 CNN을 것 같아요 + +653 +00:46:46,570 --> 00:46:50,789 + 예 그들은 여기 정말 창조적했다 있었다하지만 아이디어는 매우 간단하다 + +654 +00:46:50,789 --> 00:46:55,460 + 바로 여기서 당신은 우리의 입력 이미지를 복용하는 경우 빠른 우리의 CNN에서 일부 어디 + +655 +00:46:55,460 --> 00:46:59,630 + 전체 입력 화상 위에 빅 컨벌루션 피쳐 맵을 계산 + +656 +00:46:59,630 --> 00:47:05,170 + 그래서 그 대신 일부 외부 방법을 사용하는 지역의 제안을 계산하는 그들 + +657 +00:47:05,170 --> 00:47:09,010 + 직접 보이는 지역 제안 네트워크라는이 작은 일을 추가 + +658 +00:47:09,010 --> 00:47:13,060 + 이들의 이러한 외모에 수 그들의 조성 기능을 지속 + +659 +00:47:13,059 --> 00:47:17,599 + 지도 기능을합니다 그 경쟁에서 직접 지역의 제안을 생산하고 + +660 +00:47:17,599 --> 00:47:21,190 + 이 지역의 제안을 일단 당신은 그냥 빨리 우리의 CNN을 같은 일을 + +661 +00:47:21,190 --> 00:47:25,880 + 이 ROI 풀링을 사용하고 난 상류 정지가 빠르거나 CNN과 동일 수 있습니다 + +662 +00:47:25,880 --> 00:47:31,130 + 그래서 여기 정말에 대한 새로운 비트는 그것의이 지역 제안 네트워크는 그것의이다 + +663 +00:47:31,130 --> 00:47:34,180 + 정말 멋진 바로 우리가 모든 일을하고있는 직장에서 하나의 거대한 경쟁 + +664 +00:47:34,179 --> 00:47:40,500 + 옳았다이이 지역 제안 네트워크가 작동하는 방식은 우리 종류의 + +665 +00:47:40,500 --> 00:47:43,880 + 이 매핑 기능을합니다이 경쟁에서 나오는 수의 입력으로받을 + +666 +00:47:43,880 --> 00:47:47,820 + 마지막으로 우리의 길쌈 기능 층과 우리는 당신이 좋아하는 추가 할거야 + +667 +00:47:47,820 --> 00:47:52,570 + 대부분의 물건이있는 것처럼 그에 최근 포스트는 컨볼 루션 네트워크를 오른쪽으로 일했다 + +668 +00:47:52,570 --> 00:47:57,570 + 그래서 실제로 우리가이 있도록 그 권리에 괴물에 의해이 무료 인 오타입니다 + +669 +00:47:57,570 --> 00:48:01,809 + 우리의 길쌈 기능지도를 통해 슬라이딩 윈도우 방식의 종류 만 + +670 +00:48:01,809 --> 00:48:06,820 + 슬라이딩 창 슬라이딩 그래서 우리는 단지 세 가지가 단지 회선 속도입니다 + +671 +00:48:06,820 --> 00:48:10,920 + 셋이 기능지도의 상단에 컨볼 루션 다음으로 우리는 이것을 가지고 + +672 +00:48:10,920 --> 00:48:14,599 + 독특한이 지역 제안 내부 헤드 구조에이 익숙에게 공격 + +673 +00:48:14,599 --> 00:48:19,670 + 우리가 분류하고있는 네트워크 우리는 우리가인지 말하고 싶은 여기 + +674 +00:48:19,670 --> 00:48:25,430 + 여부는 객체이고 또한 이런 종류의에서 회귀하는 회귀 + +675 +00:48:25,429 --> 00:48:29,829 + 실제 pridgen 제안에에 위치가 너무 생각이 그 + +676 +00:48:29,829 --> 00:48:33,909 + 기능지도에 슬라이딩 윈도우 상대의 위치는 종류의 우리에게 알려줍니다 + +677 +00:48:33,909 --> 00:48:38,239 + 우리는 이미지에 다음이 회귀 일종의 출력 곳의 우리에게 + +678 +00:48:38,239 --> 00:48:43,619 + 기능 맵이이 위치의 상단에 수정하지만 실제로 + +679 +00:48:43,619 --> 00:48:46,940 + 그것은 조금 더보다 복잡한 그래서 대신 주소 확인 + +680 +00:48:46,940 --> 00:48:51,110 + 회선이 위치가지도 기능을합니다에서 직접 자신이 갖고 + +681 +00:48:51,110 --> 00:48:55,280 + 당신이 복용 상상할 수있는 서로 다른 앵커 상자의이 개념 + +682 +00:48:55,280 --> 00:48:59,910 + 다른 크기와 모양의 은행 상자 원래대로 붙여 넣기의 종류 + +683 +00:48:59,909 --> 00:49:03,538 + 이 시점에 대응하는 화상의 시점 화상 + +684 +00:49:03,539 --> 00:49:08,020 + 기능지도 오른쪽 다리와 빠른 RCMP는로 이미지에서 앞으로 돌출 된 + +685 +00:49:08,019 --> 00:49:11,519 + 이 기능지도 이제 우리는 우리가 돌​​출하고 반대를하고있는 + +686 +00:49:11,519 --> 00:49:17,288 + 기능지도 다시이 상자에 대한 이미지로 그렇게 한 다음 이들 각각에 대한 + +687 +00:49:17,289 --> 00:49:21,640 + 앵커 상자 그들은을 사용할 수있는 길쌈 앵커 상자의 종류를 사용 + +688 +00:49:21,639 --> 00:49:27,400 + 이 앵커 박스의 각각에 대한 모든 이미지의 위치와 그들이에서 동일합니다 + +689 +00:49:27,400 --> 00:49:32,119 + 이들은 그 앵커 상자 객체에 해당하는지 여부에 대한 점수를 생성 + +690 +00:49:32,119 --> 00:49:36,809 + 그들은 또한 분노가 잘못의 회귀 좌표에 대한 생산 + +691 +00:49:36,809 --> 00:49:41,880 + 우리가 전에 보았던 유사한 방식으로 상자와 지금이 지역의 제안 네트워크 당신에게 + +692 +00:49:41,880 --> 00:49:45,700 + 그냥 높은 수준의 불가지론 객체의 일종의 예측하려고하는 훈련을 할 수 있습니다 + +693 +00:49:45,699 --> 00:49:52,058 + 원래 논문에서 이렇게 빨리 우리의 CNN 검출기 그들은이 일을 훈련하고 + +694 +00:49:52,059 --> 00:49:55,490 + 그들이 제안을 읽는 훈련을 먼저 재미있는 방법의 종류를 잘 작성하지 다음 작업들이 + +695 +00:49:55,489 --> 00:49:59,500 + 다음 우리의 CNN을 통과 기차 그들은 함께하고 마지막에 병합하는 마법을 + +696 +00:49:59,500 --> 00:50:03,530 + 이 이것은 그래서 오늘의 그들은 모든 것을 생성 한 네트워크를 가지고 + +697 +00:50:03,530 --> 00:50:07,880 + 조금 지저분하지만 개별 종이는이 일을 설명하지만, 그 이후 + +698 +00:50:07,880 --> 00:50:10,470 + 그들은 실제로 단지 전체를 변경 일부 미 출판 일을 했어 + +699 +00:50:10,469 --> 00:50:14,909 + 그들은 종류의 이미지를 가지고 하나의 큰 네트워크를 가지고있어 공동으로 일 + +700 +00:50:14,909 --> 00:50:19,679 + 당신이 지역의 제안서 네트워크 내에서 이것을에 가지고 오는을 설치해야 + +701 +00:50:19,679 --> 00:50:23,538 + 분류는 각 지역의 제안이 있거나인지 여부를 분류하는 손실 + +702 +00:50:23,539 --> 00:50:27,670 + 당신은이 지역의 제안 안에 이러한 경계 상자의 회귀가 객체 + +703 +00:50:27,670 --> 00:50:33,500 + 하지 빨리 우리가 할에서 다음 대회 앵커의 상단에 작동 + +704 +00:50:33,500 --> 00:50:37,190 + 우리의 삶 풀링의 끝에서 다음이 빠른 우리의 CNN 여행을하고와 + +705 +00:50:37,190 --> 00:50:41,200 + 네트워크 우리는이 분류는 그 어느 클래스 말을 잃었다가이 + +706 +00:50:41,199 --> 00:50:47,659 + 회귀 때문에이이 지역 제안의 상단에 수정을 해결하기 위해 손실 + +707 +00:50:47,659 --> 00:50:53,170 + 이 큰 것은 그래 네 손실을 단지 하나의 큰 네트워크입니다 + +708 +00:50:53,170 --> 00:51:04,019 + 그래서 제안과 억압 좌표는 세에 의해 생산된다 + +709 +00:51:04,019 --> 00:51:07,588 + 세 가지로 삼 세와 명백한 하나씩 회선 종종 있습니다 + +710 +00:51:07,588 --> 00:51:12,358 + 지도 오른쪽 그래서 아이디어는 우리가 서로 다른 앵커 상자에서 찾고 있다는 점이다 + +711 +00:51:12,358 --> 00:51:16,400 + 다른 위치와 비늘 만의 우리는 실제로 같은보고있는 + +712 +00:51:16,400 --> 00:51:20,139 + 기능 맵의 위치는 그 다른 은행 상자를 분류하지만합니다 + +713 +00:51:20,139 --> 00:51:26,179 + 당신은 다른 앵커에 대해 서로 다른 가중치를 다른 학습 한 I + +714 +00:51:26,179 --> 00:51:29,969 + 아이디어에 의해 세 당신이 원하는, 그래서 그것은 주로 경험적 바로 생각 + +715 +00:51:29,969 --> 00:51:33,429 + 비선형의 약간을 가지고 당신은 단지 종류의 일을 상상할 수 + +716 +00:51:33,429 --> 00:51:38,098 + 직접 나는 그들이 생각하지만, 기능 맵 오프 직접 하나씩 회선 + +717 +00:51:38,099 --> 00:51:40,990 + 신문에서이 문제를 논의하지 않습니다하지만 난에 단지 3 × 3 회 같은데요 + +718 +00:51:40,989 --> 00:51:44,669 + 더 조금 작동하지만 당신은 왜 당신이 왜 같은 정말 깊은 이유가 없습니다 + +719 +00:51:44,670 --> 00:51:47,450 + 적은 수 더있을 수 있다는 것을 더 큰 대령은 그냥 수 + +720 +00:51:47,449 --> 00:51:50,548 + 당신의 종류는 메인의 두 개의 머리를 가진 직장이 작은 경쟁이 + +721 +00:51:50,548 --> 00:51:53,710 + 포인트 및 질문 + +722 +00:51:53,710 --> 00:52:18,380 + 그래 나는 때문에 이해 + +723 +00:52:18,380 --> 00:52:22,140 + 전체 이미지에 대응 + +724 +00:52:22,139 --> 00:52:26,098 + 요점은 우리가 실제로 원하는 전체 이미지를 처리​​하지 않는다는 것입니다 + +725 +00:52:26,099 --> 00:52:29,960 + 에 대한 처리를 할 수있는 이미지의 일부 지역을 선택하지만 우리는 선택해야 + +726 +00:52:29,960 --> 00:52:36,048 + 어떻게 든 그 지역 + +727 +00:52:36,048 --> 00:52:42,188 + 예 즉, 그것은 기본적으로 기본적으로 외부 사용이 생각입니다 + +728 +00:52:42,188 --> 00:52:46,428 + 당신이 그 외부 지역의 제안을 수행 할 때 지역의 제안은 바로 그래서 당신은있어 + +729 +00:52:46,429 --> 00:52:50,929 + 종류의 당신이 회선을하기 전에 먼저 따기하지만 그것은 단지 종류의의 + +730 +00:52:50,929 --> 00:52:54,858 + 난처럼 당신이 한 번에 모든 것을 할 수 있다면 좋은 점은, 그래서 그것은 환상이 가지입니다입니다 + +731 +00:52:54,858 --> 00:52:58,748 + 정말 일반 가공 처리 등이 일반적으로하지만 당신은 할 수있는 + +732 +00:52:58,748 --> 00:53:01,608 + 당신이 좀 기여가 충분 것을 바라고 이미지 + +733 +00:53:01,608 --> 00:53:04,869 + 분류 거 침략 당신이이 정보의 유형 + +734 +00:53:04,869 --> 00:53:07,439 + 이러한 기여는 물론 지역을 분류하기위한 충분한 아마 좋은 + +735 +00:53:07,438 --> 00:53:11,958 + 그래서 그것 때문에 끝에 실제로 계산 절감의 사실이다 + +736 +00:53:11,958 --> 00:53:15,719 + 오늘의 당신은 모두를위한 동일한 길쌈 피터 맵을 사용하게 + +737 +00:53:15,719 --> 00:53:18,938 + 댐의 하류 분류 영역 제안서 + +738 +00:53:18,938 --> 00:53:23,389 + 여기에 속도를 얻을 왜 사실이다 하류 회귀 + +739 +00:53:23,389 --> 00:53:29,788 + 질문 그래, 우리는 우리가 네 손실 훈련이 큰 네트워크를 가지고 지금 우리가 할 수있는 + +740 +00:53:29,789 --> 00:53:31,569 + 모든 물체 감지 정렬 한 번에 + +741 +00:53:31,568 --> 00:53:37,858 + 정말 멋진 우리가 CNN의 다양한의 자유를 비교 결과를 보면, 그래서 + +742 +00:53:37,858 --> 00:53:43,630 + 속도는 우리는 약 50 초 테스트 시간 당 걸렸다 원래 우리의 CNN이 + +743 +00:53:43,630 --> 00:53:47,150 + 영상이이 실행중인 계산되는 영역의 제안을 기대하고있다 + +744 +00:53:47,150 --> 00:53:52,439 + CNN 별도로 꽤 느린 지금 전달 우리의 CNN있어 각 지역의 제안에 대한 우리 + +745 +00:53:52,438 --> 00:53:56,909 + 우리가 이동하면 그것은 일종의 지역 제안 시간으로 병목되었지만 보았다 + +746 +00:53:56,909 --> 00:54:01,768 + 빨리 우리의 CNN을 그 지역의 제안보다 기본적으로 무료오고있다 + +747 +00:54:01,768 --> 00:54:06,139 + 그들은 단지 방법에 있기 때문에 우리는 지역의 제안을 계산은 작은 세 내 + +748 +00:54:06,139 --> 00:54:09,199 + 자유 시간 희석과 몇 하나씩 회선들이있어, 그래서 매우 + +749 +00:54:09,199 --> 00:54:13,229 + 우리의 CNN이의 다섯 번째에서 실행이 빠른 테스트 시간을 보낼 평가 저렴 + +750 +00:54:13,228 --> 00:54:23,849 + 실제로의 두 번째 꽤 높은 해상도 이미지 그래 + +751 +00:54:23,849 --> 00:54:36,739 + 잘 난 당신이하지 기대하는대로 제로 패딩 뒤에 아이디어 중 하나를하지 않은 의미 + +752 +00:54:36,739 --> 00:54:40,699 + 가장자리에서 너무 멀리 떨어져 정보는 그래서 당신이있을 수 있습니다 어쩌면 생각 + +753 +00:54:40,699 --> 00:54:45,299 + 당신이 제로 패딩을하지 않은 경우에 문제가 어쩌면 더 문제지만 + +754 +00:54:45,300 --> 00:54:48,430 + 우리는 일종의 전에 논의하고 제로 것을 추가하는 사실로 의미 + +755 +00:54:48,429 --> 00:54:52,519 + 그것은 어쩌면이 될 수 있도록 패딩은 이러한 기능의 통계에 영향을 미칠 수 + +756 +00:54:52,519 --> 00:54:56,900 + 문제의 비트는하지만 실제로는 잘 작동하는 것 같다 실제로 + +757 +00:54:56,900 --> 00:55:00,099 + 대한 그래 우리가 실패의 경우 어디에 있습니까 곳의 분석이다 그 + +758 +00:55:00,099 --> 00:55:02,949 + 새를 개발할 때 우리는 정말 중요한 과정으로 잘못된 일을받을 수 있나요 + +759 +00:55:02,949 --> 00:55:08,419 + 알고리즘과 나는 당신에게 더 나은 일을 할 수 있습니다 무엇에 대한 통찰력을 제공 할 수 있습니다 + +760 +00:55:08,420 --> 00:55:26,940 + 그래 그래 + +761 +00:55:26,940 --> 00:55:35,858 + 그래서 어쩌면 도움이 될 수 있지만, 그렇게 할 다음에 좀 어려운 사실이다 + +762 +00:55:35,858 --> 00:55:40,108 + 실험 데이터 세트가 다른 맞아 때문에 때 때를 때문에 + +763 +00:55:40,108 --> 00:55:43,789 + 당신은 한 가지입니다하지만 지금 이미지와 같은 분류 데이터 세트의 종류이었다 + +764 +00:55:43,789 --> 00:55:47,259 + 당신이 검출 작업을 할 때이 다른 데이터 세트 그리고 내가 당신을 좋아하지 않은 + +765 +00:55:47,260 --> 00:55:51,000 + 어떤 객체에 기초하여 상기 검출 된 이미지를 분류하려고 상상할 수 + +766 +00:55:51,000 --> 00:55:54,500 + 존재하지만 난 정말하려고하는 정말 좋은 비교를 보지 못했다 + +767 +00:55:54,500 --> 00:56:00,630 + 연구 분명하지만 내 말은 그 해당 프로젝트의 실험 + +768 +00:56:00,630 --> 00:56:18,088 + 그래 그건 아주 좋은 질문이 너무 그럼 당신은 우리의 방법이 문제가 + +769 +00:56:18,088 --> 00:56:22,119 + 하여 투자 수익 (ROI) 풀링 작업 그뿐만 아니라 오른쪽 방식 때문에 풀링 + +770 +00:56:22,119 --> 00:56:25,720 + 6 학년으로 그 일을 분할하고 당신이 한 번 당겨 최대 일 + +771 +00:56:25,719 --> 00:56:29,949 + 회전 실제로 가지 어려움이 정말 멋진 종이 거기에 있어요 + +772 +00:56:29,949 --> 00:56:33,159 + 공간 변압기라는 여름 동안 마지막에 깊은 마음에서 + +773 +00:56:33,159 --> 00:56:39,250 + 실제로 정말 멋진 방법을 소개 네트워크는이 문제를 해결하기 위해 + +774 +00:56:39,250 --> 00:56:42,239 + 아이디어는 대신 ROI 풀링을하는 우리가 직선에 의해 야 할 것이다 + +775 +00:56:42,239 --> 00:56:46,699 + 좀 당신 같은 보간 ​​그래서 한 번 질감과 그래픽을 사용할 수 있습니다 + +776 +00:56:46,699 --> 00:56:50,009 + 실제로 아마이이 미친 할 수있는 것보다 당신은 선형 보간에 의해 수행 + +777 +00:56:50,010 --> 00:56:53,609 + 지역이 너무 좋아 그 확실히 사람들에 대해 생각하고 뭔가하지만, + +778 +00:56:53,608 --> 00:56:56,848 + 그것은 아직 전체 파이프 라인에 통합되지 않은 + +779 +00:56:56,849 --> 00:57:00,338 + 네 + +780 +00:57:00,338 --> 00:57:11,728 + 당신이 우리의 CNN 정권 바로 이런 종류의에서 다시 둔화 될 수 있고, + +781 +00:57:11,728 --> 00:57:12,449 + 저것 봐 + +782 +00:57:12,449 --> 00:57:16,828 + 250 시간은 느리게 당신은 정말 내가 다른 생각을 의미하는 가격을 지불 할 수 + +783 +00:57:16,829 --> 00:57:20,690 + 회전 된 개체와 실제 관심은 우리가 정말 땅을하지 않아도됩니다 + +784 +00:57:20,690 --> 00:57:25,318 + 진실 데이터가이 검출 데이터 세트의이 대부분의 대부분을 그렇게 설정 만 + +785 +00:57:25,318 --> 00:57:29,190 + 지상 진실 정보 우리는 그래서 이러한 액세스 온라인 경계 상자입니다있다 + +786 +00:57:29,190 --> 00:57:33,150 + 그것은 어려운 당신은 실제의 종류의 접지 진실 위치가 없습니다 + +787 +00:57:33,150 --> 00:57:39,219 + 관심 나는 사람들이 정말 끝 그래서이 너무 많이 탐험하지 않은 생각 + +788 +00:57:39,219 --> 00:57:43,009 + 우리의 CNN은 슈퍼 빠른을 가지고 있으며,이 같은 오른쪽에 있었다 과거와 이야기 + +789 +00:57:43,009 --> 00:57:49,798 + 그 좋은 정말 재미있는 내가 아는이 시점에서 지금 실제로 작동 + +790 +00:57:49,798 --> 00:57:52,949 + 당신은 실제로이 때문에 물체 검출에 예술의 상태를 이해할 수 + +791 +00:57:52,949 --> 00:57:55,669 + 이는 짓 눌린 세계 최고의 개체 검출기 중 하나입니다 + +792 +00:57:55,670 --> 00:58:00,479 + 12 월 이미지와 코코아 도전에 도전 이미지에서 모두 + +793 +00:58:00,478 --> 00:58:06,710 + 대부분 같은 다른 것은 그것이 최고의 목적 때문에이 깊은 잔여 네트워크입니다입니다 + +794 +00:58:06,710 --> 00:58:10,548 + 세계에서 지금 백 하나의 층 잔여 네트워크는 빠른 플러스 + +795 +00:58:10,548 --> 00:58:17,298 + 우리의 CNN 플러스 몇 가지 다른 케이크는 바로 여기 그래서 우리는 우리가 이야기에 대해 이야기 + +796 +00:58:17,298 --> 00:58:23,670 + 그들은 항상 여분을 얻을 수있는 우리의 CNN 과거 우리가 작년에 대통령을 보았다 + +797 +00:58:23,670 --> 00:58:26,389 + 대회를 위해 당신은 조금을 얻기 위해 미친 몇 가지를 추가해야 + +798 +00:58:26,389 --> 00:58:30,348 + 개선이 실제로 수행이 상자에 바로 그래서 여기에 성능이 약간 향상 + +799 +00:58:30,349 --> 00:58:33,528 + 바운딩 박스를 정제의 여러 단계 + +800 +00:58:33,528 --> 00:58:38,818 + 당신은 빠른 우리의 CNN 프레임 워크에이 보정에 일을 보았다 + +801 +00:58:38,818 --> 00:58:41,929 + 지역 제안 위에 실제로 네트워크로 그를 피드백 할 수 + +802 +00:58:41,929 --> 00:58:46,298 + 즉,이 상자 정제 그래서 안드레아 다른 생산을 얻을 재 분류 + +803 +00:58:46,298 --> 00:58:50,929 + 그것은 당신에게 후원을 조금주는 단계 그들은뿐만 아니라, 그래서 컨텍스트를 추가 + +804 +00:58:50,929 --> 00:58:55,710 + 당신 제공 그들이 배우 나가 그냥 지역 분류 + +805 +00:58:55,710 --> 00:59:00,309 + 종류의 당신에게보다 더 많은 접촉을 제공하는 전체 이미지에 대한 전체 기능 + +806 +00:59:00,309 --> 00:59:03,999 + 그냥 작은 작물 그물 당신에게 조금 더 아파트 또한 제공 + +807 +00:59:03,998 --> 00:59:08,179 + 우리는 그들이 실제로 실행할 수 있도록 다시 피트 이상에서 본 좀처럼 다중 스케일 테스트를 할 + +808 +00:59:08,179 --> 00:59:10,730 + 다른 크기의 이미지에 대한 것은 시험 시간 + +809 +00:59:10,730 --> 00:59:13,949 + 집계 또는 그 서로 다른 크기와 모든 것들을 넣을 때 + +810 +00:59:13,949 --> 00:59:21,129 + 함께 실제로 SOCO에이 일 하나 때문에 대회를 많이 승리 + +811 +00:59:21,130 --> 00:59:24,960 + 마이크로 소프트 코코 실제로 검출 도전을 실행하고 검출 궁금해 + +812 +00:59:24,960 --> 00:59:29,199 + 우리는 또한 이미지에 급속한 진전을 볼 수 코코아에 도전하는 + +813 +00:59:29,199 --> 00:59:32,909 + 지난 몇 년 동안 감지 도전은 그래서 당신은 2013 년에 볼 수 있습니다 + +814 +00:59:32,909 --> 00:59:38,949 + 우리가 이러한 깊은 학습 탐지 모델을 가지고 첫 번째 시간이었다 종류 + +815 +00:59:38,949 --> 00:59:43,789 + 우리가 현지화 보았다 위업을 통해 그들은 실제로 버전을 제출 자신의 + +816 +00:59:43,789 --> 00:59:47,949 + 로와 로직을 변경의 종류에 의해뿐만 아니라 탐지 작동 시스템 + +817 +00:59:47,949 --> 00:59:51,849 + 이들이 상자를 경계 병합 그들은 꽤 좋은했지만 그들은 실제로 있었다 + +818 +00:59:51,849 --> 00:59:57,319 + 이 다른이 다른 그룹보다 실적 일종의했다 당신의 비전을 호출 + +819 +00:59:57,320 --> 01:00:02,289 + 없는 깊은 학습 방법 2014 우리의 기능을 많이하지만 전혀 사용 + +820 +01:00:02,289 --> 01:00:05,840 + 실제로 실제로 깊은 학습 방법과 구글이 모두 한보고 + +821 +01:00:05,840 --> 01:00:09,740 + 하나 구글 맵 플러스의 상단에 다른 검출 재료를 사용하여 원 + +822 +01:00:09,739 --> 01:00:15,029 + 구글은하지 후 2015 일에 미친 이러한 잔류 네트워크 갔다 플러스 + +823 +01:00:15,030 --> 01:00:19,410 + 내가 통해 특히 해당 작업을 생각하도록 통행 나는 CNN은 모든 것을 분쇄 + +824 +01:00:19,409 --> 01:00:22,409 + 우리가 본 적이 있기 때문에 지난 몇 년은 정말 흥미 진진한 일이있다 + +825 +01:00:22,409 --> 01:00:25,429 + 가장 좋아 검출 지난 몇 년 동안 정말 빠른 진행 + +826 +01:00:25,429 --> 01:00:29,129 + 다른 것들과 내가 생각하는 또 다른 포인트는 만드는 재미의 종류이다이다 + +827 +01:00:29,130 --> 01:00:33,800 + 내가 할 수있는 모든 대회 우승을 위해 실제로는 앙드레 당신을 말했다 알고 + +828 +01:00:33,800 --> 01:00:37,830 + 앙상블과 항상 앙상블과 함께 경기를 승리 그래서 2 %를 얻을 수 있지만, + +829 +01:00:37,829 --> 01:00:42,829 + 실제로 종류의 재미 마이크로 소프트는 또한 최고의 단일 거주 제출 + +830 +01:00:42,829 --> 01:00:47,440 + 이 앙상블이 아니었다 모델링하고 단 하나의 거주자 모델은 실제로 모든 일 + +831 +01:00:47,440 --> 01:00:52,400 + 그래 정말 멋진 사실의 다른 모든 년에서 다른 것들 + +832 +01:00:52,400 --> 01:00:58,130 + 즉,이 재미있는 일의 종류를 잘 그래서 그 밖에 최고의 배우입니다입니다 + +833 +01:00:58,130 --> 01:01:03,240 + 그래서 이것은 정말 그래서 우리는 우리가 현지화의이 아이디어로 이야기 우리입니다 + +834 +01:01:03,239 --> 01:01:08,439 + 회귀 그래서 욜로 당신 만 보면이라는 재미있는 것은 한 번 실제로 시도 + +835 +01:01:08,440 --> 01:01:13,519 + 아이디어 정도로 회귀 문제로 직접 검출 문제를 향하여 + +836 +01:01:13,519 --> 01:01:18,389 + 우리는 실제로 우리의 입력 영상을하려고 우리는로 나누어거야 있음 + +837 +01:01:18,389 --> 01:01:22,190 + 일부 공간 격자 그들은 일곱하여 다음 내에서 칠하는 데 사용 + +838 +01:01:22,190 --> 01:01:26,480 + 공간 격자에 대한 각 요소는 우리가 거​​ 경계 상자의 여섯 번호를 확인하고 + +839 +01:01:26,480 --> 01:01:31,039 + 예측은 내가 그렇게 한 다음 실험의 대부분의 생각에 동일하게 사용 + +840 +01:01:31,039 --> 01:01:36,489 + 각 격자 내에는 네 어쩌면 경계 할 상자를 예측하는거야 + +841 +01:01:36,489 --> 01:01:41,229 + 숫자는 미국 당신이 믿는 정도에 대한 하나의 점수를 보호하려고 + +842 +01:01:41,230 --> 01:01:44,969 + 그 경계 상자는 당신은 또한 각 분류 점수를 보호하는거야 + +843 +01:01:44,969 --> 01:01:49,659 + 그래서 다음에 데이비스 근처 클래스는 정렬이이 검출 문제를 취할 수 있으며, + +844 +01:01:49,659 --> 01:01:53,969 + 그것은 회귀되는 것은 당신의 입력이 출력에 이미지가 어쩌면이입니다 끝 + +845 +01:01:53,969 --> 01:01:59,529 + 오 B에 의해 일곱으로 칠 플러스 단지 회귀 문제를 지금 답변을보고 + +846 +01:01:59,530 --> 01:02:04,820 + 다만 그것을 시도하고 꽤 멋진 그리고 그것은에 새로운 접근 방식의 일종이다 + +847 +01:02:04,820 --> 01:02:07,900 + 우리가 전에 본 적이 이러한 영역 제안 가지 다른 비트 + +848 +01:02:07,900 --> 01:02:12,300 + 물론 종류의이 문제는 바인드 상부있을 것입니다 + +849 +01:02:12,300 --> 01:02:15,930 + 문제가있는 경우 일 수 있습니다 모델 그렇게 할 수 출력 수 + +850 +01:02:15,929 --> 01:02:20,279 + 당신이 테스트하고 데이터는 훈련 데이터에 많은 더 많은 지상 진실 상자가 + +851 +01:02:20,280 --> 01:02:27,180 + 그래서이이 노란색 검출기는 실제로 빠른 실제로 빠르고입니다 정말 + +852 +01:02:27,179 --> 01:02:32,460 + 다음 빠르게 꽤 미친 그러나 불행하게도 그것이 작동하는 경향이 우리의 CNN + +853 +01:02:32,460 --> 01:02:36,769 + 조금 더 그렇게 나쁜 내가 그나마 빠른 노란색이라는 다른 것은 + +854 +01:02:36,769 --> 01:02:39,460 + 약 싶어 이야기하지만, + +855 +01:02:39,460 --> 01:02:45,170 + 중 하나에 전달 마우스 오른쪽 그러나 우리의 숫자로 이들은 평균 AP 번호는 + +856 +01:02:45,170 --> 01:02:49,619 + 파스칼 데이터는 우리가 당신이 볼 수 보았다 설​​정 노란색 실제로 64이 꽤있어 도착 + +857 +01:02:49,619 --> 01:02:53,329 + 좋은이가에 분명히 초당 마흔다섯 프레임에서 실행 + +858 +01:02:53,329 --> 01:02:58,840 + 강력한 GPU는하지만 아직도 그게 놀라운 거의 실시간의의 + +859 +01:02:58,840 --> 01:03:03,960 + 나는 싶어 지금이 다른 버전을 알고에 대해 얘기하지 마세요도했다 + +860 +01:03:03,960 --> 01:03:09,309 + 과거와 목사의 CNN의 당신이 실제로 거의 모든 이길 것을 볼 수 있습니다 + +861 +01:03:09,309 --> 01:03:14,119 + 성능면에서 요하지만 꽤 느린 그래 그 그 사실입니다입니다입니다 + +862 +01:03:14,119 --> 01:03:20,119 + 검출 문제에 깔끔한 트위스트의 종류 실제로 모든 모든 + +863 +01:03:20,119 --> 01:03:22,779 + 다른 검출 메트릭이 모든 다른 검출 모델이 우리 + +864 +01:03:22,780 --> 01:03:26,780 + 오늘 이야기 그들은 모두 꽤 많이는 출시 당신이해야 코드를 가지고 + +865 +01:03:26,780 --> 01:03:30,800 + 프로젝트는 아마 우리의 CNN을 사용하지 않는 어쩌면 너무 느린을 사용하는 것이 좋습니다 + +866 +01:03:30,800 --> 01:03:36,090 + 빠른 꽤 잘 볼하지만 우리의 CNN이 MATLAB 목사를 필요로하는 + +867 +01:03:36,090 --> 01:03:39,720 + 실제로 페르시아는 MATLAB을 필요로하지 않는 목사 우리의 CNN은 파이프 라인이다 + +868 +01:03:39,719 --> 01:03:44,379 + 카페 나는 개인적으로 그것을 사용하지 않은하지만 당신이 시도 할 수 있습니다 뭔가 + +869 +01:03:44,380 --> 01:03:48,070 + 프로젝트에 사용할 나는 그것이 실행 얻을 수 있습니다 얼마나 어려운지 잘 모르겠어요 및 + +870 +01:03:48,070 --> 01:03:52,050 + 노란색 실제로 나는이 프로젝트 때문에 일부 아마 좋은 선택을 생각한다 + +871 +01:03:52,050 --> 01:03:55,810 + 당신이 정말 큰되지 않은 경우 작업을 쉽게 할 수 있도록 빨리 + +872 +01:03:55,809 --> 01:03:59,860 + 실제로 강력한 GPU를하고도 잡았습니다 + +873 +01:03:59,860 --> 01:04:03,480 + 예 그게 내가 이렇게도 예상보다 조금 빨리 물건을 가지고 사실이다 + +874 +01:04:03,480 --> 01:04:10,559 + 검출에이 질문 + +875 +01:04:10,559 --> 01:04:15,880 + 네 + +876 +01:04:15,880 --> 01:04:22,630 + 예 모델의 크기와 같은 모델의 관점에서이 같은 대한 거의이다 + +877 +01:04:22,630 --> 01:04:26,039 + 분류 모델을 사용하면 더 큰 이미지를 실행하는 경우 때 때문에 + +878 +01:04:26,039 --> 01:04:29,109 + 우리의 CNN이 바로 회선을 일으킬 특히 빠른 당신은 정말하지 않습니다 + +879 +01:04:29,108 --> 01:04:32,558 + 층의 전체 영향 아닌 정말 더 이상 더 이상 매개 변수를 소개합니다 + +880 +01:04:32,559 --> 01:04:35,829 + 당신은이 지역의 제안에 대한 몇 가지 추가 매개 변수가 매개 변수 + +881 +01:04:35,829 --> 01:04:38,798 + 그러나 네트워크는 기본적으로 분류와 같은 다수의 원색 + +882 +01:04:38,798 --> 01:04:45,619 + 모델 바로 내가 오늘 일찍 우리가 조금 완료 추측 추측 + diff --git a/classification.md b/classification.md index efeeb8a2..8a405a47 100644 --- a/classification.md +++ b/classification.md @@ -4,108 +4,114 @@ mathjax: true permalink: /classification/ --- -This is an introductory lecture designed to introduce people from outside of Computer Vision to the Image Classification problem, and the data-driven approach. The Table of Contents: +본 강의노트는 컴퓨터비전 외의 분야를 공부하던 사람들에게 Image Classification(이미지 분류) 문제와, data-driven approach(데이터 기반 방법론)을 소개한다. 목차는 다음과 같다. -- [Intro to Image Classification, data-driven approach, pipeline](#intro) -- [Nearest Neighbor Classifier](#nn) - - [k-Nearest Neighbor](#knn) -- [Validation sets, Cross-validation, hyperparameter tuning](#val) -- [Pros/Cons of Nearest Neighbor](#procon) -- [Summary](#summary) -- [Summary: Applying kNN in practice](#summaryapply) -- [Further Reading](#reading) +- [Image Classification(이미지 분류), data-driven approach(데이터 기반 방법론), pipeline(파이프라인)](#intro) +- [Nearest Neighbor 분류기](#nn) + - [k-Nearest Neighbor 알고리즘](#knn) +- [Validation sets, Cross-validation, hyperparameter 튜닝](#val) +- [Nearest Neighbor의 장단점](#procon) +- [요약](#summary) +- [요약: 실제 문제에 kNN 적용하기](#summaryapply) +- [읽을 자료](#reading) -## Image Classification -**Motivation**. In this section we will introduce the Image Classification problem, which is the task of assigning an input image one label from a fixed set of categories. This is one of the core problems in Computer Vision that, despite its simplicity, has a large variety of practical applications. Moreover, as we will see later in the course, many other seemingly distinct Computer Vision tasks (such as object detection, segmentation) can be reduced to image classification. +## Image Classification(이미지 분류) -**Example**. For example, in the image below an image classification model takes a single image and assigns probabilities to 4 labels, *{cat, dog, hat, mug}*. As shown in the image, keep in mind that to a computer an image is represented as one large 3-dimensional array of numbers. In this example, the cat image is 248 pixels wide, 400 pixels tall, and has three color channels Red,Green,Blue (or RGB for short). Therefore, the image consists of 248 x 400 x 3 numbers, or a total of 297,600 numbers. Each number is an integer that ranges from 0 (black) to 255 (white). Our task is to turn this quarter of a million numbers into a single label, such as *"cat"*. +**동기**. 이 섹션에서는 이미지 분류 문제에 대해 다룰 것이다. 이미지 분류 문제란, 입력 이미지를 미리 정해진 카테고리 중 하나인 라벨로 분류하는 문제다. 문제 정의는 매우 간단하지만 다양한 활용 가능성이 있는 컴퓨터 비전 분야의 핵심적인 문제 중의 하나이다. 강의의 나중 파트에서도 살펴보겠지만, 이미지 분류와 멀어보이는 다른 컴퓨터 비전 분야의 여러 문제들 (물체 검출, 영상 분할 등)이 이미지 분류 문제를 푸는 것으로 인해 해결될 수 있다. + +**예시**. 예를 들어, 아래 그림의 이미지 분류 모델은 하나의 이미지와 4개의 분류가능한 라벨 *{cat, dog, hat, mug}* 이 있다. 그림에서 보다시피, 컴퓨터에서 이미지는 3차원 배열로 표현된다. 이 예시에서 고양이 이미지는 가로 248픽셀(모니터의 화면을 구성하는 최소 단위, 역자 주), 세로 400픽셀로 구성되어 있고 Red, Green, Blue(RGB) 3개의 색상 채널이 있다. 따라서 이 이미지는 248 x 400 x 3개(총 297,500개)의 픽셀로 구성되어 있다. 각 픽셀의 값은 0~255 범위의 정수값이다. 이미지 분류 문제는 이 수많은 값들을 *"cat"* 이라는 하나의 라벨로 변경하는 것이다.
- -
The task in Image Classification is to predict a single label (or a distribution over labels as shown here to indicate our confidence) for a given image. Images are 3-dimensional arrays of integers from 0 to 255, of size Width x Height x 3. The 3 represents the three color channels Red, Green, Blue.
+ +
이미지 분류는 이미지가 주어졌을 때 그에 대한 라벨(각 라벨에 대한 신뢰도를 표시하는 분류)을 예측하는 일이다. 이미지는 0~255 정수 범위의 값을 가지는 Width(너비) x Height(높이) x 3의 크기의 3차원 배열이다. 3은 Red, Green, Blue로 구성된 3개의 채널을 의미한다.
-**Challenges**. Since this task of recognizing a visual concept (e.g. cat) is relatively trivial for a human to perform, it is worth considering the challenges involved from the perspective of a Computer Vision algorithm. As we present (an inexhaustive) list of challenges below, keep in mind the raw representation of images as a 3-D array of brightness values: +**문제**. 이미지를 분류하는 일(예를들어 *"cat"*)이 사람에게는 대수롭지 않겠지만, 컴퓨터 비전의 관점에서 생각해보면 해결해야 하는 문제들이 있다. 아래에 서술된 해결해야 하는 문제들처럼, 이미지는 3차원 배열의 값으로 나타내는 것을 염두해두어야 한다. -- **Viewpoint variation**. A single instance of an object can be oriented in many ways with respect to the camera. -- **Scale variation**. Visual classes often exhibit variation in their size (size in the real world, not only in terms of their extent in the image). -- **Deformation**. Many objects of interest are not rigid bodies and can be deformed in extreme ways. -- **Occlusion**. The objects of interest can be occluded. Sometimes only a small portion of an object (as little as few pixels) could be visible. -- **Illumination conditions**. The effects of illumination are drastic on the pixel level. -- **Background clutter**. The objects of interest may *blend* into their environment, making them hard to identify. -- **Intra-class variation**. The classes of interest can often be relatively broad, such as *chair*. There are many different types of these objects, each with their own appearance. +- **Viewpoint variation(시점 변화)**. 객체의 단일 인스턴스는 카메라에 의해 시점이 달라질 수 있다. +- **Scale variation(크기 변화)**. 비주얼 클래스는 대부분 그것들의 크기의 변화를 나타낸다(이미지의 크기뿐만 아니라 실제 세계에서의 크기까지 포함함). +- **Deformation(변형)**. 많은 객체들은 고정된 형태가 없고, 극단적인 형태로 변형될 수 있다. +- **Occlusion(폐색)**. 객체들은 전체가 보이지 않을 수 있다. 때로는 물체의 매우 적은 부분(매우 적은 픽셀)만이 보인다. +- **Illumination conditions(조명 상태)**. 조명의 영향으로 픽셀 값이 변형된다. +- **Background clutter(배경 분규)**. 객체가 주변 환경에 섞여(*blend*) 알아보기 힘들게 된다. +- **Intra-class variation(내부클래스의 다양성)**. 분류해야할 클래스는 범위가 큰 것들이 많다. 예를 들어 *의자* 의 경우, 매우 다양한 형태의 객체가 있다. -A good image classification model must be invariant to the cross product of all these variations, while simultaneously retaining sensitivity to the inter-class variations. +좋은 이미지 분류기는 각 클래스간의 감도를 유지하면서 동시에 이런 다양한 문제들에 대해 변함 없이 분류할 수 있는 성능을 유지해야 한다.
- +
-**Data-driven approach**. How might we go about writing an algorithm that can classify images into distinct categories? Unlike writing an algorithm for, for example, sorting a list of numbers, it is not obvious how one might write an algorithm for identifying cats in images. Therefore, instead of trying to specify what every one of the categories of interest look like directly in code, the approach that we will take is not unlike one you would take with a child: we're going to provide the computer with many examples of each class and then develop learning algorithms that look at these examples and learn about the visual appearance of each class. This approach is referred to as a *data-driven approach*, since it relies on first accumulating a *training dataset* of labeled images. Here is an example of what such a dataset might look like: +**Data-driven approach(데이터 기반 방법론)**. 어떻게 하면 이미지를 각각의 카테고리로 분류하는 알고리즘을 작성할 수 있을까? 숫자를 정렬하는 알고리즘 작성과는 달리 고양이를 분별하는 알고리즘을 작성하는 것은 어렵다. + + 그러므로, 코드를 통해 직접적으로 모든 것을 카테고리로 분류하기 보다는 좀 더 쉬운 방법을 사용할 것이다. 먼저 컴퓨터에게 각 클래스에 대해 많은 예제를 주고 나서 이 예제들을 보고 시각적으로 학습할 수 있는 학습 알고리즘을 개발한다. + 이런 방법을 *data-driven approach(데이터 기반 방법론)* 이라고 한다. 이 방법은 라벨화가 된 이미지들 *training dataset(학습 데이터셋)* 이 처음 학습을 위해 필요하다. 아래 그림은 이런 데이터셋의 예이다.
- -
An example training set for four visual categories. In practice we may have thousands of categories and hundreds of thousands of images for each category.
+ +
4개의 카테고리에 대한 학습 데이터셋에 대한 예. 학습과정에서 천여 개의 카테고리와 각 카테고리당 수십만 개의 이미지가 있을 수 있다.
-**The image classification pipeline**. We've seen that the task in Image Classification is to take an array of pixels that represents a single image and assign a label to it. Our complete pipeline can be formalized as follows: +**The image classification pipeline(이미지 분류 파이프라인)**. 이미지 분류 문제란, 이미지를 픽셀들의 배열로 표현하고 각 이미지에 라벨을 하나씩 할당하는 문제라는 것을 이제까지 살펴보았다. 완전한 파이프라인은 아래와 같이 공식화할 수 있다: -- **Input:** Our input consists of a set of *N* images, each labeled with one of *K* different classes. We refer to this data as the *training set*. -- **Learning:** Our task is to use the training set to learn what every one of the classes looks like. We refer to this step as *training a classifier*, or *learning a model*. -- **Evaluation:** In the end, we evaluate the quality of the classifier by asking it to predict labels for a new set of images that it has never seen before. We will then compare the true labels of these images to the ones predicted by the classifier. Intuitively, we're hoping that a lot of the predictions match up with the true answers (which we call the *ground truth*). +- **Input(입력):** 입력은 *N* 개의 이미지로 구성되어 있고, *K* 개의 별개의 클래스로 라벨화 되어 있다. 이 데이터를 *training set* 으로 사용한다. +- **Learning(학습):** 학습에서 할 일은 트레이닝 셋을 이용해 각각의 클래스를 학습하는 것이다. 이 과정을 *training a classifier* 혹은 *learning a model* 이란 용어를 사용해 표현할 수 있다. +- **Evaluation(평가):** 마지막으로 새로운 이미지에 대해 어떤 라벨로 분류되는지 예측해봄으로써 분류기의 성능을 평가한다. 새로운 이미지의 라벨과 분류기를 통해 예측된 라벨을 비교할 것이다. 직관적으로, 많은 예상치들이 실제 답과 일치하기를 기대하는 것이고, 이 것을 우리는 *ground truth(실측 자료)* 라고 한다. -### Nearest Neighbor Classifier -As our first approach, we will develop what we call a **Nearest Neighbor Classifier**. This classifier has nothing to do with Convolutional Neural Networks and it is very rarely used in practice, but it will allow us to get an idea about the basic approach to an image classification problem. -**Example image classification dataset: CIFAR-10.** One popular toy image classification dataset is the CIFAR-10 dataset. This dataset consists of 60,000 tiny images that are 32 pixels high and wide. Each image is labeled with one of 10 classes (for example *"airplane, automobile, bird, etc"*). These 60,000 images are partitioned into a training set of 50,000 images and a test set of 10,000 images. In the image below you can see 10 random example images from each one of the 10 classes: +## Nearest Neighbor Classifier(최근접 이웃 분류기) + +첫번째 방법으로써, **Nearest Neighbor Classifier** 라 불리는 분류기를 개발할 것이다. 이 분류기는 컨볼루션 신경망 방법과는 아무 상관이 없고 실제 문제를 풀 때 자주 사용되지는 않지만, 이미지 분류 문제에 대한 기본적인 접근 방법을 알 수 있도록 한다. + +**이미지 분류 데이터셋의 예: CIFAR-10.** 간단하면서 유명한 이미지 분류 데이터셋 중의 하나는 CIFAR-10 dataset 이다. 이 데이터셋은 60,000개의 작은 이미지로 구성되어 있고, 각 이미지는 32x32 픽셀 크기이다. 각 이미지는 10개의 클래스중 하나로 라벨링되어 있다(Ex. *"airplane, automobile, bird, etc"*). 이 60,000개의 이미지 중에 50,000개는 학습 데이터셋 (트레이닝 셋), 10,000개는 테스트 (데이터)셋으로 분류된다. 아래의 그림에서 각 10개의 클래스에 대해 임의로 선정한 10개의 이미지들의 예를 볼 수 있다:
- -
Left: Example images from the CIFAR-10 dataset. Right: first column shows a few test images and next to each we show the top 10 nearest neighbors in the training set according to pixel-wise difference.
+ +
좌: CIFAR-10 dataset 의 각 클래스 예. 우: 첫번째 열은 테스트 셋이고 나머지 열은 이 테스트 셋에 대해서 트레이닝 셋에 있는 이미지 중 픽셀값 차에 따른 상위 10개의 최근접 이웃 이미지이다.
-Suppose now that we are given the CIFAR-10 training set of 50,000 images (5,000 images for every one of the labels), and we wish to label the remaining 10,000. The nearest neighbor classifier will take a test image, compare it to every single one of the training images, and predict the label of the closest training image. In the image above and on the right you can see an example result of such a procedure for 10 example test images. Notice that in only about 3 out of 10 examples an image of the same class is retrieved, while in the other 7 examples this is not the case. For example, in the 8th row the nearest training image to the horse head is a red car, presumably due to the strong black background. As a result, this image of a horse would in this case be mislabeled as a car. +50,000개의 CIFAR-10 트레이닝 셋(하나의 라벨 당 5,000개의 이미지)이 주어진 상태에서 나머지 10,000개의 이미지에 대해 라벨화 하는 것을 가정해보자. 최근접 이웃 분류기는 테스트 이미지를 취해 모든 학습 이미지와 비교를 하고 라벨 값을 예상할 것이다. 상단 이미지의 우측과 같이 10개의 테스트 이미지에 대한 결과를 확인해보면, 10개의 이미지 중 3개만이 같은 클래스로 검색된 반면, 7개의 이미지는 같은 클래스로 분류되지 않았다. 예를 들어, 8번째 행의 말 학습 이미지에 대한 첫번째 최근접 이웃 이미지는 붉은색의 차이다. 짐작컨데 이 경우는 검은색 배경의 영향이 큰 듯 하다. 결과적으로, 이 말 이미지는 차로 잘못 분류될 것이다. -You may have noticed that we left unspecified the details of exactly how we compare two images, which in this case are just two blocks of 32 x 32 x 3. One of the simplest possibilities is to compare the images pixel by pixel and add up all the differences. In other words, given two images and representing them as vectors \\( I\_1, I\_2 \\) , a reasonable choice for comparing them might be the **L1 distance**: +두개의 이미지(이 경우에는 32 x 32 x 3 크기의 두 블록)를 비교하는 정확한 방법을 아직 명시하지 않았다는 점을 눈치챘을 것이다. 가장 간단한 방법 중 하나는 이미지를 각각의 픽셀값으로 비교하고, 그 차이를 모두 더하는 것이다. 다시 말해서 두 개의 이미지가 주어지고 그 것들을 $$ I_1, I_2 $$ 벡터로 나타냈을 때, 벡터 간의 **L1 distance(L1 거리)** 를 계산하는 것이 한 가지 방법이다: $$ -d\_1 (I\_1, I\_2) = \sum\_{p} \left| I^p\_1 - I^p\_2 \right| +d_1 (I_1, I_2) = \sum_{p} \left| I^p_1 - I^p_2 \right| $$ -Where the sum is taken over all pixels. Here is the procedure visualized: +결과는 모든 픽셀값 차이의 합이다. 아래에 그 과정을 시각화 하였다:
- -
An example of using pixel-wise differences to compare two images with L1 distance (for one color channel in this example). Two images are subtracted elementwise and then all differences are added up to a single number. If two images are identical the result will be zero. But if the images are very different the result will be large.
+ +
두 개의 이미지를 (각각의 색 채널마다의) L1 거리를 이용해서 비교할 때, 각 픽셀마다의 차이를 사용하는 예시. 두 이미지 벡터(행렬)의 각 성분마다 차를 계산하고, 그 차를 전부 더해서 하나의 숫자를 얻는다. 두 이미지가 똑같을 경우에는 결과가 0일 것이고, 두 이미지가 매우 다르다면 결과값이 클 것이다.
-Let's also look at how we might implement the classifier in code. First, let's load the CIFAR-10 data into memory as 4 arrays: the training data/labels and the test data/labels. In the code below, `Xtr` (of size 50,000 x 32 x 32 x 3) holds all the images in the training set, and a corresponding 1-dimensional array `Ytr` (of length 50,000) holds the training labels (from 0 to 9): +다음으로, 분류기를 실제로 코드 상에서 어떻게 구현하는지 살펴보자. 첫 번째로 CIFAR-10 데이터를 메모리로 불러와 4개의 배열에 저장한다. 각각은 학습(트레이닝) 데이터와 그 라벨, 테스트 데이터와 그 라벨이다. 아래 코드에 `Xtr`(크기 50,000 x 32 x 32 x 3)은 트레이닝 셋의 모든 이미지를 저장하고 1차원 배열인 `Ytr`(길이 50,000)은 트레이닝 데이터의 라벨(0부터 9까지)을 저장한다. -```python -Xtr, Ytr, Xte, Yte = load_CIFAR10('data/cifar10/') # a magic function we provide -# flatten out all images to be one-dimensional -Xtr_rows = Xtr.reshape(Xtr.shape[0], 32 * 32 * 3) # Xtr_rows becomes 50000 x 3072 -Xte_rows = Xte.reshape(Xte.shape[0], 32 * 32 * 3) # Xte_rows becomes 10000 x 3072 -``` +~~~python +Xtr, Ytr, Xte, Yte = load_CIFAR10('data/cifar10/') # 제공되는 함수 +# 모든 이미지가 1차원 배열로 저장된다. +Xtr_rows = Xtr.reshape(Xtr.shape[0], 32 * 32 * 3) # Xtr_rows는 50000 x 3072 크기의 배열. +Xte_rows = Xte.reshape(Xte.shape[0], 32 * 32 * 3) # Xte_rows는 10000 x 3072 크기의 배열. +~~~ -Now that we have all images stretched out as rows, here is how we could train and evaluate a classifier: +이제 모든 이미지를 배열의 각 행들로 얻었다. 아래에는 분류기를 어떻게 학습시키고 평가하는지에 대한 코드이다: -```python -nn = NearestNeighbor() # create a Nearest Neighbor classifier class -nn.train(Xtr_rows, Ytr) # train the classifier on the training images and labels -Yte_predict = nn.predict(Xte_rows) # predict labels on the test images -# and now print the classification accuracy, which is the average number -# of examples that are correctly predicted (i.e. label matches) +~~~python +nn = NearestNeighbor() # Nearest Neighbor 분류기 클래스 생성 +nn.train(Xtr_rows, Ytr) # 학습 이미지/라벨을 활용하여 분류기 학습 +Yte_predict = nn.predict(Xte_rows) # 테스트 이미지들에 대해 라벨 예측 +# 그리고 분류 성능을 프린트한다 +# 정확도는 이미지가 올바르게 예측된 비율로 계산된다 (라벨이 같을 비율) print 'accuracy: %f' % ( np.mean(Yte_predict == Yte) ) -``` +~~~ -Notice that as an evaluation criterion, it is common to use the **accuracy**, which measures the fraction of predictions that were correct. Notice that all classifiers we will build satisfy this one common API: they have a `train(X,y)` function that takes the data and the labels to learn from. Internally, the class should build some kind of model of the labels and how they can be predicted from the data. And then there is a `predict(X)` function, which takes new data and predicts the labels. Of course, we've left out the meat of things - the actual classifier itself. Here is an implementation of a simple Nearest Neighbor classifier with the L1 distance that satisfies this template: +일반적으로 평가 기준으로서 **accuracy(정확도)** 를 사용한다. 정확도는 예측값이 실제와 얼마나 일치하는지 그 비율을 측정한다. 앞으로 만들어볼 모든 분류기는 공통적인 API를 갖게 될 것이다: 데이터(X)와 데이터가 실제로 속하는 라벨(y)을 입력으로 받는 `train(X,y)` 형태의 함수가 있다는 점이다. 내부적으로, 이 함수는 라벨들을 활용하여 어떤 모델을 만들어야 하고, 그 값들이 데이터로부터 어떻게 예측될 수 있는지를 알아야 한다. 그 이후에는 새로운 데이터로 부터 라벨을 예측하는 `predict(X)` 형태의 함수가 있다. 물론, 아직까지는 실제 분류기 자체가 빠져있다. 다음은 앞의 형식을 만족하는 L1 거리를 이용한 간단한 최근접 이웃 분류기의 구현이다: -```python +~~~python import numpy as np class NearestNeighbor(object): @@ -114,171 +120,181 @@ class NearestNeighbor(object): def train(self, X, y): """ X is N x D where each row is an example. Y is 1-dimension of size N """ - # the nearest neighbor classifier simply remembers all the training data + # nearest neighbor 분류기는 단순히 모든 학습 데이터를 기억해둔다. self.Xtr = X self.ytr = y def predict(self, X): """ X is N x D where each row is an example we wish to predict label for """ num_test = X.shape[0] - # lets make sure that the output type matches the input type + # 출력 type과 입력 type이 갖게 되도록 확인해준다. Ypred = np.zeros(num_test, dtype = self.ytr.dtype) # loop over all test rows for i in xrange(num_test): - # find the nearest training image to the i'th test image - # using the L1 distance (sum of absolute value differences) + # i번째 테스트 이미지와 가장 가까운 학습 이미지를 + # L1 거리(절대값 차의 총합)를 이용하여 찾는다. distances = np.sum(np.abs(self.Xtr - X[i,:]), axis = 1) - min_index = np.argmin(distances) # get the index with smallest distance - Ypred[i] = self.ytr[min_index] # predict the label of the nearest example + min_index = np.argmin(distances) # 가장 작은 distance를 갖는 인덱스를 찾는다. + Ypred[i] = self.ytr[min_index] # 가장 가까운 이웃의 라벨로 예측 return Ypred -``` +~~~ -If you ran this code, you would see that this classifier only achieves **38.6%** on CIFAR-10. That's more impressive than guessing at random (which would give 10% accuracy since there are 10 classes), but nowhere near human performance (which is [estimated at about 94%](http://karpathy.github.io/2011/04/27/manually-classifying-cifar10/)) or near state-of-the-art Convolutional Neural Networks that achieve about 95%, matching human accuracy (see the [leaderboard](http://www.kaggle.com/c/cifar-10/leaderboard) of a recent Kaggle competition on CIFAR-10). +이 코드를 실행해보면 이 분류기는 CIFAR-10에 대해 정확도가 **38.6%** 밖에 되지 않는다는 것을 확인할 수 있다. 임의로 답을 결정하는 것(10개의 클래스가 있으므로 10%의 정확도)보다는 낫지만, 사람의 정확도([약 94%](http://karpathy.github.io/2011/04/27/manually-classifying-cifar10/))나 최신 컨볼루션 신경망의 성능(약 95%)에는 훨씬 미치지 못한다(최근 Kaggle 대회 [순위표](http://www.kaggle.com/c/cifar-10/leaderboard) 참고). -**The choice of distance.** -There are many other ways of computing distances between vectors. Another common choice could be to instead use the **L2 distance**, which has the geometric interpretation of computing the euclidean distance between two vectors. The distance takes the form: +**거리(distance) 선택** +벡터간의 거리를 계산하는 방법은 L1 거리 외에도 매우 많다. 또 다른 일반적인 선택으로, 기하학적으로 두 벡터간의 유클리디안 거리를 계산하는 것으로 해석할 수 있는 **L2 distance(L2 거리)** 의 사용을 고려해볼 수 있다. 이 거리의 계산 방식은 다음과 같다: $$ -d\_2 (I\_1, I\_2) = \sqrt{\sum\_{p} \left( I^p\_1 - I^p\_2 \right)^2} +d_2 (I_1, I_2) = \sqrt{\sum_{p} \left( I^p_1 - I^p_2 \right)^2} $$ -In other words we would be computing the pixelwise difference as before, but this time we square all of them, add them up and finally take the square root. In numpy, using the code from above we would need to only replace a single line of code. The line that computes the distances: +즉, 이전처럼 각 픽셀간의 차를 구하지만 각각에 제곱을 취하고, 전부 더한 다음에 최종적으로 제곱근을 취한다. NumPy를 사용한다면 위 코드를 사용하여 거리를 계산하는 아래의 코드 부분 딱 한 줄만 바꾸면 된다. -```python +~~~python distances = np.sqrt(np.sum(np.square(self.Xtr - X[i,:]), axis = 1)) -``` +~~~ -Note that I included the `np.sqrt` call above, but in a practical nearest neighbor application we could leave out the square root operation because square root is a *monotonic function*. That is, it scales the absolute sizes of the distances but it preserves the ordering, so the nearest neighbors with or without it are identical. If you ran the Nearest Neighbor classifier on CIFAR-10 with this distance, you would obtain **35.4%** accuracy (slightly lower than our L1 distance result). +위 코드에서는 `np.sqrt` 함수를 호출하는 것을 그대로 남겨두었지만, 제곱근 함수는 단조 함수이기 때문에 실제 nearest neighbor 응용에서 제곱근은 빼도 결과에 상관이 없다. 즉, 계산되는 거리들의 크기에는 차이가 생기겠지만 그 순서는 동일하기 때문에, 제곱근 함수를 포함할 때와 포함하지 않을 때의 nearest neighbor(최근접 이웃)는 동일하다. 이 거리 함수를 사용하여 Nearest Neighbor 분류기를 CIFAR-10 데이터셋에 돌린다면, **35.4%** 정확도를 얻을 수 있다 (L1 거리를 사용한 결과보다 조금 낮아졌다). -**L1 vs. L2.** It is interesting to consider differences between the two metrics. In particular, the L2 distance is much more unforgiving than the L1 distance when it comes to differences between two vectors. That is, the L2 distance prefers many medium disagreements to one big one. L1 and L2 distances (or equivalently the L1/L2 norms of the differences between a pair of images) are the most commonly used special cases of a [p-norm](http://planetmath.org/vectorpnorm). +**L1 vs. L2.** 두 거리 함수의 특징을 비교하는 것은 매우 흥미로운 주제이다. 일반적으로, L2 거리는 L1 거리에 비해 두 벡터간의 차가 커지는 것에 대해 훨씬 더 크게 반응한다. 즉, L2 거리는 하나의 큰 차이가 있는 것보다 여러 개의 적당한 차이가 생기는 것을 선호한다. L1/L2 거리(또는 두 이미지의 차에 대한 L1/L2 norm)는 일반적인 [p-norm](http://planetmath.org/vectorpnorm)의 형태 중 가장 많이 사용되는 두 가지이다. -### k - Nearest Neighbor Classifier -You may have noticed that it is strange to only use the label of the nearest image when we wish to make a prediction. Indeed, it is almost always the case that one can do better by using what's called a **k-Nearest Neighbor Classifier**. The idea is very simple: instead of finding the single closest image in the training set, we will find the top **k** closest images, and have them vote on the label of the test image. In particular, when *k = 1*, we recover the Nearest Neighbor classifier. Intuitively, higher values of **k** have a smoothing effect that makes the classifier more resistant to outliers: +## k - Nearest Neighbor (kNN) 분류기 + +여태까지 예측을 할 때 가장 가까운 이미지의 라벨만을 사용하는 것을 이상하다고 생각할 수도 있을 것이다. 확실히, **k-Nearest Neighbor Classifier (kNN 분류기)** 라는 것을 사용한다면 거의 무조건 더 분류를 잘 할 수 있다. 아이디어는 매우 간단하다: 학습 데이터셋에서 가장 가까운 하나의 이미지만을 찾는 것이 아니라, 가장 가까운 **k** 개의 이미지를 찾아서 테스트 이미지의 라벨에 대해 투표하도록 하는 것이다. 여기서 *k = 1* 인 경우, 원래의 Nearest Neighbor 분류기가 된다. 직관적으로 **k** 값이 커질수록 분류기는 이상점(outlier)에 더 강인하고, 분류 경계가 부드러워지는 효과가 있다.
- -
An example of the difference between Nearest Neighbor and a 5-Nearest Neighbor classifier, using 2-dimensional points and 3 classes (red, blue, green). The colored regions show the decision boundaries induced by the classifier with an L2 distance. The white regions show points that are ambiguously classified (i.e. class votes are tied for at least two classes). Notice that in the case of a NN classifier, outlier datapoints (e.g. green point in the middle of a cloud of blue points) create small islands of likely incorrect predictions, while the 5-NN classifier smooths over these irregularities, likely leading to better generalization on the test data (not shown). Also note that the gray regions in the 5-NN image are caused by ties in the votes among the nearest neighbors (e.g. 2 neighbors are red, next two neighbors are blue, last neighbor is green).
+ +
Nearest Neighbor 분류기와 5-Nearest Neighbor 분류기의 차이 예시. 2차원 점과 3개의 클래스(라벨: red, blue, green)를 사용하였다. 색칠된 부분들은 L2 거리를 사용한 분류기를 통해 정해진 결정 경계(decision boundaries)이다. 흰색 부분들은 애매하게 분류(투표를 가장 많이 받은 라벨이 여러 개 있는 경우)된 점들을 나타낸다. NN 분류기의 경우 이상점들(e.g. 수많은 파란 점들 가운데에 있는 하나의 초록색 점)이 실제 결과와 맞지 않을 가능성이 큰 섬들을 형성하지만, 5-NN 분류기는 이런 조그만한 섬들이 생기지 않도록 부드럽게 이어주는 것을 확인하자. 이런 특성이 실제 테스트 데이터(그림에는 없음)에 적용할 때는 더 나은 일반화(generalization) 성능을 보인다. 또한, 5-NN 분류기 결과에서 회색 부분들은 nearest neighbors 간의 투표에서 동점이 발생한 경우(e.g. 2개의 이웃이 red, 다음 2개가 blue, 마지막 이웃이 green)인 것을 확인하자.
-In practice, you will almost always want to use k-Nearest Neighbor. But what value of *k* should you use? We turn to this problem next. +실제 문제에 적용할 경우, 대부분은 NN 분류기보다는 k-Nearest Neighbor (kNN) 분류기를 사용하고 싶을 것이다. 그러나 어떤 *k* 값을 골라야 할까? 이 문제에 대해 지금부터 다룰 것이다. -### Validation sets for Hyperparameter tuning -The k-nearest neighbor classifier requires a setting for *k*. But what number works best? Additionally, we saw that there are many different distance functions we could have used: L1 norm, L2 norm, there are many other choices we didn't even consider (e.g. dot products). These choices are called **hyperparameters** and they come up very often in the design of many Machine Learning algorithms that learn from data. It's often not obvious what values/settings one should choose. +### Hyperparameter 튜닝을 위한 검증 셋 (Validation set) + +k-nearest neighbor 분류기는 *k* 를 정해줘야 한다. 그런데 어떤 값이 가장 좋을까? 또한, 앞서 우리는 여러 가지 거리 함수(L1 norm, L2 norm, 여기서 고려하지 않은 다른 종류들 - e.g.내적 - 도 매우 많다)에 대해서도 살펴보았다. 이러한 선택들을 **hyperparameters** 라 부르고, 데이터로부터 학습하는 많은 기계학습(머신러닝) 알고리즘 디자인에 등장한다. 그런데 어떤 값/세팅을 골라야 하는지에 대해서 확신이 있는 경우는 거의 없다. -You might be tempted to suggest that we should try out many different values and see what works best. That is a fine idea and that's indeed what we will do, but this must be done very carefully. In particular, **we cannot use the test set for the purpose of tweaking hyperparameters**. Whenever you're designing Machine Learning algorithms, you should think of the test set as a very precious resource that should ideally never be touched until one time at the very end. Otherwise, the very real danger is that you may tune your hyperparameters to work well on the test set, but if you were to deploy your model you could see a significantly reduced performance. In practice, we would say that you **overfit** to the test set. Another way of looking at it is that if you tune your hyperparameters on the test set, you are effectively using the test set as the training set, and therefore the performance you achieve on it will be too optimistic with respect to what you might actually observe when you deploy your model. But if you only use the test set once at end, it remains a good proxy for measuring the **generalization** of your classifier (we will see much more discussion surrounding generalization later in the class). +여러 가지 다른 값들을 시도해보고, 어떤 것이 가장 좋은 성능을 보이는지 확인해보는 방법을 생각할 수 있다. 아래에서 우리도 실제로 이렇게 할 것이지만, 이 과정은 매우 조심스럽게 수행되어야 한다. 특히, **hyperparameter 값을 조정하기 위해 테스트 셋을 사용하면 절대 안 된다**. 우리가 머신러닝 알고리즘을 디자인할 때, 테스트 셋은 매우 귀한 리소스이고, 이론적으로는 실제로 알고리즘을 평가할 때인 맨 마지막 단 한 번을 제외하고는 절대 쳐다봐서는 안 된다. 그렇게 하지 않는다면 위험한 점은, 우리 모델의 hyperparameter 들이 테스트 셋에서는 잘 동작하도록 튜닝이 되어 있지만, 실전에서 모델을 사용(deploy)할 때 상당히 성능이 낮아지는 것을 확인할 수 있을 것이다. 머신러닝에서는 이것을 테스트 셋에 **overfit** 되었다고 말한다. 이를 다른 관점으로 바라본다면, 우리가 테스트 셋을 사용하여 hyperparameter 들을 튜닝했다는 것은 곧 우리가 테스트 셋을 마치 학습 데이터셋(트레이닝 셋)처럼 사용한 것이고, 우리 모델의 테스트 셋에서의 성능은 실제로 다른 데이터에 적용할 때에 비해 너무 낙관적이게 되어버린다. 그러나 테스트 셋을 맨 마지막에 딱 한 번만 사용한다면, 그 때는 우리가 학습한 분류기의 **일반화(generalization)** 된 성능을 잘 평가할 수 있는 척도로 활용될 것이다. (이 수업의 나중 부분에서도 일반화에 관련된 주제를 다룰 것이다.) -> Evaluate on the test set only a single time, at the very end. +> 테스트 셋에 성능을 평가하는 것은 맨 마지막에 단 한 번만 하라. -Luckily, there is a correct way of tuning the hyperparameters and it does not touch the test set at all. The idea is to split our training set in two: a slightly smaller training set, and what we call a **validation set**. Using CIFAR-10 as an example, we could for example use 49,000 of the training images for training, and leave 1,000 aside for validation. This validation set is essentially used as a fake test set to tune the hyper-parameters. +다행히도, hyperparameter 들을 튜닝하는 올바른 방법이 존재하고, 이 방법은 테스트 셋을 전혀 건드리지 않는다. 아이디어는, 우리가 갖고 있는 트레이닝 셋을 두 개로 쪼개는 것이다: 이른바 **검증 셋(validation set)** 으로 불리는, 약간 적은 수의 트레이닝 셋과 나머지로 나눈다. CIFAR-10 데이터셋을 예로 들면, 학습 이미지들 중에 49,000 장을 트레이닝 셋으로 삼고, 나머지 1,000 개를 검증(validation) 용으로 남겨놓는 것이다. 이 검증 셋은 hyperparameter 들을 튜닝할 때, 가짜 테스트 셋으로 활용된다. (역자 주: 즉, 실전 테스트인 수능을 준비하기 위한 모의고사라고 생각하면 된다.) -Here is what this might look like in the case of CIFAR-10: +CIFAR-10의 경우, 이런 식으로 나타낼 수 있을 것이다: -```python -# assume we have Xtr_rows, Ytr, Xte_rows, Yte as before -# recall Xtr_rows is 50,000 x 3072 matrix -Xval_rows = Xtr_rows[:1000, :] # take first 1000 for validation +~~~python +# Xtr_rows, Ytr, Xte_rows, Yte 는 이전과 동일하게 갖고 있다고 가정하자. +# Xtr_rows 는 50,000 x 3072 행렬이었다. +Xval_rows = Xtr_rows[:1000, :] # 앞의 1000 개를 검증용으로 선택한다. Yval = Ytr[:1000] -Xtr_rows = Xtr_rows[1000:, :] # keep last 49,000 for train +Xtr_rows = Xtr_rows[1000:, :] # 뒤쪽의 49,000 개를 학습용으로 선택한다. Ytr = Ytr[1000:] -# find hyperparameters that work best on the validation set +# 검증 셋에서 가장 잘 동작하는 hyperparameter 들을 찾는다. validation_accuracies = [] for k in [1, 3, 5, 10, 20, 50, 100]: - - # use a particular value of k and evaluation on validation data + + # 특정 k 값을 정해서 검증 데이터에 대해 평가할 때 사용한다. nn = NearestNeighbor() nn.train(Xtr_rows, Ytr) - # here we assume a modified NearestNeighbor class that can take a k as input + # 여기서는 k를 input으로 받을 수 있도록 변형된 NearestNeighbor 클래스가 있다고 가정하자. Yval_predict = nn.predict(Xval_rows, k = k) acc = np.mean(Yval_predict == Yval) print 'accuracy: %f' % (acc,) - # keep track of what works on the validation set + # 검증 셋에 대한 정확도를 저장해 놓는다. validation_accuracies.append((k, acc)) -``` +~~~ -By the end of this procedure, we could plot a graph that shows which values of *k* work best. We would then stick with this value and evaluate once on the actual test set. +이 과정이 끝나면, 어떤 *k* 값이 가장 잘 동작하는지를 그래프로 그려볼 수 있다. 그 뒤, 가장 잘 동작하는 k 값으로 정하고, 실제 테스트 셋에 대해 한 번 평가를 하면 된다. -> Split your training set into training set and a validation set. Use validation set to tune all hyperparameters. At the end run a single time on the test set and report performance. +> 학습 데이터셋을 트레이닝 셋과 검증 셋으로 나누고, 검증 셋을 활용하여 모든 hyperparameter 들을 튜닝하라. 마지막으로 테스트 셋에 대해서는 딱 한 번 돌려보고, 성능을 리포트한다. -**Cross-validation**. -In cases where the size of your training data (and therefore also the validation data) might be small, people sometimes use a more sophisticated technique for hyperparameter tuning called **cross-validation**. Working with our previous example, the idea is that instead of arbitrarily picking the first 1000 datapoints to be the validation set and rest training set, you can get a better and less noisy estimate of how well a certain value of *k* works by iterating over different validation sets and averaging the performance across these. For example, in 5-fold cross-validation, we would split the training data into 5 equal folds, use 4 of them for training, and 1 for validation. We would then iterate over which fold is the validation fold, evaluate the performance, and finally average the performance across the different folds. +**Cross-validation (교차 검증)**. +학습 데이터셋의 크기가 작을 경우(검증 셋의 크기도 작을 것이다), 조금 더 정교한 방식으로 **교차 검증(cross-validation)** 이라는 hyperparameter 튜닝 방법을 사용한다. 앞의 예시에서처럼 첫 1000 개의 데이터를 검증 셋으로 사용하고 나머지를 학습(training) 셋으로 사용하는 대신, 어떤 *k* 값이 더 좋은지를 여러 가지 검증 셋에 대해 시험해보고 평균 성능을 확인해본다면 보다 잡음이 덜 섞이고 나은 예측을 할 수 있을 것이다. 예를 들어, 5-fold 교차 검증에서는 학습 데이터를 5개의 동일한 크기의 그룹(fold)으로 쪼갠 뒤, 4개를 학습용으로, 1개를 검증용으로 사용한다. 그 다음에는 어떤 그룹을 검증 셋으로 사용할지에 따라 iteration(반복)을 돌고, 성능을 평가하고, 각 그룹에 대해 평가한 성능을 평균낸다.
- -
Example of a 5-fold cross-validation run for the parameter k. For each value of k we train on 4 folds and evaluate on the 5th. Hence, for each k we receive 5 accuracies on the validation fold (accuracy is the y-axis, each result is a point). The trend line is drawn through the average of the results for each k and the error bars indicate the standard deviation. Note that in this particular case, the cross-validation suggests that a value of about k = 7 works best on this particular dataset (corresponding to the peak in the plot). If we used more than 5 folds, we might expect to see a smoother (i.e. less noisy) curve.
+ +
파라미터 k 에 대한 5-fold 교차 검증 예시. 각 k 값마다 4개의 그룹에 대해 학습을 하고 다섯 번째 그룹을 사용하여 성능을 평가한다. 따라서, 각 k 마다 검증 셋으로 활용한 그룹들에서 5 개의 정확도가 나온다. (y축이 정확도를 나타내고, 각 결과는 점으로 표시하였다.) 그래프에서 선은 각 k 에서의 결과의 평균으로 그려져 있고, 에러 바는 표준 편차를 나타낸다. 이 경우, 이 데이터셋에 대해서는 k = 7 로 놓는 것이 가장 좋을 것(그래프에서 가장 높은 부분)이라고 교차 검증 결과가 말해준다. 만약 5개보다 더 많은 그룹 수를 사용했다면, 지금보다는 더 부드러운 곡선 형태(즉, 잡음이 덜 섞여있음)의 그래프를 볼 수 있을 것이다.
-**In practice**. In practice, people prefer to avoid cross-validation in favor of having a single validation split, since cross-validation can be computationally expensive. The splits people tend to use is between 50%-90% of the training data for training and rest for validation. However, this depends on multiple factors: For example if the number of hyperparameters is large you may prefer to use bigger validation splits. If the number of examples in the validation set is small (perhaps only a few hundred or so), it is safer to use cross-validation. Typical number of folds you can see in practice would be 3-fold, 5-fold or 10-fold cross-validation. +**실제 활용**. 교차 검증은 계산량이 매우 많아지기 때문에, 실제로 사람들은 교차 검증보다 하나의 검증 셋을 정해놓는 것을 선호한다. 보통은 학습 데이터의 50% ~ 90% 정도를 학습 용으로 쓰고 나머지를 검증 데이터로 활용하는데, 검증 데이터셋의 크기는 여러 가지 변수들에 의해 영향을 받는다. 예를 들어, hyperparameter 개수가 매우 많다면, 검증 데이터셋의 크기를 늘리는게 좋을 것이다. 검증 셋에 있는 데이터의 개수가 매우 적다면 (수백 개 정도), 교차 검증 방법을 사용하는 것이 더 안전하다. 보통은 3-fold, 5-fold, 10-fold 교차 검증을 주로 많이 사용한다.
- -
Common data splits. A training and test set is given. The training set is split into folds (for example 5 folds here). The folds 1-4 become the training set. One fold (e.g. fold 5 here in yellow) is denoted as the Validation fold and is used to tune the hyperparameters. Cross-validation goes a step further iterates over the choice of which fold is the validation fold, separately from 1-5. This would be referred to as 5-fold cross-validation. In the very end once the model is trained and all the best hyperparameters were determined, the model is evaluated a single time on the test data (red).
+ +
데이터를 그룹으로 나누는 일반적인 방법. 학습 데이터셋과 테스트 셋은 주어져 있다. 학습 셋은 이 예시의 경우, 5개의 그룹으로 나누어져 있다. 이 중 1-4 그룹이 학습 셋이 되고, 나머지 하나(노란색 그룹 5)가 검증 셋으로 사용할 그룹으로 hyperparameter 들을 튜닝하는데 사용된다. 교차 검증 방법은 여기서 한 단계 더 나아가서 어떤 그룹을 검증 셋으로 사용할지를 1-5까지 바꿔가며 전부 반복하고, 이를 5-fold 교차 검증이라 부른다. 모델의 학습이 끝나고 가장 좋은 hyperparameter 들이 정해진 이후에는, 마지막으로 모델을 테스트 데이터(빨간색)에 대해 딱 한 번 시험해보고 성능을 평가한다.
-**Pros and Cons of Nearest Neighbor classifier.** -It is worth considering some advantages and drawbacks of the Nearest Neighbor classifier. Clearly, one advantage is that it is very simple to implement and understand. Additionally, the classifier takes no time to train, since all that is required is to store and possibly index the training data. However, we pay that computational cost at test time, since classifying a test example requires a comparison to every single training example. This is backwards, since in practice we often care about the test time efficiency much more than the efficiency at training time. In fact, the deep neural networks we will develop later in this class shift this tradeoff to the other extreme: They are very expensive to train, but once the training is finished it is very cheap to classify a new test example. This mode of operation is much more desirable in practice. +**Nearest Neighbor 분류기의 장단점.** + +Nearest Neighbor 분류기의 장점과 단점이 무엇인지 분석해보자. 당연히, 한 가지 장점은 방법을 이해하고 구현하는 것이 매우 쉽다는 점이다. 또한, 분류기를 학습할 때 단순히 학습 데이터셋을 저장하고 기억만 해놓으면 되기 때문에 학습 시간이 전혀 소요되지 않는다. 그러나, 학습 시의 계산량이 없는 것은 테스트할 때 모든 학습 데이터 예시들과 비교를 해야되기 때문에 계산량이 매우 많아지는 것으로 보상된다. 이것은 거꾸로인게, 보통 우리는 테스트할 때 얼마나 효율적인지에 관심이 많이 있고, 학습에 소요되는 시간이 얼마인지는 크게 중요하게 생각하지 않기 때문이다. 사실, 이 수업에서 나중에 다룰 (깊은) 뉴럴 네트워크, 또는 신경망 구조는 이 교환(tradeoff)을 반대 극단으로 이끈다. 뉴럴 네트워크는 학습할 때 매우 많은 계산량을 필요로 하지만, 학습이 끝나면 새로운 테스트 샘플을 분류하는데 매우 적은 계산만으로도 수행할 수 있다. 실제 환경에서는 이러한 형태가 더 바람직하다. -As an aside, the computational complexity of the Nearest Neighbor classifier is an active area of research, and several **Approximate Nearest Neighbor** (ANN) algorithms and libraries exist that can accelerate the nearest neighbor lookup in a dataset (e.g. [FLANN](http://www.cs.ubc.ca/research/flann/)). These algorithms allow one to trade off the correctness of the nearest neighbor retrieval with its space/time complexity during retrieval, and usually rely on a pre-processing/indexing stage that involves building a kdtree, or running the k-means algorithm. +딴 얘기지만, Nearest Neighbor 분류기의 계산량(computational complexity) 문제는 매우 활발한 연구 주제이고, 많은 **Approximate Nearest Neighbor** (ANN, 근사 최근접 이웃) 알고리즘 및 라이브러리들이 있어서 데이터셋 내에서 nearest neighbor를 찾는 것을 가속화해준다 (e.g. [FLANN](http://www.cs.ubc.ca/research/flann/)). 이 알고리즘들은 nearest neighbor를 찾는 것의 정확도를 조금 희생하여 공간(메모리)/시간(계산량) 복잡도를 크게 낮추도록 하고, 보통 kdtree나 k-means 알고리즘 등과 같은 전처리 기법에 의존하는 경우가 많다. -The Nearest Neighbor Classifier may sometimes be a good choice in some settings (especially if the data is low-dimensional), but it is rarely appropriate for use in practical image classification settings. One problem is that images are high-dimensional objects (i.e. they often contain many pixels), and distances over high-dimensional spaces can be very counter-intuitive. The image below illustrates the point that the pixel-based L2 similarities we developed above are very different from perceptual similarities: +Nearest Neighbor 분류기가 좋은 경우도 있지만 (특히 데이터의 차원이 낮을 때), 실제 이미지 분류 문제 세팅에서는 대부분 효과적이지 않다. 한 가지 문제는, 이미지가 매우 고차원 물체라는 것이고 (수많은 픽셀들로 이루어져 있다), 고차원 공간에서의 '거리'는 매우 직관적이지 않는 경우가 많다. 아래 그림을 보면, 사람이 보기에 비슷한 이미지로 느끼는 것과 위에서 살펴본 픽셀 값들의 L2 거리를 기준으로 비슷한 것은 매우 다르다는 것을 알 수 있다.
- -
Pixel-based distances on high-dimensional data (and images especially) can be very unintuitive. An original image (left) and three other images next to it that are all equally far away from it based on L2 pixel distance. Clearly, the pixel-wise distance does not correspond at all to perceptual or semantic similarity.
+ +
고차원 데이터(이미지)에서의 픽셀값 기준 거리는 매우 비직관적인 경우가 많다. 원본 이미지(왼쪽)와 그 옆의 세 이미지는 픽셀값의 L2 거리를 기준으로 모두 같은 거리만큼 떨어져 있다. 이로 보아 픽셀값을 기준으로 한 거리는 인지적, 의미적으로 거의 연관이 없다고 생각할 수 있다.
-Here is one more visualization to convince you that using pixel differences to compare images is inadequate. We can use a visualization technique called t-SNE to take the CIFAR-10 images and embed them in two dimensions so that their (local) pairwise distances are best preserved. In this visualization, images that are shown nearby are considered to be very near according to the L2 pixelwise distance we developed above: +아래는 픽셀값의 차이만으로는 불충분하다는 점을 다시 한 번 보여주기 위한 시각화이다. 여기서는 t-SNE 라는 시각화 기법을 사용하여 CIFAR-10 이미지들을 서로간의 거리가 잘 보존되도록 2차원으로 투사시킨 것이다. 이 시각화에서, 가까이 있는 이미지들은 픽셀간의 L2 거리가 매우 가까울 것이라고 생각하면 된다.
- -
CIFAR-10 images embedded in two dimensions with t-SNE. Images that are nearby on this image are considered to be close based on the L2 pixel distance. Notice the strong effect of background rather than semantic class differences. Click here for a bigger version of this visualization.
+ +
t-SNE로 2차원으로 투사시킨 CIFAR-10 이미지들. 여기서 서로 가까이 있는 이미지들은 픽셀간의 L2 거리가 가까울 것이라고 생각하면 된다. 실제 클래스의 의미적인 차이보다 배경이 끼치는 영향이 얼마나 큰지 확인할 수 있다. 시각화의 큰 버전은 여기 에서 확인할 수 있다.
-In particular, note that images that are nearby each other are much more a function of the general color distribution of the images, or the type of background rather than their semantic identity. For example, a dog can be seen very near a frog since both happen to be on white background. Ideally we would like images of all of the 10 classes to form their own clusters, so that images of the same class are nearby to each other regardless of irrelevant characteristics and variations (such as the background). However, to get this property we will have to go beyond raw pixels. +여기서 특히, 서로 가까운 이미지들은 보편적인 색의 분포나 배경의 종류에 영향을 많이 받고 각자의 실제 의미가 담긴 클래스에는 큰 영향을 받지 않는 것을 확인할 수 있다. 예를 들어, 강아지와 개구리가 똑같이 흰 배경에 있어서 (실제 클래스가 다름에도 불구하고) 매우 가까이 위치한 것을 볼 수 있다. 이상적으로는 같은 클래스의 이미지들이 여러 변칙적인 성질과 변화(또는 배경)에 상관없이 가까이 있어서 10개의 클래스들이 각각 군집을 이뤄서 뭉쳐있었으면 좋겠지만, 이러한 성질을 위해서는 단순 픽셀값 이상의 것이 필요하다. -### Summary -In summary: +### 요약 -- We introduced the problem of **Image Classification**, in which we are given a set of images that are all labeled with a single category. We are then asked to predict these categories for a novel set of test images and measure the accuracy of the predictions. -- We introduced a simple classifier called the **Nearest Neighbor classifier**. We saw that there are multiple hyper-parameters (such as value of k, or the type of distance used to compare examples) that are associated with this classifier and that there was no obvious way of choosing them. -- We saw that the correct way to set these hyperparameters is to split your training data into two: a training set and a fake test set, which we call **validation set**. We try different hyperparameter values and keep the values that lead to the best performance on the validation set. -- If the lack of training data is a concern, we discussed a procedure called **cross-validation**, which can help reduce noise in estimating which hyperparameters work best. -- Once the best hyperparameters are found, we fix them and perform a single **evaluation** on the actual test set. -- We saw that Nearest Neighbor can get us about 40% accuracy on CIFAR-10. It is simple to implement but requires us to store the entire training set and it is expensive to evaluate on a test image. -- Finally, we saw that the use of L1 or L2 distances on raw pixel values is not adequate since the distances correlate more strongly with backgrounds and color distributions of images than with their semantic content. +- 여기서는 **이미지 분류(Image Classification)** 문제에 대해 살펴보았다. 각 이미지별로 한 개의 카테고리로 라벨링 되어있는 이미지들이 주어지고, 새로운 테스트 이미지들이 들어왔을 때 이 카테고리들 중 하나로 분류하도록 하고 예측값들의 정확도를 측정하였다. +- 간단한 **Nearest Neighbor 분류기** 를 소개하였다. 이 분류기와 관련하여 여러 가지 hyperparameter (k의 값이라든지, 데이터를 비교할 때 사용하는 거리의 종류라든지) 들이 존재하는 것을 보았고, 어떤 것을 선택할지 확실한 답은 없다는 것을 보았다. +- 이 hyperparameter 들을 올바르게 정하는 방법은 학습 데이터셋을 두 개로 (학습 셋과 **검증 셋(validation set)** 으로 불리는 가까 테스트 셋) 나누는 것임을 배웠다. 검증 셋에서 여러 가지 hyperparameter 값들을 시험해 보았고, 가장 좋은 성능을 얻는 값을 찾을 수 있었다. +- 학습 데이터가 적은 경우, 어떤 hyperparameter를 선택해야 하는지에 대해 보다 안정적인 방식인 **교차 검증(cross-validation)** 이라는 방법을 알게 되었다. +- 가장 좋은 hyperparameter 값들을 찾은 뒤, 그것으로 값을 고정하고 실제 테스트 셋에 대해 마지막에 단 한 번 **평가** 를 한다. +- Nearest Neighbor 분류기는 CIFAR-10 데이터셋에서 약 40% 정도의 정확도를 보이는 것을 확인하였다. 이 방법은 구현이 매우 간단하지만, 학습 데이터셋 전체를 메모리에 저장해야 하고, 새로운 테스트 이미지를 분류하고 평가할 때 계산량이 매우 많다. +- 마지막으로, 단순히 픽셀 값들의 L1이나 L2 거리는 이미지의 클래스보다 배경이나 이미지의 전체적인 색깔 분포 등에 더 큰 영향을 받기 때문에 이미지 분류 문제에 있어서 충분하지 못하다는 점을 보았다. -In next lectures we will embark on addressing these challenges and eventually arrive at solutions that give 90% accuracies, allow us to completely discard the training set once learning is complete, and they will allow us to evaluate a test image in less than a millisecond. +다음 강의에서는 여기서의 문제들을 해결하기 위한 방법들에 대해 살펴보고, 최종적으로 90% 정도의 성능을 갖고, 학습이 완료된 이후에는 학습 데이터셋을 전부 없애버려도 상관없으며, 테스트 이미지를 1/1000 초 단위로 빠르게 분류하고 평가할 수 있도록 해주는 모델을 살펴볼 것이다. -### Summary: Applying kNN in practice -If you wish to apply kNN in practice (hopefully not on images, or perhaps as only a baseline) proceed as follows: +### 요약2: kNN을 실제로 적용하기 + +실제 응용에서 kNN을 사용하고 싶다면 (이미지에는 적용하지 않는 것을 추천하지만, 베이스라인으로 시도해볼 수는 있을 것이다), 다음 과정을 따르면 된다: -1. Preprocess your data: Normalize the features in your data (e.g. one pixel in images) to have zero mean and unit variance. We will cover this in more detail in later sections, and chose not to cover data normalization in this section because pixels in images are usually homogeneous and do not exhibit widely different distributions, alleviating the need for data normalization. -2. If your data is very high-dimensional, consider using a dimensionality reduction technique such as PCA ([wiki ref](http://en.wikipedia.org/wiki/Principal_component_analysis), [CS229ref](http://cs229.stanford.edu/notes/cs229-notes10.pdf), [blog ref](http://www.bigdataexaminer.com/understanding-dimensionality-reduction-principal-component-analysis-and-singular-value-decomposition/)) or even [Random Projections](http://scikit-learn.org/stable/modules/random_projection.html). -3. Split your training data randomly into train/val splits. As a rule of thumb, between 70-90% of your data usually goes to the train split. This setting depends on how many hyperparameters you have and how much of an influence you expect them to have. If there are many hyperparameters to estimate, you should err on the side of having larger validation set to estimate them effectively. If you are concerned about the size of your validation data, it is best to split the training data into folds and perform cross-validation. If you can afford the computational budget it is always safer to go with cross-validation (the more folds the better, but more expensive). -4. Train and evaluate the kNN classifier on the validation data (for all folds, if doing cross-validation) for many choices of **k** (e.g. the more the better) and across different distance types (L1 and L2 are good candidates) -5. If your kNN classifier is running too long, consider using an Approximate Nearest Neighbor library (e.g. [FLANN](http://www.cs.ubc.ca/research/flann/)) to accelerate the retrieval (at cost of some accuracy). -6. Take note of the hyperparameters that gave the best results. There is a question of whether you should use the full training set with the best hyperparameters, since the optimal hyperparameters might change if you were to fold the validation data into your training set (since the size of the data would be larger). In practice it is cleaner to not use the validation data in the final classifier and consider it to be *burned* on estimating the hyperparameters. Evaluate the best model on the test set. Report the test set accuracy and declare the result to be the performance of the kNN classifier on your data. +1. 데이터 전처리 과정을 수행하라: 데이터의 각 특징(feature)들을 평균이 0, 표준편차가 1이 되도록 정규화하라. 정규화 관련된 내용은 강의의 나중 부분에서도 다루겠지만, 여기서 따로 다루지 않았던 이유는 이미지의 픽셀들은 보통 균등한 분포를 갖기 때문에 데이터 정규화가 크게 필요하지 않기 때문이다. +2. 사용할 데이터가 매우 고차원 데이터라면, PCA ([wiki ref](http://en.wikipedia.org/wiki/Principal_component_analysis), [CS229ref](http://cs229.stanford.edu/notes/cs229-notes10.pdf), [blog ref](http://www.bigdataexaminer.com/understanding-dimensionality-reduction-principal-component-analysis-and-singular-value-decomposition/))나 아예 [Random Projection](http://scikit-learn.org/stable/modules/random_projection.html)과 같은 차원 축소 기법들을 적용하는 것을 고려해 보자. +3. 학습 데이터를 랜덤으로 학습/검증 셋(train/val split)으로 나누어라. 일반적으로, 70~90% 정도의 데이터를 학습용으로 사용한다. 이 세팅은 튜닝할 hyperparameter 들이 얼마나 많이 있는지에 따라, 각각이 얼만큼의 영향을 끼칠지에 따라 달라진다. 정해야 할 hyperparameter의 개수가 많다면, 그것들을 효과적으로 정하기 위해 충분히 큰 검증 셋을 사용해야 한다. 검증 셋의 크기가 적당한지에 대해서 의문이 있다면, 학습 데이터를 그룹으로 나누어서 교차 검증을 하는 방법이 제일 좋다. 계산할 시간만 충분하다면, 교차 검증을 하는 것이 항상 더 안전하다 (그룹이 많을수록 더 좋지만, 그만큼 계산량도 늘어난다). +4. 여러 가지 **k** 값에 대해 (많이 해볼수록 좋다), 다른 종류의 거리 함수에 대해 (L1과 L2 거리를 주로 사용한다) kNN 분류기를 학습하고, 검증 셋으로 (또는 교차 검증을 사용한다면 모든 그룹에 대해) 평가해 보자. +5. 현재의 kNN 분류기가 너무 느리다면, 이를 가속하기 위해 Approximate Nearest Neighbor 라이브러리 (e.g. [FLANN](http://www.cs.ubc.ca/research/flann/))를 사용하는 것을 고려해보라. (성능은 조금 떨어질 것이다) +6. 가장 좋은 결과를 주는 hyperparameter 들을 기록해두라. 가장 좋은 hyperparameter 세팅으로 다시 전체 학습 데이터셋을 학습해야 하는지에 대한 점은 아직 확실하지 않다. 학습 셋에서 쪼갠 검증 셋을 다시 합친다면 최적의 hyperparameter 세팅이 바뀔 수도 있기 때문이다 (학습에 사용한 데이터셋의 크기가 커지기 때문에). 실제로는 최종 분류기에서 검증 셋은 사용하지 않는 편이 더 깔끔하고, 검증에 사용한 데이터들은 hyperparameter 들을 고르는데 사용되어 *날라가버렸다* 고 생각해도 된다. 그 뒤, 최종 모델을 테스트 셋에 대해 성능을 평가해 보고, 그 테스트 셋에 대한 정확도를 현재 데이터로 학습한 kNN 분류기의 성능으로 발표하라. -#### Further Reading -Here are some (optional) links you may find interesting for further reading: +#### 추가 읽기 자료 -- [A Few Useful Things to Know about Machine Learning](http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf), where especially section 6 is related but the whole paper is a warmly recommended reading. +관심있을 법한 추가적인 읽기 자료 몇 가지를 선정해 두었다 (optional): -- [Recognizing and Learning Object Categories](http://people.csail.mit.edu/torralba/shortCourseRLOC/index.html), a short course of object categorization at ICCV 2005. +- [A Few Useful Things to Know about Machine Learning](http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf) 에서 section 6이 가장 연관이 있지만, 전체적인 내용을 다 읽는 것도 추천한다. + +- [Recognizing and Learning Object Categories](http://people.csail.mit.edu/torralba/shortCourseRLOC/index.html). 물체 분류에 관한 ICCV 2005 (컴퓨터비전 분야에서 유명한) 학회의 short course. + +--- +

+번역: 이옥민 (OkminLee), + 최명섭(myungsub) +

diff --git a/convolutional-networks.md b/convolutional-networks.md index c55ea1f0..4f924f03 100644 --- a/convolutional-networks.md +++ b/convolutional-networks.md @@ -5,146 +5,148 @@ permalink: /convolutional-networks/ Table of Contents: -- [Architecture Overview](#overview) -- [ConvNet Layers](#layers) - - [Convolutional Layer](#conv) - - [Pooling Layer](#pool) - - [Normalization Layer](#norm) - - [Fully-Connected Layer](#fc) - - [Converting Fully-Connected Layers to Convolutional Layers](#convert) -- [ConvNet Architectures](#architectures) - - [Layer Patterns](#layerpat) - - [Layer Sizing Patterns](#layersizepat) - - [Case Studies](#case) (LeNet / AlexNet / ZFNet / GoogLeNet / VGGNet) - - [Computational Considerations](#comp) -- [Additional References](#add) - -## Convolutional Neural Networks (CNNs / ConvNets) - -Convolutional Neural Networks are very similar to ordinary Neural Networks from the previous chapter: They are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still express a single differentiable score function: From the raw image pixels on one end to class scores at the other. And they still have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer and all the tips/tricks we developed for learning regular Neural Networks still apply. - -So what does change? ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduces the amount of parameters in the network. +- [아키텍쳐 개요](#overview) +- [ConvNet을 이루는 레이어들](#layers) + - [컨볼루셔널 레이어](#conv) + - [풀링 레이어](#pool) + - [Normalization 레이어](#norm) + - [Fully-Connected 레이어](#fc) + - [FC 레이어를 CONV 레이어로 변환하기](#convert) +- [ConvNet 구조](#architectures) + - [레이어 패턴](#layerpat) + - [레이어 크기 결정 패턴](#layersizepat) + - [케이스 스터디](#case) (LeNet / AlexNet / ZFNet / GoogLeNet / VGGNet) + - [계산 관련 고려사항들](#comp) +- [추가 레퍼런스](#add) + +## 컨볼루션 신경망 (ConvNet) + +컨볼루션 신경망 (Convolutional Neural Network, 이하 ConvNet)은 앞 장에서 다룬 일반 신경망과 매우 유사하다. ConvNet은 학습 가능한 가중치 (weight)와 바이어스(bias)로 구성되어 있다. 각 뉴런은 입력을 받아 내적 연산( dot product )을 한 뒤 선택에 따라 비선형 (non-linear) 연산을 한다. 전체 네트워크는 일반 신경망과 마찬가지로 미분 가능한 하나의 스코어 함수 (score function)을 갖게 된다 (맨 앞쪽에서 로우 이미지 (raw image)를 읽고 맨 뒤쪽에서 각 클래스에 대한 점수를 구하게 됨). 또한 ConvNet은 마지막 레이어에 (SVM/Softmax와 같은) 손실 함수 (loss function)을 가지며, 우리가 일반 신경망을 학습시킬 때 사용하던 각종 기법들을 동일하게 적용할 수 있다. + +ConvNet과 일반 신경망의 차이점은 무엇일까? ConvNet 아키텍쳐는 입력 데이터가 이미지라는 가정 덕분에 이미지 데이터가 갖는 특성들을 인코딩 할 수 있다. 이러한 아키텍쳐는 포워드 함수 (forward function)을 더욱 효과적으로 구현할 수 있고 네트워크를 학습시키는데 필요한 모수 (parameter)의 수를 크게 줄일 수 있게 해준다. -### Architecture Overview -*Recall: Regular Neural Nets.* As we saw in the previous chapter, Neural Networks receive an input (a single vector), and transform it through a series of *hidden layers*. Each hidden layer is made up of a set of neurons, where each neuron is fully connected to all neurons in the previous layer, and where neurons in a single layer function completely independently and do not share any connections. The last fully-connected layer is called the "output layer" and in classification settings it represents the class scores. +### 아키텍쳐 개요 -*Regular Neural Nets don't scale well to full images*. In CIFAR-10, images are only of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully-connected neuron in a first hidden layer of a regular Neural Network would have 32\*32\*3 = 3072 weights. This amount still seems manageable, but clearly this fully-connected structure does not scale to larger images. For example, an image of more respectible size, e.g. 200x200x3, would lead to neurons that have 200\*200\*3 = 120,000 weights. Moreover, we would almost certainly want to have several such neurons, so the parameters would add up quickly! Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting. +앞 장에서 보았듯이 신경망은 입력받은 벡터를 일련의 히든 레이어 (hidden layer) 를 통해 변형 (transform) 시킨다. 각 히든 레이어는 뉴런들로 이뤄져 있으며, 각 뉴런은 앞쪽 레이어 (previous layer)의 모든 뉴런과 연결되어 있다 (fully connected). 같은 레이어 내에 있는 뉴런들 끼리는 연결이 존재하지 않고 서로 독립적이다. 마지막 Fully-connected 레이어는 출력 레이어라고 불리며, 분류 문제에서 클래스 점수 (class score)를 나타낸다. -*3D volumes of neurons*. Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions: **width, height, depth**. (Note that the word *depth* here refers to the third dimension of an activation volume, not to the depth of a full Neural Network, which can refer to the total number of layers in a network.) For example, the input images in CIFAR-10 are an input volume of activations, and the volume has dimensions 32x32x3 (width, height, depth respectively). As we will soon see, the neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner. Moreover, the final output layer would for CIFAR-10 have dimensions 1x1x10, because by the end of the ConvNet architecture we will reduce the full image into a single vector of class scores, arranged along the depth dimension. Here is a visualization: +일반 신경망은 이미지를 다루기에 적절하지 않다. CIFAR-10 데이터의 경우 각 이미지가 32x32x3 (가로,세로 32, 3개 컬러 채널)로 이뤄져 있어서 첫 번째 히든 레이어 내의 하나의 뉴런의 경우 32x32x3=3072개의 가중치가 필요하지만, 더 큰 이미지를 사용할 경우에는 같은 구조를 이용하는 것이 불가능하다. 예를 들어 200x200x3의 크기를 가진 이미지는 같은 뉴런에 대해 200x200x3=120,000개의 가중치를 필요로 하기 때문이다. 더욱이, 이런 뉴런이 레이어 내에 여러개 존재하므로 모수의 개수가 크게 증가하게 된다. 이와 같이 Fully-connectivity는 심한 낭비이며 많은 수의 모수는 곧 오버피팅(overfitting)으로 귀결된다. + +ConvNet은 입력이 이미지로 이뤄져 있다는 특징을 살려 좀 더 합리적인 방향으로 아키텍쳐를 구성할 수 있다. 특히 일반 신경망과 달리, ConvNet의 레이어들은 가로,세로,깊이의 3개 차원을 갖게 된다 ( 여기에서 말하는 깊이란 전체 신경망의 깊이가 아니라 액티베이션 볼륨 ( activation volume ) 에서의 3번 째 차원을 이야기 함 ). 예를 들어 CIFAR-10 이미지는 32x32x3 (가로,세로,깊이) 의 차원을 갖는 입력 액티베이션 볼륨 (activation volume)이라고 볼 수 있다. 조만간 보겠지만, 하나의 레이어에 위치한 뉴런들은 일반 신경망과는 달리 앞 레이어의 전체 뉴런이 아닌 일부에만 연결이 되어 있다. ConvNet 아키텍쳐는 전체 이미지를 클래스 점수들로 이뤄진 하나의 벡터로 만들어주기 때문에 마지막 출력 레이어는 1x1x10(10은 CIFAR-10 데이터의 클래스 개수)의 차원을 가지게 된다. 이에 대한 그럼은 아래와 같다:
- - -
Left: A regular 3-layer Neural Network. Right: A ConvNet arranges its neurons in three dimensions (width, height, depth), as visualized in one of the layers. Every layer of a ConvNet transforms the 3D input volume to a 3D output volume of neuron activations. In this example, the red input layer holds the image, so its width and height would be the dimensions of the image, and the depth would be 3 (Red, Green, Blue channels).
+ + +
좌: 일반 3-레이어 신경망. 우: 그림과 같이 ConvNet은 뉴런들을 3차원으로 배치한다. ConvNet의 모든 레이어는 3차원 입력 볼륨을 3차원 출력 볼륨으로 변환 (transform) 시킨다. 이 예제에서 붉은 색으로 나타난 입력 레이어는 이미지를 입력으로 받으므로, 이 레이어의 가로/세로/채널은 각각 이미지의 가로/세로/3(Red,Green,Blue) 이다.
-> A ConvNet is made up of Layers. Every Layer has a simple API: It transforms an input 3D volume to an output 3D volume with some differentiable function that may or may not have parameters. +> ConvNet은 여러 레이어로 이루어져 있다. 각각의 레이어는 3차원의 볼륨을 입력으로 받고 미분 가능한 함수를 거쳐 3차원의 볼륨을 출력하는 간단한 기능을 한다. -### Layers used to build ConvNets -As we described above, every layer of a ConvNet transforms one volume of activations to another through a differentiable function. We use three main types of layers to build ConvNet architectures: **Convolutional Layer**, **Pooling Layer**, and **Fully-Connected Layer** (exactly as seen in regular Neural Networks). We will stack these layers to form a full ConvNet **architecture**. +### ConvNet을 이루는 레이어들 + +위에서 다룬 것과 같이, ConvNet의 각 레이어는 미분 가능한 변환 함수를 통해 하나의 액티베이션 볼륨을 또다른 액티베이션 볼륨으로 변환 (transform) 시킨다. ConvNet 아키텍쳐에서는 크게 컨볼루셔널 레이어, 풀링 레이어, Fully-connected 레이어라는 3개 종류의 레이어가 사용된다. 전체 ConvNet 아키텍쳐는 이 3 종류의 레이어들을 쌓아 만들어진다. -*Example Architecture: Overview*. We will go into more details below, but a simple ConvNet for CIFAR-10 classification could have the architecture [INPUT - CONV - RELU - POOL - FC]. In more detail: +*예제: 아래에서 더 자세하게 배우겠지만, CIFAR-10 데이터를 다루기 위한 간단한 ConvNet은 [INPUT-CONV-RELU-POOL-FC]로 구축할 수 있다. -- INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and with three color channels R,G,B. -- CONV layer will compute the output of neurons that are connected to local regions in the input, each computing a dot product between their weights and the region they are connected to in the input volume. This may result in volume such as [32x32x12]. -- RELU layer will apply an elementwise activation function, such as the \\(max(0,x)\\) thresholding at zero. This leaves the size of the volume unchanged ([32x32x12]). -- POOL layer will perform a downsampling operation along the spatial dimensions (width, height), resulting in volume such as [16x16x12]. -- FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1x10], where each of the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10. As with ordinary Neural Networks and as the name implies, each neuron in this layer will be connected to all the numbers in the previous volume. +- INPUT 입력 이미지가 가로32, 세로32, 그리고 RGB 채널을 가지는 경우 입력의 크기는 [32x32x3]. +- CONV 레이어는 입력 이미지의 일부 영역과 연결되어 있으며, 이 연결된 영역과 자신의 가중치의 내적 연산 (dot product) 을 계산하게 된다. 결과 볼륨은 [32x32x12]와 같은 크기를 갖게 된다. +- RELU 레이어는 max(0,x)와 같이 각 요소에 적용되는 액티베이션 함수 (activation function)이다. 이 레이어는 볼륨의 크기를 변화시키지 않는다 ([32x32x12]) +- POOL 레이어는 (가로,세로) 차원에 대해 다운샘플링 (downsampling)을 수행해 [16x16x12]와 같이 줄어든 볼륨을 출력한다. +- FC (fully-connected) 레이어는 클래스 점수들을 계산해 [1x1x10]의 크기를 갖는 볼륨을 출력한다. 10개 숫자들은 10개 카테고리에 대한 클래스 점수에 해당한다. 레이어의 이름에서 유추 가능하듯, 이 레이어는 이전 볼륨의 모든 요소와 연결되어 있다. -In this way, ConvNets transform the original image layer by layer from the original pixel values to the final class scores. Note that some layers contain parameters and other don't. In particular, the CONV/FC layers perform transformations that are a function of not only the activations in the input volume, but also of the parameters (the weights and biases of the neurons). On the other hand, the RELU/POOL layers will implement a fixed function. The parameters in the CONV/FC layers will be trained with gradient descent so that the class scores that the ConvNet computes are consistent with the labels in the training set for each image. +이와 같이, ConvNet은 픽셀 값으로 이뤄진 원본 이미지를 각 레이어를 거치며 클래스 점수로 변환 (transform) 시킨다. 한 가지 기억할 것은, 어떤 레이어는 모수 (parameter)를 갖지만 어떤 레이어는 모수를 갖지 않는다는 것이다. 특히 CONV/FC 레이어들은 단순히 입력 볼륨만이 아니라 가중치(weight)와 바이어스(bias) 또한 포함하는 액티베이션(activation) 함수이다. 반면 RELU/POOL 레이어들은 고정된 함수이다. CONV/FC 레이어의 모수 (parameter)들은 각 이미지에 대한 클래스 점수가 해당 이미지의 레이블과 같아지도록 그라디언트 디센트 (gradient descent)로 학습된다. -In summary: +요약해보면: -- A ConvNet architecture is a list of Layers that transform the image volume into an output volume (e.g. holding the class scores) -- There are a few distinct types of Layers (e.g. CONV/FC/RELU/POOL are by far the most popular) -- Each Layer accepts an input 3D volume and transforms it to an output 3D volume through a differentiable function -- Each Layer may or may not have parameters (e.g. CONV/FC do, RELU/POOL don't) -- Each Layer may or may not have additional hyperparameters (e.g. CONV/FC/POOL do, RELU doesn't) +- ConvNet 아키텍쳐는 여러 레이어를 통해 입력 이미지 볼륨을 출력 볼륨 ( 클래스 점수 )으로 변환시켜 준다. +- ConvNet은 몇 가지 종류의 레이어로 구성되어 있다. CONV/FC/RELU/POOL 레이어가 현재 가장 많이 쓰인다. +- 각 레이어는 3차원의 입력 볼륨을 미분 가능한 함수를 통해 3차원 출력 볼륨으로 변환시킨다. +- 모수(parameter)가 있는 레이어도 있고 그렇지 않은 레이어도 있다 (FC/CONV는 모수를 갖고 있고, RELU/POOL 등은 모수가 없음). +- 초모수 (hyperparameter)가 있는 레이어도 있고 그렇지 않은 레이어도 있다 (CONV/FC/POOL 레이어는 초모수를 가지며 RELU는 가지지 않음).
- +
- The activations of an example ConvNet architecture. The initial volume stores the raw image pixels and the last volume stores the class scores. Each volume of activations along the processing path is shown as a column. Since it's difficult to visualize 3D volumes, we lay out each volume's slices in rows. The last layer volume holds the scores for each class, but here we only visualize the sorted top 5 scores, and print the labels of each one. The full web-based demo is shown in the header of our website. The architecture shown here is a tiny VGG Net, which we will discuss later. + ConvNet 아키텍쳐의 액티베이션 (activation) 예제. 첫 볼륨은 로우 이미지(raw image)를 다루며, 마지막 볼륨은 클래스 점수들을 출력한다. 입/출력 사이의 액티베이션들은 그림의 각 열에 나타나 있다. 3차원 볼륨을 시각적으로 나타내기가 어렵기 때문에 각 행마다 볼륨들의 일부만 나타냈다. 마지막 레이어는 모든 클래스에 대한 점수를 나타내지만 여기에서는 상위 5개 클래스에 대한 점수와 레이블만 표시했다. 전체 웹 데모는 우리의 웹사이트 상단에 있다. 여기에서 사용된 아키텍쳐는 작은 VGG Net이다.
-We now describe the individual layers and the details of their hyperparameters and their connectivities. +이제 각각의 레이어에 대해 초모수(hyperparameter)나 연결성 (connectivity) 등의 세부 사항들을 알아보도록 하자. -#### Convolutional Layer -The Conv layer is the core building block of a Convolutional Network, and its output volume can be interpreted as holding neurons arranged in a 3D volume. We now discuss the details of the neuron connectivities, their arrangement in space, and their parameter sharing scheme. +#### 컨볼루셔널 레이어 (이하 CONV) + +CONV 레이어는 ConvNet을 이루는 핵심 요소이다. CONV 레이어의 출력은 3차원으로 정렬된 뉴런들로 해석될 수 있다. 이제부터는 뉴런들의 연결성 (connectivity), 그들의 공간상의 배치, 그리고 모수 공유(parameter sharing) 에 대해 알아보자. -**Overview and Intuition.** The CONV layer's parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume, producing a 2-dimensional activation map of that filter. As we slide the filter, across the input, we are computing the dot product between the entries of the filter and the input. Intuitively, the network will learn filters that activate when they see some specific type of feature at some spatial position in the input. Stacking these activation maps for all filters along the depth dimension forms the full output volume. Every entry in the output volume can thus also be interpreted as an output of a neuron that looks at only a small region in the input and shares parameters with neurons in the same activation map (since these numbers all result from applying the same filter). We now dive into the details of this process. +**개요 및 직관적인 설명.** CONV 레이어의 모수(parameter)들은 일련의 학습가능한 필터들로 이뤄져 있다. 각 필터는 가로/세로 차원으로는 작지만 깊이 (depth) 차원으로는 전체 깊이를 아우른다. 포워드 패스 (forward pass) 때에는 각 필터를 입력 볼륨의 가로/세로 차원으로 슬라이딩 시키며 (정확히는 convolve 시키며) 2차원의 액티베이션 맵 (activation map)을 생성한다. 필터를 입력 위로 슬라이딩 시킬 때, 필터와 입력의 요소들 사이의 내적 연산 (dot product)이 이뤄진다. 직관적으로 설명하면, 이 신경망은 입력의 특정 위치의 특정 패턴에 대해 반응하는 (activate) 필터를 학습한다. 이런 액티베이션 맵 (activation map)을 깊이 (depth) 차원을 따라 쌓은 것이 곧 출력 볼륨이 된다. 그러므로 출력 볼륨의 각 요소들은 입력의 작은 영역만을 취급하고, 같은 액티베이션 맵 내의 뉴런들은 같은 모수들을 공유한다 (같은 필터를 적용한 결과이므로). 이제 이 과정에 대해 좀 더 깊이 파헤쳐보자. -**Local Connectivity.** When dealing with high-dimensional inputs such as images, as we saw above it is impractical to connect neurons to all neurons in the previous volume. Instead, we will connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyperparameter called the **receptive field** of the neuron. The extent of the connectivity along the depth axis is always equal to the depth of the input volume. It is important to note this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in space (along width and height), but always full along the entire depth of the input volume. +**로컬 연결성 (Local connectivity).** 이미지와 같은 고차원 입력을 다룰 때에는, 현재 레이어의 한 뉴런을 이전 볼륨의 모든 뉴런들과 연결하는 것이 비 실용적이다. 대신에 우리는 레이어의 각 뉴런을 입력 볼륨의 로컬한 영역(local region)에만 연결할 것이다. 이 영역은 리셉티브 필드 (receptive field)라고 불리는 초모수 (hyperparameter) 이다. 깊이 차원 측면에서는 항상 입력 볼륨의 총 깊이를 다룬다 (가로/세로는 작은 영역을 보지만 깊이는 전체를 본다는 뜻). 공간적 차원 (가로/세로)와 깊이 차원을 다루는 방식이 다르다는 걸 기억하자. -*Example 1*. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field is of size 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5\*5\*3 = 75 weights. Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume. +*예제 1*. 예를 들어 입력 볼륨의 크기가 (CIFAR-10의 RGB 이미지와 같이) [32x32x3]이라고 하자. 만약 리셉티브 필드의 크기가 5x5라면, CONV 레이어의 각 뉴런은 입력 볼륨의 [5x5x3] 크기의 영역에 가중치 (weight)를 가하게 된다 (총 5x5x3=75 개 가중치). 입력 볼륨 (RGB 이미지)의 깊이가 3이므로 마지막 숫자가 3이 된다는 것을 기억하자. -*Example 2*. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3\*3\*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20). +*예제 2*. 입력 볼륨의 크기가 [16x16x20]이라고 하자. 3x3 크기의 리셉티브 필드를 사용하면 CONV 레이어의 각 뉴런은 입력 볼륨과 3x3x20=180 개의 연결을 갖게 된다. 이번에도 입력 볼륨의 깊이가 20이므로 마지막 숫자가 20이 된다는 것을 기억하자.
- - + +
- Left: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image), and an example volume of neurons in the first Convolutional layer. Each neuron in the convolutional layer is connected only to a local region in the input volume spatially, but to the full depth (i.e. all color channels). Note, there are multiple neurons (5 in this example) along the depth, all looking at the same region in the input - see discussion of depth columns in text below. Right: The neurons from the Neural Network chapter remain unchanged: They still compute a dot product of their weights with the input followed by a non-linearity, but their connectivity is now restricted to be local spatially. + 좌: 입력 볼륨(붉은색, 32x32x3 크기의 CIFAR-10 이미지)과 첫번째 컨볼루션 레이어 볼륨. 컨볼루션 레이어의 각 뉴런은 입력 볼륨의 일부 영역에만 연결된다 (가로/세로 공간 차원으로는 일부 연결, 깊이(컬러 채널) 차원은 모두 연결). 컨볼루션 레이어의 깊이 차원의 여러 뉴런 (그림에서 5개)들이 모두 입력의 같은 영역을 처리한다는 것을 기억하자 (깊이 차원과 관련해서는 아래에서 더 자세히 알아볼 것임). 우: 입력의 일부 영역에만 연결된다는 점을 제외하고는, 이전 신경망 챕터에서 다뤄지던 뉴런들과 똑같이 내적 연산과 비선형 함수로 이뤄진다.
-**Spatial arrangement**. We have explained the connectivity of each neuron in the Conv Layer to the input volume, but we haven't yet discussed how many neurons there are in the output volume or how they are arranged. Three hyperparameters control the size of the output volume: the **depth, stride** and **zero-padding**. We discuss these next: +**공간적 배치**. 지금까지는 컨볼루션 레이어의 한 뉴런과 입력 볼륨의 연결에 대해 알아보았다. 그러나 아직 출력 볼륨에 얼마나 많은 뉴런들이 있는지, 그리고 그 뉴런들이 어떤식으로 배치되는지는 다루지 않았다. 3개의 hyperparameter들이 출력 볼륨의 크기를 결정하게 된다. 그 3개 요소는 바로 **깊이, stride, 그리고 제로 패딩 (zero-padding)** 이다. 이들에 대해 알아보자: -1. First, the **depth** of the output volume is a hyperparameter that we can pick; It controls the number of neurons in the Conv layer that connect to the same region of the input volume. This is analogous to a regular Neural Network, where we had multiple neurons in a hidden layer all looking at the exact same input. As we will see, all of these neurons will learn to activate for different features in the input. For example, if the first Convolutional Layer takes as input the raw image, then different neurons along the depth dimension may activate in presence of various oriented edged, or blobs of color. We will refer to a set of neurons that are all looking at the same region of the input as a **depth column**. -2. Second, we must specify the **stride** with which we allocate depth columns around the spatial dimensions (width and height). When the stride is 1, then we will allocate a new depth column of neurons to spatial positions only 1 spatial unit apart. This will lead to heavily overlapping receptive fields between the columns, and also to large output volumes. Conversely, if we use higher strides then the receptive fields will overlap less and the resulting output volume will have smaller dimensions spatially. -3. As we will soon see, sometimes it will be convenient to pad the input with zeros spatially on the border of the input volume. The size of this **zero-padding** is a hyperparameter. The nice feature of zero padding is that it will allow us to control the spatial size of the output volumes. In particular, we will sometimes want to exactly preserve the spatial size of the input volume. +1. 먼저, 출력 볼륨의 **깊이** 는 우리가 결정할 수 있는 요소이다. 컨볼루션 레이어의 뉴런들 중 입력 볼륨 내 동일한 영역과 연결된 뉴런의 개수를 의미한다. 마치 일반 신경망에서 히든 레이어 내의 모든 뉴런들이 같은 입력값과 연결된 것과 비슷하다. 앞으로 살펴보겠지만, 이 뉴런들은 입력에 대해 서로 다른 특징 (feature)에 활성화된다 (activate). 예를 들어, 이미지를 입력으로 받는 첫 번째 컨볼루션 레이어의 경우, 깊이 축에 따른 각 뉴런들은 이미지의 서로 다른 엣지, 색깔, 블롭(blob) 등에 활성화된다. 앞으로는 인풋의 서로 같은 영역을 바라보는 뉴런들을 **깊이 컬럼 (depth column)**이라고 부르겠다. +2. 두 번째로 어떤 간격 (가로/세로의 공간적 간격) 으로 깊이 컬럼을 할당할 지를 의미하는 **stride**를 결정해야 한다. 만약 stride가 1이라면, 깊이 컬럼을 1칸마다 할당하게 된다 (한 칸 간격으로 깊이 컬럼 할당). 이럴 경우 각 깊이 컬럼들은 receptive field 상 넓은 영역이 겹치게 되고, 출력 볼륨의 크기도 매우 커지게 된다. 반대로, 큰 stride를 사용한다면 receptive field끼리 좁은 영역만 겹치게 되고 출력 볼륨도 작아지게 된다 (깊이는 작아지지 않고 가로/세로만 작아지게 됨). +3. 조만간 살펴보겠지만, 입력 볼륨의 가장자리를 0으로 패딩하는 것이 좋을 때가 있다. 이 **zero-padding**은 hyperparamter이다. zero-padding을 사용할 때의 장점은, 출력 볼륨의 공간적 크기(가로/세로)를 조절할 수 있다는 것이다. 특히 입력 볼륨의 공간적 크기를 유지하고 싶은 경우 (입력의 가로/세로 = 출력의 가로/세로) 사용하게 된다. -We can compute the spatial size of the output volume as a function of the input volume size (\\(W\\)), the receptive field size of the Conv Layer neurons (\\(F\\)), the stride with which they are applied (\\(S\\)), and the amount of zero padding used (\\(P\\)) on the border. You can convince yourself that the correct formula for calculating how many neurons "fit" is given by \\((W - F + 2P)/S + 1\\). If this number is not an integer, then the strides are set incorrectly and the neurons cannot be tiled so that they "fit" across the input volume neatly, in a symmetric way. An example might help to get intuitions for this formula: +출력 볼륨의 공간적 크기 (가로/세로)는 입력 볼륨 크기 ($$W$$), CONV 레이어의 리셉티브 필드 크기($$F$$)와 stride ($$S$$), 그리고 제로 패딩 (zero-padding) 사이즈 ($$P$$) 의 함수로 계산할 수 있다. $$(W - F + 2P)/S + 1$$. I을 통해 알맞은 크기를 계산하면 된다. 만약 이 값이 정수가 아니라면 stride가 잘못 정해진 것이다. 이 경우 뉴런들이 대칭을 이루며 깔끔하게 배치되는 것이 불가능하다. 다음 예제를 보면 이 수식을 좀 더 직관적으로 이해할 수 있을 것이다:
- +
- Illustration of spatial arrangement. In this example there is only one spatial dimension (x-axis), one neuron with a receptive field size of F = 3, the input size is W = 5, and there is zero padding of P = 1. Left: The neuron strided across the input in stride of S = 1, giving output of size (5 - 3 + 2)/1+1 = 5. Right: The neuron uses stride of S = 2, giving output of size (5 - 3 + 2)/2+1 = 3. Notice that stride S = 3 could not be used since it wouldn't fit neatly across the volume. In terms of the equation, this can be determined since (5 - 3 + 2) = 4 is not divisible by 3. -
The neuron weights are in this example [1,0,-1] (shown on very right), and its bias is zero. These weights are shared across all yellow neurons (see parameter sharing below). + 공간적 배치에 관한 그림. 이 예제에서는 가로/세로 공간적 차원 중 하나만 고려한다 (x축). 리셉티브 필드 F=3, 입력 사이즈 W=5, 제로 패딩 P=1. : 뉴런들이 stride S=1을 갖고 배치된 경우, 출력 사이즈는 (5-3+2)/1 +1 = 5이다. : stride S=2인 경우 (5-3+2)/2 + 1 = 3의 출력 사이즈를 가진다. Stride S=3은 사용할 수 없다. (5-3+2) = 4가 3으로 나눠지지 않기 때문에 출력 볼륨의 뉴런들이 깔끔히 배치되지 않는다. + 이 예에서 뉴런들의 가중치는 [1,0,-1] (가장 오른쪽) 이며 bias는 0이다. 이 가중치는 노란 뉴런들 모두에게 공유된다 (아래에서 parameter sharing에 대해 살펴보라).
-*Use of zero-padding*. In the example above on left, note that the input dimension was 5 and the output dimension was equal: also 5. This worked out so because our receptive fields were 3 and we used zero padding of 1. If there was no zero-padding used, then the output volume would have had spatial dimension of only 3, because that it is how many neurons would have "fit" across the original input. In general, setting zero padding to be \\(P = (F - 1)/2\\) when the stride is \\(S = 1\\) ensures that the input volume and output volume will have the same size spatially. It is very common to use zero-padding in this way and we will discuss the full reasons when we talk more about ConvNet architectures. +*제로 패딩 사용*. 위 예제의 왼쪽 그림에서, 입력과 출력의 차원이 모두 5라는 것을 기억하자. 리셉티브 필드가 3이고 제로 패딩이 1이기 때문에 이런 결과가 나오는 것이다. 만약 제로 패딩이 사용되지 않았다면 출력 볼륨의 크기는 3이 될 것이다. 일반적으로, 제로 패딩을 $$P = (F - 1)/2$$ , stride $$S = 1$$로 세팅하면 입/출력의 크기가 같아지게 된다. 이런 방식으로 사용하는 것이 일반적이며, 앞으로 컨볼루션 신경망에 대해 다루면서 그 이유에 대해 더 알아볼 것이다. -*Constraints on strides*. Note that the spatial arrangement hyperparameters have mutual constraints. For example, when the input has size \\(W = 10\\), no zero-padding is used \\(P = 0\\), and the filter size is \\(F = 3\\), then it would be impossible to use stride \\(S = 2\\), since \\((W - F + 2P)/S + 1 = (10 - 3 + 0) / 2 + 1 = 4.5\\), i.e. not an integer, indicating that the neurons don't "fit" neatly and symmetrically across the input. Therefore, this setting of the hyperparameters is considered to be invalid, and a ConvNet library would likely throw an exception. As we will see in the ConvNet architectures section, sizing the ConvNets appropriately so that all the dimensions "work out" can be a real headache, which the use of zero-padding and some design guidelines will significantly alleviate. +*Stride에 대한 constraints*. 공간적 배치와 관련된 hyperparameter들은 상호 constraint들이 존재한다는 것을 기억하자. 예를 들어, 입력 사이즈 $$W=10$$이고 제로 패딩이 사용되지 않았고 $$P=0$$, 필터 사이즈가 $$F=3$$이라면, stride $$S=2$$를 사용하는 것이 불가능하다. $$(W - F + 2P)/S + 1 = (10 - 3 + 0) / 2 + 1 = 4.5$$이 정수가 아니기 때문이다. 그러므로 hyperparameter를 이런 식으로 설정하면 컨볼루션 신경망 관련 라이브러리들은 exception을 낸다. 컨볼루션 신경망의 구조 관련 섹션에서 확인하겠지만, 전체 신경망이 잘 돌아가도록 이런 숫자들을 설정하는 과정은 매우 골치 아프다. 제로 패딩이나 다른 신경망 디자인 비법들을 사용하면 훨씬 수월하게 진행할 수 있다. -*Real-world example*. The [Krizhevsky et al.](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks) architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size \\(F = 11\\), stride \\(S = 4\\) and no zero padding \\(P = 0\\). Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of \\(K = 96\\), the Conv layer output volume had size [55x55x96]. Each of the 55\*55\*96 neurons in this volume was connected to a region of size [11x11x3] in the input volume. Moreover, all 96 neurons in each depth column are connected to the same [11x11x3] region of the input, but of course with different weights. +*실제 예제*. 이미지넷 대회에서 우승한 [Krizhevsky et al.](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks) 의 모델의 경우 [227x227x3] 크기의 이미지를 입력으로 받는다. 첫 번째 컨볼루션 레이어에서는 리셉티브 필드 $$F=11$$, stride $$S=4$$를 사용했고 제로 패딩은 사용하지 않았다 $$P=0$$. (227 - 11)/4 +1=55 이고 컨볼루션 레이어의 깊이는 $$K=96$$이므로 이 컨볼루션 레이어의 크기는 [55x55x96]이 된다. 각각의 55\*55\*96개 뉴런들은 입력 볼륨의 [11x11x3]개 뉴런들과 연결되어 있다. 그리고 각 깊이의 모든 96개 뉴런들은 입력 볼륨의 같은 [11x11x3] 영역에 서로 다른 가중치를 가지고 연결된다. -**Parameter Sharing.** Parameter sharing scheme is used in Convolutional Layers to control the number of parameters. Using the real-world example above, we see that there are 55\*55\*96 = 290,400 neurons in the first Conv Layer, and each has 11\*11\*3 = 363 weights and 1 bias. Together, this adds up to 290400 * 364 = 105,705,600 parameters on the first layer of the ConvNet alone. Clearly, this number is very high. +**파라미터 공유**. 파라미터 공유 기법은 컨볼루션 레이어의 파라미터 개수를 조절하기 위해 사용된다. 위의 실제 예제에서 보았듯, 첫 번째 컨볼루션 레이어에는 55\*55\*96 = 290,400 개의 뉴런이 있고 각각의 뉴런은 11\*11\*3 = 363개의 가중치와 1개의 바이어스를 가진다. 첫 번째 컨볼루션 레이어만 따져도 총 파라미터 개수는 290400*364=105,705,600개가 된다. 분명히 이 숫자는 너무 크다. -It turns out that we can dramatically reduce the number of parameters by making one reasonable assumption: That if one patch feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2). In other words, denoting a single 2-dimensional slice of depth as a **depth slice** (e.g. a volume of size [55x55x96] has 96 depth slices, each of size [55x55]), we are going to constrain the neurons in each depth slice to use the same weights and bias. With this parameter sharing scheme, the first Conv Layer in our example would now have only 96 unique set of weights (one for each depth slice), for a total of 96\*11\*11\*3 = 34,848 unique weights, or 34,944 parameters (+96 biases). Alternatively, all 55*55 neurons in each depth slice will now be using the same parameters. In practice during backpropagation, every neuron in the volume will compute the gradient for its weights, but these gradients will be added up across each depth slice and only update a single set of weights per slice. +사실 적절한 가정을 통해 파라미터 개수를 크게 줄이는 것이 가능하다: (x,y)에서 어떤 patch feature가 유용하게 사용되었다면, 이 feature는 다른 위치 (x2,y2)에서도 유용하게 사용될 수 있다. 3차원 볼륨의 한 슬라이스 (깊이 차원으로 자른 2차원 슬라이스) 를 **depth slice**라고 하자 ([55x55x96] 사이즈의 볼륨은 각각 [55x55]의 크기를 가진 96개의 depth slice임). 앞으로는 각 depth slice 내의 뉴런들이 같은 가중치와 바이어스를 가지도록 제한할 것이다. 이런 파라미터 공유 기법을 사용하면, 예제의 첫 번째 컨볼루션 레이어는 (depth slice 당) 96개의 고유한 가중치를 가져서 총 96\*11\*11\*3 = 34,848개의 고유한 가중치, 또는 바이어스를 합쳐서 34,944개의 파라미터를 갖게 된다. 또는 각 depth slice에 존재하는 55*55개의 뉴런들은 모두 같은 파라미터를 사용하게 된다. 실제로는 backpropagation 과정에서 각 depth slice 내의 모든 뉴런들이 가중치에 대한 gradient를 계산하겠지만, 가중치 업데이트 할 때에는 이 gradient들을 합해 사용한다. -Notice that if all neurons in a single depth slice are using the same weight vector, then the forward pass of the CONV layer can in each depth slice be computed as a **convolution** of the neuron's weights with the input volume (Hence the name: Convolutional Layer). Therefore, it is common to refer to the sets of weights as a **filter** (or a **kernel**), which is convolved with the input. The result of this convolution is an *activation map* (e.g. of size [55x55]), and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output volume (e.g. [55x55x96]). +한 depth slice내의 모든 뉴런들이 같은 가중치 벡터를 갖기 때문에 컨볼루션 레이어의 forward pass는 입력 볼륨과 가중치 간의 **컨볼루션**으로 계산될 수 있다 (컨볼루션 레이어라는 이름이 붙은 이유). 그러므로 컨볼루션 레이어의 가중치는 **필터(filter)** 또는 **커널(kernel)**이라고 부른다. 컨볼루션의 결과물은 **액티베이션 맵(activation map, [55x55] 사이즈)** 이 되며 각 깊이에 해당하는 필터의 액티베이션 맵들을 쌓으면 최종 출력 볼륨 ([55x55x96] 사이즈) 가 된다.
- +
- Example filters learned by Krizhevsky et al. Each of the 96 filters shown here is of size [11x11x3], and each one is shared by the 55*55 neurons in one depth slice. Notice that the parameter sharing assumption is relatively reasonable: If detecting a horizontal edge is important at some location in the image, it should intuitively be useful at some other location as well due to the translationally-invariant structure of images. There is therefore no need to relearn to detect a horizontal edge at every one of the 55*55 distinct locations in the Conv layer output volume. + Krizhevsky et al. 에서 학습된 필터의 예. 96개의 필터 각각은 [11x11x3] 사이즈이며, 하나의 depth slice 내 55*55개 뉴런들이 이 필터들을 공유한다. 만약 이미지의 특정 위치에서 가로 엣지 (edge)를 검출하는 것이 중요했다면, 이미지의 다른 위치에서도 같은 특성이 중요할 수 있다 (이미지의 translationally-invariant한 특성 때문). 그러므로 55*55개 뉴런 각각에 대해 가로 엣지 검출 필터를 재학습 할 필요가 없다.
-Note that sometimes the parameter sharing assumption may not make sense. This is especially the case when the input images to a ConvNet have some specific centered structure, where we should expect, for example, that completely different features should be learned on one side of the image than another. One practical example is when the input are faces that have been centered in the image. You might expect that different eye-specific or hair-specific features could (and should) be learned in different spatial locations. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a **Locally-Connected Layer**. +가끔은 파라미터 sharing에 대한 가정이 부적절할 수도 있다. 특히 입력 이미지가 중심을 기준으로 찍힌 경우 (예를 들면 이미지 중앙에 얼굴이 있는 이미지), 이미지의 각 영역에 대해 완전히 다른 feature들이 학습되어야 할 수 있다. 눈과 관련된 feature나 머리카락과 관련된 feature 등은 서로 다른 영역에서 학습될 것이다. 이런 경우에는 파라미터 sharing 기법을 접어두고 대신 **Locally-Connected Layer**라는 레이어를 사용하는 것이 좋다. -**Numpy examples.** To make the discussion above more concrete, lets express the same ideas but in code and with a specific example. Suppose that the input volume is a numpy array `X`. Then: +**Numpy 예제.** 위에서 다룬 것들을 더 확실히 알아보기 위해 코드를 작성해보자. 입력 볼륨을 numpy 배열 `X`라고 하면: +- `(x,y)`위치에서의 *depth column*은 액티베이션 `X[x,y,:]`이 된다. +- depth `d`에서의 *depth slice*, 또는 *액티베이션 맵 (activation map)*은 `X[:,:,d]`가 된다. -- A *depth column* at position `(x,y)` would be the activations `X[x,y,:]`. -- A *depth slice*, or equivalently an *activation map* at depth `d` would be the activations `X[:,:,d]`. - -*Conv Layer Example*. Suppose that the input volume `X` has shape `X.shape: (11,11,4)`. Suppose further that we use no zero padding (\\(P = 0\\)), that the filter size is \\(F = 5\\), and that the stride is \\(S = 2\\). The output volume would therefore have spatial size (11-5)/2+1 = 4, giving a volume with width and height of 4. The activation map in the output volume (call it `V`), would then look as follows (only some of the elements are computed in this example): +*컨볼루션 레이어 예제*. 입력 볼륨 `X`의 모양이 `X.shape: (11,11,4)`이고 제로 패딩은 사용하지 않으며($$P = 0$$) 필터 크기는 $$F = 5$$, stride $$S = 2$$라고 하자. 출력 볼륨의 spatial 크기 (가로/세로)는 (11-5)/2 + 1 = 4가 된다. 출력 볼륨의 액티베이션 맵 (`V`라고 하자) 는 아래와 같은 것이다 (아래에는 일부 요소만 나타냄). - `V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0` - `V[1,0,0] = np.sum(X[2:7,:5,:] * W0) + b0` - `V[2,0,0] = np.sum(X[4:9,:5,:] * W0) + b0` - `V[3,0,0] = np.sum(X[6:11,:5,:] * W0) + b0` -Remember that in numpy, the operation `*` above denotes elementwise multiplication between the arrays. Notice also that the weight vector `W0` is the weight vector of that neuron and `b0` is the bias. Here, `W0` is assumed to be of shape `W0.shape: (5,5,4)`, since the filter size is 5 and the depth of the input volume is 4. Notice that at each point, we are computing the dot product as seen before in ordinary neural networks. Also, we see that we are using the same weight and bias (due to parameter sharing), and where the dimensions along the width are increasing in steps of 2 (i.e. the stride). To construct a second activation map in the output volume, we would have: +Numpy에서 `*`연산은 두 배열 간의 elementwise 곱셈이라는 것을 기억하자. 또한 `W0`는 가중치 벡터이고 `b0`은 바이어스라는 것도 기억하자. 여기에서 `W0`의 모양은 `W0.shape: (5,5,4)`라고 가정하자 (필터 사이즈는 5, depth는 4). 각 위치에서 일반 신경망에서와 같이 내적 연산을 수행하게 된다. 또한 파라미터 sharing 기법으로 같은 가중치, 바이어스가 사용되고 가로 차원에 대해 2 (stride)칸씩 옮겨가며 연산이 이뤄진다는 것을 볼 수 있다. 출력 볼륨의 두 번째 액티베이션 맵을 구성하는 방법은: - `V[0,0,1] = np.sum(X[:5,:5,:] * W1) + b1` - `V[1,0,1] = np.sum(X[2:7,:5,:] * W1) + b1` @@ -153,171 +155,171 @@ Remember that in numpy, the operation `*` above denotes elementwise multiplicati - `V[0,1,1] = np.sum(X[:5,2:7,:] * W1) + b1` (example of going along y) - `V[2,3,1] = np.sum(X[4:9,6:11,:] * W1) + b1` (or along both) -where we see that we are indexing into the second depth dimension in `V` (at index 1) because we are computing the second activation map, and that a different set of parameters (`W1`) is now used. In the example above, we are for brevity leaving out some of the other operatations the Conv Layer would perform to fill the other parts of the output array `V`. Additionally, recall that these activation maps are often followed elementwise through an activation function such as ReLU, but this is not shown here. +위 예제는 `V`의 두 번째 depth 차원 (인덱스 1)을 인덱싱하고 있다. 두 번째 액티베이션 맵을 계산하므로, 여기에서 사용된 가중치는 이전 예제와 달리 `W1`이다. 보통 액티베이션 맵이 구해진 뒤 ReLU와 같은 elementwise 연산이 가해지는 경우가 많은데, 위 예제에서는 다루지 않았다. -**Summary**. To summarize, the Conv Layer: +**요약**. To summarize, the Conv Layer: -- Accepts a volume of size \\(W\_1 \times H\_1 \times D\_1\\) -- Requires four hyperparameters: - - Number of filters \\(K\\), - - their spatial extent \\(F\\), - - the stride \\(S\\), - - the amount of zero padding \\(P\\). -- Produces a volume of size \\(W\_2 \times H\_2 \times D\_2\\) where: - - \\(W\_2 = (W\_1 - F + 2P)/S + 1\\) - - \\(H\_2 = (H\_1 - F + 2P)/S + 1\\) (i.e. width and height are computed equally by symmetry) - - \\(D\_2 = K\\) -- With parameter sharing, it introduces \\(F \cdot F \cdot D\_1\\) weights per filter, for a total of \\((F \cdot F \cdot D\_1) \cdot K\\) weights and \\(K\\) biases. -- In the output volume, the \\(d\\)-th depth slice (of size \\(W\_2 \times H\_2\\)) is the result of performing a valid convolution of the \\(d\\)-th filter over the input volume with a stride of \\(S\\), and then offset by \\(d\\)-th bias. +- $$W_1 \times H_1 \times D_1$$ 크기의 볼륨을 입력받는다. +- 4개의 hyperparameter가 필요하다: + - 필터 개수 $$K$$, + - 필터의 가로/세로 Spatial 크기 $$F$$, + - Stride $$S$$, + - 제로 패딩 $$P$$. +- $$W_2 \times H_2 \times D_2$$ 크기의 출력 볼륨을 생성한다: + - $$W_2 = (W_1 - F + 2P)/S + 1$$ + - $$H_2 = (H_1 - F + 2P)/S + 1$$ (i.e. 가로/세로는 같은 방식으로 계산됨) + - $$D_2 = K$$ +- 파라미터 sharing로 인해 필터 당 $$F \cdot F \cdot D_1$$개의 가중치를 가져서 총 $$(F \cdot F \cdot D_1) \cdot K$$개의 가중치와 $$K$$개의 바이어스를 갖게 된다. +- 출력 볼륨에서 $$d$$번째 depth slice ($$W_2 \times H_2$$ 크기)는 입력 볼륨에 $$d$$번째 필터를 stride $$S$$만큼 옮겨가며 컨볼루션 한 뒤 $$d$$번째 바이어스를 더한 결과이다. -A common setting of the hyperparameters is \\(F = 3, S = 1, P = 1\\). However, there are common conventions and rules of thumb that motivate these hyperparameters. See the [ConvNet architectures](#architectures) section below. +흔한 Hyperparameter기본 세팅은 $$F = 3, S = 1, P = 1$$이다. 뒤에서 다룰 [ConvNet 구조](#architectures)에서 hyperparameter 세팅과 관련된 법칙이나 방식 등을 확인할 수 있다. -**Convolution Demo**. Below is a running demo of a CONV layer. Since 3D volumes are hard to visualize, all the volumes (the input volume (in blue), the weight volumes (in red), the output volume (in green)) are visualized with each depth slice stacked in rows. The input volume is of size \\(W\_1 = 5, H\_1 = 5, D\_1 = 3\\), and the CONV layer parameters are \\(K = 2, F = 3, S = 2, P = 1\\). That is, we have two filters of size \\(3 \times 3\\), and they are applied with a stride of 2. Therefore, the output volume size has spatial size (5 - 3 + 2)/2 + 1 = 3. Moreover, notice that a padding of \\(P = 1\\) is applied to the input volume, making the outer border of the input volume zero. The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias. +**컨볼루션 데모**. 아래는 컨볼루션 레이어 데모이다. 3차원 볼륨은 시각화하기 힘드므로 각 행마다 depth slice를 하나씩 배치했다. 각 볼륨은 입력 볼륨(파란색), 가중치 볼륨(빨간색), 출력 볼륨(녹색)으로 이뤄진다. 입력 볼륨의 크기는 $$W_1 = 5, H_1 = 5, D_1 = 3$$이고 컨볼루션 레이어의 파라미터들은 $$K = 2, F = 3, S = 2, P = 1$$이다. 즉, 2개의 $$3 \times 3$$크기의 필터가 각각 stride 2마다 적용된다. 그러므로 출력 볼륨의 spatial 크기 (가로/세로)는 (5 - 3 + 2)/2 + 1 = 3이다. 제로 패딩 $$P = 1$$ 이 적용되어 입력 볼륨의 가장자리가 모두 0으로 되어있다는 것을 확인할 수 있다. 아래의 영상에서 하이라이트 표시된 입력(파란색)과 필터(빨간색)이 elementwise로 곱해진 뒤 하나로 더해지고 bias가 더해지는걸 볼 수 있다.
- +
-**Implementation as Matrix Multiplication**. Note that the convolution operation essentially performs dot products between the filters and local regions of the input. A common implementation pattern of the CONV layer is to take advantage of this fact and formulate the forward pass of a convolutional layer as one big matrix multiply as follows: +**매트릭스 곱으로 구현**. 컨볼루션 연산은 필터와 이미지의 로컬한 영역간의 내적 연산을 한 것과 같다. 컨볼루션 레이어의 일반적인 구현 패턴은 이 점을 이용해 컨볼루션 레이어의 forward pass를 다음과 같이 하나의 큰 매트릭스 곱으로 계산된다: -1. The local regions in the input image are stretched out into columns in an operation commonly called **im2col**. For example, if the input is [227x227x3] and it is to be convolved with 11x11x3 filters at stride 4, then we would take [11x11x3] blocks of pixels in the input and stretch each block into a column vector of size 11\*11\*3 = 363. Iterating this process in the input at stride of 4 gives (227-11)/4+1 = 55 locations along both width and height, leading to an output matrix `X_col` of *im2col* of size [363 x 3025], where every column is a stretched out receptive field and there are 55*55 = 3025 of them in total. Note that since the receptive fields overlap, every number in the input volume may be duplicated in multiple distinct columns. -2. The weights of the CONV layer are similarly stretched out into rows. For example, if there are 96 filters of size [11x11x3] this would give a matrix `W_row` of size [96 x 363]. -3. The result of a convolution is now equivalent to performing one large matrix multiply `np.dot(W_row, X_col)`, which evaluates the dot product between every filter and every receptive field location. In our example, the output of this operation would be [96 x 3025], giving the output of the dot product of each filter at each location. -4. The result must finally be reshaped back to its proper output dimension [55x55x96]. +1. 이미지의 각 로컬 영역을 열 벡터로 stretch 한다 (이런 연산을 보통 **im2col** 이라고 부름). 예를 들어, 만약 [227x227x3] 사이즈의 입력이 11x11x3 사이즈와 strie 4의 필터와 컨볼루션 한다면, 이미지에서 [11x11x3] 크기의 픽셀 블록을 가져와 11\*11\*3=363 크기의 열 벡터로 바꾸게 된다. 이 과정을 stride 4마다 하므로 가로, 세로에 대해 각각 (227-11)/4+1=55, 총 55\*55=3025 개 영역에 대해 반복하게 되고, 출력물인 `X_col`은 [363x3025]의 사이즈를 갖게 된다. 각각의 열 벡터는 리셉티브 필드를 1차원으로 stretch 한 것이고, 이 리셉티브 필드는 주위 리셉티브 필드들과 겹치므로 입력 볼륨의 여러 값들이 여러 출력 열벡터에 중복되어 나타날 수 있다. +2. 컨볼루션 레이어의 가중치는 비슷한 방식으로 행 벡터 형태로 stretch된다. 예를 들어 [11x11x3]사이즈의 총 96개 필터가 있다면, [96x363] 사이즈의 W_row가 만들어진다. +3. 이제 컨볼루션 연산은 하나의 큰 매트릭스 연산 `np.dot(W_row, X_col)`를 계산하는 것과 같다. 이 연산은 모든 필터와 모든 리셉티브 필터 영역들 사이의 내적 연산을 하는 것과 같다. 우리의 예에서는 각 영역에 대한 각각의 필터를 각각의 영역에 적용한 [96x3025] 사이즈의 출력물이 얻어진다. +4. 결과물은 [55x55x96] 차원으로 reshape 한다. -This approach has the downside that it can use a lot of memory, since some values in the input volume are replicated multiple times in `X_col`. However, the benefit is that there are many very efficient implementations of Matrix Multiplication that we can take advantage of (for example, in the commonly used [BLAS](http://www.netlib.org/blas/) API). Morever, the same *im2col* idea can be reused to perform the pooling operation, which we discuss next. +이 방식은 입력 볼륨의 여러 값들이 `X_col`에 여러 번 복사되기 때문에 메모리가 많이 사용된다는 단점이 있다. 그러나 매트릭스 연산과 관련된 많은 효율적 구현방식들을 사용할 수 있다는 장점도 있다 ([BLAS](http://www.netlib.org/blas/) API 가 하나의 예임). 뿐만 아니라 같은 *im2col* 아이디어는 풀링 연산에서 재활용 할 수도 있다 (뒤에서 다루게 된다). -**Backpropagation.** The backward pass for a convolution operation (for both the data and the weights) is also a convolution (but with spatially-flipped filters). This is easy to derive in the 1-dimensional case with a toy example (not expanded on for now). +**Backpropagation.** 컨볼루션 연산의 backward pass 역시 컨볼루션 연산이다 (가로/세로가 뒤집어진 필터를 사용한다는 차이점이 있음). 간단한 1차원 예제를 가지고 쉽게 확인해볼 수 있다. -#### Pooling Layer +#### 풀링 레이어 (Pooling Layer) -It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation. The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations. Every MAX operation would in this case be taking a max over 4 numbers (little 2x2 region in some depth slice). The depth dimension remains unchanged. More generally, the pooling layer: +ConvNet 구조 내에 컨볼루션 레이어들 중간중간에 주기적으로 풀링 레이어를 넣는 것이 일반적이다. 풀링 레이어가 하는 일은 네트워크의 파라미터의 개수나 연산량을 줄이기 위해 representation의 spatial한 사이즈를 줄이는 것이다. 이는 오버피팅을 조절하는 효과도 가지고 있다. 풀링 레이어는 MAX 연산을 각 depth slice에 대해 독립적으로 적용하여 spatial한 크기를 줄인다. 사이즈 2x2와 stride 2가 가장 많이 사용되는 풀링 레이어이다. 각 depth slice를 가로/세로축을 따라 1/2로 downsampling해 75%의 액티베이션은 버리게 된다. 이 경우 MAX 연산은 4개 숫자 중 최대값을 선택하게 된다 (같은 depth slice 내의 2x2 영역). Depth 차원은 변하지 않는다. 풀링 레이어의 특징들은 일반적으로 아래와 같다: -- Accepts a volume of size \\(W\_1 \times H\_1 \times D\_1\\) -- Requires three hyperparameters: - - their spatial extent \\(F\\), - - the stride \\(S\\), -- Produces a volume of size \\(W\_2 \times H\_2 \times D\_2\\) where: - - \\(W\_2 = (W\_1 - F)/S + 1\\) - - \\(H\_2 = (H\_1 - F)/S + 1\\) - - \\(D\_2 = D\_1\\) -- Introduces zero parameters since it computes a fixed function of the input -- Note that it is not common to use zero-padding for Pooling layers +- $$W_1 \times H_1 \times D_1$$ 사이즈의 입력을 받는다 +- 3가지 hyperparameter를 필요로 한다. + - Spatial extent $$F$$ + - Stride $$S$$ +- $$W_2 \times H_2 \times D_2$$ 사이즈의 볼륨을 만든다 + - $$W_2 = (W_1 - F)/S + 1$$ + - $$H_2 = (H_1 - F)/S + 1$$ + - $$D_2 = D_1$$ +- 입력에 대해 항상 같은 연산을 하므로 파라미터는 따로 존재하지 않는다 +- 풀링 레이어에는 보통 제로 패딩을 하지 않는다 -It is worth noting that there are only two commonly seen variations of the max pooling layer found in practice: A pooling layer with \\(F = 3, S = 2\\) (also called overlapping pooling), and more commonly \\(F = 2, S = 2\\). Pooling sizes with larger receptive fields are too destructive. +일반적으로 실전에서는 두 종류의 max 풀링 레이어만 널리 쓰인다. 하나는 overlapping 풀링이라고도 불리는 $$F = 3, S = 2$$ 이고 하나는 더 자주 쓰이는 $$F = 2, S = 2$$ 이다. 큰 리셉티브 필드에 대해서 풀링을 하면 보통 너무 많은 정보를 버리게 된다. -**General pooling**. In addition to max pooling, the pooling units can also perform other functions, such as *average pooling* or even *L2-norm pooling*. Average pooling was often used historically but has recently fallen out of favor compared to the max pooling operation, which has been shown to work better in practice. +**일반적인 풀링**. Max 풀링 뿐 아니라 *average 풀링*, *L2-norm 풀링* 등 다른 연산으로 풀링할 수도 있다. Average 풀링은 과거에 많이 쓰였으나 최근에는 Max 풀링이 더 좋은 성능을 보이며 점차 쓰이지 않고 있다.
- - + +
- Pooling layer downsamples the volume spatially, independently in each depth slice of the input volume. Left: In this example, the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. Notice that the volume depth is preserved. Right: The most common downsampling operation is max, giving rise to max pooling, here shown with a stride of 2. That is, each max is taken over 4 numbers (little 2x2 square). + 풀링 레이어는 입력 볼륨의 각 depth slice를 spatial하게 downsampling한다. 좌: 이 예제에서는 입력 볼륨이 [224x224x64]이며 필터 크기 2, stride 2로 풀링해 [112x112x64] 크기의 출력 볼륨을 만든다. 볼륨의 depth는 그대로 유지된다는 것을 기억하자. Right: 가장 널리 쓰이는 max 풀링. 2x2의 4개 숫자에 대해 max를 취하게된다.
-**Backpropagation**. Recall from the backpropagation chapter that the backward pass for a max(x, y) operation has a simple interpretation as only routing the gradient to the input that had the highest value in the forward pass. Hence, during the forward pass of a pooling layer it is common to keep track of the index of the max activation (sometimes also called *the switches*) so that gradient routing is efficient during backpropagation. +**Backpropagation**. Backpropagation 챕터에서 max(x,y)의 backward pass는 그냥 forward pass에서 가장 큰 값을 가졌던 입력의 gradient를 보내는 것과 같다고 배운 것을 기억하자. 그러므로 forward pass 과정에서 보통 max 액티베이션의 위치를 저장해두었다가 backpropagation 때 사용한다. -**Recent developments**. +**최근의 발전된 내용들**. -- [Fractional Max-Pooling](http://arxiv.org/abs/1412.6071) suggests a method for performing the pooling operation with filters smaller than 2x2. This is done by randomly generating pooling regions with a combination of 1x1, 1x2, 2x1 or 2x2 filters to tile the input activation map. The grids are generated randomly on each forward pass, and at test time the predictions can be averaged across several grids. -- [Striving for Simplicity: The All Convolutional Net](http://arxiv.org/abs/1412.6806) proposes to discard the pooling layer in favor of architecture that only consists of repeated CONV layers. To reduce the size of the representation they suggest using larger stride in CONV layer once in a while. +- [Fractional Max-Pooling](http://arxiv.org/abs/1412.6071) 2x2보다 더 작은 필터들로 풀링하는 방식. 1x1, 1x2, 2x1, 2x2 크기의 필터들을 임의로 조합해 풀링한다. 매 forward pass마다 grid들이 랜덤하게 생성되고, 테스트 때에는 여러 grid들의 예측 점수들의 평균치를 사용하게 된다. +- [Striving for Simplicity: The All Convolutional Net](http://arxiv.org/abs/1412.6806) 라는 논문은 컨볼루션 레이어만 반복하며 풀링 레이어를 사용하지 않는 방식을 제안한다. Representation의 크기를 줄이기 위해 가끔씩 큰 stride를 가진 컨볼루션 레이어를 사용한다. -Due to the aggressive reduction in the size of the representation (which is helpful only for smaller datasets to control overfitting), the trend in the literature is towards discarding the pooling layer in modern ConvNets. +풀링 레이어가 보통 representation의 크기를 심하게 줄이기 때문에 (이런 효과는 작은 데이터셋에서만 오버피팅 방지 효과 등으로 인해 도움이 됨), 최근 추세는 점점 풀링 레이어를 사용하지 않는 쪽으로 발전하고 있다. -#### Normalization Layer +#### Normalization 레이어 -Many types of normalization layers have been proposed for use in ConvNet architectures, sometimes with the intentions of implementing inhibition schemes observed in the biological brain. However, these layers have recently fallen out of favor because in practice their contribution has been shown to be minimal, if any. For various types of normalizations, see the discussion in Alex Krizhevsky's [cuda-convnet library API](http://code.google.com/p/cuda-convnet/wiki/LayerParams#Local_response_normalization_layer_(same_map)). +실제 두뇌의 억제 메커니즘 구현 등을 위해 많은 종류의 normalization 레이어들이 제안되었다. 그러나 이런 레이어들이 실제로 주는 효과가 별로 없다는 것이 알려지면서 최근에는 거의 사용되지 않고 있다. Normalization에 대해 알고 싶다면 Alex Krizhevsky의 글을 읽어보기 바란다 [cuda-convnet library API](http://code.google.com/p/cuda-convnet/wiki/LayerParams#Local_response_normalization_layer_(same_map)). -#### Fully-connected layer +#### Fully-connected 레이어 -Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset. See the *Neural Network* section of the notes for more information. +Fully connected 레이어 내의 뉴런들은 일반 신경망 챕터에서 보았듯이이전 레이어의 모든 액티베이션들과 연결되어 있다. 그러므로 Fully connected레이어의 액티베이션은 매트릭스 곱을 한 뒤 바이어스를 더해 구할 수 있다. 더 많은 정보를 위해 강의 노트의 "신경망" 섹션을 보기 바란다. -#### Converting FC layers to CONV layers +#### FC 레이어를 CONV 레이어로 변환하기 + +FC 레이어와 CONV 레이어의 차이점은, CONV 레이어는 입력의 일부 영역에만 연결되어 있고, CONV 볼륨의 많은 뉴런들이 파라미터를 공유한다는 것 뿐이라는 것을 알아 둘 필요가 있다. 두 레이어 모두 내적 연산을 수행하므로 실제 함수 형태는 동일하다. 그러므로 FC 레이어를 CONV 레이어로 변환하는 것이 가능하다: -It is worth noting that the only difference between FC and CONV layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters. However, the neurons in both layers still compute dot products, so their functional form is identical. Therefore, it turns out that it's possible to convert between FC and CONV layers: +- 모든 CONV 레이어는 동일한 forward 함수를 수행하는 FC 레이어 짝이 있다. 이 경우의 가중치 매트릭스는 몇몇 블록을 제외하고 모두 0으로 이뤄지며 (local connectivity: 입력의 일부 영역에만 연결된 특성), 이 블록들 중 여러개는 같은 값을 지니게 된다 (파라미터 공유). -- For any CONV layer there is an FC layer that implements the same forward function. The weight matrix would be a large matrix that is mostly zero except for at certian blocks (due to local connectivity) where the weights in many of the blocks are equal (due to parameter sharing). -- Conversely, any FC layer can be converted to a CONV layer. For example, an FC layer with \\(K = 4096\\) that is looking at some input volume of size \\(7 \times 7 \times 512\\) can be equivalently expressed as a CONV layer with \\(F = 7, P = 0, S = 1, K = 4096\\). In other words, we are setting the filter size to be exactly the size of the input volume, and hence the output will simply be \\(1 \times 1 \times 4096\\) since only a single depth column "fits" across the input volume, giving identical result as the initial FC layer. +- 반대로, 모든 FC 레이어는 CONV 레이어로 변환될 수 있다. 예를 들어, $$7 \times 7 \times 512$$ 크기의 입력을 받고 $$K= 4906$$ 인 FC 레이어는 $$F = 7, P = 0, S = 1, K = 4096$$인 CONV 레이어로 표현 가능하다. 바꿔 말하면, 필터의 크기를 입력 볼륨의 크기와 동일하게 만들고 $$1 \times 1 \times 4906$$ 크기의 아웃풋을 출력할 수 있다. 각 depth에 대해 하나의 값만 구해지므로 (필터의 가로/세로가 입력 볼륨의 가로/세로와 같으므로) FC 레이어와 같은 결과를 얻게 된다. -**FC->CONV conversion**. Of these two conversions, the ability to convert an FC layer to a CONV layer is particularly useful in practice. Consider a ConvNet architecture that takes a 224x224x3 image, and then uses a series of CONV layers and POOL layers to reduce the image to an activations volume of size 7x7x512 (in an *AlexNet* architecture that we'll see later, this is done by use of 5 pooling layers that downsample the input spatially by a factor of two each time, making the final spatial size 224/2/2/2/2/2 = 7). From there, an AlexNet uses two FC layers of size 4096 and finally the last FC layers with 1000 neurons that compute the class scores. We can convert each of these three FC layers to CONV layers as described above: +**FC->CONV 변환**. 이 두 변환 중, FC 레이어를 CONV 레이어로의 변환은 매우 실전에서 매우 유용하다. 224x224x3의 이미지를 입력으로 받고 일련의 CONV레이어와 POOL 레이어를 이용해 7x7x512의 액티베이션을 만드는 컨볼루션넷 아키텍쳐를 생각해 보자 (뒤에서 살펴 볼 *AlexNet* 아키텍쳐에서는 입력의 spatial(가로/세로) 크기를 반으로 줄이는 풀링 레이어 5개를 사용해 7x7x512의 액티베이션을 만든다. 224/2/2/2/2/2 = 7이기 때문이다). AlexNet은 여기에 4096의 크기를 갖는 FC 레이어 2개와 클래스 스코어를 계산하는 1000개 뉴런으로 이뤄진 마지막 FC 레이어를 사용한다. 이 마지막 3개의 FC 레이어를 CONV 레이어로 변환하는 방법을 아래에서 배우게 된다: -- Replace the first FC layer that looks at [7x7x512] volume with a CONV layer that uses filter size \\(F = 7\\), giving output volume [1x1x4096]. -- Replace the second FC layer with a CONV layer that uses filter size \\(F = 1\\), giving output volume [1x1x4096] -- Replace the last FC layer similarly, with \\(F=1\\), giving final output [1x1x1000] +- [7x7x512]의 입력 볼륨을 받는 첫 번째 FC 레이어를 $$F = 7$$의 필터 크기를 갖는 CONV 레이어로 바꾼다. 이 때 출력 볼륨의 크기는 [1x1x4096] 이 된다. +- 두 번째 FC 레이어를 $$F = 1$$ 필터 사이즈의 CONV 레이어로 바꾼다. 이 때 출력 볼륨의 크기는 [1x1s4096]이 된다. +- 같은 방식으로 마지막 FC 레이어를 $$F = 1$$의 CONV 레이어를 바꾼다. 출력 볼륨의 크기는 [1x1x1000]이 된다. -Each of these conversions could in practice involve manipulating (e.g. reshaping) the weight matrix \\(W\\) in each FC layer into CONV layer filters. It turns out that this conversion allows us to "slide" the original ConvNet very efficiently across many spatial positions in a larger image, in a single forward pass. +각각의 변환은 일반적으로 FC 레이어의 가중치 $$W$$를 CONV 레이어의 필터로 변환하는 과정을 수반한다. 이런 변환을 하고 나면, 큰 이미지 (가로/세로가 224보다 큰 이미지)를 단 한번의 forward pass만으로 마치 이미지를 "슬라이딩"하면서 여러 영역을 읽은 것과 같은 효과를 준다. -For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reduction by 32, then forwarding an image of size 384x384 through the converted architecture would give the equivalent volume in size [12x12x512], since 384/32 = 12. Following through with the next 3 CONV layers that we just converted from FC layers would now give the final volume of size [6x6x1000], since (12 - 7)/1 + 1 = 6. Note that instead of a single vector of class scores of size [1x1x1000], we're now getting and entire 6x6 array of class scores across the 384x384 image. +예를 들어,224x224 크기의 이미지를 입력으로 받으면 [7x7x512]의 볼륨을 출력하는 이 아키텍쳐에, ( 224/7 = 32배 줄어듦 ) 된 아키텍쳐에 384x384 크기의 이미지를 넣으면 [12x12x512] 크기의 볼륨을 출력하게 된다 (384/32 = 12 이므로). 이후 FC에서 CONV로 변환한 3개의 CONV 레이어를 거치면 [6x6x1000] 크기의 최종 볼륨을 얻게 된다 ( (12 - 7)/1 +1 =6 이므로). [1x1x1000]크기를 지닌 하나의 클래스 점수 벡터 대신 384x384 이미지로부터 6x6개의 클래스 점수 배열을 구했다는 것이 중요하다. -> Evaluating the original ConvNet (with FC layers) independently across 224x224 crops of the 384x384 image in strides of 32 pixels gives an identical result to forwarding the converted ConvNet one time. +> 위의 내용은 384x384 크기의 이미지를 32의 stride 간격으로 224x224 크기로 잘라 각각을 원본 ConvNet (뒷쪽 3개 레이어가 FC인)에 적용한 것과 같은 결과를 보여준다. -Naturally, forwarding the converted ConvNet a single time is much more efficient than iterating the original ConvNet over all those 36 locations, since the 36 evaluations share computation. This trick is often used in practice to get better performance, where for example, it is common to resize an image to make it bigger, use a converted ConvNet to evaluate the class scores at many spatial positions and then average the class scores. +당연히 (CONV레이어만으로) 변환된 ConvNet을 이용해 한 번에 이미지를 처리하는 것이 원본 ConvNet으로 36개 위치에 대해 반복적으로 처리하는 것 보다 훨씬 효율적이다. 36번의 처리 과정에서 같은 계산이 중복되기 때문이다. 이런 기법은 실전에서 성능 향상을 위해 종종 사용된다. 예를 들어 이미지를 크게 리사이즈 한 뒤 변환된 ConvNet을 이용해 여러 위치에 대한 클래스 점수를 구한 다음 그 점수들의 평균을 취하는 기법 등이 있다. -Lastly, what if we wanted to efficiently apply the original ConvNet over the image but at a stride smaller than 32 pixels? We could achieve this with multiple forward passes. For example, note that if we wanted to use a stride of 16 pixels we could do so by combining the volumes received by forwarding the converted ConvNet twice: First over the original image and second over the image but with the image shifted spatially by 16 pixels along both width and height. +마지막으로 32 픽셀보다 적은 stride 간격으로 ConvNet을 적용하고 싶다면 어떡해야 할까? 포워드 패스 (forward pass)를 여러 번 적용하면 가능하다. 예를 들어 16의 stride 간격으로 처리를 하고 싶다면 변환된 ConvNet에 이미지를 2번 적용한 뒤 합치는 방식을 사용하면 된다: 먼저 원본 이미지를 처리한 뒤 원본 이미지를 가로/세로 16 픽셀만큼 쉬프트 시킨 뒤 한번 더 처리하면 된다. -- An IPython Notebook on [Net Surgery](https://github.com/BVLC/caffe/blob/master/examples/net_surgery.ipynb) shows how to perform the conversion in practice, in code (using Caffe) +- Caffe를 이용해 ConvNet 변환을 수행하는 실제 IPython Notebook 예제 [Net Surgery](https://github.com/BVLC/caffe/blob/master/examples/net_surgery.ipynb) -### ConvNet Architectures +### ConvNet 구조 -We have seen that Convolutional Networks are commonly made up of only three layer types: CONV, POOL (we assume Max pool unless stated otherwise) and FC (short for fully-connected). We will also explicitly write the RELU activation function as a layer, which applies elementwise non-linearity. In this section we discuss how these are commonly stacked together to form entire ConvNets. +위에서 컨볼루셔널 신경망은 일반적으로 CONV, POOL (별다른 언급이 없다면 Max Pool이라고 가정), FC 레이어로 이뤄져 있다는 것을 배웠다. 각 원소에 비선형 특징을 가해주는 RELU 액티베이션 함수도 명시적으로 레이어로 취급하겠다. 이 섹션에서는 어떤 방식으로 이 레이어들이 쌓아져 전체 ConvNet이 이뤄지는지 알아보겠다. -#### Layer Patterns -The most common form of a ConvNet architecture stacks a few CONV-RELU layers, follows them with POOL layers, and repeats this pattern until the image has been merged spatially to a small size. At some point, it is common to transition to fully-connected layers. The last fully-connected layer holds the output, such as the class scores. In other words, the most common ConvNet architecture follows the pattern: - +#### 레이어 패턴 +가장 흔한 ConvNet 구조는 몇 개의 CONV-RELU 레이어를 쌓은 뒤 POOL 레이어를 추가한 형태가 여러 번 반복되며 이미지 볼륨의 spatial (가로/세로) 크기를 줄이는 것이다. 이런 방식으로 적절히 쌓은 뒤 FC 레이어들을 쌓아준다. 마지막 FC 레이어는 클래스 점수와 같은 출력을 만들어낸다. 다시 말해서, 일반적인 ConvNet 구조는 다음 패턴을 따른다: `INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC` -where the `*` indicates repetition, and the `POOL?` indicates an optional pooling layer. Moreover, `N >= 0` (and usually `N <= 3`), `M >= 0`, `K >= 0` (and usually `K < 3`). For example, here are some common ConvNet architectures you may see that follow this pattern: +`*`는 반복을 의미하며 `POOL?` 은 선택적으로 POOL 레이어를 사용한다는 의미이다. 또한 `N >= 0` (보통 `N <= 3`), `M >= 0`, `K >= 0` (보통 `K < 3`)이다. 예를 들어, 보통의 ConvNet 구조에서 아래와 같은 패턴들을 흔히 발견할 수 있다: -- `INPUT -> FC`, implements a linear classifier. Here `N = M = K = 0`. +- `INPUT -> FC`, 선형 분류기이다. 이 때 `N = M = K = 0`. - `INPUT -> CONV -> RELU -> FC` -- `INPUT -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FC`. Here we see that there is a single CONV layer between every POOL layer. -- `INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2 -> FC` Here we see two CONV layers stacked before every POOL layer. This is generally a good idea for larger and deeper networks, because multiple stacked CONV layers can develop more complex features of the input volume before the destructive pooling operation. +- `INPUT -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FC`. 이 경우는 POOL 레이어 하나 당 하나의 CONV 레이어가 존재한다. +- `INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2 -> FC` 이 경우는 각각의 POOL 레이어를 거치기 전에 여러 개의 CONV 레이어를 거치게 된다. 크고 깊은 신경망에서는 이런 구조가 적합하다. 여러 층으로 쌓인 CONV 레이어는 pooling 연산으로 인해 많은 정보가 파괴되기 전에 복잡한 feature들을 추출할 수 있게 해주기 때문이다. -*Prefer a stack of small filter CONV to one large receptive field CONV layer*. Suppose that you stack three 3x3 CONV layers on top of each other (with non-linearities in between, of course). In this arrangement, each neuron on the first CONV layer has a 3x3 view of the input volume. A neuron on the second CONV layer has a 3x3 view of the first CONV layer, and hence by extension a 5x5 view of the input volume. Similarly, a neuron on the third CONV layer has a 3x3 view of the 2nd CONV layer, and hence a 7x7 view of the input volume. Suppose that instead of these three layers of 3x3 CONV, we only wanted to use a single CONV layer with 7x7 receptive fields. These neurons would have a receptive field size of the input volume that is identical in spatial extent (7x7), but with several disadvantages. First, the neurons would be computing a linear function over the input, while the three stacks of CONV layers contain non-linearities that make their features more expressive. Second, if we suppose that all the volumes have \\(C\\) channels, then it can be seen that the single 7x7 CONV layer would contain \\(C \times (7 \times 7 \times C) = 49 C^2\\) parameters, while the three 3x3 CONV layers would only contain \\(3 \times (C \times (3 \times 3 \times C)) = 27 C^2\\) parameters. Intuitively, stacking CONV layers with tiny filters as opposed to having one CONV layer with big filters allows us to express more powerful features of the input, and with fewer parameters. As a practical disadvantage, we might need more memory to hold all the intermediate CONV layer results if we plan to do backpropagation. +*큰 리셉티브 필드를 가지는 CONV 레이어 하나 대신 여러개의 작은 필터를 가진 CONV 레이어를 쌓는 것이 좋다*. 3x3 크기의 CONV 레이어 3개를 쌓는다고 생각해보자 (물론 각 레이어 사이에는 비선형 함수를 넣어준다). 이 경우 첫 번째 CONV 레이어의 각 뉴런은 입력 볼륨의 3x3 영역을 보게 된다. 두 번째 CONV 레이어의 각 뉴런은 첫 번째 CONV 레이어의 3x3 영역을 보게 되어 결론적으로 입력 볼륨의 5x5 영역을 보게 되는 효과가 있다. 비슷하게, 세 번째 CONV 레이어의 각 뉴런은 두 번째 CONV 레이어의 3x3 영역을 보게 되어 입력 볼륨의 7x7 영역을 보는 것과 같아진다. 이런 방식으로 3개의 3x3 CONV 레이어를 사용하는 대신 7x7의 리셉티브 필드를 가지는 CONV 레이어 하나를 사용한다고 생각해 보자. 이 경우에도 각 뉴런은 입력 볼륨의 7x7 영역을 리셉티브 필드로 갖게 되지만 몇 가지 단점이 존재한다. 먼저, CONV 레이어 3개를 쌓은 경우에는 중간 중간 비선형 함수의 영향으로 표현력 높은 feature를 만드는 반면, 하나의 (7x7) CONV 레이어만 갖는 경우 각 뉴런은 입력에 대해 선형 함수를 적용하게 된다. 두 번째로, 모든 볼륨이 $$C$$ 개의 채널(또는 깊이)을 갖는다고 가정한다면, 7x7 CONV 레이어의 경우 $$C \times (7 \times 7 \times C)=49 C^2$$개의 파라미터를 갖게 된다. 반면 3개의 3x3 CONV 레이어의 경우는 $$3 \times (C \times (3 \times 3 \times)) = 27 C^2$$개의 파라미터만 갖게 된다. 직관적으로, 하나의 큰 필터를 갖는 CONV 레이어보다, 작은 필터를 갖는 여러 개의 CONV 레이어를 쌓는 것이 더 적은 파라미터만 사용하면서도 입력으로부터 더 좋은 feature를 추출하게 해준다. 단점이 있다면, backpropagation을 할 때 CONV 레이어의 중간 결과들을 저장하기 위해 더 많은 메모리 공간을 잡고 있어야 한다는 것이다. -#### Layer Sizing Patterns +#### 레이어 크기 결정 패턴 -Until now we've omitted mentions of common hyperparameters used in each of the layers in a ConvNet. We will first state the common rules of thumb for sizing the architectures and then follow the rules with a discussion of the notation: +지금까지는 ConvNet의 각 레이어에서 흔히 쓰이는 하이퍼파라미터에 대한 언급을 하지 않았다. 여기에서는 처음으로 ConvNet 구조의 크기를 결정하는 법칙 (수학적으로 증명된 법칙은 아니고 실험적으로 좋은 법칙)들을 살펴보고, 그 뒤에 각종 표기법에 대해 알아보겠다: -The **input layer** (that contains the image) should be divisible by 2 many times. Common numbers include 32 (e.g. CIFAR-10), 64, 96 (e.g. STL-10), or 224 (e.g. common ImageNet ConvNets), 384, and 512. +**입력 레이어** (이미지 포함)는 여러번 2로 나눌 수 있어야 한다. 흔히 사용되는 숫자들은 32 (CIFAR-10 데이터), 64, 96 (STL-10), 224 (많이 쓰이는 ImageNet ConvNet), 384, 512 등이 있다. -The **conv layers** should be using small filters (e.g. 3x3 or at most 5x5), using a stride of \\(S = 1\\), and crucially, padding the input volume with zeros in such way that the conv layer does not alter the spatial dimensions of the input. That is, when \\(F = 3\\), then using \\(P = 1\\) will retain the original size of the input. When \\(F = 5\\), \\(P = 2\\). For a general \\(F\\), it can be seen that \\(P = (F - 1) / 2\\) preserves the input size. If you must use bigger filter sizes (such as 7x7 or so), it is only common to see this on the very first conv layer that is looking at the input image. +**CONV 레이어**는 (3x3 또는 최대 5x5의)작은 필터들과 $$S = 1$$의 stride를 사용하며, 결정적으로 입력과 출력의 spatial 크기 (가로/세로)가 달라지지 않도록 입력 볼륨에 제로 패딩을 해 줘야 한다. 즉, $$F = 3$$이라면, $$P = 1$$로 제로 패딩을 해 주면 입력의 spatial 사이즈를 그대로 유지하게 된다. 만약 $$F = 5$$라면 $$P = 2$$를 사용하게 된다. 일반적으로 $$F$$에 대해서 $$P = (F - 1)/2$$를 사용하면 입력의 크기가 그대로 유지된다. 만약 7x7과 같이 큰 필터를 사용하는 경우에는 보통 이미지와 바로 연결된 첫 번째 CONV 레이어에만 사용한다. -The **pool layers** are in charge of downsampling the spatial dimensions of the input. The most common setting is to use max-pooling with 2x2 receptive fields (i.e. \\(F = 2\\)), and with a stride of 2 (i.e. \\(S = 2\\)). Note that this discards exactly 75% of the activations in an input volume (due to downsampling by 2 in both width and height). Another sligthly less common setting is to use 3x3 receptive fields with a stride of 2, but this makes. It is very uncommon to see receptive field sizes for max pooling that are larger than 3 because the pooling is then too lossy and agressive. This usually leads to worse performance. +**POOL 레이어**는 spatial 차원에 대한 다운샘플링을 위해 사용된다. 가장 일반적인 세팅은 2x2의 리셉티브 필드($$F = 2$$)를 가진 max 풀링이다. 이 경우 입력의 75%의 액티베이션 값이 버려진다는 것을 기억하자 (가로/세로에 대해 각각 절반으로 다운샘플링 하므로). 또 다른 약간 덜 사용되는 세팅은 3x3 리셉티브 필드에 stride를 2로 놓는 것이다. Max 풀링에 3보다 큰 리셉티브 필드를 가지는 경우는 너무 많은 정보를 버리게 되므로 거의 사용되지 않는다. 많은 정보 손실은 곧 성능 하락으로 이어진다. -*Reducing sizing headaches.* The scheme presented above is pleasing because all the CONV layers preserve the spatial size of their input, while the POOL layers alone are in charge of down-sampling the volumes spatially. In an alternative scheme where we use strides greater than 1 or don't zero-pad the input in CONV layers, we would have to very carefully keep track of the input volumes throughout the CNN architecture and make sure that all strides and filters "work out", and that the ConvNet architecture is nicely and symmetrically wired. +*크기 축소와 관련된 고민들.* 위에서 다룬 전략은 꽤 좋지만 모든 CONV 레이어는 입력의 spatial 크기를 그대로 유지시키고, POOL 레이어만 spatial 차원의 다운샘플링을 책임지게 된다. 또다른 대안은 CONV 레이어에서 1보다 큰 stride를 사용하거나 제로 패딩 주지 않는 것이다. 이 경우에는 전체 ConvNet이 잘 동작하도록 각 레이어의 입력 볼륨들을 잘 살펴봐야 한다. -*Why use stride of 1 in CONV?* Smaller strides work better in practice. Additionally, as already mentioned stride 1 allows us to leave all spatial down-sampling to the POOL layers, with the CONV layers only transforming the input volume depth-wise. +*왜 CONV 레이어에 stride 1을 사용할까?* 보통 작은 stride가 더 잘 동작한다. 뿐만 아니라, 위에서 언급한 것과 같이 stirde를 1로 놓으면 모든 spatial 다운샘플링을 POOL 레이어에 맡기게 되고 CONV 레이어는 입력 볼륨의 깊이만 변화시키게 된다. -*Why use padding?* In addition to the aforementioned benefit of keeping the spatial sizes constant after CONV, doing this actually improves performance. If the CONV layers were to not zero-pad the inputs and only perform valid convolutions, then the size of the volumes would reduce by a small amount after each CONV, and the information at the borders would be "washed away" too quickly. +*왜 (제로)패딩을 사용할까?* 앞에서 본 것과 같이 CONV 레이어를 통과하면서 spatial 크기를 그대로 유지하게 해준다는 점 외에도, 패딩을 쓰면 성능도 향상된다. 만약 제로 패딩을 하지 않고 valid convolution (패딩을 하지 않은 convolution)을 한다면 볼륨의 크기는 CONV 레이어를 거칠 때마다 줄어들게 되고, 가장자리의 정보들이 빠르게 사라진다. -*Compromising based on memory constraints.* In some cases (especially early in the ConvNet architectures), the amount of memory can build up very quickly with the rules of thumb presented above. For example, filtering a 224x224x3 image with three 3x3 CONV layers with 64 filters each and padding 1 would create three activation volumes of size [224x224x64]. This amounts to a total of about 10 million activations, or 72MB of memory (per image, for both activations and gradients). Since GPUs are often bottlenecked by memory, it may be necessary to compromise. In practice, people prefer to make the compromise at only the first CONV layer of the network. For example, one compromise might be to use a first CONV layer with filter sizes of 7x7 and stride of 2 (as seen in a ZF net). As another example, an AlexNet uses filer sizes of 11x11 and stride of 4. +*메모리 제한에 따른 타협.* 어떤 경우에는 (특히 예전에 나온 ConvNet 구조에서), 위에서 다룬 기법들을 사용할 경우 메모리 사용량이 매우 빠른 속도로 늘게 된다. 예를 들어 224x224x3의 이미지를 64개의 필터와 stride 1을 사용하는 3x3 CONV 레이어 3개로 필터링하면 [224x224x64]의 크기를 가진 액티베이션 볼륨을 총 3개 만들게 된다. 이 숫자는 거의 1,000만 개의 액티베이션 값이고, (이미지 1장 당)72MB 정도의 메모리를 사용하게 된다 (액티베이션과 그라디언트 각각에). GPU를 사용하면 보통 메모리에서 병목 현상이 생기므로, 이 부분에서는 어느 정도 현실과 타협을 할 필요가 있다. 실전에서는 보통 첫 번째 CONV 레이어에서 타협점을 찾는다. 예를 들면 첫 번째 CONV 레이어에서 7x7 필터와 stride 2 (ZF net)을 사용하는 케이스가 있다. AlexNet의 경우 11x11 필터와 stride 4를 사용한다. -#### Case studies +#### 케이스 스터디 -There are several architectures in the field of Convolutional Networks that have a name. The most common are: +필드에서 사용되는 몇몇 ConvNet들은 별명을 갖고 있다. 그 중 가장 많이 쓰이는 구조들은: -- **LeNet**. The first successful applications of Convolutional Networks were developed by Yann LeCun in 1990's. Of these, the best known is the [LeNet](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf) architecture that was used to read zip codes, digits, etc. -- **AlexNet**. The first work that popularized Convolutional Networks in Computer Vision was the [AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks), developed by Alex Krizhevsky, Ilya Sutskever and Geoff Hinton. The AlexNet was submitted to the [ImageNet ILSVRC challenge](http://www.image-net.org/challenges/LSVRC/2014/) in 2012 and significantly outperformed the second runner-up (top 5 error of 16% compared to runner-up with 26% error). The Network had a similar architecture basic as LeNet, but was deeper, bigger, and featured Convolutional Layers stacked on top of each other (previously it was common to only have a single CONV layer immediately followed by a POOL layer). -- **ZF Net**. The ILSVRC 2013 winner was a Convolutional Network from Matthew Zeiler and Rob Fergus. It became known as the [ZFNet](http://arxiv.org/abs/1311.2901) (short for Zeiler & Fergus Net). It was an improvement on AlexNet by tweaking the architecture hyperparameters, in particular by expanding the size of the middle convolutional layers. -- **GoogLeNet**. The ILSVRC 2014 winner was a Convolutional Network from [Szegedy et al.](http://arxiv.org/abs/1409.4842) from Google. Its main contribution was the development of an *Inception Module* that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M). Additionally, this paper uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters that do not seem to matter much. -- **VGGNet**. The runner-up in ILSVRC 2014 was the network from Karen Simonyan and Andrew Zisserman that became known as the [VGGNet](http://www.robots.ox.ac.uk/~vgg/research/very_deep/). Its main contribution was in showing that the depth of the network is a critical component for good performance. Their final best network contains 16 CONV/FC layers and, appealingly, features an extremely homogeneous architecture that only performs 3x3 convolutions and 2x2 pooling from the beginning to the end. It was later found that despite its slightly weaker classification performance, the VGG ConvNet features outperform those of GoogLeNet in multiple transfer learning tasks. Hence, the VGG network is currently the most preferred choice in the community when extracting CNN features from images. In particular, their [pretrained model](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) is available for plug and play use in Caffe. A downside of the VGGNet is that it is more expensive to evaluate and uses a lot more memory and parameters (140M). -- **ResNet**. [Residual Network](http://arxiv.org/abs/1512.03385) developed by Kaiming He et al. was the winner of ILSVRC 2015. It features an interesting architecture with special *skip connections* and features heavy use of batch normalization. The architecture is also missing fully connected layers at the end of the network. The reader is also referred to Kaiming's presentation ([video](https://www.youtube.com/watch?v=1PGLj-uKT1w), [slides](http://research.microsoft.com/en-us/um/people/kahe/ilsvrc15/ilsvrc2015_deep_residual_learning_kaiminghe.pdf)), and some [recent experiments](https://github.com/gcr/torch-residual-networks) that reproduce these networks in Torch. +- **LeNet**. 최초의 성공적인 ConvNet 애플리케이션들은 1990년대에 Yann LeCun이 만들었다. 그 중에서도 zip 코드나 숫자를 읽는 [LeNet](http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf) 아키텍쳐가 가장 유명하다. +- **AlexNet**. 컴퓨터 비전 분야에서 ConvNet을 유명하게 만든 것은 Alex Krizhevsky, Ilya Sutskever, Geoff Hinton이 만든 [AlexNet](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks)이다. AlexNet은 [ImageNet ILSVRC challenge](http://www.image-net.org/challenges/LSVRC/2014/) 2012에 출전해 2등을 큰 차이로 제치고 1등을 했다 (top 5 에러율 16%, 2등은 26%). 아키텍쳐는 LeNet과 기본적으로 유사지만, 더 깊고 크다. 또한 (과거에는 하나의 CONV 레이어 이후에 바로 POOL 레이어를 쌓은 것과 달리) 여러 개의 CONV 레이어들이 쌓여 있다. +- **ZF Net**. ILSVRC 2013년의 승자는 Matthew Zeiler와 Rob Fergus가 만들었다. 저자들의 이름을 따 [ZFNet](http://arxiv.org/abs/1311.2901)이라고 불린다. AlexNet에서 중간 CONV 레이어 크기를 조정하는 등 하이퍼파라미터들을 수정해 만들었다. +- **GoogLeNet**. ILSVRC 2014의 승자는 [Szegedy et al.](http://arxiv.org/abs/1409.4842) 이 구글에서 만들었다. 이 모델의 가장 큰 기여는 파라미터의 개수를 엄청나게 줄여주는 Inception module을 제안한 것이다 (4M, AlexNet의 경우 60M). 뿐만 아니라, ConvNet 마지막에 FC 레이어 대신 Average 풀링을 사용해 별로 중요하지 않아 보이는 파라미터들을 많이 줄이게 된다. +- **VGGNet**. ILSVRC 2014에서 2등을 한 네트워크는 Karen Simonyan과 Andrew Zisserman이 만든 [VGGNet](http://www.robots.ox.ac.uk/~vgg/research/very_deep/)이라고 불리우는 모델이다. 이 모델의 가장 큰 기여는 네트워크의 깊이가 좋은 성능에 있어 매우 중요한 요소라는 것을 보여준 것이다. 이들이 제안한 여러 개 모델 중 가장 좋은 것은 16개의 CONV/FC 레이어로 이뤄지며, 모든 컨볼루션은 3x3, 모든 풀링은 2x2만으로 이뤄져 있다. 비록 GoogLeNet보다 이미지 분류 성능은 약간 낮지만, 여러 Transfer Learning 과제에서 더 좋은 성능을 보인다는 것이 나중에 밝혀졌다. 그래서 VGGNet은 최근에 이미지 feature 추출을 위해 가장 많이 사용되고 있다. Caffe를 사용하면 Pretrained model을 받아 바로 사용하는 것도 가능하다. VGGNet의 단점은, 매우 많은 메모리를 사용하며 (140M) 많은 연산을 한다는 것이다. + - **ResNet**. Kaiming He et al.이 만든 [Residual Network](http://arxiv.org/abs/1512.03385)가 ILSVRC 2015에서 우승을 차지했다. Skip connection이라는 특이한 구조를 사용하며 batch normalizatoin을 많이 사용했다는 특징이 있다. 이 아키텍쳐는 또한 마지막 부분에서 FC 레이어를 사용하지 않는다. Kaiming의 발표자료 ([video](https://www.youtube.com/watch?v=1PGLj-uKT1w), [slides](http://research.microsoft.com/en-us/um/people/kahe/ilsvrc15/ilsvrc2015_deep_residual_learning_kaiminghe.pdf))나 Torch로 구현된 [최근 실험들](https://github.com/gcr/torch-residual-networks) 들도 확인할 수 있다. -**VGGNet in detail**. -Lets break down the [VGGNet](http://www.robots.ox.ac.uk/~vgg/research/very_deep/) in more detail. The whole VGGNet is composed of CONV layers that perform 3x3 convolutions with stride 1 and pad 1, and of POOL layers that perform 2x2 max pooling with stride 2 (and no padding). We can write out the size of the representation at each step of the processing and keep track of both the representation size and the total number of weights: +**VGGNet의 세부 사항들**. +[VGGNet](http://www.robots.ox.ac.uk/~vgg/research/very_deep/)에 대해 좀 더 자세히 파헤쳐 보자. 전체 VGGNet은 필터 크기 3x3, stride 1, 제로패딩 1로 이뤄진 CONV 레이어들과 2x2 필터 크기 (패딩은 없음)의 POOL 레이어들로 구성된다. 아래에서 각 단계의 처리 과정을 살펴보고, 각 단계의 결과 크기와 가중치 개수를 알아본다. -``` +~~~ INPUT: [224x224x3] memory: 224*224*3=150K weights: 0 CONV3-64: [224x224x64] memory: 224*224*64=3.2M weights: (3*3*3)*64 = 1,728 CONV3-64: [224x224x64] memory: 224*224*64=3.2M weights: (3*3*64)*64 = 36,864 @@ -343,34 +345,40 @@ FC: [1x1x1000] memory: 1000 weights: 4096*1000 = 4,096,000 TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd) TOTAL params: 138M parameters -``` - -As is common with Convolutional Networks, notice that most of the memory is used in the early CONV layers, and that most of the parameters are in the last FC layers. In this particular case, the first FC layer contains 100M weights, out of a total of 140M. +~~~ +ConvNet에서 자주 볼 수 있는 특징으로써, 대부분의 메모리가 앞쪽에서 소비된다는 점과, 마지막 FC 레이어들이 가장 많은 파라미터들을 갖고 있다는 점을 기억하자. 이 예제에서는, 첫 번째 FC 레이어가 총 140M개 중 100M개의 가중치를 갖는다. -#### Computational Considerations -The largest bottleneck to be aware of when constructing ConvNet architectures is the memory bottleneck. Many modern GPUs have a limit of 3/4/6GB memory, with the best GPUs having about 12GB of memory. There are three major sources of memory to keep track of: +#### 계산 관련 고려사항들 + +ConvNet을 만들 때 일어나는 가장 큰 병목 현상은 메모리 병목이다. 최신 GPU들은 3/4/6GB의 메모리를 내장하고 있다. 가장 좋은 GPU들의 경우 12GB를 갖고 있다. 메모리와 관련해 주의깊게 살펴 볼 것은 크게 3가지이다. -- From the intermediate volume sizes: These are the raw number of **activations** at every layer of the ConvNet, and also their gradients (of equal size). Usually, most of the activations are on the earlier layers of a ConvNet (i.e. first Conv Layers). These are kept around because they are needed for backpropagation, but a clever implementation that runs a ConvNet only at test time could in principle reduce this by a huge amount, by only storing the current activations at any layer and discarding the previous activations on layers below. -- From the parameter sizes: These are the numbers that hold the network **parameters**, their gradients during backpropagation, and commonly also a step cache if the optimization is using momentum, Adagrad, or RMSProp. Therefore, the memory to store the parameter vector alone must usually be multiplied by a factor of at least 3 or so. -- Every ConvNet implementation has to maintain **miscellaneous** memory, such as the image data batches, perhaps their augmented versions, etc. +- 중간 단계의 볼륨 크기: 매 레이어에서 발생하는 액티베이션들과 그에 상응하는 그라디언트 (액티베이션과 같은 크기)의 개수이다. 보통 대부분의 액티베이션들은 ConvNet의 앞쪽 레이어들에서 발생된다 (예: 첫 번째 CONV 레이어). 이 값들은 backpropagation에 필요하기 때문에 계속 메모리에 두고 있어야 한다. 학습이 아닌 테스트에만 ConvNet을 사용할 때는 현재 처리 중인 레이어의 액티베이션 값을 제외한 앞쪽 액티베이션들은 버리는 방식으로 구현할 수 있다. +- 파라미터 크기: 신경망이 갖고 있는 파라미터의 개수이며, backpropagation을 위한 각 파라미터의 그라디언트, 그리고 최적화에 momentum, Adagrad, RMSProp 등을 사용한다면 이와 관련된 파라미터들도 캐싱해 놓아야 한다. 그러므로 파라미터 저장 공간은 기본적으로 (파라미터 개수의)3배 정도 더 필요하다. +- 모든 ConvNet 구현체는 이미지 데이터 배치 등을 위한 기타 용도의 메모리를 유지해야 한다. -Once you have a rough estimate of the total number of values (for activations, gradients, and misc), the number should be converted to size in GB. Take the number of values, multiply by 4 to get the raw number of bytes (since every floating point is 4 bytes, or maybe by 8 for double precision), and then divide by 1024 multiple times to get the amount of memory in KB, MB, and finally GB. If your network doesn't fit, a common heuristic to "make it fit" is to decrease the batch size, since most of the memory is usually consumed by the activations. +일단 액티베이션, 그라디언트, 기타용도에 필요한 값들의 개수를 예상했다면, GB 스케일로 바꿔야 한다. 예측한 개수에 4를 곱해 바이트 수를 구하고 (floating point가 4바이트, double precision의 경우 8바이트 이므로), 1024로 여러 번 나눠 KB, MB, GB로 바꾼다. 만약 신경망의 크기가 너무 크다면, 배치 크기를 줄이는 등의 휴리스틱을 이용해 (대부분의 메모리가 액티베이션에 사용되므로) 가용 메모리에 맞게 만들어야 한다. -### Visualizing and Understanding Convolutional Networks +### ConvNet의 시각화 및 이해 -In the [next section](../understanding-cnn/) of these notes we look at visualizing and understanding Convolutional Neural Networks. +[다음 섹션](../understanding-ConvNet/)에서는 ConvNet을 시각화하고, ConvNet이 어떤 정보들을 인코딩 하는지 알아본다. -### Additional Resources -Additional resources related to implementation: +### 추가 레퍼런스 + +구현과 관련된 리소스들: -- [DeepLearning.net tutorial](http://deeplearning.net/tutorial/lenet.html) walks through an implementation of a ConvNet in Theano -- [cuda-convnet2](https://code.google.com/p/cuda-convnet2/) by Alex Krizhevsky is a ConvNet implementation that supports multiple GPUs -- [ConvNetJS CIFAR-10 demo](http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html) allows you to play with ConvNet architectures and see the results and computations in real time, in the browser. -- [Caffe](http://caffe.berkeleyvision.org/), one of the most popular ConvNet libraries. -- [Example Torch 7 ConvNet](https://github.com/nagadomi/kaggle-cifar10-torch7) that achieves 7% error on CIFAR-10 with a single model -- [Ben Graham's Sparse ConvNet](https://www.kaggle.com/c/cifar-10/forums/t/10493/train-you-very-own-deep-convolutional-network/56310) package, which Ben Graham used to great success to achieve less than 4% error on CIFAR-10. +- [DeepLearning.net tutorial](http://deeplearning.net/tutorial/lenet.html) Theano로 ConvNet을 구현하는 과정을 보여줌 +- [cuda-convnet2](https://code.google.com/p/cuda-convnet2/) Alex Krizhevsky가 여러 GPU를 사용해 ConvNet을 구현하는 방법을 알려주는 자료 +- [ConvNetJS CIFAR-10 demo](http://cs.stanford.edu/people/karpathy/convnetjs/demo/cifar10.html) 브라우저에서 ConvNet의 구조를 바꿔보고 결과를 실시간으로 볼 수 있는 자료 +- [Caffe](http://caffe.berkeleyvision.org/), 가장 널리 쓰이는 ConvNet 라이브러리 중 하나 +- [Example Torch 7 ConvNet](https://github.com/nagadomi/kaggle-cifar10-torch7) 하나의 모델로 CIFAR-10 데이터에 대해 7% 에러율을 기록한 코드 +- [Ben Graham's Sparse ConvNet](https://www.kaggle.com/c/cifar-10/forums/t/10493/train-you-very-own-deep-convolutional-network/56310) CIFAR-10에서 4% 이하의 에러율을 보인 패키지 + +--- +

+번역: 김택수 (jazzsaxmafia) +

diff --git a/css/main.css b/css/main.css index 9be974a7..d3e55bb8 100644 --- a/css/main.css +++ b/css/main.css @@ -48,12 +48,16 @@ a:visited { color: #205caa; } margin-bottom: 5px; } +.module-header a{ + color: #8C1515; +} + .materials-wrap { font-size: 18px; } .materials-item a{ color: #333; - display: block; + display: inline; padding: 3px; } .materials-item { @@ -63,6 +67,77 @@ a:visited { color: #205caa; } background-color: #f7f6f1; } +#hor-minimalist-a +{ + font-family: "Lucida Sans Unicode", "Lucida Grande", Sans-Serif; + font-size: 14px; + background: #fff; + margin: 45px; + width: 480px; + border-collapse: collapse; + text-align: left; +} +#hor-minimalist-a th +{ + font-size: 16px; + font-weight: normal; + color: #039; + padding: 10px 8px; + border-bottom: 2px solid #6678b1; +} +#hor-minimalist-a td +{ + color: #669; + padding: 9px 8px 0px 8px; +} +#hor-minimalist-a tbody tr:hover td +{ + color: #009; +} + + +#hor-minimalist-b +{ + font-family: "Lucida Sans Unicode", "Lucida Grande", Sans-Serif; + font-size: 12px; + background: #fff; + margin: 45px; + width: 480px; + border-collapse: collapse; + text-align: left; +} +#hor-minimalist-b th +{ + font-size: 14px; + font-weight: normal; + color: #039; + padding: 10px 8px; + border-bottom: 2px solid #6678b1; +} +#hor-minimalist-b td +{ + border-bottom: 1px solid #ccc; + color: #669; + padding: 6px 8px; +} +#hor-minimalist-b tbody tr:hover td +{ + color: #009; +} + +/* Custom CSS rules for progress bar */ +.progress { + position: relative; + font-size: 16px; +} +.progress span { + font-family: "Arial"; + position: absolute; + text-align:center; + top: 0%; + font-size: small; +} + /* Custom CSS rules for content */ .embedded-video { diff --git a/glossary.md b/glossary.md new file mode 100644 index 00000000..873f0861 --- /dev/null +++ b/glossary.md @@ -0,0 +1,104 @@ +--- +layout: page +mathjax: true +permalink: /glossary/ +--- + +영어 --> 한글 번역시 용어의 통일성을 위한 단어장입니다. 새로운 용어에 대한 추가는 GitHub에 이슈를 파서 서로 논의해 보고 정하도록 하면 좋을 것 같습니다. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
English한글
Accuracy정확도, 성능
Activation function활성 함수
Architecture구조
Backpropagation(영어 그대로)
Batch배치
Batch normalization배치 정규화
Bias(영어 그대로)
Chain rule연쇄 법칙
Class클래스
Classification분류
Classifier분류기
Column vector열 벡터
Convolution컨볼루션
Convolutional neural network컨볼루션 신경망
Covariance공분산
Cross entropy(영어 그대로)
Cross validation교차 검증
Depth깊이
Derivative미분값, 도함수
Dropout(영어 그대로)
Error에러, 오차
Evaluate평가하다
Feature특징, 표현, 피쳐
Filter필터
Forward propagation(영어 그대로)
Fully-connected(영어 그대로)
Gate게이트
Gradient그라디언트
GRU(영어 그대로)
Hyperparameter(영어 그대로)
Image이미지
Implement구현하다
Initialization초기화
Iteration반복
Label라벨
Layer레이어
Learning러닝, 학습
Loop루프
Loss (function)손실 함수
LSTM(영어 그대로)
Matrix행렬
Nearest neighbor(영어 그대로)
Network네트워크
Neural network신경망, 뉴럴 네트워크
Neuron뉴런
Node노드
Non-linearity비선형~
Optimization최적화
Overfitting(영어 그대로)
Padding패딩
Parameter파라미터
Performance성능
Pixel픽셀, 화소
Pooling풀링
Preprocessing전처리
Receptive Field(영어 그대로)
Regression회귀
Regularization(영어 그대로)
ReLU(영어 그대로)
Representation표현
Recurrent neural network (RNN)회귀신경망, RNN
Row vector행 벡터
Score스코어, 점수
Sigmoid(영어 그대로)
Softmax(영어 그대로)
Training학습, 트레이닝
Tuning튜닝
Validation검증
Variable변수
Visualization시각화
Weights파라미터 값, 가중치 (문맥상 사용되는 의미에 따라)
+ + + diff --git a/index.html b/index.html index 0d1f9bba..5c16f963 100644 --- a/index.html +++ b/index.html @@ -4,36 +4,51 @@
- These notes accompany the Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition. + 스탠포드 CS231n 강의 CS231n: Convolutional Neural Networks for Visual Recognition에 대한 강의노트의 한글 번역 프로젝트입니다.
- For questions/concerns/bug reports regarding contact Justin Johnson regarding the assignments, or contact Andrej Karpathy regarding the course notes. You can also submit a pull request directly to our git repo. -
- We encourage the use of the hypothes.is extension to annote comments and discuss these notes inline. + 질문/논의거리/이슈 등은 AI Korea 이메일로 연락주시거나, GitHub 레포지토리에 pull request, 또는 이슈를 열어주세요.
+
- -
Winter 2016 Assignments
+
+ Glossary +
+ + + +
Winter 2016 과제
-
Module 0: Preparation
+
Module 0: 준비
Terminal.com Tutorial + + Complete! +
AWS Tutorial + + Complete! +
-
Module 1: Neural Networks
+
Module 1: 신경망 구조
- Image Classification: Data-driven Approach, k-Nearest Neighbor, train/val/test splits + 이미지 분류: 데이터 기반 방법론, k-Nearest Neighbor, train/val/test 구분 + + Complete! +
- L1/L2 distances, hyperparameter search, cross-validation + L1/L2 거리, hyperparameter 탐색, 교차검증(cross-validation)
- Linear classification: Support Vector Machine, Softmax + 선형 분류: Support Vector Machine, Softmax + + +
- parameteric approach, bias trick, hinge loss, cross-entropy loss, L2 regularization, web demo + parameteric 접근법, bias 트릭, hinge loss, cross-entropy loss, L2 regularization, 웹 데모
- Optimization: Stochastic Gradient Descent + 최적화: 확률 그라디언트 하강(Stochastic Gradient Descent) + + Complete! +
- optimization landscapes, local search, learning rate, analytic/numerical gradient + '지형'으로서의 최적화 목적 함수 (optimization landscapes), 국소 탐색(local search), 학습 속도(learning rate), 해석적(analytic)/수치적(numerical) 그라디언트
- Backpropagation, Intuitions + Backpropagation, 직관 + + +
- chain rule interpretation, real-valued circuits, patterns in gradient flow + 연쇄 법칙 (chain rule) 해석, real-valued circuits, 그라디언트 흐름의 패턴
- Neural Networks Part 1: Setting up the Architecture + 신경망 파트 1: 네트워크 구조 정하기 + + +
- model of a biological neuron, activation functions, neural net architecture, representational power + 생물학적 뉴런 모델, 활성 함수(activation functions), 신경망 구조, 표현력(representational power)
- - Neural Networks Part 2: Setting up the Data and the Loss + + 신경망 파트 2: 데이터 준비 및 Loss + + +
- preprocessing, weight initialization, batch normalization, regularization (L2/dropout), loss functions + 전처리, weight 초기값 설정, 배치 정규화(batch normalization), regularization (L2/dropout), 손실함수
- Neural Networks Part 3: Learning and Evaluation + 신경망 파트 3: 학습 및 평가 + + +
- gradient checks, sanity checks, babysitting the learning process, momentum (+nesterov), second-order methods, Adagrad/RMSprop, hyperparameter optimization, model ensembles + 그라디언트 체크, 버그 점검, 학습 과정 모니터링, momentum (+nesterov), 2차(2nd-order) 방법, Adagrad/RMSprop, hyperparameter 최적화, 모델 ensemble
@@ -165,19 +213,25 @@
- Convolutional Neural Networks: Architectures, Convolution / Pooling Layers + 컨볼루션 신경망: 구조, Convolution / Pooling 레이어들 + + Complete! +
- layers, spatial arrangement, layer patterns, layer sizing patterns, AlexNet/ZFNet/VGGNet case studies, computational considerations + 레이어(층), 공간적 배치, 레이어 패턴, 레이어 사이즈, AlexNet/ZFNet/VGGNet 사례 분석, 계산량에 관한 고려 사항들
- Understanding and Visualizing Convolutional Neural Networks + 컨볼루션 신경망 분석 및 시각화 + + +
- tSNE embeddings, deconvnets, data gradients, fooling ConvNets, human comparisons + tSNE embeddings, deconvnets, 데이터에 대한 그라디언트, ConvNet 속이기, 사람과의 비교
@@ -185,7 +239,14 @@ Transfer Learning and Fine-tuning Convolutional Neural Networks + + + +
+ + - +
diff --git a/ipython-tutorial.md b/ipython-tutorial.md index 1c894162..07d6716e 100644 --- a/ipython-tutorial.md +++ b/ipython-tutorial.md @@ -3,72 +3,60 @@ layout: page title: IPython Tutorial permalink: /ipython-tutorial/ --- +cs231n 수업에서는 프로그래밍 과제 진행을 위해 [IPython notebooks](http://ipython.org/)을 사용합니다. IPython notebook을 사용하면 여러분의 브라우저에서 Python 코드를 작성하고 실행할 수 있습니다. Python notebook를 사용하면 여러 조각의 코드를 아주 쉽게 수정하고 실행할 수 있습니다. 이런 장점 때문에 IPython notebook은 계산과학분야에서 널리 사용되고 있습니다. -In this class, we will use [IPython notebooks](http://ipython.org/) for the -programming assignments. An IPython notebook lets you write and execute Python -code in your web browser. IPython notebooks make it very easy to tinker with -code and execute it in bits and pieces; for this reason IPython notebooks are -widely used in scientific computing. +IPython의 설치와 실행은 간단합니다. command line에서 다음 명령어를 입력하여 IPython을 설치합니다. -Installing and running IPython is easy. From the command line, the following -will install IPython: - -``` +~~~ pip install "ipython[notebook]" -``` +~~~ -Once you have IPython installed, start it with this command: +IPython의 설치가 완료되면 다음 명령어를 통해 IPython을 실행합니다. -``` +~~~ ipython notebook -``` +~~~ -Once IPython is running, point your web browser at http://localhost:8888 to -start using IPython notebooks. If everything worked correctly, you should -see a screen like this, showing all available IPython notebooks in the current -directory: +IPython이 실행되면, IPyhton을 사용하기 위해 웹 브라우저를 실행하여 http://localhost:8888 에 접속합니다. 모든 것이 잘 작동한다면 웹 브라우저에는 아래와 같은 화면이 나타납니다. 화면에는 현재 폴더에 사용가능한 Python notebook들이 나타납니다.
- +
-If you click through to a notebook file, you will see a screen like this: +notebook 파일을 클릭하면 다음과 같은 화면이 나타납니다.
- +
-An IPython notebook is made up of a number of **cells**. Each cell can contain -Python code. You can execute a cell by clicking on it and pressing `Shift-Enter`. -When you do so, the code in the cell will run, and the output of the cell -will be displayed beneath the cell. For example, after running the first cell -the notebook looks like this: +IPython notebook은 여러 개의 **cell**들로 이루어져 있습니다. 각각의 cell들은 Python 코드를 포함하고 있습니다. `Shift-Enter`를 누르거나 셀을 클릭하여 셀을 실행할 수 있습니다. 셀의 코드를 실행하면 셀의 코드의 실행결과는 셀의 바로 아래에 나타납니다. 예를 들어 첫 번째 cell의 코드를 실행하면 아래와 같은 화면이 나타납니다.
- +
-Global variables are shared between cells. Executing the second cell thus gives -the following result: +전역변수들은 다른 셀들에도 공유됩니다. 두 번째 셀을 실행하면 다음과 같은 결과가 나옵니다.
- +
-By convention, IPython notebooks are expected to be run from top to bottom. -Failing to execute some cells or executing cells out of order can result in -errors: +일반적으로, IPython notebook의 코드를 실행할 때 맨 위에서 맨 아래 순서로 실행합니다. +몇몇 셀을 실행하는 데 실패하거나 셀들을 순서대로 실행하지 않으면 오류가 발생할 수 있습니다.
- +
-After you have modified an IPython notebook for one of the assignments by -modifying or executing some of its cells, remember to **save your changes!** +과제를 진행하면서 notebook의 cell을 수정하거나 실행하여 IPython notebook이 변경되었다면 **저장하는 것을 잊지 마세요.**
- +
-This has only been a brief introduction to IPython notebooks, but it should -be enough to get you up and running on the assignments for this course. +지금까지 IPyhton의 사용법에 대해서 알아보았습니다. 간략한 내용이지만 위 내용을 잘 숙지하면 무리 없이 과제를 진행할 수 있습니다. + +--- +

+번역: 김우정 (gnujoow) +

diff --git a/linear-classify.md b/linear-classify.md index 01d206a0..36d6347f 100644 --- a/linear-classify.md +++ b/linear-classify.md @@ -5,99 +5,102 @@ permalink: /linear-classify/ Table of Contents: -- [Intro to Linear classification](#intro) -- [Linear score function](#score) -- [Interpreting a linear classifier](#interpret) -- [Loss function](#loss) +- [선형 분류 소개](#intro) +- [선형 스코어 함수](#score) +- [선형 분류기 분석하기](#interpret) +- [손실함수(Loss function)](#loss) - [Multiclass SVM](#svm) - - [Softmax classifier](#softmax) + - [Softmax 분류기](#softmax) - [SVM vs Softmax](#svmvssoftmax) -- [Interactive Web Demo of Linear Classification](#webdemo) -- [Summary](#summary) +- [선형 분류 웹 데모](#webdemo) +- [요약](#summary) -## Linear Classification -In the last section we introduced the problem of Image Classification, which is the task of assigning a single label to an image from a fixed set of categories. Morever, we described the k-Nearest Neighbor (kNN) classifier which labels images by comparing them to (annotated) images from the training set. As we saw, kNN has a number of disadvantages: +## 선형 분류 (Linear Classification) -- The classifier must *remember* all of the training data and store it for future comparisons with the test data. This is space inefficient because datasets may easily be gigabytes in size. -- Classifying a test image is expensive since it requires a comparison to all training images. +지난 섹션에서는 특정 카테고리에서 하나의 라벨을 이미지에 붙이는 문제인 이미지 분류에 대해 소개하였다. 또한, 학습 데이터셋에 있는 (라벨링된) 이미지들 중 가까이 있는 것들의 라벨을 활용하는 k-Nearest Neighbor (kNN) 분류기에 대해 설명하였다. 앞서 살펴보았듯이, kNN은 몇 가지 단점이 있다: -**Overview**. We are now going to develop a more powerful approach to image classification that we will eventually naturally extend to entire Neural Networks and Convolutional Neural Networks. The approach will have two major components: a **score function** that maps the raw data to class scores, and a **loss function** that quantifies the agreement between the predicted scores and the ground truth labels. We will then cast this as an optimization problem in which we will minimize the loss function with respect to the parameters of the score function. +- 이 분류기는 모든 학습 데이터를 *기억* 해야 하고, 나중에 테스트 데이터와 비교하기 위해 저장해 두어야 한다. 이것은 메모리 공간 관점에서 매우 비효율적이고, 일반적인 데이터셋들은 용량이 기가바이트 단위를 쉽게 넘기는 것이 많기 때문에 문제가 된다. +- 테스트 이미지를 분류할 때 모든 학습 이미지와 다 비교를 해야 하기 때문에 매우 계산량/시간이 많이 소요된다. + +**Overview**. 이번 노트에서는 이미지 분류를 위한 보다 강력한 방법들을 발전시켜나갈 것이고, 이는 나중에 뉴럴 네트워크와 컨볼루션 신경망으로 확장될 것이다. 이 방법들은 두 가지 중요한 요소가 있다: 데이터를 클래스 스코어로 매핑시키는 **스코어 함수**, 그리고 예측한 스코어와 실제(ground truth) 라벨과의 차이를 정량화해주는 **손실 함수** 가 그 두 가지이다. 우리는 이를 최적화 문제로 바꾸어서 스코어 함수의 파라미터들에 대한 손실 함수를 최소화할 것이다. -### Parameterized mapping from images to label scores -The first component of this approach is to define the score function that maps the pixel values of an image to confidence scores for each class. We will develop the approach with a concrete example. As before, let's assume a training dataset of images \\( x\_i \in R^D \\), each associated with a label \\( y\_i \\). Here \\( i = 1 \dots N \\) and \\( y_i \in \{ 1 \dots K \} \\). That is, we have **N** examples (each with a dimensionality **D**) and **K** distinct categories. For example, in CIFAR-10 we have a training set of **N** = 50,000 images, each with **D** = 32 x 32 x 3 = 3072 pixels, and **K** = 10, since there are 10 distinct classes (dog, cat, car, etc). We will now define the score function \\(f: R^D \mapsto R^K\\) that maps the raw image pixels to class scores. +### 이미지에서 라벨 스코어로의 파라미터화된 매핑(mapping) + +먼저, 이미지의 픽셀 값들을 각 클래스에 대한 신뢰도 점수 (confidence score)로 매핑시켜주는 스코어 함수를 정의한다. 여기서는 구체적인 예시를 통해 각 과정을 살펴볼 것이다. 이전 노트에서처럼, 학습 데이터셋 이미지들인 $$ x_i \in R^D $$가 있고, 각각이 해당 라벨 $$ y_i $$를 갖고 있다고 하자. 여기서 $$ i = 1 \dots N $$, 그리고 $$ y_i \in \{ 1 \dots K \} $$이다. 즉, 학습할 데이터 **N** 개가 있고 (각각은 **D** 차원의 벡터이다.), 총 **K** 개의 서로 다른 카테고리(클래스)가 있다. 예를 들어, CIFAR-10 에서는 **N** = 50,000 개의 학습 데이터 이미지들이 있고, 각각은 **D** = 32 x 32 x 3 = 3072 픽셀로 이루어져 있으며, (dog, cat, car, 등등) 10개의 서로 다른 클래스가 있으므로 **K** = 10 이다. 이제 이미지의 픽셀값들을 클래스 스코어로 매핑해 주는 스코어 함수 $$f: R^D \mapsto R^K$$ 을 아래에 정의할 것이다. -**Linear classifier.** In this module we will start out with arguably the simplest possible function, a linear mapping: +**선형 분류기 (Linear Classifier).** 이 파트에서는 가장 단순한 함수라고 할 수 있는 선형 매핑 함수로 시작할 것이다. $$ -f(x\_i, W, b) = W x\_i + b +f(x_i, W, b) = W x_i + b $$ -In the above equation, we are assuming that the image \\(x\_i\\) has all of its pixels flattened out to a single column vector of shape [D x 1]. The matrix **W** (of size [K x D]), and the vector **b** (of size [K x 1]) are the **parameters** of the function. In CIFAR-10, \\(x\_i\\) contains all pixels in the i-th image flattened into a single [3072 x 1] column, **W** is [10 x 3072] and **b** is [10 x 1], so 3072 numbers come into the function (the raw pixel values) and 10 numbers come out (the class scores). The parameters in **W** are often called the **weights**, and **b** is called the **bias vector** because it influences the output scores, but without interacting with the actual data \\(x\_i\\). However, you will often hear people use the terms *weights* and *parameters* interchangeably. +위 식에서, 우리는 각 이미지 $$x_i$$의 모든 픽셀들이 [D x 1] 모양을 갖는 하나의 열 벡터로 평평하게 했다고 가정하였다. [K x D] 차원의 행렬 **W** 와 [K x 1] 차원의 벡터 **b** 는 이 함수의 **파라미터** 이다. CIFAR-10 에서 $$x_i$$ 는 i번째 이미지의 모든 픽셀을 [3072 x 1] 크기로 평평하게 모양을 바꾼 열 벡터가 될 것이고, **W** 는 [10 x 3072], **b** 는 [10 x 1] 여서 3072 개의 숫자가 함수의 입력(이미지 픽셀 값들)으로 들어와 10개의 숫자가 출력(클래스 스코어)으로 나오게 된다. **W** 안의 파라미터들은 보통 **weight** 라고 불리고, **b** 는 **bias 벡터** 라 불리는데, 그 이유는 b가 실제 입력 데이터인 $$x_i$$와의 아무런 상호 작용이 없이 출력 스코어 값에는 영향을 주기 때문이다. 그러나 보통 일반적으로 사람마다 *weight* 와 *파라미터(parameter)* 두 개의 용어를 혼용해서 사용하는 경우가 많다. -There are a few things to note: +여기서 몇 가지 짚고 넘어갈 점이 있다. -- First, note that the single matrix multiplication \\(W x\_i\\) is effectively evaluating 10 separate classifiers in parallel (one for each class), where each classifier is a row of **W**. -- Notice also that we think of the input data \\( (x\_i, y\_i) \\) as given and fixed, but we have control over the setting of the parameters **W,b**. Our goal will be to set these in such way that the computed scores match the ground truth labels across the whole training set. We will go into much more detail about how this is done, but intuitively we wish that the correct class has a score that is higher than the scores of incorrect classes. -- An advantage of this approach is that the training data is used to learn the parameters **W,b**, but once the learning is complete we can discard the entire training set and only keep the learned parameters. That is because a new test image can be simply forwarded through the function and classified based on the computed scores. -- Lastly, note that to classifying the test image involves a single matrix multiplication and addition, which is significantly faster than comparing a test image to all training images. +- 먼저, 한 번의 행렬곱 $$W x_i$$ 만으로 10 개의 로 다른 분류기(각 클래스마다 하나씩)를 병렬로 계산하는 효과를 나타내고 있다는 점을 살펴보자. 이 때 **W** 행렬의 각 열이 각각 하나의 분류기가 된다. +- 또한, 여기서 입력 데이터 $$ (x_i, y_i) $$는 주어진 값이고 고정되어 있지만, 파라미터들인 **W, b** 의 세팅은 우리가 조절할 수 있다는 점을 생각하자. 우리의 최종 목표는 전체 학습 데이터에 대해서 우리가 계산할 스코어 값들이 실제 (ground truth) 라벨과 가장 잘 일치하도록 이 파라미터 값들을 정하는 것이다. 이후(아래)에 자세한 방법에 대해 다룰 것이지만, 직관적으로 간략하게 말하자면 올바르게 잘 맞춘 클래스가 틀린 클래스들보다 더 높은 스코어를 갖도록 조절할 것이다. +- 이러한 방식의 장점은, 학습 데이터가 파라미터들인 **W, b** 를 학습하는데 사용되지만 학습이 끝난 이후에는 학습된 파라미터들만 남기고, 학습에 사용된 데이터셋은 더 이상 필요가 없다는 (따라서 메모리에서 지워버려도 된다는) 점이다. 그 이유는, 새로운 테스트 이미지가 입력으로 들어올 때 위의 함수에 의해 스코어를 계산하고, 계산된 스코어를 통해 바로 분류되기 때문이다. +- 마지막으로, 테스트 이미지를 분류할 때 행렬곱 한 번과 덧셈 한 번을 하는 계산만 필요하다는 점을 주목하자. 이것은 테스트 이미지를 모든 학습 이미지와 비교하는 것에 비하면 매우 빠르다. -> Foreshadowing: Convolutional Neural Networks will map image pixels to scores exactly as shown above, but the mapping ( f ) will be more complex and will contain more parameters. +> 스포일러: 컨볼루션 신경망(Convolutional Neural Networks)은 정확히 위의 방식처럼 이미지 픽셀 값을 스코어 값으로 매핑시켜 주지만, 매핑시켜주는 함수 ( f ) 가 훨씬 더 복잡해지고 더 많은 수의 파라미터를 갖고 있을 것이다. -### Interpreting a linear classifier -Notice that a linear classifier computes the score of a class as a weighted sum of all of its pixel values across all 3 of its color channels. Depending on precisely what values we set for these weights, the function has the capacity to like or dislike (depending on the sign of each weight) certain colors at certain positions in the image. For instance, you can imagine that the "ship" class might be more likely if there is a lot of blue on the sides of an image (which could likely correspond to water). You might expect that the "ship" classifier would then have a lot of positive weights across its blue channel weights (presence of blue increases score of ship), and negative weights in the red/green channels (presence of red/green descreases the score of ship). +### 선형 분류기 분석하기 + +선형 분류기는 클래스 스코어를 이미지의 모든 픽셀 값들의 가중치 합으로 스코어를 계산하고, 이 때 각 픽셀의 3 개의 색 채널을 모두 고려하는 것에 주목하자. 이 때 각 가중치(파라미터, weights)에 어떤 값을 주느냐에 따라 스코어 함수는 이미지의 특정 위치에서 특정 색깔을 선호하거나 선호하지 않거나 (가중치 값의 부호에 따라) 할 수 있다. 예를 들어, "ship" 클래스는 이미지의 가장자리 부분에 파란색이 많은 경우에 (강, 바다 등의 물에 해당하는 색) 스코어 값이 더 높아질 것이라고 추측해 볼 수 있을 것이다. 즉, "ship" 분류기는 파란색 채널의 파라미터(weights)들이 양의 값을 갖고 (파란색이 존재하는 것이 ship의 스코어를 증가시키도록), 빨강/초록색 채널에는 음의 값을 갖는 파라미터들이 많을 것이라고 (빨간색/초록색의 존재는 ship의 스코어를 감소시키도록) 예상할 수 있다.
- -
An example of mapping an image to class scores. For the sake of visualization, we assume the image only has 4 pixels (4 monochrome pixels, we are not considering color channels in this example for brevity), and that we have 3 classes (red (cat), green (dog), blue (ship) class). (Clarification: in particular, the colors here simply indicate 3 classes and are not related to the RGB channels.) We stretch the image pixels into a column and perform matrix multiplication to get the scores for each class. Note that this particular set of weights W is not good at all: the weights assign our cat image a very low cat score. In particular, this set of weights seems convinced that it's looking at a dog.
+ +
이미지에서 클래스 스코어로의 매핑 예시. 시각화를 위해서, 이미지가 픽셀 4 개 만으로 이루어져 있고 (색 채널도 고려하지 않고, 단일 채널이라고 생각하자), 3 개의 클래스가 있다고 하자 (빨강 (cat), 초록 (dog), 파랑 (ship) 클래스). (주: 여기에서의 색깔은 3 개의 클래스를 나타내기 위함이고, RGB 채널과는 전혀 상관이 없다.) 이제 이미지 픽셀들을 펼쳐서 열 벡터로 만들고 각 클래스에 대해 행렬곱을 수행하면 스코어 값을 얻을 수 있다. 여기서 정해준 파라미터 W 값들은 매우 안 좋은 예시인 것을 확인하자: 현재의 파라미터로는 고양이(cat) 이미지를 매우 낮은 cat 스코어를 갖도록 한다. 이 경우, 현재의 파라미터 값은 우리가 dog 이미지를 보고있다고 생각하고 있다.
-**Analogy of images as high-dimensional points.** Since the images are stretched into high-dimensional column vectors, we can interpret each image as a single point in this space (e.g. each image in CIFAR-10 is a point in 3072-dimensional space of 32x32x3 pixels). Analogously, the entire dataset is a (labeled) set of points. +**이미지와 고차원 공간 상의 점에 대한 비유.** 이미지들을 고차원 열 벡터로 펼쳤기 때문에, 우리는 각 이미지를 이 고차원 공간 상의 하나의 점으로 생각할 수 있다 (e.g. CIFAR-10 데이터셋의 각 이미지는 32x32x3 개의 픽셀로 이루어진 3072-차원 공간 상의 한 점이 된다). 마찬가지로 생각하면, 전체 데이터셋은 라벨링된 고차원 공간 상의 점들의 집합이 될 것이다. -Since we defined the score of each class as a weighted sum of all image pixels, each class score is a linear function over this space. We cannot visualize 3072-dimensional spaces, but if we imagine squashing all those dimensions into only two dimensions, then we can try to visualize what the classifier might be doing: +위에서 각 클래스에 대한 스코어를 이미지의 모든 픽셀에 대한 가중치 합으로 정의했기 때문에, 각 클래스 스코어는 이 공간 상에서의 선형 함수값이 된다. 3072-차원 공간은 시각화할 수 없지만, 2차원으로 축소시켰다고 상상해보면 우리의 분류기가 어떤 행동을 하는지를 시각화하려고 시도해볼 수 있을 것이다:
- +
- Cartoon representation of the image space, where each image is a single point, and three classifiers are visualized. Using the example of the car classifier (in red), the red line shows all points in the space that get a score of zero for the car class. The red arrow shows the direction of increase, so all points to the right of the red line have positive (and linearly increasing) scores, and all points to the left have a negative (and linearly decreasing) scores. + 이미지 공간의 시각화. 각 이미지는 하나의 점에 해당되고, 3 개의 분류기가 표시되어 있다. 자동차(car) 분류기(빨간색)를 예로 들어보면, 빨간색 선이 이 공간 상에서 car 클래스에 대해 스코어 값이 0이 되는 모든 점을 나타낸 것이다. 빨간색 화살표는 스코어가 증가하는 방향을 나타낸 것으로, 빨간색 선의 오른쪽에 있는 점들은 양의 (그리고 선형적으로 증가하는) 스코어 값을 가질 것이고, 왼쪽의 점들은 음의 (그리고 선형적으로 감소하는) 스코어 값을 가질 것이다.
-As we saw above, every row of \\(W\\) is a classifier for one of the classes. The geometric interpretation of these numbers is that as we change one of the rows of \\(W\\), the corresponding line in the pixel space will rotate in different directions. The biases \\(b\\), on the other hand, allow our classifiers to translate the lines. In particular, note that without the bias terms, plugging in \\( x\_i = 0 \\) would always give score of zero regardless of the weights, so all lines would be forced to cross the origin. +위에서 살펴보았듯이, $$W$$의 각 행은 각각의 클래스를 구별하는 분류기이다. 각 행에 있는 숫자들을 기하학적으로 해석해보자면, 우리가 $$W$$의 하나의 행을 바꾸면 픽셀 공간에서 해당하는 선이 다른 방향으로 회전할 것이다. 반면에, bias인 $$b$$는 분류기가 그 선들을 평행이동 할 수 있도록 해준다. 특히, bias가 없다면 $$ x_i = 0 $$가 입력으로 들어왔을 때 파라미터 값들에 상관없이 항상 스코어가 0이 될 것이고, 모든 (분류) 선들이 원점을 지나야만 할 것이다. -**Interpretation of linear classifiers as template matching.** -Another interpretation for the weights \\(W\\) is that each row of \\(W\\) corresponds to a *template* (or sometimes also called a *prototype*) for one of the classes. The score of each class for an image is then obtained by comparing each template with the image using an *inner product* (or *dot product*) one by one to find the one that "fits" best. With this terminology, the linear classifier is doing template matching, where the templates are learned. Another way to think of it is that we are still effectively doing Nearest Neighbor, but instead of having thousands of training images we are only using a single image per class (although we will learn it, and it does not necessarily have to be one of the images in the training set), and we use the (negative) inner product as the distance instead of the L1 or L2 distance. +**템플릿 매칭으로서의 선형 분류기 해석.** +파라미터 $$W$$에 대해 다른 방식으로 해석해보면, $$W$$의 각 행은 각 클래스별 *템플릿* (또는 *프로토타입*)에 해당된다. 이미지의 각 클래스 스코어는 각 템플릿들을 이미지와 *내적(inner product, 또는 dot product)*을 통해 하나하나 비교함으로써 계산되고, 이 스코어를 기준으로 가장 잘 "맞는" 것이 무엇인지 정한다. 즉, 선형 분류기가 결국 템플릿 매칭을 하고 있고, 각 템플릿이 학습을 통해 배워진다고 할 수 있다. 또다른 방식으로 생각해보면, 우리는 Nearest Neighbor와 비슷한 것을 하고 있는데, 수 천 장의 학습 이미지를 갖고 있지 않고 각 클래스마다 한 장의 이미지만 사용한다고 볼 수 있다. (다만, 그 이미지를 학습하고, 학습 데이터셋에 실제로 존재하는 이미지일 필요는 없다.) 이 때, 거리 함수로는 L1이나 L2 거리를 사용하지 않고 서로 내적한 것(의 반대 부호인 값)을 사용한다.
- +
- Skipping ahead a bit: Example learned weights at the end of learning for CIFAR-10. Note that, for example, the ship template contains a lot of blue pixels as expected. This template will therefore give a high score once it is matched against images of ships on the ocean with an inner product. + 약간의 선행학습: CIFAR-10 데이터셋에 학습된 파라미터들의 시각화 예시. 예를 들어 ship 템플릿을 보면, 예상할 수 있듯이 많은 수의 파란색 픽셀들로 이루어져 있다는 점에 주목하자. 이 템플릿은 배(ship)가 바다 위에 떠있는 이미지와 내적을 통해 비교되었을 때, 높은 스코어 값을 가질 것이다.
-Additionally, note that the horse template seems to contain a two-headed horse, which is due to both left and right facing horses in the dataset. The linear classifier *merges* these two modes of horses in the data into a single template. Similarly, the car classifier seems to have merged several modes into a single template which has to identify cars from all sides, and of all colors. In particular, this template ended up being red, which hints that there are more red cars in the CIFAR-10 dataset than of any other color. The linear classifier is too weak to properly account for different-colored cars, but as we will see later neural networks will allow us to perform this task. Looking ahead a bit, a neural network will be able to develop intermediate neurons in its hidden layers that could detect specific car types (e.g. green car facing left, blue car facing front, etc.), and neurons on the next layer could combine these into a more accurate car score through a weighted sum of the individual car detectors. +추가적으로, horse 템플릿은 머리가 두 개인 말(horse)이 있는 것처럼 보이는데, 이것은 데이터셋 안에 왼쪽을 보고 있는 말과 오른쪽을 보고 있는 말이 섞여있기 때문이다. 선형 분류기는 말에 대한 이 두 가지 모드를 하나의 템플릿으로 *합친* 것을 확인할 수 있다. 이와 비슷한 현상으로, car 분류기는 모든 방향 및 색깔의 자동차 모양들을 하나의 템플릿으로 합쳐 놓았다. 특히, 이 템플릿이 결과적으로 붉은 색을 띄는 것으로 보아 CIFAR-10 데이터셋에는 다른 색깔에 비해 빨간색 자동차가 더 많다는 점을 알 수 있다. 선형 분류기는 여러 가지 색깔의 자동차를 제대로 분류하기에는 너무 모델이 단순하지만, 나중에 배울 뉴럴 네트워크는 이를 해결할 수 있다. 약간만 미리 살펴보자면, 뉴럴 네트워크는 히든 레이어의 각 뉴런들이 특정 자동차 타입 (e.g. 왼쪽을 바라보고 있는 초록색 자동차, 정면을 보고 있는 파란색 차, 등등)을 검출하도록 할 수 있고, 다음 레이어의 뉴런들이 이 정보들을 종합하여 각각의 자동차 타입 검출기의 점수의 가중치 합을 통해 보다 정확한 (자동차에 대한) 스코어를 계산할 수 있다. -**Bias trick.** Before moving on we want to mention a common simplifying trick to representing the two parameters \\(W,b\\) as one. Recall that we defined the score function as: +**Bias 트릭.** 다음 내용으로 넘어가기 전에, 두 파라미터 $$W, b$$를 하나로 표현하는 간단한 트릭을 소개한다. 앞에서 스코어 함수는 아래와 같이 정의되었다. $$ -f(x\_i, W, b) = W x\_i + b +f(x_i, W, b) = W x_i + b $$ -As we proceed through the material it is a little cumbersome to keep track of two sets of parameters (the biases \\(b\\) and weights \\(W\\)) separately. A commonly used trick is to combine the two sets of parameters into a single matrix that holds both of them by extending the vector \\(x\_i\\) with one additional dimension that always holds the constant \\(1\\) - a default *bias dimension*. With the extra dimension, the new score function will simplify to a single matrix multiply: +앞으로 내용을 전개해 나갈 때 두 가지 파라미터를 (bias $$b$$와 weight $$W$$) 매번 동시에 고려해야 한다면 표현이 번거로워진다. 흔히 사용하는 트릭은 이 두 파라미터들을 하나의 행렬로 합치고, $$x_i$$를 항상 $$1$$의 값을 갖는 한 차원 - 디폴트 *bias* 차원 - 을 늘리는 방식이다. 이 한 차원 추가하는 것으로, 새 스코어 함수는 행렬곱 한 번으로 계산이 가능해진다: $$ -f(x\_i, W) = W x\_i +f(x_i, W) = W x_i $$ -With our CIFAR-10 example, \\(x\_i\\) is now [3073 x 1] instead of [3072 x 1] - (with the extra dimension holding the constant 1), and \\(W\\) is now [10 x 3073] instead of [10 x 3072]. The extra column that \\(W\\) now corresponds to the bias \\(b\\). An illustration might help clarify: +With our CIFAR-10 example, $x_i$ is now [3073 x 1] instead of [3072 x 1] - (with the extra dimension holding the constant 1), and $W$ is now [10 x 3073] instead of [10 x 3072]. The extra column that $W$ now corresponds to the bias $b$. An illustration might help clarify:
- +
Illustration of the bias trick. Doing a matrix multiplication and then adding a bias vector (left) is equivalent to adding a bias dimension with a constant of 1 to all input vectors and extending the weight matrix by 1 column - a bias column (right). Thus, if we preprocess our data by appending ones to all vectors we only have to learn a single matrix of weights instead of two matrices that hold the weights and the biases.
@@ -106,45 +109,48 @@ With our CIFAR-10 example, \\(x\_i\\) is now [3073 x 1] instead of [3072 x 1] - **Image data preprocessing.** As a quick note, in the examples above we used the raw pixel values (which range from [0...255]). In Machine Learning, it is a very common practice to always perform normalization of your input features (in the case of images, every pixel is thought of as a feature). In particular, it is important to **center your data** by subtracting the mean from every feature. In the case of images, this corresponds to computing a *mean image* across the training images and subtracting it from every image to get images where the pixels range from approximately [-127 ... 127]. Further common preprocessing is to scale each input feature so that its values range from [-1, 1]. Of these, zero mean centering is arguably more important but we will have to wait for its justification until we understand the dynamics of gradient descent. -### Loss function -In the previous section we defined a function from the pixel values to class scores, which was parameterized by a set of weights \\(W\\). Moreover, we saw that we don't have control over the data \\( (x\_i,y\_i) \\) (it is fixed and given), but we do have control over these weights and we want to set them so that the predicted class scores are consistent with the ground truth labels in the training data. + +### 손실함수(Loss function) + +In the previous section we defined a function from the pixel values to class scores, which was parameterized by a set of weights $W$. Moreover, we saw that we don't have control over the data $ (x_i,y_i) $ (it is fixed and given), but we do have control over these weights and we want to set them so that the predicted class scores are consistent with the ground truth labels in the training data. For example, going back to the example image of a cat and its scores for the classes "cat", "dog" and "ship", we saw that the particular set of weights in that example was not very good at all: We fed in the pixels that depict a cat but the cat score came out very low (-96.8) compared to the other classes (dog score 437.9 and ship score 61.95). We are going to measure our unhappiness with outcomes such as this one with a **loss function** (or sometimes also referred to as the **cost function** or the **objective**). Intuitively, the loss will be high if we're doing a poor job of classifying the training data, and it will be low if we're doing well. -#### Multiclass Support Vector Machine loss -There are several ways to define the details of the loss function. As a first example we will first develop a commonly used loss called the **Multiclass Support Vector Machine** (SVM) loss. The SVM loss is set up so that the SVM "wants" the correct class for each image to a have a score higher than the incorrect classes by some fixed margin \\(\Delta\\). Notice that it's sometimes helpful to anthropomorphise the loss functions as we did above: The SVM "wants" a certain outcome in the sense that the outcome would yield a lower loss (which is good). +#### Multiclass Support Vector Machine 손실함수 + +There are several ways to define the details of the loss function. As a first example we will first develop a commonly used loss called the **Multiclass Support Vector Machine** (SVM) loss. The SVM loss is set up so that the SVM "wants" the correct class for each image to a have a score higher than the incorrect classes by some fixed margin $\Delta$. Notice that it's sometimes helpful to anthropomorphise the loss functions as we did above: The SVM "wants" a certain outcome in the sense that the outcome would yield a lower loss (which is good). -Let's now get more precise. Recall that for the i-th example we are given the pixels of image \\( x\_i \\) and the label \\( y\_i \\) that specifies the index of the correct class. The score function takes the pixels and computes the vector \\( f(x\_i, W) \\) of class scores, which we will abbreviate to \\(s\\) (short for scores). For example, the score for the j-th class is the j-th element: \\( s\_j = f(x\_i, W)\_j \\). The Multiclass SVM loss for the i-th example is then formalized as follows: +Let's now get more precise. Recall that for the i-th example we are given the pixels of image $ x_i $ and the label $ y_i $ that specifies the index of the correct class. The score function takes the pixels and computes the vector $ f(x_i, W) $ of class scores, which we will abbreviate to $s$ (short for scores). For example, the score for the j-th class is the j-th element: $ s_j = f(x_i, W)_j $. The Multiclass SVM loss for the i-th example is then formalized as follows: $$ -L\_i = \sum\_{j\neq y\_i} \max(0, s\_j - s\_{y\_i} + \Delta) +L_i = \sum_{j\neq y_i} \max(0, s_j - s_{y_i} + \Delta) $$ -**Example.** Lets unpack this with an example to see how it works. Suppose that we have three classes that receive the scores \\( s = [13, -7, 11]\\), and that the first class is the true class (i.e. \\(y\_i = 0\\)). Also assume that \\(\Delta\\) (a hyperparameter we will go into more detail about soon) is 10. The expression above sums over all incorrect classes (\\(j \neq y\_i\\)), so we get two terms: +**Example.** Lets unpack this with an example to see how it works. Suppose that we have three classes that receive the scores $ s = [13, -7, 11]$, and that the first class is the true class (i.e. $y_i = 0$). Also assume that $\Delta$ (a hyperparameter we will go into more detail about soon) is 10. The expression above sums over all incorrect classes ($j \neq y_i$), so we get two terms: $$ -L\_i = \max(0, -7 - 13 + 10) + \max(0, 11 - 13 + 10) +L_i = \max(0, -7 - 13 + 10) + \max(0, 11 - 13 + 10) $$ -You can see that the first term gives zero since [-7 - 13 + 10] gives a negative number, which is then thresholded to zero with the \\(max(0,-)\\) function. We get zero loss for this pair because the correct class score (13) was greater than the incorrect class score (-7) by at least the margin 10. In fact the difference was 20, which is much greater than 10 but the SVM only cares that the difference is at least 10; Any additional difference above the margin is clamped at zero with the max operation. The second term computes [11 - 13 + 10] which gives 8. That is, even though the correct class had a higher score than the incorrect class (13 > 11), it was not greater by the desired margin of 10. The difference was only 2, which is why the loss comes out to 8 (i.e. how much higher the difference would have to be to meet the margin). In summary, the SVM loss function wants the score of the correct class \\(y\_i\\) to be larger than the incorrect class scores by at least by \\(\Delta\\) (delta). If this is not the case, we will accumulate loss. +You can see that the first term gives zero since [-7 - 13 + 10] gives a negative number, which is then thresholded to zero with the $max(0,-)$ function. We get zero loss for this pair because the correct class score (13) was greater than the incorrect class score (-7) by at least the margin 10. In fact the difference was 20, which is much greater than 10 but the SVM only cares that the difference is at least 10; Any additional difference above the margin is clamped at zero with the max operation. The second term computes [11 - 13 + 10] which gives 8. That is, even though the correct class had a higher score than the incorrect class (13 > 11), it was not greater by the desired margin of 10. The difference was only 2, which is why the loss comes out to 8 (i.e. how much higher the difference would have to be to meet the margin). In summary, the SVM loss function wants the score of the correct class $y_i$ to be larger than the incorrect class scores by at least by $\Delta$ (delta). If this is not the case, we will accumulate loss. -Note that in this particular module we are working with linear score functions ( \\( f(x\_i; W) = W x\_i \\) ), so we can also rewrite the loss function in this equivalent form: +Note that in this particular module we are working with linear score functions ( $ f(x_i; W) = W x_i $ ), so we can also rewrite the loss function in this equivalent form: $$ -L\_i = \sum\_{j\neq y\_i} \max(0, w\_j^T x\_i - w\_{y\_i}^T x\_i + \Delta) +L_i = \sum_{j\neq y_i} \max(0, w_j^T x_i - w_{y_i}^T x_i + \Delta) $$ -where \\(w\_j\\) is the j-th row of \\(W\\) reshaped as a column. However, this will not necessarily be the case once we start to consider more complex forms of the score function \\(f\\). +where $w_j$ is the j-th row of $W$ reshaped as a column. However, this will not necessarily be the case once we start to consider more complex forms of the score function $f$. -A last piece of terminology we'll mention before we finish with this section is that the threshold at zero \\(max(0,-)\\) function is often called the **hinge loss**. You'll sometimes hear about people instead using the squared hinge loss SVM (or L2-SVM), which uses the form \\(max(0,-)^2\\) that penalizes violated margins more strongly (quadratically instead of linearly). The unsquared version is more standard, but in some datasets the squared hinge loss can work better. This can be determined during cross-validation. +A last piece of terminology we'll mention before we finish with this section is that the threshold at zero $max(0,-)$ function is often called the **hinge loss**. You'll sometimes hear about people instead using the squared hinge loss SVM (or L2-SVM), which uses the form $max(0,-)^2$ that penalizes violated margins more strongly (quadratically instead of linearly). The unsquared version is more standard, but in some datasets the squared hinge loss can work better. This can be determined during cross-validation. > The loss function quantifies our unhappiness with predictions on the training set
- +
The Multiclass Support Vector Machine "wants" the score of the correct class to be higher than all other scores by at least a margin of delta. If any class has a score inside the red region (or higher), then there will be accumulated loss. Otherwise the loss will be zero. Our objective will be to find the weights that will simultaneously satisfy this constraint for all examples in the training data and give a total loss that is as low as possible.
@@ -152,37 +158,38 @@ A last piece of terminology we'll mention before we finish with this section is -**Regularization**. There is one bug with the loss function we presented above. Suppose that we have a dataset and a set of parameters **W** that correctly classify every example (i.e. all scores are so that all the margins are met, and \\(L\_i = 0\\) for all i). The issue is that this set of **W** is not necessarily unique: there might be many similar **W** that correctly classify the examples. One easy way to see this is that if some parameters **W** correctly classify all examples (so loss is zero for each example), then any multiple of these parameters \\( \lambda W \\) where \\( \lambda > 1 \\) will also give zero loss because this transformation uniformly stretches all score magnitudes and hence also their absolute differences. For example, if the difference in scores between a correct class and a nearest incorrect class was 15, then multiplying all elements of **W** by 2 would make the new difference 30. -In other words, we wish to encode some preference for a certain set of weights **W** over others to remove this ambiguity. We can do so by extending the loss function with a **regularization penalty** \\(R(W)\\). The most common regularization penalty is the **L2** norm that discourages large weights through an elementwise quadratic penalty over all parameters: +**Regularization**. There is one bug with the loss function we presented above. Suppose that we have a dataset and a set of parameters **W** that correctly classify every example (i.e. all scores are so that all the margins are met, and $L_i = 0$ for all i). The issue is that this set of **W** is not necessarily unique: there might be many similar **W** that correctly classify the examples. One easy way to see this is that if some parameters **W** correctly classify all examples (so loss is zero for each example), then any multiple of these parameters $ \lambda W $ where $ \lambda > 1 $ will also give zero loss because this transformation uniformly stretches all score magnitudes and hence also their absolute differences. For example, if the difference in scores between a correct class and a nearest incorrect class was 15, then multiplying all elements of **W** by 2 would make the new difference 30. + +In other words, we wish to encode some preference for a certain set of weights **W** over others to remove this ambiguity. We can do so by extending the loss function with a **regularization penalty** $R(W)$. The most common regularization penalty is the **L2** norm that discourages large weights through an elementwise quadratic penalty over all parameters: $$ -R(W) = \sum\_k\sum\_l W\_{k,l}^2 +R(W) = \sum_k\sum_l W_{k,l}^2 $$ -In the expression above, we are summing up all the squared elements of \\(W\\). Notice that the regularization function is not a function of the data, it is only based on the weights. Including the regularization penalty completes the full Multiclass Support Vector Machine loss, which is made up of two components: the **data loss** (which is the average loss \\(L\_i\\) over all examples) and the **regularization loss**. That is, the full Multiclass SVM loss becomes: +In the expression above, we are summing up all the squared elements of $W$. Notice that the regularization function is not a function of the data, it is only based on the weights. Including the regularization penalty completes the full Multiclass Support Vector Machine loss, which is made up of two components: the **data loss** (which is the average loss $L_i$ over all examples) and the **regularization loss**. That is, the full Multiclass SVM loss becomes: $$ -L = \underbrace{ \frac{1}{N} \sum\_i L\_i }\_\text{data loss} + \underbrace{ \lambda R(W) }\_\text{regularization loss} \\\\ +L = \underbrace{ \frac{1}{N} \sum_i L_i }_\text{data loss} + \underbrace{ \lambda R(W) }_\text{regularization loss} \\\\ $$ Or expanding this out in its full form: $$ -L = \frac{1}{N} \sum\_i \sum\_{j\neq y\_i} \left[ \max(0, f(x\_i; W)\_j - f(x\_i; W)\_{y\_i} + \Delta) \right] + \lambda \sum\_k\sum\_l W\_{k,l}^2 +L = \frac{1}{N} \sum_i \sum_{j\neq y_i} \left[ \max(0, f(x_i; W)_j - f(x_i; W)_{y_i} + \Delta) \right] + \lambda \sum_k\sum_l W_{k,l}^2 $$ -Where \\(N\\) is the number of training examples. As you can see, we append the regularization penalty to the loss objective, weighted by a hyperparameter \\(\lambda\\). There is no simple way of setting this hyperparameter and it is usually determined by cross-validation. +Where $N$ is the number of training examples. As you can see, we append the regularization penalty to the loss objective, weighted by a hyperparameter $\lambda$. There is no simple way of setting this hyperparameter and it is usually determined by cross-validation. In addition to the motivation we provided above there are many desirable properties to include the regularization penalty, many of which we will come back to in later sections. For example, it turns out that including the L2 penalty leads to the appealing **max margin** property in SVMs (See [CS229](http://cs229.stanford.edu/notes/cs229-notes3.pdf) lecture notes for full details if you are interested). -The most appealing property is that penalizing large weights tends to improve generalization, because it means that no input dimension can have a very large influence on the scores all by itself. For example, suppose that we have some input vector \\(x = [1,1,1,1] \\) and two weight vectors \\(w\_1 = [1,0,0,0]\\), \\(w\_2 = [0.25,0.25,0.25,0.25] \\). Then \\(w\_1^Tx = w\_2^Tx = 1\\) so both weight vectors lead to the same dot product, but the L2 penalty of \\(w\_1\\) is 1.0 while the L2 penalty of \\(w\_2\\) is only 0.25. Therefore, according to the L2 penalty the weight vector \\(w\_2\\) would be preferred since it achieves a lower regularization loss. Intuitively, this is because the weights in \\(w\_2\\) are smaller and more diffuse. Since the L2 penalty prefers smaller and more diffuse weight vectors, the final classifier is encouraged to take into account all input dimensions to small amounts rather than a few input dimensions and very strongly. As we will see later in the class, this effect can improve the generalization performance of the classifiers on test images and lead to less *overfitting*. +The most appealing property is that penalizing large weights tends to improve generalization, because it means that no input dimension can have a very large influence on the scores all by itself. For example, suppose that we have some input vector $x = [1,1,1,1] $ and two weight vectors $w_1 = [1,0,0,0]$, $w_2 = [0.25,0.25,0.25,0.25] $. Then $w_1^Tx = w_2^Tx = 1$ so both weight vectors lead to the same dot product, but the L2 penalty of $w_1$ is 1.0 while the L2 penalty of $w_2$ is only 0.25. Therefore, according to the L2 penalty the weight vector $w_2$ would be preferred since it achieves a lower regularization loss. Intuitively, this is because the weights in $w_2$ are smaller and more diffuse. Since the L2 penalty prefers smaller and more diffuse weight vectors, the final classifier is encouraged to take into account all input dimensions to small amounts rather than a few input dimensions and very strongly. As we will see later in the class, this effect can improve the generalization performance of the classifiers on test images and lead to less *overfitting*. -Note that biases do not have the same effect since, unlike the weights, they do not control the strength of influence of an input dimension. Therefore, it is common to only regularize the weights \\(W\\) but not the biases \\(b\\). However, in practice this often turns out to have a negligible effect. Lastly, note that due to the regularization penalty we can never achieve loss of exactly 0.0 on all examples, because this would only be possible in the pathological setting of \\(W = 0\\). +Note that biases do not have the same effect since, unlike the weights, they do not control the strength of influence of an input dimension. Therefore, it is common to only regularize the weights $W$ but not the biases $b$. However, in practice this often turns out to have a negligible effect. Lastly, note that due to the regularization penalty we can never achieve loss of exactly 0.0 on all examples, because this would only be possible in the pathological setting of $W = 0$. **Code**. Here is the loss function (without regularization) implemented in Python, in both unvectorized and half-vectorized form: -```python +~~~python def L_i(x, y, W): """ unvectorized version. Compute the multiclass svm loss for a single example (x,y) @@ -229,7 +236,7 @@ def L(X, y, W): """ # evaluate loss over all examples in X without using any for loops # left as exercise to reader in the assignment -``` +~~~ The takeaway from this section is that the SVM loss takes one particular approach to measuring how consistent the predictions on training data are with the ground truth labels. Additionally, making good predictions on the training set is equivalent to minimizing the loss. @@ -237,58 +244,59 @@ The takeaway from this section is that the SVM loss takes one particular approac ### Practical Considerations -**Setting Delta.** Note that we brushed over the hyperparameter \\(\Delta\\) and its setting. What value should it be set to, and do we have to cross-validate it? It turns out that this hyperparameter can safely be set to \\(\Delta = 1.0\\) in all cases. The hyperparameters \\(\Delta\\) and \\(\lambda\\) seem like two different hyperparameters, but in fact they both control the same tradeoff: The tradeoff between the data loss and the regularization loss in the objective. The key to understanding this is that the magnitude of the weights \\(W\\) has direct effect on the scores (and hence also their differences): As we shrink all values inside \\(W\\) the score differences will become lower, and as we scale up the weights the score differences will all become higher. Therefore, the exact value of the margin between the scores (e.g. \\(\Delta = 1\\), or \\(\Delta = 100\\)) is in some sense meaningless because the weights can shrink or stretch the differences arbitrarily. Hence, the only real tradeoff is how large we allow the weights to grow (through the regularization strength \\(\lambda\\)). +**Setting Delta.** Note that we brushed over the hyperparameter $\Delta$ and its setting. What value should it be set to, and do we have to cross-validate it? It turns out that this hyperparameter can safely be set to $\Delta = 1.0$ in all cases. The hyperparameters $\Delta$ and $\lambda$ seem like two different hyperparameters, but in fact they both control the same tradeoff: The tradeoff between the data loss and the regularization loss in the objective. The key to understanding this is that the magnitude of the weights $W$ has direct effect on the scores (and hence also their differences): As we shrink all values inside $W$ the score differences will become lower, and as we scale up the weights the score differences will all become higher. Therefore, the exact value of the margin between the scores (e.g. $\Delta = 1$, or $\Delta = 100$) is in some sense meaningless because the weights can shrink or stretch the differences arbitrarily. Hence, the only real tradeoff is how large we allow the weights to grow (through the regularization strength $\lambda$). **Relation to Binary Support Vector Machine**. You may be coming to this class with previous experience with Binary Support Vector Machines, where the loss for the i-th example can be written as: $$ -L\_i = C \max(0, 1 - y\_i w^Tx\_i) + R(W) +L_i = C \max(0, 1 - y_i w^Tx_i) + R(W) $$ -where \\(C\\) is a hyperparameter, and \\(y\_i \in \\{ -1,1 \\} \\). You can convince yourself that the formulation we presented in this section contains the binary SVM as a special case when there are only two classes. That is, if we only had two classes then the loss reduces to the binary SVM shown above. Also, \\(C\\) in this formulation and \\(\lambda\\) in our formulation control the same tradeoff and are related through reciprocal relation \\(C \propto \frac{1}{\lambda}\\). +where $C$ is a hyperparameter, and $y_i \in \\{ -1,1 \\} $. You can convince yourself that the formulation we presented in this section contains the binary SVM as a special case when there are only two classes. That is, if we only had two classes then the loss reduces to the binary SVM shown above. Also, $C$ in this formulation and $\lambda$ in our formulation control the same tradeoff and are related through reciprocal relation $C \propto \frac{1}{\lambda}$. **Aside: Optimization in primal**. If you're coming to this class with previous knowledge of SVMs, you may have also heard of kernels, duals, the SMO algorithm, etc. In this class (as is the case with Neural Networks in general) we will always work with the optimization objectives in their unconstrained primal form. Many of these objectives are technically not differentiable (e.g. the max(x,y) function isn't because it has a *kink* when x=y), but in practice this is not a problem and it is common to use a subgradient. **Aside: Other Multiclass SVM formulations.** It is worth noting that the Multiclass SVM presented in this section is one of few ways of formulating the SVM over multiple classes. Another commonly used form is the *One-Vs-All* (OVA) SVM which trains an independent binary SVM for each class vs. all other classes. Related, but less common to see in practice is also the *All-vs-All* (AVA) strategy. Our formulation follows the [Weston and Watkins 1999 (pdf)](https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es1999-461.pdf) version, which is a more powerful version than OVA (in the sense that you can construct multiclass datasets where this version can achieve zero data loss, but OVA cannot. See details in the paper if interested). The last formulation you may see is a *Structured SVM*, which maximizes the margin between the score of the correct class and the score of the highest-scoring incorrect runner-up class. Understanding the differences between these formulations is outside of the scope of the class. The version presented in these notes is a safe bet to use in practice, but the arguably simplest OVA strategy is likely to work just as well (as also argued by Rikin et al. 2004 in [In Defense of One-Vs-All Classification (pdf)](http://www.jmlr.org/papers/volume5/rifkin04a/rifkin04a.pdf)). -### Softmax classifier -It turns out that the SVM is one of two commonly seen classifiers. The other popular choice is the **Softmax classifier**, which has a different loss function. If you've heard of the binary Logistic Regression classifier before, the Softmax classifier is its generalization to multiple classes. Unlike the SVM which treats the outputs \\(f(x\_i,W)\\) as (uncalibrated and possibly difficult to interpret) scores for each class, the Softmax classifier gives a slightly more intuitive output (normalized class probabilities) and also has a probabilistic interpretation that we will describe shortly. In the Softmax classifier, the function mapping \\(f(x\_i; W) = W x\_i\\) stays unchanged, but we now interpret these scores as the unnormalized log probabilities for each class and replace the *hinge loss* with a **cross-entropy loss** that has the form: +### Softmax 분류기 + +It turns out that the SVM is one of two commonly seen classifiers. The other popular choice is the **Softmax classifier**, which has a different loss function. If you've heard of the binary Logistic Regression classifier before, the Softmax classifier is its generalization to multiple classes. Unlike the SVM which treats the outputs $f(x_i,W)$ as (uncalibrated and possibly difficult to interpret) scores for each class, the Softmax classifier gives a slightly more intuitive output (normalized class probabilities) and also has a probabilistic interpretation that we will describe shortly. In the Softmax classifier, the function mapping $f(x_i; W) = W x_i$ stays unchanged, but we now interpret these scores as the unnormalized log probabilities for each class and replace the *hinge loss* with a **cross-entropy loss** that has the form: $$ -L\_i = -\log\left(\frac{e^{f\_{y\_i}}}{ \sum\_j e^{f\_j} }\right) \hspace{0.5in} \text{or equivalently} \hspace{0.5in} L\_i = -f\_{y\_i} + \log\sum\_j e^{f\_j} +L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) \hspace{0.5in} \text{or equivalently} \hspace{0.5in} L_i = -f_{y_i} + \log\sum_j e^{f_j} $$ -where we are using the notation \\(f\_j\\) to mean the j-th element of the vector of class scores \\(f\\). As before, the full loss for the dataset is the mean of \\(L\_i\\) over all training examples together with a regularization term \\(R(W)\\). The function \\(f\_j(z) = \frac{e^{z\_j}}{\sum\_k e^{z\_k}} \\) is called the **softmax function**: It takes a vector of arbitrary real-valued scores (in \\(z\\)) and squashes it to a vector of values between zero and one that sum to one. The full cross-entropy loss that involves the softmax function might look scary if you're seeing it for the first time but it is relatively easy to motivate. +where we are using the notation $f_j$ to mean the j-th element of the vector of class scores $f$. As before, the full loss for the dataset is the mean of $L_i$ over all training examples together with a regularization term $R(W)$. The function $f_j(z) = \frac{e^{z_j}}{\sum_k e^{z_k}} $ is called the **softmax function**: It takes a vector of arbitrary real-valued scores (in $z$) and squashes it to a vector of values between zero and one that sum to one. The full cross-entropy loss that involves the softmax function might look scary if you're seeing it for the first time but it is relatively easy to motivate. -**Information theory view**. The *cross-entropy* between a "true" distribution \\(p\\) and an estimated distribution \\(q\\) is defined as: +**Information theory view**. The *cross-entropy* between a "true" distribution $p$ and an estimated distribution $q$ is defined as: $$ -H(p,q) = - \sum\_x p(x) \log q(x) +H(p,q) = - \sum_x p(x) \log q(x) $$ -The Softmax classifier is hence minimizing the cross-entropy between the estimated class probabilities ( \\(q = e^{f\_{y\_i}} / \sum\_j e^{f\_j} \\) as seen above) and the "true" distribution, which in this interpretation is the distribution where all probability mass is on the correct class (i.e. \\(p = [0, \ldots 1, \ldots, 0]\\) contains a single 1 at the \\(y\_i\\) -th position.). Moreover, since the cross-entropy can be written in terms of entropy and the Kullback-Leibler divergence as \\(H(p,q) = H(p) + D\_{KL}(p\|\|q)\\), and the entropy of the delta function \\(p\\) is zero, this is also equivalent to minimizing the KL divergence between the two distributions (a measure of distance). In other words, the cross-entropy objective *wants* the predicted distribution to have all of its mass on the correct answer. +The Softmax classifier is hence minimizing the cross-entropy between the estimated class probabilities ( $q = e^{f_{y_i}} / \sum_j e^{f_j} $ as seen above) and the "true" distribution, which in this interpretation is the distribution where all probability mass is on the correct class (i.e. $p = [0, \ldots 1, \ldots, 0]$ contains a single 1 at the $y_i$ -th position.). Moreover, since the cross-entropy can be written in terms of entropy and the Kullback-Leibler divergence as $H(p,q) = H(p) + D_{KL}(p\|\|q)$, and the entropy of the delta function $p$ is zero, this is also equivalent to minimizing the KL divergence between the two distributions (a measure of distance). In other words, the cross-entropy objective *wants* the predicted distribution to have all of its mass on the correct answer. **Probabilistic interpretation**. Looking at the expression, we see that $$ -P(y\_i \mid x\_i; W) = \frac{e^{f\_{y\_i}}}{\sum\_j e^{f\_j} } +P(y_i \mid x_i; W) = \frac{e^{f_{y_i}}}{\sum_j e^{f_j} } $$ -can be interpreted as the (normalized) probability assigned to the correct label \\(y\_i\\) given the image \\(x\_i\\) and parameterized by \\(W\\). To see this, remember that the Softmax classifier interprets the scores inside the output vector \\(f\\) as the unnormalized log probabilities. Exponentiating these quantities therefore gives the (unnormalized) probabilities, and the division performs the normalization so that the probabilities sum to one. In the probabilistic interpretation, we are therefore minimizing the negative log likelihood of the correct class, which can be interpreted as performing *Maximum Likelihood Estimation* (MLE). A nice feature of this view is that we can now also interpret the regularization term \\(R(W)\\) in the full loss function as coming from a Gaussian prior over the weight matrix \\(W\\), where instead of MLE we are performing the *Maximum a posteriori* (MAP) estimation. We mention these interpretations to help your intuitions, but the full details of this derivation are beyond the scope of this class. +can be interpreted as the (normalized) probability assigned to the correct label $y_i$ given the image $x_i$ and parameterized by $W$. To see this, remember that the Softmax classifier interprets the scores inside the output vector $f$ as the unnormalized log probabilities. Exponentiating these quantities therefore gives the (unnormalized) probabilities, and the division performs the normalization so that the probabilities sum to one. In the probabilistic interpretation, we are therefore minimizing the negative log likelihood of the correct class, which can be interpreted as performing *Maximum Likelihood Estimation* (MLE). A nice feature of this view is that we can now also interpret the regularization term $R(W)$ in the full loss function as coming from a Gaussian prior over the weight matrix $W$, where instead of MLE we are performing the *Maximum a posteriori* (MAP) estimation. We mention these interpretations to help your intuitions, but the full details of this derivation are beyond the scope of this class. -**Practical issues: Numeric stability**. When you're writing code for computing the Softmax function in practice, the intermediate terms \\(e^{f\_{y\_i}}\\) and \\(\sum\_j e^{f\_j}\\) may be very large due to the exponentials. Dividing large numbers can be numerically unstable, so it is important to use a normalization trick. Notice that if we multiply the top and bottom of the fraction by a constant \\(C\\) and push it into the sum, we get the following (mathematically equivalent) expression: +**Practical issues: Numeric stability**. When you're writing code for computing the Softmax function in practice, the intermediate terms $e^{f_{y_i}}$ and $\sum_j e^{f_j}$ may be very large due to the exponentials. Dividing large numbers can be numerically unstable, so it is important to use a normalization trick. Notice that if we multiply the top and bottom of the fraction by a constant $C$ and push it into the sum, we get the following (mathematically equivalent) expression: $$ -\frac{e^{f\_{y\_i}}}{\sum\_j e^{f\_j}} -= \frac{Ce^{f\_{y\_i}}}{C\sum\_j e^{f\_j}} -= \frac{e^{f\_{y\_i} + \log C}}{\sum\_j e^{f\_j + \log C}} +\frac{e^{f_{y_i}}}{\sum_j e^{f_j}} += \frac{Ce^{f_{y_i}}}{C\sum_j e^{f_j}} += \frac{e^{f_{y_i} + \log C}}{\sum_j e^{f_j + \log C}} $$ -We are free to choose the value of \\(C\\). This will not change any of the results, but we can use this value to improve the numerical stability of the computation. A common choice for \\(C\\) is to set \\(\log C = -\max\_j f\_j \\). This simply states that we should shift the values inside the vector \\(f\\) so that the highest value is zero. In code: +We are free to choose the value of $C$. This will not change any of the results, but we can use this value to improve the numerical stability of the computation. A common choice for $C$ is to set $\log C = -\max_j f_j $. This simply states that we should shift the values inside the vector $f$ so that the highest value is zero. In code: -```python +~~~python f = np.array([123, 456, 789]) # example with 3 classes and each having large scores p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup @@ -296,50 +304,53 @@ p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup f -= np.max(f) # f becomes [-666, -333, 0] p = np.exp(f) / np.sum(np.exp(f)) # safe to do, gives the correct answer -``` +~~~ **Possibly confusing naming conventions**. To be precise, the *SVM classifier* uses the *hinge loss*, or also sometimes called the *max-margin loss*. The *Softmax classifier* uses the *cross-entropy loss*. The Softmax classifier gets its name from the *softmax function*, which is used to squash the raw class scores into normalized positive values that sum to one, so that the cross-entropy loss can be applied. In particular, note that technically it doesn't make sense to talk about the "softmax loss", since softmax is just the squashing function, but it is a relatively commonly used shorthand. + ### SVM vs. Softmax A picture might help clarify the distinction between the Softmax and SVM classifiers:
- +
Example of the difference between the SVM and Softmax classifiers for one datapoint. In both cases we compute the same score vector f (e.g. by matrix multiplication in this section). The difference is in the interpretation of the scores in f: The SVM interprets these as class scores and its loss function encourages the correct class (class 2, in blue) to have a score higher by a margin than the other class scores. The Softmax classifier instead interprets the scores as (unnormalized) log probabilities for each class and then encourages the (normalized) log probability of the correct class to be high (equivalently the negative of it to be low). The final loss for this example is 1.58 for the SVM and 1.04 for the Softmax classifier, but note that these numbers are not comparable; They are only meaningful in relation to loss computed within the same classifier and with the same data.
-**Softmax classifier provides "probabilities" for each class.** Unlike the SVM which computes uncalibrated and not easy to interpret scores for all classes, the Softmax classifier allows us to compute "probabilities" for all labels. For example, given an image the SVM classifier might give you scores [12.5, 0.6, -23.0] for the classes "cat", "dog" and "ship". The softmax classifier can instead compute the probabilities of the three labels as [0.9, 0.09, 0.01], which allows you to interpret its confidence in each class. The reason we put the word "probabilities" in quotes, however, is that how peaky or diffuse these probabilities are depends directly on the regularization strength \\(\lambda\\) - which you are in charge of as input to the system. For example, suppose that the unnormalized log-probabilities for some three classes come out to be [1, -2, 0]. The softmax function would then compute: +**Softmax classifier provides "probabilities" for each class.** Unlike the SVM which computes uncalibrated and not easy to interpret scores for all classes, the Softmax classifier allows us to compute "probabilities" for all labels. For example, given an image the SVM classifier might give you scores [12.5, 0.6, -23.0] for the classes "cat", "dog" and "ship". The softmax classifier can instead compute the probabilities of the three labels as [0.9, 0.09, 0.01], which allows you to interpret its confidence in each class. The reason we put the word "probabilities" in quotes, however, is that how peaky or diffuse these probabilities are depends directly on the regularization strength $\lambda$ - which you are in charge of as input to the system. For example, suppose that the unnormalized log-probabilities for some three classes come out to be [1, -2, 0]. The softmax function would then compute: $$ [1, -2, 0] \rightarrow [e^1, e^{-2}, e^0] = [2.71, 0.14, 1] \rightarrow [0.7, 0.04, 0.26] $$ -Where the steps taken are to exponentiate and normalize to sum to one. Now, if the regularization strength \\(\lambda\\) was higher, the weights \\(W\\) would be penalized more and this would lead to smaller weights. For example, suppose that the weights became one half smaller ([0.5, -1, 0]). The softmax would now compute: +Where the steps taken are to exponentiate and normalize to sum to one. Now, if the regularization strength $\lambda$ was higher, the weights $W$ would be penalized more and this would lead to smaller weights. For example, suppose that the weights became one half smaller ([0.5, -1, 0]). The softmax would now compute: $$ [0.5, -1, 0] \rightarrow [e^{0.5}, e^{-1}, e^0] = [1.65, 0.37, 1] \rightarrow [0.55, 0.12, 0.33] $$ -where the probabilites are now more diffuse. Moreover, in the limit where the weights go towards tiny numbers due to very strong regularization strength \\(\lambda\\), the output probabilities would be near uniform. Hence, the probabilities computed by the Softmax classifier are better thought of as confidences where, similar to the SVM, the ordering of the scores is interpretable, but the absolute numbers (or their differences) technically are not. +where the probabilites are now more diffuse. Moreover, in the limit where the weights go towards tiny numbers due to very strong regularization strength $\lambda$, the output probabilities would be near uniform. Hence, the probabilities computed by the Softmax classifier are better thought of as confidences where, similar to the SVM, the ordering of the scores is interpretable, but the absolute numbers (or their differences) technically are not. -**In practice, SVM and Softmax are usually comparable.** The performance difference between the SVM and Softmax are usually very small, and different people will have different opinions on which classifier works better. Compared to the Softmax classifier, the SVM is a more *local* objective, which could be thought of either as a bug or a feature. Consider an example that achieves the scores [10, -2, 3] and where the first class is correct. An SVM (e.g. with desired margin of \\(\Delta = 1\\)) will see that the correct class already has a score higher than the margin compared to the other classes and it will compute loss of zero. The SVM does not care about the details of the individual scores: if they were instead [10, -100, -100] or [10, 9, 9] the SVM would be indifferent since the margin of 1 is satisfied and hence the loss is zero. However, these scenarios are not equivalent to a Softmax classifier, which would accumulate a much higher loss for the scores [10, 9, 9] than for [10, -100, -100]. In other words, the Softmax classifier is never fully happy with the scores it produces: the correct class could always have a higher probability and the incorrect classes always a lower probability and the loss would always get better. However, the SVM is happy once the margins are satisfied and it does not micromanage the exact scores beyond this constraint. This can intuitively be thought of as a feature: For example, a car classifier which is likely spending most of its "effort" on the difficult problem of separating cars from trucks should not be influenced by the frog examples, which it already assigns very low scores to, and which likely cluster around a completely different side of the data cloud. +**In practice, SVM and Softmax are usually comparable.** The performance difference between the SVM and Softmax are usually very small, and different people will have different opinions on which classifier works better. Compared to the Softmax classifier, the SVM is a more *local* objective, which could be thought of either as a bug or a feature. Consider an example that achieves the scores [10, -2, 3] and where the first class is correct. An SVM (e.g. with desired margin of $\Delta = 1$) will see that the correct class already has a score higher than the margin compared to the other classes and it will compute loss of zero. The SVM does not care about the details of the individual scores: if they were instead [10, -100, -100] or [10, 9, 9] the SVM would be indifferent since the margin of 1 is satisfied and hence the loss is zero. However, these scenarios are not equivalent to a Softmax classifier, which would accumulate a much higher loss for the scores [10, 9, 9] than for [10, -100, -100]. In other words, the Softmax classifier is never fully happy with the scores it produces: the correct class could always have a higher probability and the incorrect classes always a lower probability and the loss would always get better. However, the SVM is happy once the margins are satisfied and it does not micromanage the exact scores beyond this constraint. This can intuitively be thought of as a feature: For example, a car classifier which is likely spending most of its "effort" on the difficult problem of separating cars from trucks should not be influenced by the frog examples, which it already assigns very low scores to, and which likely cluster around a completely different side of the data cloud. -### Interactive web demo - +### 선형 분류 웹 데모 +
- + +
We have written an interactive web demo to help your intuitions with linear classifiers. The demo visualizes the loss functions discussed in this section using a toy 3-way classification on 2D data. The demo also jumps ahead a bit and performs the optimization, which we will discuss in full detail in the next section.
- - -### Summary + + + +### 요약 In summary, @@ -351,7 +362,8 @@ In summary, We now saw one way to take a dataset of images and map each one to class scores based on a set of parameters, and we saw two examples of loss functions that we can use to measure the quality of the predictions. But how do we efficiently determine the parameters that give the best (lowest) loss? This process is *optimization*, and it is the topic of the next section. -### Further Reading + +### 추가 읽기 자료 These readings are optional and contain pointers of interest. diff --git a/neural-networks-1.md b/neural-networks-1.md index 6ffaafba..afc07e81 100644 --- a/neural-networks-1.md +++ b/neural-networks-1.md @@ -3,13 +3,13 @@ layout: page permalink: /neural-networks-1/ --- -Table of Contents: +목차: -- [Quick intro without brain analogies](#quick) -- [Modeling one neuron](#intro) - - [Biological motivation and connections](#bio) - - [Single neuron as a linear classifier](#classifier) - - [Commonly used activation functions](#actfun) +- [간단한 소개: 뇌에 비유하지 않고](#quick) +- [뉴런 하나 모델링하기](#intro) + - [생물학적 동기와 연결](#bio) + - [선형분류기로서의 뉴런 1개](#classifier) + - [흔하게 사용되는 활성함수](#actfun) - [Neural Network architectures](#nn) - [Layer-wise organization](#layers) - [Example feed-forward computation](#feedforward) @@ -19,149 +19,158 @@ Table of Contents: - [Additional references](#add) -## Quick intro -It is possible to introduce neural networks without appealing to brain analogies. In the section on linear classification we computed scores for different visual categories given the image using the formula \\( s = W x \\), where \\(W\\) was a matrix and \\(x\\) was an input column vector containing all pixel data of the image. In the case of CIFAR-10, \\(x\\) is a [3072x1] column vector, and \\(W\\) is a [10x3072] matrix, so that the output scores is a vector of 10 class scores. +## 간단한 소개 -An example neural network would instead compute \\( s = W\_2 \max(0, W\_1 x) \\). Here, \\(W\_1\\) could be, for example, a [100x3072] matrix transforming the image into a 100-dimensional intermediate vector. The function \\(max(0,-) \\) is a non-linearity that is applied elementwise. There are several choices we could make for the non-linearity (which we'll study below), but this one is a common choice and simply thresholds all activations that are below zero to zero. Finally, the matrix \\(W\_2\\) would then be of size [10x100], so that we again get 10 numbers out that we interpret as the class scores. Notice that the non-linearity is critical computationally - if we left it out, the two matrices could be collapsed to a single matrix, and therefore the predicted class scores would again be a linear function of the input. The non-linearity is where we get the *wiggle*. The parameters \\(W\_2, W\_1\\) are learned with stochastic gradient descent, and their gradients are derived with chain rule (and computed with backpropagation). +뇌에 비유하지 않고도 신경망(neural networks)를 소개할 수 있다. 이 선형분류에 관한 섹션에서, $$W$$가 행렬이고 $$x$$가 입력 열벡터(column vector)로서 이미지의 모든 픽셀 정보값을 가질 때, $$ s = W x $$ 형태의 공식을 이용하여 주어진 이미지를 가지고 각 카테고리에 해당하는 스코어를 계산했었다. CIFAR-10의 경우, $$x$$는 크기가 [3072x1]인 열벡터이고, $$W$$는 크기가 [10x3072]인 행렬이었다. 따라서, 출력 스코어는 크기가 [10x1]인 벡터가 된다. (역자 주: 숫자 1개가 클래스 1개랑 관련있음.) -A three-layer neural network could analogously look like \\( s = W\_3 \max(0, W\_2 \max(0, W\_1 x)) \\), where all of \\(W\_3, W\_2, W\_1\\) are parameters to be learned. The sizes of the intermediate hidden vectors are hyperparameters of the network and we'll see how we can set them later. Lets now look into how we can interpret these computations from the neuron/network perspective. +신경망(neural network)는 그 대신, 예컨대 이런 류의 것을 계산한다: $$ s = W_2 \max(0, W_1 x) $$. 여기서 $$W_1$$는, 역시 예를 들자면, 크기가 [100x3072]인 행렬로서 이미지를 100차원짜리 중간단계 벡터로 전환하는 것일 수도 있겠다. $$max(0,-) $$ 함수는 비선형함수로서 $$W_1 x $$의 각 원소에 적용된다. (밑에서 다루겠지만), 이러한 비선형성을 구현하기 위한 방법은 여러 개 있지만, 이 함수는 흔히 쓰이는 것이고 단순히 모든 0 이하값을 0으로 막아버린다. 끝으로, 행렬 $$W_2$$은 크기 [10x100]짜리 행렬일 수도 있겠다. 그래서 결국에는 클래스 스코어(class score)로 쓰일 숫자 10개를 내놓게 된다. 비선형성이 계산에 있어 결정적이라는 점을 주목하자. 만약에 비선형성이 없다면, 이 행렬들은 서로 곱해져서 결국에는 하나의 행렬이 되고, 예측 스코어(score)도 역시나 입력값의 선형 함수(linear function)이 되고 만다. 이 비선형성에서 우리는 *wiggle*을 찾는다. 파라미터 $$W_2, W_1$$는 확률그라디언트로 학습시키고, 그 그라디언트들은 연쇄법칙(과 backpropagation)으로 계산하여 구한다. + +3단계 신경망(neural network)는 $$ s = W_3 \max(0, W_2 \max(0, W_1 x)) $$랑 비슷하다. 이 때, $$W_3, W_2, W_1$$들은 모두 파라미터(parameter)들이고 추후에 학습시킨다. 중간 단계 벡터의 크기들은 하이퍼파라미터(hyperparameter)로서 나중에 어떻게 정하는지 알아보겠다. 이제 뉴런(neuron) 혹은 네트워크의 입장에서 이 계산들을 어떻게 해석해야하는지 알아보자. -## Modeling one neuron + +## 뉴런 하나 모델링하기 The area of Neural Networks has originally been primarily inspired by the goal of modeling biological neural systems, but has since diverged and become a matter of engineering and achieving good results in Machine Learning tasks. Nonetheless, we begin our discussion with a very brief and high-level description of the biological system that a large portion of this area has been inspired by. + ### Biological motivation and connections -The basic computational unit of the brain is a **neuron**. Approximately 86 billion neurons can be found in the human nervous system and they are connected with approximately 10^14 - 10^15 **synapses**. The diagram below shows a cartoon drawing of a biological neuron (left) and a common mathematical model (right). Each neuron receives input signals from its **dendrites** and produces output signals along its (single) **axon**. The axon eventually branches out and connects via synapses to dendrites of other neurons. In the computational model of a neuron, the signals that travel along the axons (e.g. \\(x\_0\\)) interact multiplicatively (e.g. \\(w\_0 x\_0\\)) with the dendrites of the other neuron based on the synaptic strength at that synapse (e.g. \\(w\_0\\)). The idea is that the synaptic strengths (the weights \\(w\\)) are learnable and control the strength of influence (and its direction: excitory (positive weight) or inhibitory (negative weight)) of one neuron on another. In the basic model, the dendrites carry the signal to the cell body where they all get summed. If the final sum is above a certain threshold, the neuron can *fire*, sending a spike along its axon. In the computational model, we assume that the precise timings of the spikes do not matter, and that only the frequency of the firing communicates information. Based on this *rate code* interpretation, we model the *firing rate* of the neuron with an **activation function** \\(f\\), which represents the frequency of the spikes along the axon. Historically, a common choice of activation function is the **sigmoid function** \\(\sigma\\), since it takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1. We will see details of these activation functions later in this section. +The basic computational unit of the brain is a **neuron**. Approximately 86 billion neurons can be found in the human nervous system and they are connected with approximately 10^14 - 10^15 **synapses**. The diagram below shows a cartoon drawing of a biological neuron (left) and a common mathematical model (right). Each neuron receives input signals from its **dendrites** and produces output signals along its (single) **axon**. The axon eventually branches out and connects via synapses to dendrites of other neurons. In the computational model of a neuron, the signals that travel along the axons (e.g. $$x_0$$) interact multiplicatively (e.g. $$w_0 x_0$$) with the dendrites of the other neuron based on the synaptic strength at that synapse (e.g. $$w_0$$). The idea is that the synaptic strengths (the weights $$w$$) are learnable and control the strength of influence (and its direction: excitory (positive weight) or inhibitory (negative weight)) of one neuron on another. In the basic model, the dendrites carry the signal to the cell body where they all get summed. If the final sum is above a certain threshold, the neuron can *fire*, sending a spike along its axon. In the computational model, we assume that the precise timings of the spikes do not matter, and that only the frequency of the firing communicates information. Based on this *rate code* interpretation, we model the *firing rate* of the neuron with an **activation function** $$f$$, which represents the frequency of the spikes along the axon. Historically, a common choice of activation function is the **sigmoid function** $$\sigma$$, since it takes a real-valued input (the signal strength after the sum) and squashes it to range between 0 and 1. We will see details of these activation functions later in this section.
- - + +
A cartoon drawing of a biological neuron (left) and its mathematical model (right).
An example code for forward-propagating a single neuron might look as follows: -```python +~~~python class Neuron(object): - # ... + # ... def forward(inputs): """ assume inputs and weights are 1-D numpy arrays and bias is a number """ cell_body_sum = np.sum(inputs * self.weights) + self.bias firing_rate = 1.0 / (1.0 + math.exp(-cell_body_sum)) # sigmoid activation function return firing_rate -``` +~~~ -In other words, each neuron performs a dot product with the input and its weights, adds the bias and applies the non-linearity (or activation function), in this case the sigmoid \\(\sigma(x) = 1/(1+e^{-x})\\). We will go into more details about different activation functions at the end of this section. +In other words, each neuron performs a dot product with the input and its weights, adds the bias and applies the non-linearity (or activation function), in this case the sigmoid $$\sigma(x) = 1/(1+e^{-x})$$. We will go into more details about different activation functions at the end of this section. **Coarse model.** It's important to stress that this model of a biological neuron is very coarse: For example, there are many different types of neurons, each with different properties. The dendrites in biological neurons perform complex nonlinear computations. The synapses are not just a single weight, they're a complex non-linear dynamical system. The exact timing of the output spikes in many systems in known to be important, suggesting that the rate code approximation may not hold. Due to all these and many other simplifications, be prepared to hear groaning sounds from anyone with some neuroscience background if you draw analogies between Neural Networks and real brains. See this [review](https://physics.ucsd.edu/neurophysics/courses/physics_171/annurev.neuro.28.061604.135703.pdf) (pdf), or more recently this [review](http://www.sciencedirect.com/science/article/pii/S0959438814000130) if you are interested. + ### Single neuron as a linear classifier The mathematical form of the model Neuron's forward computation might look familiar to you. As we saw with linear classifiers, a neuron has the capacity to "like" (activation near one) or "dislike" (activation near zero) certain linear regions of its input space. Hence, with an appropriate loss function on the neuron's output, we can turn a single neuron into a linear classifier: -**Binary Softmax classifier**. For example, we can interpret \\(\sigma(\sum\_iw\_ix\_i + b)\\) to be the probability of one of the classes \\(P(y\_i = 1 \mid x\_i; w) \\). The probability of the other class would be \\(P(y\_i = 0 \mid x\_i; w) = 1 - P(y\_i = 1 \mid x\_i; w) \\), since they must sum to one. With this interpretation, we can formulate the cross-entropy loss as we have seen in the Linear Classification section, and optimizing it would lead to a binary Softmax classifier (also known as *logistic regression*). Since the sigmoid function is restricted to be between 0-1, the predictions of this classifier are based on whether the output of the neuron is greater than 0.5. +**Binary Softmax classifier**. For example, we can interpret $$\sigma(\sum_iw_ix_i + b)$$ to be the probability of one of the classes $$P(y_i = 1 \mid x_i; w) $$. The probability of the other class would be $$P(y_i = 0 \mid x_i; w) = 1 - P(y_i = 1 \mid x_i; w) $$, since they must sum to one. With this interpretation, we can formulate the cross-entropy loss as we have seen in the Linear Classification section, and optimizing it would lead to a binary Softmax classifier (also known as *logistic regression*). Since the sigmoid function is restricted to be between 0-1, the predictions of this classifier are based on whether the output of the neuron is greater than 0.5. **Binary SVM classifier**. Alternatively, we could attach a max-margin hinge loss to the output of the neuron and train it to become a binary Support Vector Machine. -**Regularization interpretation**. The regularization loss in both SVM/Softmax cases could in this biological view be interpreted as *gradual forgetting*, since it would have the effect of driving all synaptic weights \\(w\\) towards zero after every parameter update. +**Regularization interpretation**. The regularization loss in both SVM/Softmax cases could in this biological view be interpreted as *gradual forgetting*, since it would have the effect of driving all synaptic weights $$w$$ towards zero after every parameter update. > A single neuron can be used to implement a binary classifier (e.g. binary Softmax or binary SVM classifiers) + ### Commonly used activation functions Every activation function (or *non-linearity*) takes a single number and performs a certain fixed mathematical operation on it. There are several activation functions you may encounter in practice:
- - + +
Left: Sigmoid non-linearity squashes real numbers to range between [0,1] Right: The tanh non-linearity squashes real numbers to range between [-1,1].
-**Sigmoid.** The sigmoid non-linearity has the mathematical form \\(\sigma(x) = 1 / (1 + e^{-x})\\) and is shown in the image above on the left. As alluded to in the previous section, it takes a real-valued number and "squashes" it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1. The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1). In practice, the sigmoid non-linearity has recently fallen out of favor and it is rarely ever used. It has two major drawbacks: +**Sigmoid.** The sigmoid non-linearity has the mathematical form $$\sigma(x) = 1 / (1 + e^{-x})$$ and is shown in the image above on the left. As alluded to in the previous section, it takes a real-valued number and "squashes" it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1. The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1). In practice, the sigmoid non-linearity has recently fallen out of favor and it is rarely ever used. It has two major drawbacks: - *Sigmoids saturate and kill gradients*. A very undesirable property of the sigmoid neuron is that when the neuron's activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero. Recall that during backpropagation, this (local) gradient will be multiplied to the gradient of this gate's output for the whole objective. Therefore, if the local gradient is very small, it will effectively "kill" the gradient and almost no signal will flow through the neuron to its weights and recursively to its data. Additionally, one must pay extra caution when initializing the weights of sigmoid neurons to prevent saturation. For example, if the initial weights are too large then most neurons would become saturated and the network will barely learn. - - *Sigmoid outputs are not zero-centered*. This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive (e.g. \\(x > 0\\) elementwise in \\(f = w^Tx + b\\))), then the gradient on the weights \\(w\\) will during backpropagation become either all be positive, or all negative (depending on the gradient of the whole expression \\(f\\)). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above. + - *Sigmoid outputs are not zero-centered*. This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive (e.g. $$x > 0$$ elementwise in $$f = w^Tx + b$$)), then the gradient on the weights $$w$$ will during backpropagation become either all be positive, or all negative (depending on the gradient of the whole expression $$f$$). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above. -**Tanh.** The tanh non-linearity is shown on the image above on the right. It squashes a real-valued number to the range [-1, 1]. Like the sigmoid neuron, its activations saturate, but unlike the sigmoid neuron its output is zero-centered. Therefore, in practice the *tanh non-linearity is always preferred to the sigmoid nonlinearity.* Also note that the tanh neuron is simply a scaled sigmoid neuron, in particular the following holds: \\( \tanh(x) = 2 \sigma(2x) -1 \\). +**Tanh.** The tanh non-linearity is shown on the image above on the right. It squashes a real-valued number to the range [-1, 1]. Like the sigmoid neuron, its activations saturate, but unlike the sigmoid neuron its output is zero-centered. Therefore, in practice the *tanh non-linearity is always preferred to the sigmoid nonlinearity.* Also note that the tanh neuron is simply a scaled sigmoid neuron, in particular the following holds: $$ \tanh(x) = 2 \sigma(2x) -1 $$.
- - + +
Left: Rectified Linear Unit (ReLU) activation function, which is zero when x < 0 and then linear with slope 1 when x > 0. Right: A plot from Krizhevsky et al. (pdf) paper indicating the 6x improvement in convergence with the ReLU unit compared to the tanh unit.
-**ReLU.** The Rectified Linear Unit has become very popular in the last few years. It computes the function \\(f(x) = \max(0, x)\\). In other words, the activation is simply thresholded at zero (see image above on the left). There are several pros and cons to using the ReLUs: +**ReLU.** The Rectified Linear Unit has become very popular in the last few years. It computes the function $$f(x) = \max(0, x)$$. In other words, the activation is simply thresholded at zero (see image above on the left). There are several pros and cons to using the ReLUs: - (+) It was found to greatly accelerate (e.g. a factor of 6 in [Krizhevsky et al.](http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf)) the convergence of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that this is due to its linear, non-saturating form. - (+) Compared to tanh/sigmoid neurons that involve expensive operations (exponentials, etc.), the ReLU can be implemented by simply thresholding a matrix of activations at zero. - (-) Unfortunately, ReLU units can be fragile during training and can "die". For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be "dead" (i.e. neurons that never activativate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue. -**Leaky ReLU.** Leaky ReLUs are one attempt to fix the "dying ReLU" problem. Instead of the function being zero when x < 0, a leaky ReLU will instead have a small negative slope (of 0.01, or so). That is, the function computes \\(f(x) = \mathbb{1}(x < 0) (\alpha x) + \mathbb{1}(x>=0) (x) \\) where \\(\alpha\\) is a small constant. Some people report success with this form of activation function, but the results are not always consistent. The slope in the negative region can also be made into a parameter of each neuron, as seen in PReLU neurons, introduced in [Delving Deep into Rectifiers](http://arxiv.org/abs/1502.01852), by Kaiming He et al., 2015. However, the consistency of the benefit across tasks is presently unclear. +**Leaky ReLU.** Leaky ReLUs are one attempt to fix the "dying ReLU" problem. Instead of the function being zero when x < 0, a leaky ReLU will instead have a small negative slope (of 0.01, or so). That is, the function computes $$f(x) = \mathbb{1}(x < 0) (\alpha x) + \mathbb{1}(x>=0) (x) $$ where $$\alpha$$ is a small constant. Some people report success with this form of activation function, but the results are not always consistent. The slope in the negative region can also be made into a parameter of each neuron, as seen in PReLU neurons, introduced in [Delving Deep into Rectifiers](http://arxiv.org/abs/1502.01852), by Kaiming He et al., 2015. However, the consistency of the benefit across tasks is presently unclear. -**Maxout**. Other types of units have been proposed that do not have the functional form \\(f(w^Tx + b)\\) where a non-linearity is applied on the dot product between the weights and the data. One relatively popular choice is the Maxout neuron (introduced recently by [Goodfellow et al.](http://www-etud.iro.umontreal.ca/~goodfeli/maxout.html)) that generalizes the ReLU and its leaky version. The Maxout neuron computes the function \\(\max(w\_1^Tx+b\_1, w\_2^Tx + b\_2)\\). Notice that both ReLU and Leaky ReLU are a special case of this form (for example, for ReLU we have \\(w\_1, b\_1 = 0\\)). The Maxout neuron therefore enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU). However, unlike the ReLU neurons it doubles the number of parameters for every single neuron, leading to a high total number of parameters. +**Maxout**. Other types of units have been proposed that do not have the functional form $$f(w^Tx + b)$$ where a non-linearity is applied on the dot product between the weights and the data. One relatively popular choice is the Maxout neuron (introduced recently by [Goodfellow et al.](http://www-etud.iro.umontreal.ca/~goodfeli/maxout.html)) that generalizes the ReLU and its leaky version. The Maxout neuron computes the function $$\max(w_1^Tx+b_1, w_2^Tx + b_2)$$. Notice that both ReLU and Leaky ReLU are a special case of this form (for example, for ReLU we have $$w_1, b_1 = 0$$). The Maxout neuron therefore enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU). However, unlike the ReLU neurons it doubles the number of parameters for every single neuron, leading to a high total number of parameters. This concludes our discussion of the most common types of neurons and their activation functions. As a last comment, it is very rare to mix and match different types of neurons in the same network, even though there is no fundamental problem with doing so. **TLDR**: "*What neuron type should I use?*" Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of "dead" units in a network. If this concerns you, give Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than ReLU/Maxout. + ## Neural Network architectures + ### Layer-wise organization **Neural Networks as neurons in graphs**. Neural Networks are modeled as collections of neurons that are connected in an acyclic graph. In other words, the outputs of some neurons can become inputs to other neurons. Cycles are not allowed since that would imply an infinite loop in the forward pass of a network. Instead of an amorphous blobs of connected neurons, Neural Network models are often organized into distinct layers of neurons. For regular neural networks, the most common layer type is the **fully-connected layer** in which neurons between two adjacent layers are fully pairwise connected, but neurons within a single layer share no connections. Below are two example Neural Network topologies that use a stack of fully-connected layers:
- - + +
Left: A 2-layer Neural Network (one hidden layer of 4 neurons (or units) and one output layer with 2 neurons), and three inputs. Right: A 3-layer neural network with three inputs, two hidden layers of 4 neurons each and one output layer. Notice that in both cases there are connections (synapses) between neurons across layers, but not within a layer.
**Naming conventions.** Notice that when we say N-layer neural network, we do not count the input layer. Therefore, a single-layer neural network describes a network with no hidden layers (input directly mapped to output). In that sense, you can sometimes hear people say that logistic regression or SVMs are simply a special case of single-layer Neural Networks. You may also hear these networks interchangeably referred to as *"Artificial Neural Networks"* (ANN) or *"Multi-Layer Perceptrons"* (MLP). Many people do not like the analogies between Neural Networks and real brains and prefer to refer to neurons as *units*. -**Output layer.** Unlike all layers in a Neural Network, the output layer neurons most commonly do not have an activation function (or you can think of them as having a linear identity activation function). This is because the last output layer is usually taken to represent the class scores (e.g. in classification), which are arbitrary real-valued numbers, or some kind of real-valued target (e.g. in regression). +**Output layer.** Unlike all layers in a Neural Network, the output layer neurons most commonly do not have an activation function (or you can think of them as having a linear identity activation function). This is because the last output layer is usually taken to represent the class scores (e.g. in classification), which are arbitrary real-valued numbers, or some kind of real-valued target (e.g. in regression). **Sizing neural networks**. The two metrics that people commonly use to measure the size of neural networks are the number of neurons, or more commonly the number of parameters. Working with the two example networks in the above picture: -- The first network (left) has 4 + 2 = 6 neurons (not counting the inputs), [3 x 4] + [4 x 2] = 20 weights and 4 + 2 = 6 biases, for a total of 26 learnable parameters. +- The first network (left) has 4 + 2 = 6 neurons (not counting the inputs), [3 x 4] + [4 x 2] = 20 weights and 4 + 2 = 6 biases, for a total of 26 learnable parameters. - The second network (right) has 4 + 4 + 1 = 9 neurons, [3 x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 = 32 weights and 4 + 4 + 1 = 9 biases, for a total of 41 learnable parameters. To give you some context, modern Convolutional Networks contain on orders of 100 million parameters and are usually made up of approximately 10-20 layers (hence *deep learning*). However, as we will see the number of *effective* connections is significantly greater due to parameter sharing. More on this in the Convolutional Neural Networks module. + ### Example feed-forward computation *Repeated matrix multiplications interwoven with activation function*. One of the primary reasons that Neural Networks are organized into layers is that this structure makes it very simple and efficient to evaluate Neural Networks using matrix vector operations. Working with the example three-layer neural network in the diagram above, the input would be a [3x1] vector. All connection strengths for a layer can be stored in a single matrix. For example, the first hidden layer's weights `W1` would be of size [4x3], and the biases for all units would be in the vector `b1`, of size [4x1]. Here, every single neuron has its weights in a row of `W1`, so the matrix vector multiplication `np.dot(W1,x)` evaluates the activations of all neurons in that layer. Similarly, `W2` would be a [4x4] matrix that stores the connections of the second hidden layer, and `W3` a [1x4] matrix for the last (output) layer. The full forward pass of this 3-layer neural network is then simply three matrix multiplications, interwoven with the application of the activation function: -```python +~~~python # forward-pass of a 3-layer neural network: f = lambda x: 1.0/(1.0 + np.exp(-x)) # activation function (use sigmoid) x = np.random.randn(3, 1) # random input vector of three numbers (3x1) h1 = f(np.dot(W1, x) + b1) # calculate first hidden layer activations (4x1) h2 = f(np.dot(W2, h1) + b2) # calculate second hidden layer activations (4x1) out = np.dot(W3, h2) + b3 # output neuron (1x1) -``` +~~~ In the above code, `W1,W2,W3,b1,b2,b3` are the learnable parameters of the network. Notice also that instead of having a single input column vector, the variable `x` could hold an entire batch of training data (where each input example would be a column of `x`) and then all examples would be efficiently evaluated in parallel. Notice that the final Neural Network layer usually doesn't have an activation function (e.g. it represents a (real-valued) class score in a classification setting). > The forward pass of a fully-connected layer corresponds to one matrix multiplication followed by a bias offset and an activation function. + ### Representational power -One way to look at Neural Networks with fully-connected layers is that they define a family of functions that are parameterized by the weights of the network. A natural question that arises is: What is the representational power of this family of functions? In particular, are there functions that cannot be modeled with a Neural Network? +One way to look at Neural Networks with fully-connected layers is that they define a family of functions that are parameterized by the weights of the network. A natural question that arises is: What is the representational power of this family of functions? In particular, are there functions that cannot be modeled with a Neural Network? -It turns out that Neural Networks with at least one hidden layer are *universal approximators*. That is, it can be shown (e.g. see [*Approximation by Superpositions of Sigmoidal Function*](http://www.dartmouth.edu/~gvc/Cybenko_MCSS.pdf) from 1989 (pdf), or this [intuitive explanation](http://neuralnetworksanddeeplearning.com/chap4.html) from Michael Nielsen) that given any continuous function \\(f(x)\\) and some \\(\epsilon > 0\\), there exists a Neural Network \\(g(x)\\) with one hidden layer (with a reasonable choice of non-linearity, e.g. sigmoid) such that \\( \forall x, \mid f(x) - g(x) \mid < \epsilon \\). In other words, the neural network can approximate any continuous function. +It turns out that Neural Networks with at least one hidden layer are *universal approximators*. That is, it can be shown (e.g. see [*Approximation by Superpositions of Sigmoidal Function*](http://www.dartmouth.edu/~gvc/Cybenko_MCSS.pdf) from 1989 (pdf), or this [intuitive explanation](http://neuralnetworksanddeeplearning.com/chap4.html) from Michael Nielsen) that given any continuous function $$f(x)$$ and some $$\epsilon > 0$$, there exists a Neural Network $$g(x)$$ with one hidden layer (with a reasonable choice of non-linearity, e.g. sigmoid) such that $$ \forall x, \mid f(x) - g(x) \mid < \epsilon $$. In other words, the neural network can approximate any continuous function. -If one hidden layer suffices to approximate any function, why use more layers and go deeper? The answer is that the fact that a two-layer Neural Network is a universal approximator is, while mathematically cute, a relatively weak and useless statement in practice. In one dimension, the "sum of indicator bumps" function \\(g(x) = \sum\_i c\_i \mathbb{1}(a\_i < x < b\_i)\\) where \\(a,b,c\\) are parameter vectors is also a universal approximator, but noone would suggest that we use this functional form in Machine Learning. Neural Networks work well in practice because they compactly express nice, smooth functions that fit well with the statistical properties of data we encounter in practice, and are also easy to learn using our optimization algorithms (e.g. gradient descent). Similarly, the fact that deeper networks (with multiple hidden layers) can work better than a single-hidden-layer networks is an empirical observation, despite the fact that their representational power is equal. +If one hidden layer suffices to approximate any function, why use more layers and go deeper? The answer is that the fact that a two-layer Neural Network is a universal approximator is, while mathematically cute, a relatively weak and useless statement in practice. In one dimension, the "sum of indicator bumps" function $$g(x) = \sum_i c_i \mathbb{1}(a_i < x < b_i)$$ where $$a,b,c$$ are parameter vectors is also a universal approximator, but noone would suggest that we use this functional form in Machine Learning. Neural Networks work well in practice because they compactly express nice, smooth functions that fit well with the statistical properties of data we encounter in practice, and are also easy to learn using our optimization algorithms (e.g. gradient descent). Similarly, the fact that deeper networks (with multiple hidden layers) can work better than a single-hidden-layer networks is an empirical observation, despite the fact that their representational power is equal. As an aside, in practice it is often the case that 3-layer neural networks will outperform 2-layer nets, but going even deeper (4,5,6-layer) rarely helps much more. This is in stark contrast to Convolutional Networks, where depth has been found to be an extremely important component for a good recognition system (e.g. on order of 10 learnable layers). One argument for this observation is that images contain hierarchical structure (e.g. faces are made up of eyes, which are made up of edges, etc.), so several layers of processing make intuitive sense for this data domain. @@ -172,25 +181,26 @@ The full story is, of course, much more involved and a topic of much recent rese - [FitNets: Hints for Thin Deep Nets](http://arxiv.org/abs/1412.6550) + ### Setting number of layers and their sizes How do we decide on what architecture to use when faced with a practical problem? Should we use no hidden layers? One hidden layer? Two hidden layers? How large should each layer be? First, note that as we increase the size and number of layers in a Neural Network, the **capacity** of the network increases. That is, the space of representable functions grows since the neurons can collaborate to express many different functions. For example, suppose we had a binary classification problem in two dimensions. We could train three separate neural networks, each with one hidden layer of some size and obtain the following classifiers:
- +
Larger Neural Networks can represent more complicated functions. The data are shown as circles colored by their class, and the decision regions by a trained neural network are shown underneath. You can play with these examples in this ConvNetsJS demo.
In the diagram above, we can see that Neural Networks with more neurons can express more complicated functions. However, this is both a blessing (since we can learn to classify more complicated data) and a curse (since it is easier to overfit the training data). **Overfitting** occurs when a model with high capacity fits the noise in the data instead of the (assumed) underlying relationship. For example, the model with 20 hidden neurons fits all the training data but at the cost of segmenting the space into many disjoint red and green decision regions. The model with 3 hidden neurons only has the representational power to classify the data in broad strokes. It models the data as two blobs and interprets the few red points inside the green cluster as **outliers** (noise). In practice, this could lead to better **generalization** on the test set. -Based on our discussion above, it seems that smaller neural networks can be preferred if the data is not complex enough to prevent overfitting. However, this is incorrect - there are many other preferred ways to prevent overfitting in Neural Networks that we will discuss later (such as L2 regularization, dropout, input noise). In practice, it is always better to use these methods to control overfitting instead of the number of neurons. +Based on our discussion above, it seems that smaller neural networks can be preferred if the data is not complex enough to prevent overfitting. However, this is incorrect - there are many other preferred ways to prevent overfitting in Neural Networks that we will discuss later (such as L2 regularization, dropout, input noise). In practice, it is always better to use these methods to control overfitting instead of the number of neurons. The subtle reason behind this is that smaller networks are harder to train with local methods such as Gradient Descent: It's clear that their loss functions have relatively few local minima, but it turns out that many of these minima are easier to converge to, and that they are bad (i.e. with high loss). Conversely, bigger neural networks contain significantly more local minima, but these minima turn out to be much better in terms of their actual loss. Since Neural Networks are non-convex, it is hard to study these properties mathematically, but some attempts to understand these objective functions have been made, e.g. in a recent paper [The Loss Surfaces of Multilayer Networks](http://arxiv.org/abs/1412.0233). In practice, what you find is that if you train a small network the final loss can display a good amount of variance - in some cases you get lucky and converge to a good place but in some cases you get trapped in one of the bad minima. On the other hand, if you train a large network you'll start to find many different solutions, but the variance in the final achieved loss will be much smaller. In other words, all solutions are about equally as good, and rely less on the luck of random initialization. To reiterate, the regularization strength is the preferred way to control the overfitting of a neural network. We can look at the results achieved by three different settings:
- +
The effects of regularization strength: Each neural network above has 20 hidden neurons, but changing the regularization strength makes its final decision regions smoother with a higher regularization. You can play with these examples in this ConvNetsJS demo.
@@ -199,9 +209,10 @@ To reiterate, the regularization strength is the preferred way to control the ov The takeaway is that you should not be using smaller networks because you are afraid of overfitting. Instead, you should use as big of a neural network as your computational budget allows, and use other regularization techniques to control overfitting. + ## Summary -In summary, +In summary, - We introduced a very coarse model of a biological **neuron** - We discussed several types of **activation functions** that are used in practice, with ReLU being the most common choice @@ -211,6 +222,7 @@ In summary, - We discussed the fact that larger networks will always work better than smaller networks, but their higher model capacity must be appropriately addressed with stronger regularization (such as higher weight decay), or they might overfit. We will see more forms of regularization (especially dropout) in later sections. + ## Additional References - [deeplearning.net tutorial](http://www.deeplearning.net/tutorial/mlp.html) with Theano diff --git a/neural-networks-2.kr.md b/neural-networks-2.kr.md new file mode 100644 index 00000000..75b89ceb --- /dev/null +++ b/neural-networks-2.kr.md @@ -0,0 +1,313 @@ +--- +layout: page +permalink: /neural-networks-2-kr/ +--- + +목차: + +- [데이터 및 모델 준비](#intro) + - [데이터 전처리(Data Preprocessing)](#datapre) + - [가중치 초기화(Weight Initialization)](#init) + - [배치 정규화(Batch Normalization)](#batchnorm) + - [Regularization](#reg) (L2/L1/Maxnorm/Dropout) +- [손실 함수(Loss functions)](#losses) +- [요약 (Summary)](#summary) + + + + +## 데이터 및 모델 준비 + +앞 장에서 내적(dot product) 및 비선형성(non-linearity)을 연산을 순차적으로 수행하는 뉴런(Neuron) 모델과 이러한 뉴런들의 다층구조(layers)로 구성된 신경망(Neural Networks)에 대해서 소개하였다. 신경망(Neural Networks) 모델은 선형변환(linear mapping) 결과를 비선형성 변환에 적용하는 과정이 연속적으로 발생하게 되고 따라서 선형분류(Linear Classification) 부분에서 소개한 선형변환(linear mapping)을 확장한 새로운 형태의 **score function** 정의를 필요로 한다. 이번 장에서는 데이터 전처리(data preprocessing), 파라미터 초기화(weight initialization), 손실 함수(loss function)을 소개한다. + + + +### 데이터 전처리(Data Preprocessing) + +데이터 행렬 `X`에 대해서 일반적으로 아래의 3가지 전처리 방법을 사용한다. (여기서 데이터 `X`는 `D` 차원의 데이터 벡터 `N`개로 이루어진 `[N X D]` 행렬로 가정한다) + +**평균 차감(Mean Subtraction)** +가장 흔하게 사용되는 데이터 전처리 기법이다. 데이터의 모든 *피쳐(feature)*에 각각에 대해서 평균값 만큼 차감하는 방법으로 기하학 관점에서 보자면 데이터 군집을 모든 차원에 대해서 원점으로 이동시키는 것으로 해석할 수 있다. numpy에서는 다음과 같이 구현 가능하다: `X -= np.mean(X, axis = 0)`. 특히 이미지 처리에 있어서 계산의 간결성을 위해서 모든 픽셀에서 동일한 값을 차감하는 방식으로 구현한다.(예를들어 numpy에서 `X -= np.mean(X)`) + +**정규화(Normalization)** +정규화(Normalization)는 각 차원의 데이터가 동일한 범위내의 값을 갖도록 하는 전처리 기법을 의미한다. 일반적으로 다음의 2가지 중 하나를 선택하여 구현한다. (1) 각 데이터값을 평균 만큼 차감 하고 표준편차 값으로 나눈다: (`X /= np.std(X, axis = 0)`), 이때 각 차원에 대해서 개별적으로 연산을 수행한다. (2) 또 다른 기법은 각 차원에서 최소/최대 값이 각각 -1/1의 값을 갖도록 정규화 하는 것이다. 하지만 이 기법은 스케일(scale)(혹은 단위(units))이 다른 features가 (거의) 동일한 비중으로 학습 결과에 영향을 줄 것이라는 가정하에 사용하는 것이 일반적이다. +이미지 처리에서는 각 픽셀 값이 이미 동일한 스케일(0~255)을 갖고 있는 경우가 대부분 이기 때문에 정규화 전처리 기법을 반드시 사용해야 하는 것은 아니다. + +
+ +
Common data preprocessing pipeline. Left: Original toy, 2-dimensional input data. Middle: The data is zero-centered by subtracting the mean in each dimension. The data cloud is now centered around the origin. Right: Each dimension is additionally scaled by its standard deviation. The red lines indicate the extent of the data - they are of unequal length in the middle, but of equal length on the right.
+
+ +**PCA와 Whitening** +먼저 평균차감(Mean Subtraction) 기법을 이용하여 데이터를 정규화 시킨다. 그리고 데이터 간의 상관관계를 나타내는 공분산(Covariance)을 계산한다: + +~~~python +# Assume input data matrix X of size [N x D] +X -= np.mean(X, axis = 0) # zero-center the data (important) +cov = np.dot(X.T, X) / X.shape[0] # get the data covariance matrix +~~~ + +공분산(Convairance) 행렬에서 (i, j) 값은 `X` 행렬에서 i번째, j번째 데이터 간의 **상관정도(covariance)**를 나타내는 값이라고 해석할 수 있다. 특히, 공분산(Covariance) 행렬에서 대각선 상(the diagonal)의 값들은 `X` 행렬의 각 데이터(주, row 벡터)의 분산(variance)값과 같다. 또한 공분산(Covariance) 행렬은 simmetric, [positive semi-definite](http://en.wikipedia.org/wiki/Positive-definite_matrix#Negative-definite.2C_semidefinite_and_indefinite_matrices) 성질을 갖는다. 공분산(Covariance) 행렬의 SVD factorication은 다음과 같이 구할 수 있는데, + +~~~python +U,S,V = np.linalg.svd(cov) +~~~ + +여기서 `U` 행렬의 컬럼(column) 벡터는 아이겐벡터(eigenvector), S는 특이값(singular value)의 1차원 배열이다 (공분산(Covariance)은 symmetric, positive semi-definite의 성질이 있으므로 S 벡터의 각 성분은 아이겐밸류(engienvalue) 제곱의 값을 갖는다) 데이터 `X`를 고유기저(eigenbasis)에 사상시킴으로써 데이터 간의 상관관계를 없앨 수 있다: + +~~~python +Xrot = np.dot(X, U) # decorrelate the data +~~~ + +`U` 행렬의 컬럼 벡터는 norm 값은 1이고 서로 직교하는 정규직교(orthonormal)의 성질을 갖고 있기때문에, 기저벡터(basis vector)가 됨을 알 수 있다. 따라서 고유기저(eigenbasis)로 사상(projection)하는 것은 아이겐벡터(eigenvector)를 새로운 축으로하여 `X` 데이터를 회전하는 것으로 해석할 수 있다. (위의 python 코드에서) `Xrot` 행렬의 공분산(Covariance)을 구하면 대각행렬(diagonal matrix)인 것을 알 수 있디. `np.linalg.svd`의 이점 중 하나는 `U` 행렬의 컬럼 벡터는 각 벡터에 상응하는 아이겐밸류(eigenvalue)의 내림차순으로 정렬된 다는 것이다. 따라서 처음 몇 개의 벡터만 사용하여 데이터 차원을 축소하는데 사용할 수 있다.(and discarding the dimensions along which the data has no variance) 이러한 기법을 [Principal Component Analysis (PCA)](http://en.wikipedia.org/wiki/Principal_component_analysis) 차원 축소 기법이라 부르기도 한다. + +~~~python +Xrot_reduced = np.dot(X, U[:,:100]) # Xrot_reduced becomes [N x 100] +~~~ + +위의 연산을 통하여, [N x D] 크기의 `X` 데이터를 [N x 100] 크기의 데이터로 압축 할 수 있는데 데이터의 variance가 가능한 큰 값을 갖도록 하는 100개의 차원이 선택된다. PCA-축소 기법으로 전처리 된 데이터를 선형 분류기 혹은 신경망에 학습시킴으로써 좋은 성능을 기대할 수 있을 뿐만 아니라 트레이닝 시간과 사용 메모리 용량에서도 이득을 볼 수 있다. + +마지막으로 살펴볼 기법은 **화이트닝(whitening)**으로 이는 기저벡터(eigenbasis) 데이터를 아이겐밸류(eigenvalue) 값으로 나누어 정규화는 기법이다. 화이트닝 변환의 기하학적 해석은 만약 입력 데이터가 multivariable gaussian 분포를라면 화이트닝된 데이터는 평균은 0이고 공분산(covariance)는 단위행렬을 갖는 정규분포를 갖게된다. 와이트닝은 다음과 같이 구할 수 있다: + +~~~python +# whiten the data: +# divide by the eigenvalues (which are square roots of the singular values) +Xwhite = Xrot / np.sqrt(S + 1e-5) +~~~ + +*주의: 노이즈 과장(Exaggeratin noice)* 위의 식에서 분모가 0이 되는 것을 방지하기 위해서 1e-5(또는 임의의 작은 상수도 무방)를 더한 것에 주목하자. 화이트닝 기법의 단점 중 하나는 모든 차원의 데이터를 동일하게 늘리게 되는데 특히 분산값이 매우 작아 노이즈로 해석할 수 있는 차원의 데이터까지 포함되어 데이터 내의 노이즈 과장되는 효과가 나타난다는 것이다. 이런 경우 보통 (1e-5와 같은 작은 수가 아닌) 큰 수를 분모에 더하는 방식으로 스무딩(smoothing) 효과를 추가하여 이러한 노이즈 과장 현상을 완화 할 수 있다. + +
+ +
PCA / Whitening. Left: Original toy, 2-dimensional input data. Middle: After performing PCA. The data is centered at zero and then rotated into the eigenbasis of the data covariance matrix. This decorrelates the data (the covariance matrix becomes diagonal). Right: Each dimension is additionally scaled by the eigenvalues, transforming the data covariance matrix into the identity matrix. Geometrically, this corresponds to stretching and squeezing the data into an isotropic gaussian blob.
+
+ +CIFAR-10 이미지에 위에서 소개된 변환들을 적용하여 각 변환 효과를 시각화 할 수 있다. CIFAR-10 학습 데이터는 50,000 x 3072 크기이며 각 이미지 데이터는 3072 차원을 갖는 row 벡터로 표현되어 있다. [3072 x 3072] 크기를 갖는 공분산(covariance) 행렬을 구하고 SVD 분해 (연산 시간이 비교적 오래걸린다)를 한다. 연산을 통하여 구해진 eigenvector는 어떤 특성을 보이는가? 다음의 이미지를 통하여 그 결과를 확인해 볼 수 있다: + +
+ +
Left:An example set of 49 images. 2nd from Left: The top 144 out of 3072 eigenvectors. The top eigenvectors account for most of the variance in the data, and we can see that they correspond to lower frequencies in the images. 2nd from Right: The 49 images reduced with PCA, using the 144 eigenvectors shown here. That is, instead of expressing every image as a 3072-dimensional vector where each element is the brightness of a particular pixel at some location and channel, every image above is only represented with a 144-dimensional vector, where each element measures how much of each eigenvector adds up to make up the image. In order to visualize what image information has been retained in the 144 numbers, we must rotate back into the "pixel" basis of 3072 numbers. Since U is a rotation, this can be achieved by multiplying by U.transpose()[:144,:], and then visualizing the resulting 3072 numbers as the image. You can see that the images are slightly more blurry, reflecting the fact that the top eigenvectors capture lower frequencies. However, most of the information is still preserved. Right: Visualization of the "white" representation, where the variance along every one of the 144 dimensions is squashed to equal length. Here, the whitened 144 numbers are rotated back to image pixel basis by multiplying by U.transpose()[:144,:]. The lower frequencies (which accounted for most variance) are now negligible, while the higher frequencies (which account for relatively little variance originally) become exaggerated.
+
+ +**실전 응용** 모든 변환 기법을 소개하기 위해 PCA/화이트닝(Whitening)도 함께 살펴보았지만 콘볼루션 신경망(Convolutional Networks)에서는 이 변환을 사용하는 경우는 거의 없다. 하지만 (평균차감(Mean Subtraction) 기법을 통하여) zero-centered 데이터로 변환하거나 각 픽셀 값을 정규화 하는 기법은 일반적으로 흔하게 쓰는 전처리 기법 중에 하나이다. + +**흔히 하는 실수**. 전처리 기법을 적용함에 있어서 명심해야 하는 중요한 사항은 전처리를 위한 여러 통계치들은 학습 데이터만 대상으로 추출하고 검증, 테스트 데이터에 적용해야 한다. 예를들어 평균차감(mena subtraction) 기법을 적용 할 때 흔히 하는 실수 중에 하나는 전체 데이터를 대상으로 평균차감 처리를 하고 이 데이터를 학습, 검증, 테스트 데이터로 나누어 사용하는 것이다. 올바른 방법은 학습, 검증, 테스트를 위한 데이터를 먼저 나눈 후에 학습 데이터를 대상으로 평균값을 구한 후에 평균차감 전처리를 모든 데이터군(학습, 검증, 테스트)에 적용하는 것이다. + + + +### 가중치 초기화 + +우리는 지금까지 신경망(Neural Network) 구조 및 데이터 전처리 기법에 대해 알아 보았다. 실제 데이터를 신경망 내에서 학습 시키기 전에 해야하는 작업이 있는데 바로 파라미터(paramters) 초기화 이다. + +**실수: 0으로 초기화하기**. 실은 우리가 하지 말아야 하는 방식을 먼저 적용해보자. 학습된 신경망에서 가중치들이 최종적으로 어떤 값으로 수렴해야 하는지 알 수 없지만 데이터 정규화 기법을 적절하게 적용하여 가중치의 절반은 양수 값 나머지 절반은 음수 값을 갖는다는 가정을 할 수 있을 것이다. 더나아가 모든 가중치를 0으로 초기화 함으로써 최상의 학습 결과를 얻을 것이라는 아이디어 또한 합리적인 추론으로 보일 수 있다. 하지만 이러한 방법은 명백히 잘못된 방법이라는 것이 밝혀졌다. 왜냐하면 가중치가 0으로 초기화된 신경망 내의 뉴런들은 모두 동일한 연산 결과를 낼 것이고 따라서 backpropagaton 과정에서 동일한 그라디언트(gradient) 값을 얻게 될 것이고 결과적으로 모든 파라미터(paramter)는 동일한 값으로 업데이트 될 것이기 때문이다. 다시말해, 모든 가중치 값이 동일한 값으로 초기화 된다면 뉴런들의 비대칭성(asymmetry)를 야기할 요소가 사라지게 된다. + +**0에 가까운 작은 난수**. 위에서 언급한 이야기를 종합하자면, 가중치 값은 가능한 0에 가까운 값이어야 또한 모든 동일하게 0이되어서는 안된다는 것이다. 소위 *symmetry breaking*을 사용하는데 이는 0에 가까운 (하지만 0이 아닌) 값으로 가중치를 초기화시키는 방법이다. 즉, 모든 가중치들을 난수를 이용하여 고유한 값으로 초기화 함으로써 각 파라미터 값이 서로 다른 값으로 업데이트 되고 결과적으로 전체 신경망 내에서 서로 다른 특성을 보이는 다양한 부분으로 분화될 수 있다. 가중치 배열은 다음과 같이 구현할 수 있는데 `W = 0.01* np.random.randn(D,H)` 여기서 `randn`은 평균 0, 표준편자 1인 정규 분포로 부터 얻는 값이다. 앞의 공식에 의한면, 모든 가중치 벡터는 다차원 정규 분포로 부터 추출된 벡터로 초기화 되기 때문에 공간 상에서 각 벡터들은 (특정한 패턴 혹은 방향성 없이) 무작위의 방향성을 갖게 된다. 정규 분포가 아닌 균일 분포(uniform distribution)로 부터 추출된 값으로 가중치를 초기화 해도 무방하지만 이 방법은 학습된 최종 성능에 미치는 영향은 미미한 것으로 알려져 있다. + +*주의*: 가중치를 0에 가까운 작은 값으로 초기화 하는 것은 항상 좋은 성능을 답보하는 것은 아니다. 예를들어 아주 작은 값으로 구성된 가중치 값으로 된 신경망의 경우 backpropagation 연상 과정에서 그라디언트(gradient) 또한 작은 값을 갖게 된다(그라디언트(gradient)는 가중치 값에 례하기 때문). 이는 네트워크의 역방향으로 흐르며 전달되는 "그라디언트 시그널(gradient signal)"을 감소시키게 되고 이는 신경망 학습에 있어서 중요한 문제를 야기하게 된다. + +**분산 보정, 1/sqrt(n)**. 위에서 제안한 방법의 문제점 중 하나는 랜덤값으로 초기화된 뉴런으로 학습되어 나온 결과의 분포가 입력 데이터 수에 비례하여 커지는 분산을 갖는다는 것이다. 가중치 벡터를 *팬인(fan-in)*(입력 데이터 수)의 제곱근 값으로 나누는 연산을 통하여 뉴런 출력의 분산이 1로 정규화 할 수 있다. 권장되는 휴리스틱(heuristic) 기법은 뉴런의 가중치 벡터를 다음과 같이 초기화 하는 것이다. `w = np.random.randn(n) / sqrt(n)` (n: 입력 수). 이 방법은 근사적으로 동일한 출력 분포를 갖게 할 뿐만 아니라 신경망의 수렴률 또한 향상시키는 것으로 알려져 있다. + +이는 다음의 유도 과정을 통해서 확인할 수 있다.: 가중치 값을 나타내는 $$ w $$와 입력 데이터를 나타내는 $$ x $$의 내적 연산 $$ s = \sum_i^n w_i x_i $$가 있다고 하자. 이는 비선형 연산 이전 단계에 일어나는 뉴런 연산이 되고 $$ s $$의 연산은 다음과 같이 구할 수 있다. + +$$ +\begin{align} +\text{Var}(s) &= \text{Var}(\sum_i^n w_ix_i) \\\\ +&= \sum_i^n \text{Var}(w_ix_i) \\\\ +&= \sum_i^n [E(w_i)]^2\text{Var}(x_i) + E[(x_i)]^2\text{Var}(w_i) + \text{Var}(x_i)\text{Var}(w_i) \\\\ +&= \sum_i^n \text{Var}(x_i)\text{Var}(w_i) \\\\ +&= \left( n \text{Var}(w) \right) \text{Var}(x) +\end{align} +$$ + +처음 2단계는 [분산의 성질](http://en.wikipedia.org/wiki/Variance)을 이용하여 전개하였다. 가중치와 입력 데이터 모두 평균이 0이라고 가정하고 있기때문에 $$ E[x_i] = E[w_i] = 0 $$이 되고 따라서 3번째 단계에서 4번째 단계로 전개가 가능하다. 하지만 평균이 0이라고 가정하는 것은 일반적으로 모든 상황에서 가정할 수 있는 것은 아니라는 것을 명심해야 한다. 일례로 ReLU 유닛은 0보다 큰 평균값을 갖는다. 마지막 단계는 $$ w_i, x_i $$ 모두 동일한 확률 분포(identically distribution)를 갖는다고 가정하여 전개할 수 있다. +위의 유도 과정을 통하여 $$ s $$가 입력 데이터 $$ x $$와 동일한 분산을 갖기 위해서는 초기화 단계에서 모든 가중치 벡터 $$ w $$의 분산이 $$ 1/n $$로 만들어야 한다는 것을 알 수 있다. 또한 확률 변수 $$ X $$, 스칼라(scalar) 값 $$ a $$에 대해서 $$ \text{Var}(aX) = a^2\text{Var}(X) $$이 성립하므로 분산이 $$ 1/n $$이 되기 위해서는 표준정규분포에서 값을 뽑아서 $$ a = \sqrt{1/n} $$ 곱해주어야 한다는 것을 알 수 있다. `w = np.random.randn(n) / sqrt(n)`로 가중치를 초기화하면 된다. + +이와 유사한 내용의 연구를 [Understanding the difficulty of training deep feedforward neural networks](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) by Glorot et al. 논문에서 확인할 수 있다. 논문의 저자는 $$ \text{Var}(w) = 2/(n_{in} + n _{out}) $$ ($$ n_{in}, n_{out} $$ 각각 이전 레이어, 다음 레이어의 입력 유닛수)로 초기화 할 것을 권고하며 끝맺고 있다. **This is motivated by based on a compromise and an equivalent analysis of the backpropagated gradients.** 동일한 주제에 대한 더 최근의 연구는 [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](http://arxiv-web3.library.cornell.edu/abs/1502.01852) by He et al.에서 확인 할 수 있는데, 특히 ReLU 뉴런에 대한 초기화 방법에 대해서 다루고 있다. 이 논문에서는 신경망에서 뉴런의 분산은 $$ 2.0/n $$가 되야 한다고 결론내리고 있다. 즉, `w = np.random.randn(n) * sqrt(2.0/n)`을 이용하여 가중치를 초기화 하는 것을 의미하며 이는 특히 ReLU 뉴런이 사용되는 신경망에서 최근에 권장되고 있는 방식이다. + +**희소 초기화(Sparse initialization)**. 보정되지 않은 분산을 위한 또 다른 방법은 모든 가중치 행렬을 0으로 초기화 하는 것이다. 이때 대칭성을 깨기 위해서 모든 뉴런을 고정된 숫자의 아래 단계 뉴런들과 무작위로 연결한다.**(with weights sampled from a small gaussian as above)** 연결하는 뉴런의 수는 대략 10개 정도이다. + +**bias 초기화**. 가중치에 랜덤한 값을 설정하므로써 대칭성 문제는 해결되기 때문에 주로bias는 0으로 초기화한다. ReLU 연산의 비선형성에 의해서 몇몇 경우에는 0.01과 같은 작은 상수값을 사용하기도 하는데 이는 ReLU 연산이 초기부터 fire되고 따라서 그라디언트(gradient) 값이 유미의한 값을 갖고 신경망을 통해서 전달되는 것을 보장할 수 있기 때문이다. 하지만 상수값을 사용하는 방식이 성능 향상을 언제나 보장하는 것인가에 대해서는 이견이 존재한다(실제 몇몇 사례에서 더 나쁜 결과가 볼 수 있다). 따라서 bias 값은 0으로 초기화 하는 것이 더 일반적이라 할 수 있다. + +**실전응용**, ReLU 유닛을 사용하고 `w = np.random.randn(n) * sqrt(2.0/n)` 초기화하는 것이 요즘의 추세이다 [He et al.](http://arxiv-web3.library.cornell.edu/abs/1502.01852). + + + +**배치 정규화(Batch Normalization)** 최근 Ioffe and Szegedy에 의해서 제안된 [배치 정규화(Batch Normalization)](http://arxiv.org/abs/1502.03167) 기법은 신경망 학습단계에서 activation 값이 표준정규분포를 갖도록 강제하는 기법으로 신경망을 적절한 값으로 초기화하여 그동안 많은 연구자들을 괴롭혀왔던 초기화 문제의 상당부분을 해소해 주었다. 여기서 사용한 정규화 기법이 단순 미분 가능한 연산이었기에 적용 가능하다. 실제 구현에서는 배치 정규화 레이어를 fully-connected 레이어 (혹은 곧 설명하게될 컨볼루션 레이어) 다음, 비선형 연산 이전에 위치 시키는 방식으로 이 기법을 신경망에 적용할 수 있다. 앞에서 링크된 논문에서 배치 정규화(Batch Normalization) 기법에 대해서 자세하게 설명하고 있기 때문에 여기에서는 관련 기법을 자세하게 다루지는 않겠지만, 이 기법은 이미 신경망 학습에서 일반적으로 사용되는 기법중 하나라는 것을 밝혀두는 바이다. 실제 적용 사례를 보면 배치 정규화(Batch Normalization)을 사용하여 학습한 신경망은 특히 나쁜 초기화의 영향에 강하다는 것이 밝혀졌다. 배치 정규화(Batch Normalization)는 신경망 내의 모든 레이어에서 전처리 과정을 수행하는 것이지만, 미분 가능하다는 성질에 의해서 신경망 내의 학습 단계로 통합되었다고 볼 수 있다. + + + +### Regularization + +이번 파트에서는 신경망 학습에서 overfitting을 막을 수 있는 몇가지 방법을 소개하고자 한다. + +**L2 regularization**은 가장 일반적으로 사용되는 regularization 기법이다. 모든 파라미터 제곱 만큼의 크기를 목적 함수에 제약을 거는 방식으로 구현된다. 다시말해, 가중치 벡터 $$w$$가 있을때, 목적 함수에 $$\frac{1}{2} \lambda w^2$$를 더한다 (여가서 $$lambda$$는 regulrization의 강도를 의미). $$\frac{1}{2}$$ 부분이 항상 존재하는데 이는 앞서 본 regularization 값을 $$w$$로 미분했을 때 $$2 \lambda w$$가 아닌 $$ \lambda w$$의 값을 갖도록 하기 위함이다. L2 reguralization은 큰 값이 많이 존재하는 가중치에 제약을 주고, 가중치 값을 가능한 널리 퍼지도록 하는 효과를 주는 것으로 볼 수 있다. 선형 분류(Linear Classification) 장에서도 이야기 했던 가중치와 입력 데이터가 곱해지는 연산이므로 특정 몇개의 입력 데이터에 강하게 적용되기 보다는 모든 입력데이터에 약하게 적용되도록 하는 것이 일반적이다. gradient descent 업데이트 과정에서 L2 regularization을 적용하는 것은 모든 가중치 값이 선형적으로 감소하게 된다: `W += -lambda * W`이 0으로 감소하게 된다. + + +**L1 regularization** 또한 상대적으로 많이 사용되는 regularization 기법으로 가중치 벡터$$w$$가 있을때, 목적 함수에 $$\lambda \mid w \mid$$를 더한다. 다음과 같이 L1 regularization과 L2 regularization을 동시에 사용할 수도 있다: $$\lambda_1 \mid w \mid + \lambda_2 w^2$$([Elastic net regularization](http://web.stanford.edu/~hastie/Papers/B67.2%20%282005%29%20301-320%20Zou%20&%20Hastie.pdf)라고도 불린다). L1 regularization은 최적화 과정 동안 가중치 벡터들을 sparse하게(거의 0에 가깝게) 만드는 흥미로운 특성이 있다. 다시 말해, L1 regularization이 적용된 뉴런들은 결국 입력 데이터의 sparse한 부분만을 사용하고, "noisy" 입력 데이터에 거의 영향을 받지 않는다. 이에 반해, L2 regularization을 적용하면 최종 가중치 벡터들은 작은 값들이 퍼져있는 형태로 나타나게 된다. 실제 신경망 학습에 적용할 때, 만약 특정한 feature selection 후 학습하는 것이 아니라면 많은 경우에 L2 regularization을 사용하면 훨씬 좋은 성능을 기대할 수 있다. + +**Max norm constrains**. regularizatio 기법 중 하나로 가중치 벡터의 길이가 미리 정해 놓은 상한 값을 넘지 못하도록 제한하면서 gradient descent 연산도 제한 된 조건 안에서만 계산하도록 하는 projected gradient descent를 사용한다. 신경망 학습에 실제 적용하는 방법은, 먼저 일반적인 방법으로 파라미터를 업데이트 하고, 모든 뉴런의 가중치 벡터 $$\vec{w}$$이 대해서 $$\Vert2 \vec{w} \Vert2 < c$$를 만족하도록 제한을 가한다. 일반적으로 c값은 3 혹은 4로 설정한다. 이 regularization 기법을 적용한 몇몇 연구를 통하여 성능 향상이 있음이 알려졌다. 이 기법의 흥미로운 사실 중 하나는 학습률(learning rate)을 큰 값을로 설정하고 학습 시키더라도 신경망이 "explode"하지 않는 다는 것인데 이는 업데이트 될 때마다 제한된 범위 내의 값을 갖기 때문이다. + +**Dropout** [Dropout: A Simple Way to Prevent Neural Networks from Overfitting](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf)에서 Srivastava et al.의해 최근 제안된 기법으로 간단하지만 아주 효과적인 regularization 방법으로 위에서 소개한 다른 regularization 기법들과 (L1, L2, maxnorm) 상호 보완적인 방법으로 알려져 있다. 각 뉴런들을 $$p$$의 확률로 활성화 시켜 학습에 적용 하는 방식으로 구현할 수 있다. + + +
+ +
Figure taken from the Dropout paper that illustrates the idea. During training, Dropout can be interpreted as sampling a Neural Network within the full Neural Network, and only updating the parameters of the sampled network based on the input data. (However, the exponential number of possible sampled networks are not independent because they share the parameters.) During testing there is no dropout applied, with the interpretation of evaluating an averaged prediction across the exponentially-sized ensemble of all sub-networks (more about ensembles in the next section).
+
+ +3-레이어 신경망 회로에 적용된 Vanilla dropout 예제를 아래 구현하였다. + +~~~python +""" Vanilla Dropout: Not recommended implementation (see notes below) """ + +p = 0.5 # probability of keeping a unit active. higher = less dropout + +def train_step(X): + """ X contains the data """ + + # forward pass for example 3-layer neural network + H1 = np.maximum(0, np.dot(W1, X) + b1) + U1 = np.random.rand(*H1.shape) < p # first dropout mask + H1 *= U1 # drop! + H2 = np.maximum(0, np.dot(W2, H1) + b2) + U2 = np.random.rand(*H2.shape) < p # second dropout mask + H2 *= U2 # drop! + out = np.dot(W3, H2) + b3 + + # backward pass: compute gradients... (not shown) + # perform parameter update... (not shown) + +def predict(X): + # ensembled forward pass + H1 = np.maximum(0, np.dot(W1, X) + b1) * p # NOTE: scale the activations + H2 = np.maximum(0, np.dot(W2, H1) + b2) * p # NOTE: scale the activations + out = np.dot(W3, H2) + b3 +~~~ + +train_step 함수를 보면 첫번째 히든 레이어와 두번째 히든레이어 총 2 부분에서 dropout이 적용된 것을 볼 수 있다. 물론 입력 데이터 `X`를 위한 p=0.5 마스크를 만들어 입력 단에도 dropout을 적용할 수 있다. 역전파(backward pass) 과정에서는 forward에서 사용된 `U1, U2`를 사용하여 수행한다. + +`predict` 함수을 보면 dropout을 적용하지 않았지만 히든 레이어 출력 데이터에 $$p$$ 만큼 스케일링 한 것을 주목할 필요가 있다. 테스트 과정에서 모든 뉴런은 모든 입력 데이터를 받기 때문에 학습 과정에서 얻을 수 있는 출력값과 동일한 조건으로 맞추어 보정해야한다. dropout 확률 $$p = 0.5$$ 인 경우를 가정해 보자. 테스트 과정 동안 뉴런의 출력 값은 모두 1/2만큼 줄어들어야 하는데 이는 학습 과정 동안 뉴런 출력 데이터의 기대값과 동일하게 맞추기 위함이다. 뉴런 $$x$$가 있을때 dropout 적용하지 않은 출력 데이터가 있다고 가정하자. dropout을 적용하면 이 뉴런에서의 기대값은 $$px + (1-p)0$$가 되는데 이는 $$1-p$$의 확률로 뉴런의 출력 데이터 값이 0이 되기 때문이다. 테스트 과정에서는 모든 뉴런을 사용하기 때문에 동일한 기대값을 갖기 위해서는 $$x \rightarrow px$$로 보정해 주어야 한다. 또 다른 관점에서 보면 $$p$$만큼 값을 줄이는 과정은 모든 가능한 dropout 마스크를 적용한 후 그 결과를 이용하여 ensemble prediction을 수행하는 것으로 해석 할 수 있다. + +위에서 소개한 방법은 테스트 과정에서 뉴런 출력에 $$p$$를 곱하는 연산이 수행해야 하는데 이는 원하지 않는 방식인 경우가 많다. 테스트 과정에서의 성능은 매우 중요한 이슈이기 때문에 많은 경우에 **inverted droptou** 방식이 더 선호된다. 이는 스케일링 연산을 학습 과정에서 적용하고 테스트 과정에서는 추가적인 스케일링 연산없이 바로 사용하는 방식이다. 이 기법의 또 다른 장점은 만약 dropout을 수정하기로 했을때 prediction 코드에는 여전히 변화가 없다는 것이다. Inverted dropout은 다음과 같이 구현할 수 있다. + +~~~python +""" +Inverted Dropout: Recommended implementation example. +We drop and scale at train time and don't do anything at test time. +""" + +p = 0.5 # probability of keeping a unit active. higher = less dropout + +def train_step(X): + # forward pass for example 3-layer neural network + H1 = np.maximum(0, np.dot(W1, X) + b1) + U1 = (np.random.rand(*H1.shape) < p) / p # first dropout mask. Notice /p! + H1 *= U1 # drop! + H2 = np.maximum(0, np.dot(W2, H1) + b2) + U2 = (np.random.rand(*H2.shape) < p) / p # second dropout mask. Notice /p! + H2 *= U2 # drop! + out = np.dot(W3, H2) + b3 + + # backward pass: compute gradients... (not shown) + # perform parameter update... (not shown) + +def predict(X): + # ensembled forward pass + H1 = np.maximum(0, np.dot(W1, X) + b1) # no scaling necessary + H2 = np.maximum(0, np.dot(W2, H1) + b2) + out = np.dot(W3, H2) + b3 +~~~ + +dropout이 처음 소개된 이후로 실제 적용 사례에서 나타난 성능 향상의 근본 원인과 기존의 다른 regularization 기법과의 관계등에 대한 수많은 연구가 진행되었다. 관련하여 다음의 자료들을 읽어보는 것인 도움이 될 것이라 생각된다: + +- [Dropout 논문](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf) by Srivastava et al. 2014. +- [Dropout Training as Adaptive Regularization](http://papers.nips.cc/paper/4882-dropout-training-as-adaptive-regularization.pdf): "we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix". + +**forward pass에서 노이즈 관련하여** 넓은 의미에서 보자면 dropout은 신경망의 forward pass에서 stochastic(확률적) 접근을 도입하는 것으로 볼 수 있다. testing 과정에서 노이즈 감소하게 되는데 이는 *분석적 해석*은 `확률 $p$ 만큼 곱해진 결과`라고 볼 수 있고, *수치적 해석*은 `랜덤하게 선택된 forward pass를 여러차례 수행한 결과의 평균`이라고 볼 수 있다. 동일한 관점에서의 연구들 중 하나인 [DropConnect](http://cs.nyu.edu/~wanli/dropc/)를 보면 forward pass 동안 가중치 값을 0으로 설정하는 것으로 볼 수 있다. Convolutional 신경망에서 dropout과 함께 stochastic(확률적) 풀링(pooling), 부분 풀링, 데이터 augmentation 등의 기법을 같이 사용하여 추가적인 성능 향상을 기대할 수 있다. 이에 대해서는 뒤에서 더 자세히 살펴 볼 것이다. + +**Bias regularization**. Linear Classification 파트에서 설명했듯이, bias 텀은 regularization을 적용하지 않는 것이 일반적인데, 이는 학습된 가중치와 곱셈 연산을 하지 않기 때문에 목적 함수에서 데이터 dimension을 결정하는 요소로 작용하지 않는다. 그러나 실제 적용 사례들을 보면 bias 텀에 regularization을 적용하였을 때 심각한 성능 저하가 나타나는 경우는 극히 드문 것으로 알려져 있다. 이는 모든 가중치 텀의 갯수와 비교했을 때 bais 텀의 갯수는 무시할 만한 수준이어서 so the classifier can "afford to" use the biases if it needs them to obtain a better data loss. + +**레이어별 정규화**. 마지막 출력 레이어를 제외하고 레이어를 각각 따로 정규화 하는 것은 일반적인 방법이 아니다. 레이어 별 정규화를 적용한 논문수도 상대적으로 매우 적은 편이다. + +**실전 응용**: 하나의 공통된 L2 정규화를 사용하는 것이 일반적이다. 또한 모든 레이어 이후에 dropout을 적용하는 것 또한 일반적으로 많이 사용된다. dropout rate로 $p = 0.5*이 주로 사용되지만 validation 과정에서 값을 조정하기도 한다. + + + +### Loss functions + +We have discussed the regularization loss part of the objective, which can be seen as penalizing some measure of complexity of the model. The second part of an objective is the *data loss*, which in a supervised learning problem measures the compatibility between a prediction (e.g. the class scores in classification) and the ground truth label. The data loss takes the form of an average over the data losses for every individual example. That is, $L = \frac{1}{N} \sum_i L_i$ where $N$ is the number of training data. Lets abbreviate $f = f(x_i; W)$ to be the activations of the output layer in a Neural Network. There are several types of problems you might want to solve in practice: + +**Classification** is the case that we have so far discussed at length. Here, we assume a dataset of examples and a single correct label (out of a fixed set) for each example. One of two most commonly seen cost functions in this setting are the SVM (e.g. the Weston Watkins formulation): + +$$ +L_i = \sum_{j\neq y_i} \max(0, f_j - f_{y_i} + 1) +$$ + +As we briefly alluded to, some people report better performance with the squared hinge loss (i.e. instead using $\max(0, f_j - f_{y_i} + 1)^2$). The second common choice is the Softmax classifier that uses the cross-entropy loss: + +$$ +L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) +$$ + +**Problem: Large number of classes**. When the set of labels is very large (e.g. words in English dictionary, or ImageNet which contains 22,000 categories), it may be helpful to use *Hierarchical Softmax* (see one explanation [here](http://arxiv.org/pdf/1310.4546.pdf) (pdf)). The hierarchical softmax decomposes labels into a tree. Each label is then represented as a path along the tree, and a Softmax classifier is trained at every node of the tree to disambiguate between the left and right branch. The structure of the tree strongly impacts the performance and is generally problem-dependent. + +**Attribute classification**. Both losses above assume that there is a single correct answer $y_i$. But what if $y_i$ is a binary vector where every example may or may not have a certain attribute, and where the attributes are not exclusive? For example, images on Instagram can be thought of as labeled with a certain subset of hashtags from a large set of all hashtags, and an image may contain multiple. A sensible approach in this case is to build a binary classifier for every single attribute independently. For example, a binary classifier for each category independently would take the form: + +$$ +L_i = \sum_j \max(0, 1 - y_{ij} f_j) +$$ + +where the sum is over all categories $j$, and $y_{ij}$ is either +1 or -1 depending on whether the i-th example is labeled with the j-th attribute, and the score vector $f_j$ will be positive when the class is predicted to be present and negative otherwise. Notice that loss is accumulated if a positive example has score less than +1, or when a negative example has score greater than -1. + +An alternative to this loss would be to train a logistic regression classifier for every attribute independently. A binary logistic regression classifier has only two classes (0,1), and calculates the probability of class 1 as: + +$$ +P(y = 1 \mid x; w, b) = \frac{1}{1 + e^{-(w^Tx +b)}} = \sigma (w^Tx + b) +$$ + +Since the probabilities of class 1 and 0 sum to one, the probability for class 0 is $P(y = 0 \mid x; w, b) = 1 - P(y = 1 \mid x; w,b)$. Hence, an example is classified as a positive example (y = 1) if $\sigma (w^Tx + b) > 0.5$, or equivalently if the score $w^Tx +b > 0$. The loss function then maximizes the log likelihood of this probability. You can convince yourself that this simplifies to: + +$$ +L_i = \sum_j y_{ij} \log(\sigma(f_j)) + (1 - y_{ij}) \log(1 - \sigma(f_j)) +$$ + +where the labels $y_{ij}$ are assumed to be either 1 (positive) or 0 (negative), and $\sigma(\cdot)$ is the sigmoid function. The expression above can look scary but the gradient on $f$ is in fact extremely simple and intuitive: $\partial{L_i} / \partial{f_j} = y_{ij} - \sigma(f_j)$ (as you can double check yourself by taking the derivatives). + +**Regression** is the task of predicting real-valued quantities, such as the price of houses or the length of something in an image. For this task, it is common to compute the loss between the predicted quantity and the true answer and then measure the L2 squared norm, or L1 norm of the difference. The L2 norm squared would compute the loss for a single example of the form: + +$$ +L_i = \Vert f - y_i \Vert_2^2 +$$ + +The reason the L2 norm is squared in the objective is that the gradient becomes much simpler, without changing the optimal parameters since squaring is a monotonic operation. The L1 norm would be formulated by summing the absolute value along each dimension: + +$$ +L_i = \Vert f - y_i \Vert_1 = \sum_j \mid f_j - (y_i)_j \mid +$$ + +where the sum $\sum_j$ is a sum over all dimensions of the desired prediction, if there is more than one quantity being predicted. Looking at only the j-th dimension of the i-th example and denoting the difference between the true and the predicted value by $\delta_{ij}$, the gradient for this dimension (i.e. $\partial{L_i} / \partial{f_j}$) is easily derived to be either $\delta_{ij}$ with the L2 norm, or $sign(\delta_{ij})$. That is, the gradient on the score will either be directly proportional to the difference in the error, or it will be fixed and only inherit the sign of the difference. + +*Word of caution*: It is important to note that the L2 loss is much harder to optimize than a more stable loss such as Softmax. Intuitively, it requires a very fragile and specific property from the network to output exactly one correct value for each input (and its augmentations). Notice that this is not the case with Softmax, where the precise value of each score is less important: It only matters that their magnitudes are appropriate. Additionally, the L2 loss is less robust because outliers can introduce huge gradients. When faced with a regression problem, first consider if it is absolutely inadequate to quantize the output into bins. For example, if you are predicting star rating for a product, it might work much better to use 5 independent classifiers for ratings of 1-5 stars instead of a regression loss. Classification has the additional benefit that it can give you a distribution over the regression outputs, not just a single output with no indication of its confidence. If you're certain that classification is not appropriate, use the L2 but be careful: For example, the L2 is more fragile and applying dropout in the network (especially in the layer right before the L2 loss) is not a great idea. + +> When faced with a regression task, first consider if it is absolutely necessary. Instead, have a strong preference to discretizing your outputs to bins and perform classification over them whenever possible. + +**Structured prediction**. The structured loss refers to a case where the labels can be arbitrary structures such as graphs, trees, or other complex objects. Usually it is also assumed that the space of structures is very large and not easily enumerable. The basic idea behind the structured SVM loss is to demand a margin between the correct structure $y_i$ and the highest-scoring incorrect structure. It is not common to solve this problem as a simple unconstrained optimization problem with gradient descent. Instead, special solvers are usually devised so that the specific simplifying assumptions of the structure space can be taken advantage of. We mention the problem briefly but consider the specifics to be outside of the scope of the class. + + + +## Summary + +In summary: + +- The recommended preprocessing is to center the data to have mean of zero, and normalize its scale to [-1, 1] along each feature +- Initialize the weights by drawing them from a gaussian distribution with standard deviation of $\sqrt{2/n}$, where $n$ is the number of inputs to the neuron. E.g. in numpy: `w = np.random.randn(n) * sqrt(2.0/n)`. +- Use L2 regularization and dropout (the inverted version) +- Use batch normalization +- We discussed different tasks you might want to perform in practice, and the most common loss functions for each task + +We've now preprocessed the data and set up and initialized the model. In the next section we will look at the learning process and its dynamics. + +--- +

+번역: 서종한 (salopge) +

diff --git a/neural-networks-2.md b/neural-networks-2.md index 6c064227..b64589ff 100644 --- a/neural-networks-2.md +++ b/neural-networks-2.md @@ -29,57 +29,57 @@ There are three common forms of data preprocessing a data matrix `X`, where we w In case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional preprocessing step.
- +
Common data preprocessing pipeline. Left: Original toy, 2-dimensional input data. Middle: The data is zero-centered by subtracting the mean in each dimension. The data cloud is now centered around the origin. Right: Each dimension is additionally scaled by its standard deviation. The red lines indicate the extent of the data - they are of unequal length in the middle, but of equal length on the right.
**PCA and Whitening** is another form of preprocessing. In this process, the data is first centered as described above. Then, we can compute the covariance matrix that tells us about the correlation structure in the data: -```python +~~~python # Assume input data matrix X of size [N x D] X -= np.mean(X, axis = 0) # zero-center the data (important) cov = np.dot(X.T, X) / X.shape[0] # get the data covariance matrix -``` +~~~ The (i,j) element of the data covariance matrix contains the *covariance* between i-th and j-th dimension of the data. In particular, the diagonal of this matrix contains the variances. Furthermore, the covariance matrix is symmetric and [positive semi-definite](http://en.wikipedia.org/wiki/Positive-definite_matrix#Negative-definite.2C_semidefinite_and_indefinite_matrices). We can compute the SVD factorization of the data covariance matrix: -```python +~~~python U,S,V = np.linalg.svd(cov) -``` +~~~ where the columns of `U` are the eigenvectors and `S` is a 1-D array of the singular values (which are equal to the eigenvalues squared since `cov` is symmetric and positive semi-definite). To decorrelate the data, we project the original (but zero-centered) data into the eigenbasis: -```python +~~~python Xrot = np.dot(X, U) # decorrelate the data -``` +~~~ Notice that the columns of `U` are a set of orthonormal vectors (norm of 1, and orthogonal to each other), so they can be regarded as basis vectors. The projection therefore corresponds to a rotation of the the data in `X` so that the new axes are the eigenvectors. If we were to compute the covariance matrix of `Xrot`, we would see that it is now diagonal. A nice property of `np.linalg.svd` is that in its returned value `U`, the eigenvector columns are sorted by their eigenvalues. We can use this to reduce the dimensionality of the data by only using the top few eigenvectors, and discarding the dimensions along which the data has no variance. This is also sometimes refereed to as [Principal Component Analysis (PCA)](http://en.wikipedia.org/wiki/Principal_component_analysis) dimensionality reduction: -```python +~~~python Xrot_reduced = np.dot(X, U[:,:100]) # Xrot_reduced becomes [N x 100] -``` +~~~ After this operation, we would have reduced the original dataset of size [N x D] to one of size [N x 100], keeping the 100 dimensions of the data that contain the most variance. It is very often the case that you can get very good performance by training linear classifiers or neural networks on the PCA-reduced datasets, obtaining savings in both space and time. The last transformation you may see in practice is **whitening**. The whitening operation takes the data in the eigenbasis and divides every dimension by the eigenvalue to normalize the scale. The geometric interpretation of this transformation is that if the input data is a multivariable gaussian, then the whitened data will be a gaussian with zero mean and identity covariance matrix. This step would take the form: -```python +~~~python # whiten the data: # divide by the eigenvalues (which are square roots of the singular values) Xwhite = Xrot / np.sqrt(S + 1e-5) -``` +~~~ *Warning: Exaggerating noise.* Note that we're adding 1e-5 (or a small constant) to prevent division by zero. One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input. This can in practice be mitigated by stronger smoothing (i.e. increasing 1e-5 to be a larger number).
- +
PCA / Whitening. Left: Original toy, 2-dimensional input data. Middle: After performing PCA. The data is centered at zero and then rotated into the eigenbasis of the data covariance matrix. This decorrelates the data (the covariance matrix becomes diagonal). Right: Each dimension is additionally scaled by the eigenvalues, transforming the data covariance matrix into the identity matrix. Geometrically, this corresponds to stretching and squeezing the data into an isotropic gaussian blob.
We can also try to visualize these transformations with CIFAR-10 images. The training set of CIFAR-10 is of size 50,000 x 3072, where every image is stretched out into a 3072-dimensional row vector. We can then compute the [3072 x 3072] covariance matrix and compute its SVD decomposition (which can be relatively expensive). What do the computed eigenvectors look like visually? An image might help:
- +
Left:An example set of 49 images. 2nd from Left: The top 144 out of 3072 eigenvectors. The top eigenvectors account for most of the variance in the data, and we can see that they correspond to lower frequencies in the images. 2nd from Right: The 49 images reduced with PCA, using the 144 eigenvectors shown here. That is, instead of expressing every image as a 3072-dimensional vector where each element is the brightness of a particular pixel at some location and channel, every image above is only represented with a 144-dimensional vector, where each element measures how much of each eigenvector adds up to make up the image. In order to visualize what image information has been retained in the 144 numbers, we must rotate back into the "pixel" basis of 3072 numbers. Since U is a rotation, this can be achieved by multiplying by U.transpose()[:144,:], and then visualizing the resulting 3072 numbers as the image. You can see that the images are slightly more blurry, reflecting the fact that the top eigenvectors capture lower frequencies. However, most of the information is still preserved. Right: Visualization of the "white" representation, where the variance along every one of the 144 dimensions is squashed to equal length. Here, the whitened 144 numbers are rotated back to image pixel basis by multiplying by U.transpose()[:144,:]. The lower frequencies (which accounted for most variance) are now negligible, while the higher frequencies (which account for relatively little variance originally) become exaggerated.
@@ -100,21 +100,21 @@ We have seen how to construct a Neural Network architecture, and how to preproce **Calibrating the variances with 1/sqrt(n)**. One problem with the above suggestion is that the distribution of the outputs from a randomly initialized neuron has a variance that grows with the number of inputs. It turns out that we can normalize the variance of each neuron's output to 1 by scaling its weight vector by the square root of its *fan-in* (i.e. its number of inputs). That is, the recommended heuristic is to initialize each neuron's weight vector as: `w = np.random.randn(n) / sqrt(n)`, where `n` is the number of its inputs. This ensures that all neurons in the network initially have approximately the same output distribution and empirically improves the rate of convergence. -The sketch of the derivation is as follows: Consider the inner product \\(s = \sum\_i^n w\_i x\_i\\) between the weights \\(w\\) and input \\(x\\), which gives the raw activation of a neuron before the non-linearity. We can examine the variance of \\(s\\): +The sketch of the derivation is as follows: Consider the inner product $s = \sum_i^n w_i x_i$ between the weights $w$ and input $x$, which gives the raw activation of a neuron before the non-linearity. We can examine the variance of $s$: $$ \begin{align} -\text{Var}(s) &= \text{Var}(\sum\_i^n w\_ix\_i) \\\\ -&= \sum\_i^n \text{Var}(w\_ix\_i) \\\\ -&= \sum\_i^n [E(w\_i)]^2\text{Var}(x\_i) + E[(x\_i)]^2\text{Var}(w\_i) + \text{Var}(x\_i)\text{Var}(w\_i) \\\\ -&= \sum\_i^n \text{Var}(x\_i)\text{Var}(w\_i) \\\\ +\text{Var}(s) &= \text{Var}(\sum_i^n w_ix_i) \\\\ +&= \sum_i^n \text{Var}(w_ix_i) \\\\ +&= \sum_i^n [E(w_i)]^2\text{Var}(x_i) + E[(x_i)]^2\text{Var}(w_i) + \text{Var}(x_i)\text{Var}(w_i) \\\\ +&= \sum_i^n \text{Var}(x_i)\text{Var}(w_i) \\\\ &= \left( n \text{Var}(w) \right) \text{Var}(x) \end{align} $$ -where in the first 2 steps we have used [properties of variance](http://en.wikipedia.org/wiki/Variance). In third step we assumed zero mean inputs and weights, so \\(E[x\_i] = E[w\_i] = 0\\). Note that this is not generally the case: For example ReLU units will have a positive mean. In the last step we assumed that all \\(w\_i, x\_i\\) are identically distributed. From this derivation we can see that if we want \\(s\\) to have the same variance as all of its inputs \\(x\\), then during initialization we should make sure that the variance of every weight \\(w\\) is \\(1/n\\). And since \\(\text{Var}(aX) = a^2\text{Var}(X)\\) for a random variable \\(X\\) and a scalar \\(a\\), this implies that we should draw from unit gaussian and then scale it by \\(a = \sqrt{1/n}\\), to make its variance \\(1/n\\). This gives the initialization `w = np.random.randn(n) / sqrt(n)`. +where in the first 2 steps we have used [properties of variance](http://en.wikipedia.org/wiki/Variance). In third step we assumed zero mean inputs and weights, so $E[x_i] = E[w_i] = 0$. Note that this is not generally the case: For example ReLU units will have a positive mean. In the last step we assumed that all $w_i, x_i$ are identically distributed. From this derivation we can see that if we want $s$ to have the same variance as all of its inputs $x$, then during initialization we should make sure that the variance of every weight $w$ is $1/n$. And since $\text{Var}(aX) = a^2\text{Var}(X)$ for a random variable $X$ and a scalar $a$, this implies that we should draw from unit gaussian and then scale it by $a = \sqrt{1/n}$, to make its variance $1/n$. This gives the initialization `w = np.random.randn(n) / sqrt(n)`. -A similar analysis is carried out in [Understanding the difficulty of training deep feedforward neural networks](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) by Glorot et al. In this paper, the authors end up recommending an initialization of the form \\( \text{Var}(w) = 2/(n\_{in} + n \_{out}) \\) where \\(n\_{in}, n\_{out}\\) are the number of units in the previous layer and the next layer. This is motivated by based on a compromise and an equivalent analysis of the backpropagated gradients. A more recent paper on this topic, [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](http://arxiv-web3.library.cornell.edu/abs/1502.01852) by He et al., derives an initialization specifically for ReLU neurons, reaching the conclusion that the variance of neurons in the network should be \\(2.0/n\\). This gives the initialization `w = np.random.randn(n) * sqrt(2.0/n)`, and is the current recommendation for use in practice in the specific case of neural networks with ReLU neurons. +A similar analysis is carried out in [Understanding the difficulty of training deep feedforward neural networks](http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf) by Glorot et al. In this paper, the authors end up recommending an initialization of the form $ \text{Var}(w) = 2/(n_{in} + n _{out}) $ where $n_{in}, n_{out}$ are the number of units in the previous layer and the next layer. This is motivated by based on a compromise and an equivalent analysis of the backpropagated gradients. A more recent paper on this topic, [Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification](http://arxiv-web3.library.cornell.edu/abs/1502.01852) by He et al., derives an initialization specifically for ReLU neurons, reaching the conclusion that the variance of neurons in the network should be $2.0/n$. This gives the initialization `w = np.random.randn(n) * sqrt(2.0/n)`, and is the current recommendation for use in practice in the specific case of neural networks with ReLU neurons. **Sparse initialization**. Another way to address the uncalibrated variances problem is to set all weight matrices to zero, but to break symmetry every neuron is randomly connected (with weights sampled from a small gaussian as above) to a fixed number of neurons below it. A typical number of neurons to connect to may be as small as 10. @@ -130,22 +130,22 @@ A similar analysis is carried out in [Understanding the difficulty of training d There are several ways of controlling the capacity of Neural Networks to prevent overfitting: -**L2 regularization** is perhaps the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. That is, for every weight \\(w\\) in the network, we add the term \\(\frac{1}{2} \lambda w^2\\) to the objective, where \\(\lambda\\) is the regularization strength. It is common to see the factor of \\(\frac{1}{2}\\) in front because then the gradient of this term with respect to the parameter \\(w\\) is simply \\(\lambda w\\) instead of \\(2 \lambda w\\). The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. As we discussed in the Linear Classification section, due to multiplicative interactions between weights and inputs this has the appealing property of encouraging the network to use all of its inputs a little rather that some of its inputs a lot. Lastly, notice that during gradient descent parameter update, using the L2 regularization ultimately means that every weight is decayed linearly: `W += -lambda * W` towards zero. +**L2 regularization** is perhaps the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. That is, for every weight $w$ in the network, we add the term $\frac{1}{2} \lambda w^2$ to the objective, where $\lambda$ is the regularization strength. It is common to see the factor of $\frac{1}{2}$ in front because then the gradient of this term with respect to the parameter $w$ is simply $\lambda w$ instead of $2 \lambda w$. The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. As we discussed in the Linear Classification section, due to multiplicative interactions between weights and inputs this has the appealing property of encouraging the network to use all of its inputs a little rather that some of its inputs a lot. Lastly, notice that during gradient descent parameter update, using the L2 regularization ultimately means that every weight is decayed linearly: `W += -lambda * W` towards zero. -**L1 regularization** is another relatively common form of regularization, where for each weight \\(w\\) we add the term \\(\lambda \mid w \mid\\) to the objective. It is possible to combine the L1 regularization with the L2 regularization: \\(\lambda\_1 \mid w \mid + \lambda\_2 w^2\\) (this is called [Elastic net regularization](http://web.stanford.edu/~hastie/Papers/B67.2%20%282005%29%20301-320%20Zou%20&%20Hastie.pdf)). The L1 regularization has the intriguing property that it leads the weight vectors to become sparse during optimization (i.e. very close to exactly zero). In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the "noisy" inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1. +**L1 regularization** is another relatively common form of regularization, where for each weight $w$ we add the term $\lambda \mid w \mid$ to the objective. It is possible to combine the L1 regularization with the L2 regularization: $\lambda_1 \mid w \mid + \lambda_2 w^2$ (this is called [Elastic net regularization](http://web.stanford.edu/~hastie/Papers/B67.2%20%282005%29%20301-320%20Zou%20&%20Hastie.pdf)). The L1 regularization has the intriguing property that it leads the weight vectors to become sparse during optimization (i.e. very close to exactly zero). In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the "noisy" inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1. -**Max norm constraints**. Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector \\(\vec{w}\\) of every neuron to satisfy \\(\Vert \vec{w} \Vert\_2 < c\\). Typical values of \\(c\\) are on orders of 3 or 4. Some people report improvements when using this form of regularization. One of its appealing properties is that network cannot "explode" even when the learning rates are set too high because the updates are always bounded. +**Max norm constraints**. Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector $\vec{w}$ of every neuron to satisfy $\Vert \vec{w} \Vert_2 < c$. Typical values of $c$ are on orders of 3 or 4. Some people report improvements when using this form of regularization. One of its appealing properties is that network cannot "explode" even when the learning rates are set too high because the updates are always bounded. -**Dropout** is an extremely effective, simple and recently introduced regularization technique by Srivastava et al. in [Dropout: A Simple Way to Prevent Neural Networks from Overfitting](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf) (pdf) that complements the other methods (L1, L2, maxnorm). While training, dropout is implemented by only keeping a neuron active with some probability \\(p\\) (a hyperparameter), or setting it to zero otherwise. +**Dropout** is an extremely effective, simple and recently introduced regularization technique by Srivastava et al. in [Dropout: A Simple Way to Prevent Neural Networks from Overfitting](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf) (pdf) that complements the other methods (L1, L2, maxnorm). While training, dropout is implemented by only keeping a neuron active with some probability $p$ (a hyperparameter), or setting it to zero otherwise.
- +
Figure taken from the Dropout paper that illustrates the idea. During training, Dropout can be interpreted as sampling a Neural Network within the full Neural Network, and only updating the parameters of the sampled network based on the input data. (However, the exponential number of possible sampled networks are not independent because they share the parameters.) During testing there is no dropout applied, with the interpretation of evaluating an averaged prediction across the exponentially-sized ensemble of all sub-networks (more about ensembles in the next section).
Vanilla dropout in an example 3-layer Neural Network would be implemented as follows: -```python +~~~python """ Vanilla Dropout: Not recommended implementation (see notes below) """ p = 0.5 # probability of keeping a unit active. higher = less dropout @@ -170,15 +170,15 @@ def predict(X): H1 = np.maximum(0, np.dot(W1, X) + b1) * p # NOTE: scale the activations H2 = np.maximum(0, np.dot(W2, H1) + b2) * p # NOTE: scale the activations out = np.dot(W3, H2) + b3 -``` +~~~ In the code above, inside the `train_step` function we have performed dropout twice: on the first hidden layer and on the second hidden layer. It is also possible to perform dropout right on the input layer, in which case we would also create a binary mask for the input `X`. The backward pass remains unchanged, but of course has to take into account the generated masks `U1,U2`. -Crucially, note that in the `predict` function we are not dropping anymore, but we are performing a scaling of both hidden layer outputs by \\(p\\). This is important because at test time all neurons see all their inputs, so we want the outputs of neurons at test time to be identical to their expected outputs at training time. For example, in case of \\(p = 0.5\\), the neurons must halve their outputs at test time to have the same output as they had during training time (in expectation). To see this, consider an output of a neuron \\(x\\) (before dropout). With dropout, the expected output from this neuron will become \\(px + (1-p)0\\), because the neuron's output will be set to zero with probability \\(1-p\\). At test time, when we keep the neuron always active, we must adjust \\(x \rightarrow px\\) to keep the same expected output. It can also be shown that performing this attenuation at test time can be related to the process of iterating over all the possible binary masks (and therefore all the exponentially many sub-networks) and computing their ensemble prediction. +Crucially, note that in the `predict` function we are not dropping anymore, but we are performing a scaling of both hidden layer outputs by $p$. This is important because at test time all neurons see all their inputs, so we want the outputs of neurons at test time to be identical to their expected outputs at training time. For example, in case of $p = 0.5$, the neurons must halve their outputs at test time to have the same output as they had during training time (in expectation). To see this, consider an output of a neuron $x$ (before dropout). With dropout, the expected output from this neuron will become $px + (1-p)0$, because the neuron's output will be set to zero with probability $1-p$. At test time, when we keep the neuron always active, we must adjust $x \rightarrow px$ to keep the same expected output. It can also be shown that performing this attenuation at test time can be related to the process of iterating over all the possible binary masks (and therefore all the exponentially many sub-networks) and computing their ensemble prediction. -The undesirable property of the scheme presented above is that we must scale the activations by \\(p\\) at test time. Since test-time performance is so critical, it is always preferable to use **inverted dropout**, which performs the scaling at train time, leaving the forward pass at test time untouched. Additionally, this has the appealing property that the prediction code can remain untouched when you decide to tweak where you apply dropout, or if at all. Inverted dropout looks as follows: +The undesirable property of the scheme presented above is that we must scale the activations by $p$ at test time. Since test-time performance is so critical, it is always preferable to use **inverted dropout**, which performs the scaling at train time, leaving the forward pass at test time untouched. Additionally, this has the appealing property that the prediction code can remain untouched when you decide to tweak where you apply dropout, or if at all. Inverted dropout looks as follows: -```python +~~~python """ Inverted Dropout: Recommended implementation example. We drop and scale at train time and don't do anything at test time. @@ -204,47 +204,47 @@ def predict(X): H1 = np.maximum(0, np.dot(W1, X) + b1) # no scaling necessary H2 = np.maximum(0, np.dot(W2, H1) + b2) out = np.dot(W3, H2) + b3 -``` +~~~ There has a been a large amount of research after the first introduction of dropout that tries to understand the source of its power in practice, and its relation to the other regularization techniques. Recommended further reading for an interested reader includes: - [Dropout paper](http://www.cs.toronto.edu/~rsalakhu/papers/srivastava14a.pdf) by Srivastava et al. 2014. - [Dropout Training as Adaptive Regularization](http://papers.nips.cc/paper/4882-dropout-training-as-adaptive-regularization.pdf): "we show that the dropout regularizer is first-order equivalent to an L2 regularizer applied after scaling the features by an estimate of the inverse diagonal Fisher information matrix". -**Theme of noise in forward pass**. Dropout falls into a more general category of methods that introduce stochastic behavior in the forward pass of the network. During testing, the noise is marginalized over *analytically* (as is the case with dropout when multiplying by \\(p\\)), or *numerically* (e.g. via sampling, by performing several forward passes with different random decisions and then averaging over them). An example of other research in this direction includes [DropConnect](http://cs.nyu.edu/~wanli/dropc/), where a random set of weights is instead set to zero during forward pass. As foreshadowing, Convolutional Neural Networks also take advantage of this theme with methods such as stochastic pooling, fractional pooling, and data augmentation. We will go into details of these methods later. +**Theme of noise in forward pass**. Dropout falls into a more general category of methods that introduce stochastic behavior in the forward pass of the network. During testing, the noise is marginalized over *analytically* (as is the case with dropout when multiplying by $p$), or *numerically* (e.g. via sampling, by performing several forward passes with different random decisions and then averaging over them). An example of other research in this direction includes [DropConnect](http://cs.nyu.edu/~wanli/dropc/), where a random set of weights is instead set to zero during forward pass. As foreshadowing, Convolutional Neural Networks also take advantage of this theme with methods such as stochastic pooling, fractional pooling, and data augmentation. We will go into details of these methods later. **Bias regularization**. As we already mentioned in the Linear Classification section, it is not common to regularize the bias parameters because they do not interact with the data through multiplicative interactions, and therefore do not have the interpretation of controlling the influence of a data dimension on the final objective. However, in practical applications (and with proper data preprocessing) regularizing the bias rarely leads to significantly worse performance. This is likely because there are very few bias terms compared to all the weights, so the classifier can "afford to" use the biases if it needs them to obtain a better data loss. **Per-layer regularization**. It is not very common to regularize different layers to different amounts (except perhaps the output layer). Relatively few results regarding this idea have been published in the literature. -**In practice**: It is most common to use a single, global L2 regularization strength that is cross-validated. It is also common to combine this with dropout applied after all layers. The value of \\(p = 0.5\\) is a reasonable default, but this can be tuned on validation data. +**In practice**: It is most common to use a single, global L2 regularization strength that is cross-validated. It is also common to combine this with dropout applied after all layers. The value of $p = 0.5$ is a reasonable default, but this can be tuned on validation data. ### Loss functions -We have discussed the regularization loss part of the objective, which can be seen as penalizing some measure of complexity of the model. The second part of an objective is the *data loss*, which in a supervised learning problem measures the compatibility between a prediction (e.g. the class scores in classification) and the ground truth label. The data loss takes the form of an average over the data losses for every individual example. That is, \\(L = \frac{1}{N} \sum\_i L\_i\\) where \\(N\\) is the number of training data. Lets abbreviate \\(f = f(x\_i; W)\\) to be the activations of the output layer in a Neural Network. There are several types of problems you might want to solve in practice: +We have discussed the regularization loss part of the objective, which can be seen as penalizing some measure of complexity of the model. The second part of an objective is the *data loss*, which in a supervised learning problem measures the compatibility between a prediction (e.g. the class scores in classification) and the ground truth label. The data loss takes the form of an average over the data losses for every individual example. That is, $L = \frac{1}{N} \sum_i L_i$ where $N$ is the number of training data. Lets abbreviate $f = f(x_i; W)$ to be the activations of the output layer in a Neural Network. There are several types of problems you might want to solve in practice: **Classification** is the case that we have so far discussed at length. Here, we assume a dataset of examples and a single correct label (out of a fixed set) for each example. One of two most commonly seen cost functions in this setting are the SVM (e.g. the Weston Watkins formulation): $$ -L\_i = \sum\_{j\neq y\_i} \max(0, f\_j - f\_{y\_i} + 1) +L_i = \sum_{j\neq y_i} \max(0, f_j - f_{y_i} + 1) $$ -As we briefly alluded to, some people report better performance with the squared hinge loss (i.e. instead using \\(\max(0, f\_j - f\_{y\_i} + 1)^2\\)). The second common choice is the Softmax classifier that uses the cross-entropy loss: +As we briefly alluded to, some people report better performance with the squared hinge loss (i.e. instead using $\max(0, f_j - f_{y_i} + 1)^2$). The second common choice is the Softmax classifier that uses the cross-entropy loss: $$ -L\_i = -\log\left(\frac{e^{f\_{y\_i}}}{ \sum\_j e^{f\_j} }\right) +L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) $$ **Problem: Large number of classes**. When the set of labels is very large (e.g. words in English dictionary, or ImageNet which contains 22,000 categories), it may be helpful to use *Hierarchical Softmax* (see one explanation [here](http://arxiv.org/pdf/1310.4546.pdf) (pdf)). The hierarchical softmax decomposes labels into a tree. Each label is then represented as a path along the tree, and a Softmax classifier is trained at every node of the tree to disambiguate between the left and right branch. The structure of the tree strongly impacts the performance and is generally problem-dependent. -**Attribute classification**. Both losses above assume that there is a single correct answer \\(y\_i\\). But what if \\(y\_i\\) is a binary vector where every example may or may not have a certain attribute, and where the attributes are not exclusive? For example, images on Instagram can be thought of as labeled with a certain subset of hashtags from a large set of all hashtags, and an image may contain multiple. A sensible approach in this case is to build a binary classifier for every single attribute independently. For example, a binary classifier for each category independently would take the form: +**Attribute classification**. Both losses above assume that there is a single correct answer $y_i$. But what if $y_i$ is a binary vector where every example may or may not have a certain attribute, and where the attributes are not exclusive? For example, images on Instagram can be thought of as labeled with a certain subset of hashtags from a large set of all hashtags, and an image may contain multiple. A sensible approach in this case is to build a binary classifier for every single attribute independently. For example, a binary classifier for each category independently would take the form: $$ -L\_i = \sum\_j \max(0, 1 - y\_{ij} f\_j) +L_i = \sum_j \max(0, 1 - y_{ij} f_j) $$ -where the sum is over all categories \\(j\\), and \\(y\_{ij}\\) is either +1 or -1 depending on whether the i-th example is labeled with the j-th attribute, and the score vector \\(f\_j\\) will be positive when the class is predicted to be present and negative otherwise. Notice that loss is accumulated if a positive example has score less than +1, or when a negative example has score greater than -1. +where the sum is over all categories $j$, and $y_{ij}$ is either +1 or -1 depending on whether the i-th example is labeled with the j-th attribute, and the score vector $f_j$ will be positive when the class is predicted to be present and negative otherwise. Notice that loss is accumulated if a positive example has score less than +1, or when a negative example has score greater than -1. An alternative to this loss would be to train a logistic regression classifier for every attribute independently. A binary logistic regression classifier has only two classes (0,1), and calculates the probability of class 1 as: @@ -252,33 +252,33 @@ $$ P(y = 1 \mid x; w, b) = \frac{1}{1 + e^{-(w^Tx +b)}} = \sigma (w^Tx + b) $$ -Since the probabilities of class 1 and 0 sum to one, the probability for class 0 is \\(P(y = 0 \mid x; w, b) = 1 - P(y = 1 \mid x; w,b)\\). Hence, an example is classified as a positive example (y = 1) if \\(\sigma (w^Tx + b) > 0.5\\), or equivalently if the score \\(w^Tx +b > 0\\). The loss function then maximizes the log likelihood of this probability. You can convince yourself that this simplifies to: +Since the probabilities of class 1 and 0 sum to one, the probability for class 0 is $P(y = 0 \mid x; w, b) = 1 - P(y = 1 \mid x; w,b)$. Hence, an example is classified as a positive example (y = 1) if $\sigma (w^Tx + b) > 0.5$, or equivalently if the score $w^Tx +b > 0$. The loss function then maximizes the log likelihood of this probability. You can convince yourself that this simplifies to: $$ -L\_i = \sum\_j y\_{ij} \log(\sigma(f\_j)) + (1 - y\_{ij}) \log(1 - \sigma(f\_j)) +L_i = \sum_j y_{ij} \log(\sigma(f_j)) + (1 - y_{ij}) \log(1 - \sigma(f_j)) $$ -where the labels \\(y\_{ij}\\) are assumed to be either 1 (positive) or 0 (negative), and \\(\sigma(\cdot)\\) is the sigmoid function. The expression above can look scary but the gradient on \\(f\\) is in fact extremely simple and intuitive: \\(\partial{L\_i} / \partial{f\_j} = y\_{ij} - \sigma(f\_j)\\) (as you can double check yourself by taking the derivatives). +where the labels $y_{ij}$ are assumed to be either 1 (positive) or 0 (negative), and $\sigma(\cdot)$ is the sigmoid function. The expression above can look scary but the gradient on $f$ is in fact extremely simple and intuitive: $\partial{L_i} / \partial{f_j} = y_{ij} - \sigma(f_j)$ (as you can double check yourself by taking the derivatives). **Regression** is the task of predicting real-valued quantities, such as the price of houses or the length of something in an image. For this task, it is common to compute the loss between the predicted quantity and the true answer and then measure the L2 squared norm, or L1 norm of the difference. The L2 norm squared would compute the loss for a single example of the form: $$ -L\_i = \Vert f - y\_i \Vert\_2^2 +L_i = \Vert f - y_i \Vert_2^2 $$ The reason the L2 norm is squared in the objective is that the gradient becomes much simpler, without changing the optimal parameters since squaring is a monotonic operation. The L1 norm would be formulated by summing the absolute value along each dimension: $$ -L\_i = \Vert f - y\_i \Vert\_1 = \sum\_j \mid f\_j - (y\_i)\_j \mid +L_i = \Vert f - y_i \Vert_1 = \sum_j \mid f_j - (y_i)_j \mid $$ -where the sum \\(\sum\_j\\) is a sum over all dimensions of the desired prediction, if there is more than one quantity being predicted. Looking at only the j-th dimension of the i-th example and denoting the difference between the true and the predicted value by \\(\delta\_{ij}\\), the gradient for this dimension (i.e. \\(\partial{L\_i} / \partial{f\_j}\\)) is easily derived to be either \\(\delta\_{ij}\\) with the L2 norm, or \\(sign(\delta\_{ij})\\). That is, the gradient on the score will either be directly proportional to the difference in the error, or it will be fixed and only inherit the sign of the difference. +where the sum $\sum_j$ is a sum over all dimensions of the desired prediction, if there is more than one quantity being predicted. Looking at only the j-th dimension of the i-th example and denoting the difference between the true and the predicted value by $\delta_{ij}$, the gradient for this dimension (i.e. $\partial{L_i} / \partial{f_j}$) is easily derived to be either $\delta_{ij}$ with the L2 norm, or $sign(\delta_{ij})$. That is, the gradient on the score will either be directly proportional to the difference in the error, or it will be fixed and only inherit the sign of the difference. *Word of caution*: It is important to note that the L2 loss is much harder to optimize than a more stable loss such as Softmax. Intuitively, it requires a very fragile and specific property from the network to output exactly one correct value for each input (and its augmentations). Notice that this is not the case with Softmax, where the precise value of each score is less important: It only matters that their magnitudes are appropriate. Additionally, the L2 loss is less robust because outliers can introduce huge gradients. When faced with a regression problem, first consider if it is absolutely inadequate to quantize the output into bins. For example, if you are predicting star rating for a product, it might work much better to use 5 independent classifiers for ratings of 1-5 stars instead of a regression loss. Classification has the additional benefit that it can give you a distribution over the regression outputs, not just a single output with no indication of its confidence. If you're certain that classification is not appropriate, use the L2 but be careful: For example, the L2 is more fragile and applying dropout in the network (especially in the layer right before the L2 loss) is not a great idea. > When faced with a regression task, first consider if it is absolutely necessary. Instead, have a strong preference to discretizing your outputs to bins and perform classification over them whenever possible. -**Structured prediction**. The structured loss refers to a case where the labels can be arbitrary structures such as graphs, trees, or other complex objects. Usually it is also assumed that the space of structures is very large and not easily enumerable. The basic idea behind the structured SVM loss is to demand a margin between the correct structure \\(y\_i\\) and the highest-scoring incorrect structure. It is not common to solve this problem as a simple unconstrained optimization problem with gradient descent. Instead, special solvers are usually devised so that the specific simplifying assumptions of the structure space can be taken advantage of. We mention the problem briefly but consider the specifics to be outside of the scope of the class. +**Structured prediction**. The structured loss refers to a case where the labels can be arbitrary structures such as graphs, trees, or other complex objects. Usually it is also assumed that the space of structures is very large and not easily enumerable. The basic idea behind the structured SVM loss is to demand a margin between the correct structure $y_i$ and the highest-scoring incorrect structure. It is not common to solve this problem as a simple unconstrained optimization problem with gradient descent. Instead, special solvers are usually devised so that the specific simplifying assumptions of the structure space can be taken advantage of. We mention the problem briefly but consider the specifics to be outside of the scope of the class. @@ -287,7 +287,7 @@ where the sum \\(\sum\_j\\) is a sum over all dimensions of the desired predicti In summary: - The recommended preprocessing is to center the data to have mean of zero, and normalize its scale to [-1, 1] along each feature -- Initialize the weights by drawing them from a gaussian distribution with standard deviation of \\(\sqrt{2/n}\\), where \\(n\\) is the number of inputs to the neuron. E.g. in numpy: `w = np.random.randn(n) * sqrt(2.0/n)`. +- Initialize the weights by drawing them from a gaussian distribution with standard deviation of $\sqrt{2/n}$, where $n$ is the number of inputs to the neuron. E.g. in numpy: `w = np.random.randn(n) * sqrt(2.0/n)`. - Use L2 regularization and dropout (the inverted version) - Use batch normalization - We discussed different tasks you might want to perform in practice, and the most common loss functions for each task diff --git a/neural-networks-3.md b/neural-networks-3.md index e30706ee..3fbac7cb 100644 --- a/neural-networks-3.md +++ b/neural-networks-3.md @@ -5,376 +5,422 @@ permalink: /neural-networks-3/ Table of Contents: -- [Gradient checks](#gradcheck) -- [Sanity checks](#sanitycheck) -- [Babysitting the learning process](#baby) - - [Loss function](#loss) - - [Train/val accuracy](#accuracy) - - [Weights:Updates ratio](#ratio) - - [Activation/Gradient distributions per layer](#distr) - - [Visualization](#vis) -- [Parameter updates](#update) - - [First-order (SGD), momentum, Nesterov momentum](#sgd) - - [Annealing the learning rate](#anneal) - - [Second-order methods](#second) - - [Per-parameter adaptive learning rates (Adagrad, RMSProp)](#ada) -- [Hyperparameter Optimization](#hyper) -- [Evaluation](#eval) - - [Model Ensembles](#ensemble) -- [Summary](#summary) -- [Additional References](#add) +- [그라디언트 점검 (Gradient checks)](#gradcheck) +- [제대로 돌아가는지 확인하기 (Sanity checks)](#sanitycheck) +- [학습 과정 돌보기 (Babysitting the learning process)](#baby) + - [손실 함수 (Loss function)](#loss) + - [훈련/검증 성능 (Train/val accuracy)](#accuracy) + - [웨이트의 현재값과 변화량의 비율 (Weights:Updates ratio)](#ratio) + - [레이어별 활성값 및 그라디언트값의 분포 (Activation/Gradient distributions per layer)](#distr) + - [시각화 (Visualization)](#vis) +- [파라미터 업데이트 (Parameter updates)](#update) + - [일차 근사 방법 (SGD) (First-order (SGD)), 모멘텀 (momentum), Nesterov 모멘텀 (Nesterov momentum)](#sgd) + - [학습 속도를 담금질하기 (Annealing the learning rate)](#anneal) + - [이차 근사 방법 (Second-order methods)](#second) + - [파라미터별로 학습 속도를 데이터가 판단하게 하기 (Adagrad, RMSProp) )Per-parameter adaptive learning rates (Adagrad, RMSProp))](#ada) +- [초-파라미터 최적화 (Hyperparameter Optimization)](#hyper) +- [평가 (Evaluation)](#eval) + - [모형 앙상블 (Model Ensembles)](#ensemble) +- [요약](#summary) +- [추가적인 참고 문헌](#add) ## Learning -In the previous sections we've discussed the static parts of a Neural Networks: how we can set up the network connectivity, the data, and the loss function. This section is devoted to the dynamics, or in other words, the process of learning the parameters and finding good hyperparameters. +이전 섹션들에서는 레이어를 몇 층 쌓고 레이어별로 몇 개의 유닛을 준비할지(newwork connectivity), 데이터를 어떻게 준비하고 어떤 손실 함수(loss function)를 선택할지 논하였다. 말하자면 이전 섹션들은 주로 뉴럴 네트워크(Neural Network)의 정적인 부분인데, 본 섹션에서는 동적인 부분들을 소개한다. 파라미터(parameter)를 학습하고 좋은 초-파라미터(hyperparamter)를 찾는 과정 등을 다룰 예정이다. -### Gradient Checks -In theory, performing a gradient check is as simple as comparing the analytic gradient to the numerical gradient. In practice, the process is much more involved and error prone. Here are some tips, tricks, and issues to watch out for: +### 그라디언트 체크 (Gradient Checks) -**Use the centered formula**. The formula you may have seen for the finite difference approximation when evaluating the numerical gradient looks as follows: +이론적인 그라디언트 체크라 하면, 수치적으로 계산한(numerical) 그라디언트와 수식으로 계산한(analytic) 그라디언트를 비교하는 정도라 매우 간단하다고 생각할 수도 있겠다. 그렇지만 이 작업을 직접 실현해 보면 훨씬 복잡하고 뜬금없이 오차가 발생하기도 쉽다는 것을 깨달을 것이다. 이제 팁, 트릭, 조심할 이슈들 몇 개를 소개하고자 한다. + + +**같은 근사라 하여도 이론적으로 더 정확도가 높은 근사 공식이 있다 (Use the centered formula)**. 그라디언트($$\frac{df(x)}{dx}$$)를 수치적으로 근사한다 하면 보통 다음 유한 차분 근사(finite difference approximation)를 떠올릴 것이다: $$ \frac{df(x)}{dx} = \frac{f(x + h) - f(x)}{h} \hspace{0.1in} \text{(bad, do not use)} $$ -where \\(h\\) is a very small number, in practice approximately 1e-5 or so. In practice, it turns out that it is much better to use the *centered* difference formula of the form: +여기서 $$h$$는 아주 작은 수이고 보통 1e-5 정도의 수를 사용한다. 위 식보다는 아래의 *중심화된(centered)* 차분 공식이 경험적으로는 훨씬 낫다: $$ \frac{df(x)}{dx} = \frac{f(x + h) - f(x - h)}{2h} \hspace{0.1in} \text{(use instead)} $$ -This requires you to evaluate the loss function twice to check every single dimension of the gradient (so it is about 2 times as expensive), but the gradient approximation turns out to be much more precise. To see this, you can use Taylor expansion of \\(f(x+h)\\) and \\(f(x-h)\\) and verify that the first formula has an error on order of \\(O(h)\\), while the second formula only has error terms on order of \\(O(h^2)\\) (i.e. it is a second order approximation). +물론 이 공식은 $$f(x+h)$$ 말고도 $$f(x-h)$$도 계산하여야 하므로 최초 식보다 계산량이 두 배 많지만 훨씬 정확한 근사를 제공한다. $$f(x+h)$$ 및 $$f(x-h)$$의 ($$x$$ 근방에서의) 테일러 전개를 고려하면 이유를 금방 알 수 있다. 첫 식은 $$O(h)$$의 오차가 있는 데 반해 두번째 식은 오차가 $$O(h^2)$$이다 (즉, 이차 근사이다). -- 역자 주 : (1) 테일러 전개에서 $$f(x + h) = f(x) + hf'(x) + O(h)$$로부터 $$f'(x) - \frac{(f(x+h)-f(x)}{h} = O(h)$$. (2) $$h$$가 보통 벡터이므로 $$O(h)$$보다는 $$O(\|h\|)$$가 더 정확한 표현이나 편의상 $$\|\cdot\|$$을 생략한 듯 보입니다. + + +**상대 오차를 사용하라 (Use relative error for the comparison)**. 그라디언트의 (수식으로 계산한, analytic) 참값 $$f'_a$$와 수치적(numerical) 근사값 $$f'_n$$을 비교하려면 어떤 디테일을 점검하여야 할까? 이 둘이 비슷하지 않음(not compatible)을 어떻게 알아낼 수 있을까? 가장 쉽게는 둘의 절대 오차 $$\mid f'_a - f'_n \mid $$ 혹은 그 제곱을 쭉 추적하여 이 값(들)이 언젠가 어느 한계점(threshold)를 넘으면 그라디언트 오류라 할 수도 있겠다. 그렇지만 절대 오차에는 문제가 있는 것이, 가령 절대 오차가 1e-4라 가정하여 보자. 만약 $$f'_a$$와 $$f'_n$$ 모두 1.0 언저리라면 1e-4의 오차 정도는 매우 훌륭한 근사이고 $$f'_a \approx f'_n$$이라 할 수 있다. 그런데 만약 두 그라디언트가 1e-5거나 더 작은 값이라면? 그렇다면 1e-4는 매우 큰 차이가 되고 근사가 실패했다고 보아야 한다. 따라서 절대 오차와 두 그라디언트 값의 비율을 고려하는 *상대 오차*가 더 적절하다. 언제나!: -**Use relative error for the comparison**. What are the details of comparing the numerical gradient \\(f'\_n\\) and analytic gradient \\(f'\_a\\)? That is, how do we know if the two are not compatible? You might be temped to keep track of the difference \\(\mid f'\_a - f'\_n \mid \\) or its square and define the gradient check as failed if that difference is above a threshold. However, this is problematic. For example, consider the case where their difference is 1e-4. This seems like a very appropriate difference if the two gradients are about 1.0, so we'd consider the two gradients to match. But if the gradients were both on order of 1e-5 or lower, then we'd consider 1e-4 to be a huge difference and likely a failure. Hence, it is always more appropriate to consider the *relative error*: $$ -\frac{\mid f'\_a - f'\_n \mid}{\max(\mid f'\_a \mid, \mid f'\_n \mid)} +\frac{\mid f'_a - f'_n \mid}{\max(\mid f'_a \mid, \mid f'_n \mid)} $$ -which considers their ratio of the differences to the ratio of the absolute values of both gradients. Notice that normally the relative error formula only includes one of the two terms (either one), but I prefer to max (or add) both to make it symmetric and to prevent dividing by zero in the case where one of the two is zero (which can often happen, especially with ReLUs). However, one must explicitly keep track of the case where both are zero and pass the gradient check in that edge case. In practice: +보통의 상대 오차 공식은 분모에 $$f'_a$$ 혹은 $$f'_n$$ 둘 중 하나만 있지만, 나는 둘의 최대값을 분모로 선호하는 편이다. 그래야 공식에 대칭성이 생기고 둘 중 하나가 exactly 0이 되어 분모가 0이 되는 사태를 방지할 수 있다 (ReLU를 사용하면 자주 일어나는 문제이다). $$f'_a$$와 $$f'_n$$가 모두 exact 0이 된다면? 이 때는 상대 오차를 점검할 필요 없이 그라디언트 체크를 통과하여야 한다. 당신의 코드가 이 상황을 감안하여 조직된 코드인지 점검하여 보라. + +실제 상황에서의 유용한 가이드: + +- (상대 오차) > 1e-2 면 그라디언트 계산이 아마 잘못되었을 수도 있다. +- 1e-2 > (상대 오차) > 1e-4 면 불편함을 느끼기 바란다. +- 1e-4 > (상대 오차) 는, 꺾임이 있는 목적함수 (objectives with kinks)에서는 괜찮다. 그렇지만 tanh 혹은 softmax를 쓰는 목적함수처럼 꺾임이 없다면 1e-4는 너무 크다. +- 1e-7 혹은 그보다 작은 상대 오차라면, 행복을 느껴야 한다. + +하나 더 유념해야 할 것은, 망의 레이어 개수가 많아지면(deeper network) 상대 오차가 커진다. 이를테면 레이어(layer) 10개짜리 망(network)에서 인풋 데이터의 그라디언트를 체크한다면, 에러가 층을 올라가며 축적되므로 1e-2 정도의 상대 오차는 괜찮을 수도 있다. 거꾸로 말하자면, 미분가능한 함수 하나만 갖고 노는데 1e-2의 상대 오차가 발생한다면 이것은 부정확한 그라디언트일 가능성이 매우 높다. + + +**이중정확성 변수를 사용하라 (Use double precision)**. 흔히들 실수하는 것이, 그라디언트 체크를 계산하는 데 단일정확성 부동소숫점(single precision floating point) 변수를 사용하는 경우가 있다. 단일정확성 변수를 쓰면 그라디언트 계산이 맞다 하더라도 상대 오차가 (1e-2 정도로) 커지는 경우가 종종 있다. 내 경험상으로는 이중정확성 변수를 쓰면 상대 오차가 1e-2에서 1e-8까지 개선되는 경우도 봤다. + -- relative error > 1e-2 usually means the gradient is probably wrong -- 1e-2 > relative error > 1e-4 should make you feel uncomfortable -- 1e-4 > relative error is usually okay for objectives with kinks. But if there are no kinks (e.g. use of tanh nonlinearities and softmax), then 1e-4 is too high. -- 1e-7 and less you should be happy. +**부동소숫점 연산이 활성화되는 범위에서 계산하라 (Stick around active range of floating point)**. 당신 좀더 세심한 코드를 작성하고 실수를 줄이려면 ["모든 컴퓨터 사이언티스트들이 부동소숫점 연산에 대해 알아야 하는 것들(What Every Computer Scientist Should Know About Floating-Point Arithmetic)"](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) 를 읽는 게 좋다. 예를 들어, 신경망에서는 손실함수(loss function)를 배치별로(over batch)로 normalize하는 것이 보통이다 (역자 주 : 그라디언트 합을 배치 사이즈로 나누는 장면을 지칭하는 듯). 그렇지만 한 자료당(per datapoint) 그라디언트가 매우 작다면, 거기에 또 데이터 갯수를 *부가적으로* 나눌 경우 매우 작은 수가 되고 더욱더 많은 수치적인 문제가 생길 수 있다. 그래서 필자는 $$f'_a$$ 혹은 $$f'_n$$의 계산값을 계속 찍어보고 두 값이 너무 작지 않은가 확인하는 편이다. (대충 1e-10 혹은 그보다 작은 크기의 값이면 걱정하여라) 만약 두 값이 너무 작다면, 적당히 상수를 곱하여 부동소숫점 표현이 조금 더 "괜찮도록" (부동소숫점 표현에서 지수 부분이 0이 되도록) 만들 수도 있다. -Also keep in mind that the deeper the network, the higher the relative errors will be. So if you are gradient checking the input data for a 10-layer network, a relative error of 1e-2 might be okay because the errors build up on the way. Conversely, an error of 1e-2 for a single differentiable function likely indicates incorrect gradient. -**Use double precision**. A common pitfall is using single precision floating point to compute gradient check. It is often that case that you might get high relative errors (as high as 1e-2) even with a correct gradient implementation. In my experience I've sometimes seen my relative errors plummet from 1e-2 to 1e-8 by switching to double precision. +**목적함수에서의 꺾인 점 (Kinks in the objective)**. *꺾인 점(kink)*들에서 부정확한 계산이 발생할 수 있는데 이를 그라디언트 체크 과정에서도 염두에 두고 있어야 한다. 꺾인 점(kink)은 목적함수의 미분 불가능한 부분을 지칭하는 용어이다. ReLU 함수 ($$max(0,x)$$), 서포트 벡터 머신(SVM) 목적함수나 맥스아웃 뉴런(maxout neuron) 등을 사용하면 발생할 수 있다. 꺾인 점이 야기시킬 수 있는 문제는 대략 이렇다. ReLU 함수의 그라디언트를 $$x = -1e6$$에서 체크한다고 생각하여 보자. $$x < 0$$이므로 $$f'_a$$는 정확히 $$0$$이다. 그렇지만, 수치적으로 계산된 그라디언트는 $$f(x+h)$$가 꺾인 점을 넘을 수도 있으므로 (이를테면 $$h > 1e-6$$인 경우) 갑자기 $$0$$이 아닌 값을 내놓게 될 수도 있다. 이런 병적인(?) 경우까지 신경써야 하냐고 물을 수도 있겠는데, 사실 매우 흔하다. 예를 들어 CIFAR-10를 위해 서포트 벡터 머신(SVM)을 쓴다고 하면, 데이터가 50,000개이고(50,000 examples) 한 데이터당 $$max(0,x)$$ 항이 9개씩 있으니 결국 45만개의 ReLU항과 맞닥뜨리게 된다. 게다가 서포트 벡터 머신 분류기(SVM classifier)와 신경망(neural network)을 붙이면 ReLU들 때문에 꺾인 점이 더 늘어날 수도 있다. -**Stick around active range of floating point**. It's a good idea to read through ["What Every Computer Scientist Should Know About Floating-Point Arithmetic"](http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html), as it may demystify your errors and enable you to write more careful code. For example, in neural nets it can be common to normalize the loss function over the batch. However, if your gradients per datapoint are very small, then *additionally* dividing them by the number of data points is starting to give very small numbers, which in turn will lead to more numerical issues. This is why I like to always print the raw numerical/analytic gradient, and make sure that the numbers you are comparing are not extremely small (e.g. roughly 1e-10 and smaller in absolute value is worrying). If they are you may want to temporarily scale your loss function up by a constant to bring them to a "nicer" range where floats are more dense - ideally on the order of 1.0, where your float exponent is 0. +다행히도, 손실함수를 계산할 때 꺾인 점을 넘어서 계산했는지 (a kink was crossed) 여부를 알 수 있다. $$max(x,y)$$ 꼴 함수에서 $$x$$, $$y$$ 중 누가 "이겼는지"를 계속 기록해둔다고 생각해 보자. $$f(x+h)$$와 $$f(x-h)$$를 계산할 때 적어도 하나의 "승자"가 바뀐다면, 꺾인 점을 넘는 현상이 발생한 것이고 그렇다면 수치적인 그라디언트가 정확한 값이 아닐 수도 있다. -**Kinks in the objective**. One source of inaccuracy to be aware of during gradient checking is the problem of *kinks*. Kinks refer to non-differentiable parts of an objective function, introduced by functions such as ReLU (\\(max(0,x)\\)), or the SVM loss, Maxout neurons, etc. Consider gradient checking the ReLU function at \\(x = -1e6\\). Since \\(x < 0\\), the analytic gradient at this point is exactly zero. However, the numerical gradient would suddenly compute a non-zero gradient because \\(f(x+h)\\) might cross over the kink (e.g. if \\(h > 1e-6\\)) and introduce a non-zero contribution. You might think that this is a pathological case, but in fact this case can be very common. For example, an SVM for CIFAR-10 contains up to 450,000 \\(max(0,x)\\) terms because there are 50,000 examples and each example yields 9 terms to the objective. Moreover, a Neural Network with an SVM classifier will contain many more kinks due to ReLUs. +**적은 수의 데이터만 써라 (Use only few datapoints)** 꺾인 점과 관련된 하나의 해결책은 더 적은 데이터를 쓰는 것이다. 손실함수가 꺾인 점을 포함하고 있으면 (ReLU나 margin loss등을 썼을 경우처럼) 데이터가 적을수록 더 적은 꺾인 점을 포함할 것이고, 따라서 유한 차분 근사(finite different approximation) 과정에서 꺾인 점을 가로지르는 경우가 더 적을 것이다. 게다가, ~2 혹은 3개의 데이터에 대해서만 그라디언트 체크를 수행하는 게 거의 배치(batch) 전부에 대해 그라디언트 체크하는 게 될 테니 훨씬 빠르고 효율적이다. (역자 주 : 그렇지만 배치 사이즈가 작아지면 다른 쪽에서 문제가 생길 수도 있을 것 같은데..) -Note that it is possible to know if a kink was crossed in the evaluation of the loss. This can be done by keeping track of the identities of all "winners" in a function of form \\(max(x,y)\\); That is, was x or y higher during the forward pass. If the identity of at least one winner changes when evaluating \\(f(x+h)\\) and then \\(f(x-h)\\), then a kink was crossed and the numerical gradient will not be exact. -**Use only few datapoints**. One fix to the above problem of kinks is to use fewer datapoints, since loss functions that contain kinks (e.g. due to use of ReLUs or margin losses etc.) will have fewer kinks with fewer datapoints, so it is less likely for you to cross one when you perform the finite different approximation. Moreover, if your gradcheck for only ~2 or 3 datapoints then you would almost certainly gradcheck for an entire batch. Using very few datapoints also makes your gradient check faster and more efficient. -**Be careful with the step size h**. It is not necessarily the case that smaller is better, because when \\(h\\) is much smaller, you may start running into numerical precision problems. Sometimes when the gradient doesn't check, it is possible that you change \\(h\\) to be 1e-4 or 1e-6 and suddenly the gradient will be correct. This [wikipedia article](http://en.wikipedia.org/wiki/Numerical_differentiation) contains a chart that plots the value of **h** on the x-axis and the numerical gradient error on the y-axis. +**Step size h에 주의하라**. 꼭 작을 수록 좋은 건 아닌 게, $$h$$가 훨씬 작으면 수치적인 정확도(numerical precision) 문제에 부딪힐 수 있다. 가끔 그라디언트 체크가 잘 안 되면, $$h$$를 1e-4나 1e-6 정도로 조정하여 보라. 갑자기 될 수도 있다. 링크된 [위키피디아 기사](http://en.wikipedia.org/wiki/Numerical_differentiation)에는 **h**에 따른 수치적 그라디언트 오차가 xy-plot으로 조사되어 있다. -**Gradcheck during a "characteristic" mode of operation**. It is important to realize that a gradient check is performed at a particular (and usually random), single point in the space of parameters. Even if the gradient check succeeds at that point, it is not immediately certain that the gradient is correctly implemented globally. Additionally, a random initialization might not be the most "characteristic" point in the space of parameters and may in fact introduce pathological situations where the gradient seems to be correctly implemented but isn't. For instance, an SVM with very small weight initialization will assign almost exactly zero scores to all datapoints and the gradients will exhibit a particular pattern across all datapoints. An incorrect implementation of the gradient could still produce this pattern and not generalize to a more characteristic mode of operation where some scores are larger than others. Therefore, to be safe it is best to use a short **burn-in** time during which the network is allowed to learn and perform the gradient check after the loss starts to go down. The danger of performing it at the first iteration is that this could introduce pathological edge cases and mask an incorrect implementation of the gradient. -**Don't let the regularization overwhelm the data**. It is often the case that a loss function is a sum of the data loss and the regularization loss (e.g. L2 penalty on weights). One danger to be aware of is that the regularization loss may overwhelm the data loss, in which case the gradients will be primarily coming from the regularization term (which usually has a much simpler gradient expression). This can mask an incorrect implementation of the data loss gradient. Therefore, it is recommended to turn off regularization and check the data loss alone first, and then the regularization term second and independently. One way to perform the latter is to hack the code to remove the data loss contribution. Another way is to increase the regularization strength so as to ensure that its effect is non-negligible in the gradient check, and that an incorrect implementation would be spotted. +**"특징적인" 연산이 수행되는 곳에서 그라디언트 체크를 (Gradcheck during a "characteristic" mode of operation)**. 그라디언트 체크는 파라미터 공간(parameter space)의 특정한 (보통 랜덤인) 점 위에서 수행됨을 기억하자. 그라디언트 체크가 한 점에서는 성공한다 하여도 다른 점에서 맞게 수행되리라고는 믿기 힘들다. 게다가, 초기값을 랜덤하게 줄 경우(random initialization) 그 점은 파라미터 공간의 가장 "특징적인(characteristic)" 점이 아닐 수도 있고, 분명 제대로 코딩(implement)된 듯한 그라디언트가 사실 잘 계산되지 않는 병적인 상황을 야기할 수도 있다. 예를 들어, SVM에서 초기 웨이트값을 매우 작게 설정하면, 모든 데이터 포인트에 거의 0에 근접한 점수를 부여할 것이고 그라디언트 값들 또한 모든 데이터에 걸쳐 어떤 패턴을 나타낼 것이다. 만약 그라디언트 구현이 잘못되었다면 이 패턴을 계속 만들어낼 것이고 좀더 특징적인 계산으로 (e.g. 몇몇 점수가 다른 것보다 큰 경우) 일반화하지 못할 수도 있다. 그러므로, 안전하게 가려면, 네트워크가 학습을 시작할 무렵 짧은 번인(**burn-in**)을 이용하고, 손실(loss)가 하강하기 시작한 뒤에 그라디언트 체크를 수행하는 것이 최선이다. 요컨대, 첫번째 iteration에서부터 그라디언트 체크를 수행하면 그 때만의 병적인(pathological) 오류 때문에 우리가 정말로 정확하게 그라디언트 체크를 수행하는 부분에서의 오류를 놓칠 수도 있다. -**Remember to turn off dropout/augmentations**. When performing gradient check, remember to turn off any non-deterministic effects in the network, such as dropout, random data augmentations, etc. Otherwise these can clearly introduce huge errors when estimating the numerical gradient. The downside of turning off these effects is that you wouldn't be gradient checking them (e.g. it might be that dropout isn't backpropagated correctly). Therefore, a better solution might be to force a particular random seed before evaluating both \\(f(x+h)\\) and \\(f(x-h)\\), and when evaluating the analytic gradient. -**Check only few dimensions**. In practice the gradients can have sizes of million parameters. In these cases it is only practical to check some of the dimensions of the gradient and assume that the others are correct. **Be careful**: One issue to be careful with is to make sure to gradient check a few dimensions for every separate parameter. In some applications, people combine the parameters into a single large parameter vector for convenience. In these cases, for example, the biases could only take up a tiny number of parameters from the whole vector, so it is important to not sample at random but to take this into account and check that all parameters receive the correct gradients. +**정규화가 데이터를 압도하게 하지 마라 (Don't let the regularization overwhelm the data)**. 가끔, 손실함수(loss function)는 데이터 손실과 정규화(regularization) 손실 (e.g. 웨이트값(weight)들에 대한 L2 벌점(penalty))의 합으로 이루어져 있다. 하나 알고 있어야 하는 위험은, 정규화 손실이 데이터 손실을 압도할 수 있다는 것인데, 이 경우 그라디언트는 주로 (그라디언트 표현이 훨씬 간단한) 정규화 항(term)에서 올 것이다. 이 경우 데이터 손실 그라디언트가 올바르게 구현되지 못하는 상황을 감출 수도 있다. 그러므로, 먼저 정규화를 끄고 데이터 손실 부분만 체크를 수행하길 추천하며 그 다음에 정규화 항을 따로 점검해 보라. 정규화 항만 따로 어떻게 점검 하냐고? 하나의 방법은 코드를 해킹(hack)하여 데이터 손실 부분을 제거하는 것이다. 다른 방법으로는 정규화 항의 강도(strength)를 높여서 그 효과가 그라디언트 체크 수행시 무시할 수 없게 키운 뒤 (정규화 항 부분에서의) 잘못된 그라디언트가 감지되도록 하라. + + +**드랍아웃과 augmentation을 끄라 (Remember to turn off dropout/augmentations)**. 그라디언트 체크를 수행하는 동안, 네트워크에서 결정되지 않은(non-deterministic) 효과, 이를테면 드랍아웃(dropout), 임의 자료 확대(random data augmentations), 등을 반드시 꺼 두어라. 당연한 이야기지만 이들을 꺼두지 않으면 수치적 그라디언트 근사에서 대규모의 오차가 생길 수 있다. 이 효과들을 끌 경우 단점은 이들의 그라디언트 체크를 수행할수 없다는 것이다 (e.g. 드랍아웃이 올바르게 역전파(backpropagate)되지 않을 수 있다). 그러므로 $$f(x+h)$$ and $$f(x-h)$$ 및 수식으로 계산된(analytic) 그라디언트를 계산하기 전에 시드(seed)를 특정 값으로 고정하는 것이 좀더 나은 해결책일 수도 있다. + + +**몇 개의 차원에서만 체크하라 (Check only few dimensions)**. 실제 데이터에서 그라디언트는 수백만개의 파라미터값을 가질 수도 있다. 이런 경우엔 오직 몇 차원의 그라디언트들만 체크 하고 다른 것들은 잘 계산되었다고 믿는 것이 현실적일 수도 있다. **조심하라**: 모든 '분리된 파라미터'들에 대해서 적은 차원의 그라디언트 체크를 수행하라. 몇몇 용례에서는, 사람들이 파라미터들을 편의상 하나의 큰 파라미터 벡터로 결합한다. 이 경우, 이를테면, 편향값(bias)들은 전체 벡터에서 아주 적은 수만 차지하고 있을 수 있으므로, 이를 반영하여 샘플한 뒤 모든 파라미터들이 올바른 그라디언트를 받고 있는지 확인하는 것이 중요하다. + -### Before learning: sanity checks Tips/Tricks -Here are a few sanity checks you might consider running before you plunge into expensive optimization: +### 학습 전에: 제대로 돌아가는지 확인하는 팁과 트릭들 (Before learning: sanity checks Tips/Tricks) -- **Look for correct loss at chance performance.** Make sure you're getting the loss you expect when you initialize with small parameters. It's best to first check the data loss alone (so set regularization strength to zero). For example, for CIFAR-10 with a Softmax classifier we would expect the initial loss to be 2.302, because we expect a diffuse probability of 0.1 for each class (since there are 10 classes), and Softmax loss is the negative log probability of the correct class so: -ln(0.1) = 2.302. For The Weston Watkins SVM, we expect all desired margins to be violated (since all scores are approximately zero), and hence expect a loss of 9 (since margin is 1 for each wrong class). If you're not seeing these losses there might be issue with initialization. -- As a second sanity check, increasing the regularization strength should increase the loss -- **Overfit a tiny subset of data**. Lastly and most importantly, before training on the full dataset try to train on a tiny portion (e.g. 20 examples) of your data and make sure you can achieve zero cost. For this experiment it's also best to set regularization to zero, otherwise this can prevent you from getting zero cost. Unless you pass this sanity check with a small dataset it is not worth proceeding to the full dataset. Note that it may happen that you can overfit very small dataset but still have an incorrect implementation. For instance, if your datapoints' features are random due to some bug, then it will be possible to overfit your small training set but you will never notice any generalization when you fold it your full dataset. +풀려는 최적화 문제가 매우 비싸(expensive)지기 전에, 다음 절차들을 돌려볼 만하다. + +- **맞는 손실함수를 찾아라 ?? (Look for correct loss at chance performance.)** +적은 수의 파라미터로 초기화할 때는 당신이 기대한 손실함수값(loss)를 얻는지 확인하라. 먼저 데이터 손실함수 (data loss) 하나만 확인하는 것이 가장 낫다 (따라서 정규화 강도(regularization strength)는 영으로 설정하여라). 예를 들어, CIFAR-10에 Softmax 분류기를 이용할 경우 초기 손실함수값을 2.302로 기대할 수 있는데, 왜냐하면, -ln(0.1) = 2.302 -- 각 클래스에 확률이 0.1로 분산되었을 테고 Softmax 손실함수는 올바른 분류 확률에 음의 로그를 취한 값이기 때문이다. Weston Watkins SVM을 사용할 경우에는, (모든 점수(score)가 어림잡아 0이기 때문에) 고려되는 모든 마진값(margin)이 위반될 테니 9의 손실값을 기대할 수 있다 (마진값은 각각 잘못 분류된 클래스마다 1이다). 이런 손실값들이 나오지 않으면 초기화에 문제가 있을 수 있다. +- 두 번째 확인 절차로써, 정규화 강도를 올릴 수록 손실함수값이 올라가야 한다. +- **자료의 작은 부분집합으로 과적합해 보라 (Overfit a tiny subset of data)**. 마지막으로 가장 중요한 사항인데, 전체 데이터셋으로 훈련을 시작하기 전에, 작은 부분으로 훈련을 시도하여 보고 (한 20개의 자료 정도), 0의 비용(cost)을 달성할 수 있는지 확인하여 보라. 이 실험에서도 역시 정규화 강도는 0으로 설정하는 것이 가장 나으며, 그렇지 않으면 0의 비용을 얻을 수 없을 것이다. 작은 자료에서의 이러한 확인 과정이 제대로 끝나지 않으면 전체 데이터셋으로 나아가는 것은 무가치하다. 하나 강조할 것은, 아주 작은 데이터셋에 성공적으로 과적합하였지만 여전히 코딩(implementation)이 올바르게 이루어지지 않았을 수 있다. 예를 들어, 가지고 있는 데이터 포인트(datapoint)들의 특성(feature)들이 어떤 버그 때문에 임의로(randomly) 선정된 경우, 작은 훈련 집합(training set)에의 과적합은 성공할지라도 그게 전체 데이터셋으로 일반화되지 않을 수도 있다. -### Babysitting the learning process -There are multiple useful quantities you should monitor during training of a neural network. These plots are the window into the training process and should be utilized to get intuitions about different hyperparameter settings and how they should be changed for more efficient learning. +### 학습 과정 돌보기 (Babysitting the learning process) + +신경망을 훈련하는 중에 몇몇 쓸모있는 값(quantitity)은 모니터링해야 한다. 이런 도표들은 학습 과정을 지켜보는 창문이다. 좀더 효율적인 학습을 위한 하이퍼파라미터(hyperparameter) 조정도 여기서 직관적 영감을 얻는다. -The x-axis of the plots below are always in units of epochs, which measure how many times every example has been seen during training in expectation (e.g. one epoch means that every example has been seen once). It is preferable to track epochs rather than iterations since the number of iterations depends on the arbitrary setting of batch size. +도표의 x축은 언제나 에폭(epoch)을 단위로 한다. 에폭(epoch)은 각 자료(example)가 몇 번이나 학습(SGD iteration--역자 주)에 사용되었는가를 재는 용어이다. (이를테면 1 에폭이 지났다는 것은 모든 자료가 한 번씩 SGD iteration에 사용되었음을 뜻한다.) x축으로 SGD 알고리즘 반복횟수(iteration)를 할 수도 있겠지만 에폭이 더 선호되는 편이다. 반복 횟수(iteration number)은 배치 사이즈(batch size)의 선택에 따라 임의로 바뀔 수 있기 때문이다. -#### Loss function -The first quantity that is useful to track during training is the loss, as it is evaluated on the individual batches during the forward pass. Below is a cartoon diagram showing the loss over time, and especially what the shape might tell you about the learning rate: +#### 손실 함수 (Loss function) + +손실 함수(loss)는 forward pass 동안 개개의 배치(batch)에서 계산되고 따라서 훈련(training) 과정에서 추적하기 용이하다. 아래는 시간에 따른 손실 그래프의 모양을 여러 학습 속도(learning rate)에 따라 그려본 것이다. 각각의 모양이 시사하는 바도 함께 적었다:
- - + +
- Left: A cartoon depicting the effects of different learning rates. With low learning rates the improvements will be linear. With high learning rates they will start to look more exponential. Higher learning rates will decay the loss faster, but they get stuck at worse values of loss (green line). This is because there is too much "energy" in the optimization and the parameters are bouncing around chaotically, unable to settle in a nice spot in the optimization landscape. Right: An example of a typical loss function over time, while training a small network on CIFAR-10 dataset. This loss function looks reasonable (it might indicate a slightly too small learning rate based on its speed of decay, but it's hard to say), and also indicates that the batch size might be a little too low (since the cost is a little too noisy). + 좌측: 훈련 과정에서 학습 속도의 영향. 낮은 학습 속도로는 선형적인 향상이 이루어질 것이다. 높은 학습 속도에서는 좀더 지수적인(exponential) 향상이 보일 것이다. 더 높은 학습 속도는 손실의 감소를 가속할 것이나, 더 나쁜 손실값에 빠지게 할 수도 있다 (초록 선). 그 이유는 최적화에 너무 많은 "에너지"가 가해져서 파라미터값들이 혼돈스러운 형태로 움직이고 (최적화 목적함수 모양에서) 좋은 곳에 정착하기가 힘들어지기 때문이다. 우측: 전형적인 손실 함수의 예. x축은 시간(epoch)이고 CIFAR-10 데이터셋에서 작은 신경망을 훈련하였다. 이 손실함수의 모양은 적절해 보이고 (손실 감소의 속ㄷ를 보았을 때, 약간 학습 속도가 너무 작은 감이 있으나 뭐라 말하기 어렵다) 배치 사이즈는 너무 작은 것으로 보인다 (비용(cost)에 너무 노이즈가 많다).
-The amount of "wiggle" in the loss is related to the batch size. When the batch size is 1, the wiggle will be relatively high. When the batch size is the full dataset, the wiggle will be minimal because every gradient update should be improving the loss function monotonically (unless the learning rate is set too high). +손실 함수의 "씰룩거림"은 배치 사이즈와 연관이 있다. 만일 배치 사이즈가 1이면 훨씬 더 많이 씰룩거릴 것이다. 만일 배치 사이즈가 전체 데이터셋이면 이 씰룩거림은 최소화될 것인데, 왜냐하면 모든 그라디언트 업데이트가 손실함수를 단조적으로 향상시킬 것이기 때문이다 (학습 속도가 너무 크지만 않다면). -Some people prefer to plot their loss functions in the log domain. Since learning progress generally takes an exponential form shape, the plot appears more as a slightly more interpretable straight line, rather than a hockey stick. Additionally, if multiple cross-validated models are plotted on the same loss graph, the differences between them become more apparent. +어떤 사람들은 손실함수의 로그값의 그래프를 선호하기도 한다. 일반적으로 학습 과정은 어떤 지수적인 모양(하키 스틱 모양)을 취하고 있기 때문에, 로그 손실 그래프는 좀 더 해석이 용이한 직선의 모양처럼 보인다. 부가적인 사항으로, 만약 여러 개의 교차검증 모형(의 손실 그래프)를 같은 그래프 위에 그리면, (로그 손실 그래프로 보면) 그들 사이의 차이가 좀 더 명백해지는 장점이 있다. -Sometimes loss functions can look funny [lossfunctions.tumblr.com](http://lossfunctions.tumblr.com/). +가끔 손실 함수 모양이 우스꽝스러울 때도 있다. [lossfunctions.tumblr.com](http://lossfunctions.tumblr.com/). -#### Train/Val accuracy -The second important quantity to track while training a classifier is the validation/training accuracy. This plot can give you valuable insights into the amount of overfitting in your model: +#### 훈련/검증 정확도 (Train/Val accuracy) + +훈련/검증 정확도(training/validation accuracy)는 분류기 훈련시 추적해야 할 또다른 중요한 값이다. 이 플롯은 당신의 모형이 과적합(overfitting) 중인지를 발견할 수 있는 값진 인사이트를 제공한다:
- +
- The gap between the training and validation accuracy indicates the amount of overfitting. Two possible cases are shown in the diagram on the left. The blue validation error curve shows very small validation accuracy compared to the training accuracy, indicating strong overfitting (note, it's possible for the validation accuracy to even start to go down after some point). When you see this in practice you probably want to increase regularization (stronger L2 weight penalty, more dropout, etc.) or collect more data. The other possible case is when the validation accuracy tracks the training accuracy fairly well. This case indicates that your model capacity is not high enough: make the model larger by increasing the number of parameters. + 훈련/검증 정확도의 차이는 오버피팅의 정도를 가리킬 수 있다. 두 가능한 경우는 그림의 왼쪽에 나타나 있다. 파란색 (검증 오류) 곡선은 훈련 정확도에 비하여 매우 낮은 검증 정확도를 보여주고 있는데, 이는 강한 과적합의 가능성을 시사한다 (어떤 지점 이후에 검증 정확도가 갑자기 떨어질 수 있는 것도 가능하다). 실제로 당신이 이 현상을 보게 되면 아마 정규화(regularization)을 쓰거나 (더 강한 L2 벌점(penalty)나 드랍아웃 등) 데이터를 더 모으고 싶을 것이다. 다른 가능성으로는 검증 정확도가 훈련 정확도를 꽤 잘 따라가는 것이다. 이것은 당신의 모델의 수용량이 충분히 높지 않음을 시사할 수도 있다. 파라미터(웨이트)의 개수를 늘려서 모형을 더 크게 만들어 봐라.
-#### Ratio of weights:updates -The last quantity you might want to track is the ratio of the update magnitudes to to the value magnitudes. Note: *updates*, not the raw gradients (e.g. in vanilla sgd this would be the gradient multiplied by the learning rate). You might want to evaluate and track this ratio for every set of parameters independently. A rough heuristic is that this ratio should be somewhere around 1e-3. If it is lower than this then the learning rate might be too low. If it is higher then the learning rate is likely too high. Here is a specific example: +#### 웨이트의 현재값과 변화량의 비율 (Ratio of weights:updates) -```python +마지막으로, 웨이트의 현재 크기와 업데이트로 인한 변화량의 크기를 비교해 볼 수도 있다. (Note: 그냥 날 것의 그라디언트 값이 아니라, 웨이트의 *변화량*이다 (이를테면 vanilla SGD에서는 학습 속도(learning rate)와 그라디언트의 곱이다).) 모든 파라미터(의 집합)마다 독립적으로 이 비율을 추적/계산하고 싶은가? 대충 짚자면 이 비율은 1e-3 근처여야 한다. 이보다 낮으면 학습 속도(learning rate)가 너무 낮은 것이다. 이보다 크면 학습 속도가 너무 크다. 특정한 예를 들자면 아래와 같다: + +~~~python # assume parameter vector W and its gradient vector dW param_scale = np.linalg.norm(W.ravel()) update = -learning_rate*dW # simple SGD update update_scale = np.linalg.norm(update.ravel()) W += update # the actual update print update_scale / param_scale # want ~1e-3 -``` +~~~ -Instead of tracking the min or the max, some people prefer to compute and track the norm of the gradients and their updates instead. These metrics are usually correlated and often give approximately the same results. +최솟값이나 최댓값을 추적할 수도 있고, 그라디언트와 업데이트값의 놈(norm)을 계산하고 추적할 수도 있다. 이 지표들은 대개 연관성이 높아서 거의 비슷한 결과를 준다. -#### Activation / Gradient distributions per layer -An incorrect initialization can slow down or even completely stall the learning process. Luckily, this issue can be diagnosed relatively easily. One way to do so is to plot activation/gradient histograms for all layers of the network. Intuitively, it is not a good sign to see any strange distributions - e.g. with tanh neurons we would like to see a distribution of neuron activations between the full range of [-1,1], instead of seeing all neurons outputting zero, or all neurons being completely saturated at either -1 or 1. +#### 층별 활성값 및 그라디언트의 분포 (Activation / Gradient distributions per layer) + +올바르지 않은 초기값 설정(initialization)은 학습 과정을 느리게 하거나 완전히 망칠 수 있다. 운좋게도 이 이슈는 상대적으로 쉽게 분석할 수 있다. 한 방법은 활성값/그라디언트값의 히스토그램을 망(network)의 모든 층(layer)마다 그려보는 것이다. 직관적으로 생각해 보면, 만일 이상한 분포가 나오면 좋은 징조가 아닐 수 있다 - 이를테면, tanh 뉴런(neuron)에서는 활성값이 [-1,1]의 전 범위에 걸쳐 분산되어 있는 모습을 보고 싶다. 혹시 모든 활성값이 0을 내놓거나 -1 혹은 1에 집중되어 있으면 문제가 있는 것이다. -#### First-layer Visualizations -Lastly, when one is working with image pixels it can be helpful and satisfying to plot the first-layer features visually: +#### 첫번째 층의 시각화 (First-layer Visualizations) + +마지막으로, 만일 당신이 이미지 픽셀에 관련된 일을 한다면 첫 층의 특징(feature)들을 시각화하는 것이 많은 도움이 될 수도 있다.
- - + +
- Examples of visualized weights for the first layer of a neural network. Left: Noisy features indicate could be a symptom: Unconverged network, improperly set learning rate, very low weight regularization penalty. Right: Nice, smooth, clean and diverse features are a good indication that the training is proceeding well. + 신경망 첫 층의 웨이트값(weight)를 시각화한 에. 좌측: 특징값(feature)에 잡음(noise)이 많을 때 나타날 수 있는 증상: 수렴하지 않은 망(network), 적절하지 않은 학습 속도(learning rate), 매우 낮은 정규화 페널티(regularization penalty). 우측: 부드럽고 깨끗하며 다양한 피쳐값들이 보이는 경우 훈련이 잘 진행되고 있다는 지표일 수 있다.
-### Parameter updates -Once the analytic gradient is computed with backpropagation, the gradients are used to perform a parameter update. There are several approaches for performing the update, which we discuss next. +### 파라미터값의 업데이트 (Parameter updates) + +수식적으로 그라디언트값은 역전파(backpropagation)으로 계산되고 이는 파라미터값 업데이트를 위해 사용된다. 업데이트를 수행하는 몇 접근법들이 있는데 후술하겠다. + +딥 네트워크에서의 최적화 문제는 지금 가장 활발히 연구가 진행되고 있는 분야이다. 이 섹션에서는 (당신이 자주 보았을) 공통적으로 자주 쓰이는 테크닉과 그것들의 직관적인 아이디어를 살펴 본다. 디테일한 사항은 수업의 범위를 넘으므로 다루지 않는다. 흥미 있는 독자는 후에 등장할 몇 참고문헌을 봐도 좋다. -We note that optimization for deep networks is currently a very active area of research. In this section we highlight some established and common techniques you may see in practice, briefly describe their intuition, but leave a detailed analysis outside of the scope of the class. We provide some further pointers for an interested reader. -#### SGD and bells and whistles -**Vanilla update**. The simplest form of update is to change the parameters along the negative gradient direction (since the gradient indicates the direction of increase, but we usually wish to minimize a loss function). Assuming a vector of parameters `x` and the gradient `dx`, the simplest update has the form: +#### SGD와 그 외 (SGD and bells and whistles) -```python +**바닐라 업데이트 (Vanilla update)**. 가장 간단한 업데이트 형태는 그라디언트의 반대방향으로 파라미터를 업데이트하는 것이다(왜냐하면 그라디언트는 증가하는 방향을 가리키니까. 그렇지만 우리는 손실함수를 최소화하고 싶어한다). 파라미터의 벡터를 `x`라 하고 그라디언트를 `dx`라 쓰면, 가장 간단한 업데이트는 다음과 같: + +~~~python # Vanilla update x += - learning_rate * dx -``` +~~~ -where `learning_rate` is a hyperparameter - a fixed constant. When evaluated on the full dataset, and when the learning rate is low enough, this is guaranteed to make non-negative progress on the loss function. +여기서 학습속도 `learning_rate` 는 하이퍼파라미터(hyperparamter)이고 고정된 상수이다. 만일 `dx`가 전체 데이터셋에서 계산되고 학습 속도가 충분히 작을 때, 최소한 나쁜 프로세스는 아님을 보장한다. -**Momentum update** is another approach that almost always enjoys better converge rates on deep networks. This update can be motivated from a physical perspective of the optimization problem. In particular, the loss can be interpreted as a the height of a hilly terrain (and therefore also to the potential energy since \\(U = mgh\\) and therefore \\( U \propto h \\) ). Initializing the parameters with random numbers is equivalent to setting a particle with zero initial velocity at some location. The optimization process can then be seen as equivalent to the process of simulating the parameter vector (i.e. a particle) as rolling on the landscape. +**모멘텀 업데이트 (Momentum update)**는, 적어도 딥 네트워크에서는, 바닐라 업데이트보다 더 잘 수렴한다. 이 방법은 최적화 문제(optimization problem)를 물리학적 관점에서 바라보는 데서 유래했다. 자세히 말하자면, 손실함수는 구릉지대에서 높이에 해당한다 (그래서 포텐셜 에너지에도 대응되는데 $$U = mgh$$이고 따라서 $$ U \propto h $$이다). 파라미터의 초기값을 임의로 정하는 것은 입자를 어떤 위치에서 0의 속도로 세팅하는 것과 똑같다. 이 상황에서 최적화 과정은 파라미터 벡터(즉 입자)를 '굴리는' 과정과 동일하다 볼 수 있다. -Since the force on the particle is related to the gradient of potential energy (i.e. \\(F = - \nabla U \\) ), the **force** felt by the particle is precisely the (negative) **gradient** of the loss function. Moreover, \\(F = ma \\) so the (negative) gradient is in this view proportional to the acceleration of the particle. Note that this is different from the SGD update shown above, where the gradient directly integrates the position. Instead, the physics view suggests an update in which the gradient only directly influences the velocity, which in turn has an effect on the position: +입자에 작용하는 힘(force)은 포텐셜 에너지의 그라디언트 (즉 $$F = - \nabla U $$ )와 관련되어 있으므로, 입자가 느끼는 **힘**은은 정확하게 손실함수의 그라디언트(의 반대부호)이다. 게다가 $$F = ma$$이므로 그 그라디언트(의 반대부호)는 입자에 작용하는 가속도에 비례한다. 위에서의 SGD와 다른 점을 발견했는가? SGD는 위치값(현재 파라미터값 - 역자주)에 그라디언트가 직접 합쳐진다. 모멘텀 업데이트는, 물리학적 관점에서, 그라디언트가 오직 속도(velocity)에만 직접적으로 영향을 주고 속도가 위치값(position)에 영향을 줄 것을 제안하고 있다: -```python +~~~python # Momentum update v = mu * v - learning_rate * dx # integrate velocity x += v # integrate position -``` +~~~ + +여기서 우리는 새로운 변수 `v`를 도입하고 0으로 초기화했다. `mu`는 또 하나의 하이퍼파라미터(hyperparamter)이다. +정확한 용어는 아니지만 우리는 이 `mu`를 *모멘텀(운동량)*이라 부르기로 한다. (보통 0.9로 설정한다) 사실 마찰 계수라고 부르는 쪽이 더 `mu`에 맞기는 하다. 이 변수는 입자의 현재 속도 및 운동에너지를 효과적으로 감소시키도록 도와준다. 이게 없다면 아마 입자는 언덕의 아래쪽에 절대 멈추지 못할 것이다. 만약 모멘텀을 교차검증(cross-validation)으로 선택한다면 보통 [0.5, 0.9, 0.95, 0.99]로 설정한다. 에폭에 따라 모멘텀의 크기를 조정하면 최적화(optimization)에 더 이로울 수도 있다. 이를테면 시작할 때는 0.5의 모멘텀으로 시작하되 몇 번의 에폭을 지나면 0.99로 설정할 수도 있다. 이는 학습 속도의 스케줄을 담금질(annealing)하는 것과도 비슷하다. (뒤에 논의할 예정이다) -Here we see an introduction of a `v` variable that is initialized at zero, and an additional hyperparameter (`mu`). As an unfortunate misnomer, this variable is in optimization referred to as *momentum* (its typical value is about 0.9), but its physical meaning is more consistent with the coefficient of friction. Effectively, this variable damps the velocity and reduces the kinetic energy of the system, or otherwise the particle would never come to a stop at the bottom of a hill. When cross-validated, this parameter is usually set to values such as [0.5, 0.9, 0.95, 0.99]. Similar to annealing schedules for learning rates (discussed later, below), optimization can sometimes benefit a little from momentum schedules, where the momentum is increased in later stages of learning. A typical setting is to start with momentum of about 0.5 and anneal it to 0.99 or so over multiple epochs. +> 모멘텀 업데이트를 쓰면, (파라미터 벡터가 업데이트되는) 속도의 방향은 그라디언트들이 많이 향하는 방향으로 축적될 것이다. -> With Momentum update, the parameter vector will build up velocity in any direction that has consistent gradient. -**Nesterov Momentum** is a slightly different version of the momentum update has recently been gaining popularity. It enjoys stronger theoretical converge guarantees for convex functions and in practice it also consistenly works slightly better than standard momentum. +최근에 많은 주목을 받은 **Nesterov 모멘텀 (Nesterov Momentum)** 은 모멘텀 업데이트와 조금 다르다. 볼록함수(convex function)에서는 이 업데이트가 강력한 이론적 성질을 갖고 있고, 실제상황에서도 보통의 모멘텀 방법론보다 (많은 경우에서) 조금 더 낫다고 한다. -The core idea behind Nesterov momentum is that when the current parameter vector is at some position `x`, then looking at the momentum update above, we know that the momentum term alone (i.e. ignoring the second term with the gradient) is about to nudge the parameter vector by `mu * v`. Therefore, if we are about to compute the gradient, we can treat the future approximate position `x + mu * v` as a "lookahead" - this is a point in the vicinity of where we are soon going to end up. Hence, it makes sense to compute the gradient at `x + mu * v` instead of at the "old/stale" position `x`. +Nesterov 모멘텀의 핵심 아이디어는 다음과 같다. 만약 현재 파라미터 벡터가 `x`라는 어떤 위치에 있다고 치고 위의 모멘텀 엄데이트를 보자. 만일 위의 integrate velocity 과정에서 뒷항없이 `v = mu * v` 만 있다고 가정하면, 다음 위치로 `x + mu * v`가 "예견"될 것이다. 그러므로 이전의/오래된 위치 `x` 대신 예견된 위치 `x + mu * v`에서 그라디언트를 계산하는 것이 합리적일 수 있다.
- +
- Nesterov momentum. Instead of evaluating gradient at the current position (red circle), we know that our momentum is about to carry us to the tip of the green arrow. With Nesterov momentum we therefore instead evaluate the gradient at this "looked-ahead" position. + Nesterov 모멘텀. 지금 위치(붉은색 원)에서 모멘텀에 의해 연두색 화살표의 끝점으로 이동할 상황이다. Nesterov 모멘텀은 현재 위치에서 그라디언트를 계산하는 것이 아니라 이 "예견된" 위치(화살표 끝점)에서 그라디언트를 계산한다.
-That is, in a slightly awkward notation, we would like to do the following: +다른 말로 하면, 다음과 같이 계산한다. (notation이 조금 이상하다.) -```python +~~~python x_ahead = x + mu * v # evaluate dx_ahead (the gradient at x_ahead instead of at x) v = mu * v - learning_rate * dx_ahead x += v -``` +~~~ -However, in practice people prefer to express the update to look as similar to vanilla SGD or to the previous momentum update as possible. This is possible to achieve by manipulating the update above with a variable transform `x_ahead = x + mu * v`, and then expressing the update in terms of `x_ahead` instead of `x`. That is, the parameter vector we are actually storing is always the ahead version. The equations in terms of `x_ahead` (but renaming it back to `x`) then become: +실제 용례에서 사람들은 위 식을 재서술하여 바닐라 SGD나 이전의 모멘텀 업데이트의 꼴처럼 고칠 때가 있다. 이를테면 `x_ahead = x + mu * v` 부분을 손보고, 업데이트를 `x`의 관점이 아닌 `x_ahead`의 관점에서 서술하면 (그리고 `x_ahead`를 `x`로 고쳐쓰면) 아래와 같다. 사족을 달자면 이제 우리가 저장하는 파라미터 벡터는 언제나 "예견된" 버전이다. -```python +~~~python v_prev = v # back this up v = mu * v - learning_rate * dx # velocity update stays the same x += -mu * v_prev + (1 + mu) * v # position update changes form -``` +~~~ + -We recommend this further reading to understand the source of these equations and the mathematical formulation of Nesterov's Accelerated Momentum (NAG): +위 식들의 출처와 Nesterov's Accelerated Momentum의 수학적 서술에 대해 더 알아보고 싶으면 아래를 참조하라. - [Advances in optimizing Recurrent Networks](http://arxiv.org/pdf/1212.0901v2.pdf) by Yoshua Bengio, Section 3.5. - [Ilya Sutskever's thesis](http://www.cs.utoronto.ca/~ilya/pubs/ilya_sutskever_phd_thesis.pdf) (pdf) contains a longer exposition of the topic in section 7.2 -#### Annealing the learning rate -In training deep networks, it is usually helpful to anneal the learning rate over time. Good intuition to have in mind is that with a high learning rate, the system contains too much kinetic energy and the parameter vector bounces around chaotically, unable to settle down into deeper, but narrower parts of the loss function. Knowing when to decay the learning rate can be tricky: Decay it slowly and you'll be wasting computation bouncing around chaotically with little improvement for a long time. But decay it too aggressively and the system will cool too quickly, unable to reach the best position it can. There are three common types of implementing the learning rate decay: +#### 학습 속도 담금질 (Annealing the learning rate) -- **Step decay**: Reduce the learning rate by some factor every few epochs. Typical values might be reducing the learning rate by a half every 5 epochs, or by 0.1 every 20 epochs. These numbers depend heavily on the type of problem and the model. One heuristic you may see in practice is to watch the validation error while training with a fixed learning rate, and reduce the learning rate by a constant (e.g. 0.5) whenever the validation error stops improving. -- **Exponential decay.** has the mathematical form \\(\alpha = \alpha\_0 e^{-k t}\\), where \\(\alpha\_0, k\\) are hyperparameters and \\(t\\) is the iteration number (but you can also use units of epochs). -- **1/t decay** has the mathematical form \\(\alpha = \alpha\_0 / (1 + k t )\\) where \\(a\_0, k\\) are hyperparameters and \\(t\\) is the iteration number. +깊은 신경망의 훈련에서 시간에 따라 훈련 속도를 담금질(anneal, 조정)하는 건 언제나 도움이 된다. 이 직관을 기억해 두면 도움이 된다: 높은 학습 속도에서는, 전체 시스템이 너무 높은 운동 에너지를 갖고 있어서 파라미터 벡터가 혼돈스럽게 튀고, (손실 함수의) 좁고 깊숙한 골짜기 안으로 쏙 들어가서 정착하기 힘들다. +그러면 학습 속도를 언제 줄일 것인가? 좀 tricky할 것이다. 우선 천천히 줄여봐라. 그러면 오랜 시간동안 거의 제자리에서 혼돈스럽게 왔다갔다 할 것이다. 그렇지만 너무 빨리 줄이면 전체 시스템이 너무 빨리 식을 것이고, 갈 수 있는 최적의 장소에 도달하지 못할 수 있다. 학습속도를 감소시키는 방법은 보통 다음 세 가지가 있다. -In practice, we find that the step decay dropout is slightly preferable because the hyperparameters it involves (the fraction of decay and the step timings in units of epochs) are more interpretable than the hyperparameter \\(k\\). Lastly, if you can afford the computational budget, err on the side of slower decay and train for a longer time. +- **계단식 감소 (step decay)**: 몇 에폭마다 일정량만큼 학습 속도를 줄인다. 전형적으로는 5 에폭마다 반으로 줄이거나 20 에폭마다 1/10씩 줄이기도 한다. 이 숫자들은 전적으로 문제와 모형의 타입에 의존한다. 실전에서는, 우선 고정된 학습 속도로 검증오차(validation error)를 살펴보다가, 검증오차가 개선되지 않을 때마다 학습 속도를 감소시키는 (이를테면 0.5정도?) 방법을 택하기도 한다. +- **지수적 감소 (exponential decay)**는 $$\alpha = \alpha_0 e^{-k t}$$ 꼴을 뜻한다. 여기서 $$\alpha_0, k$$는 초모수(hyperparameter)이고 $$t$$는 반복 횟수이다 (물론 에폭을 단위로 해도 된다.) +- **1/t 감소**는 $$\alpha = \alpha_0 / (1 + k t )$$ 꼴을 뜻하고 여기서 $$a_0, k$$는 초모수이고 $$t$$는 반복 횟수이다. + +실전에서는 계단식 감소 방식이 조금 더 선호될만 한데, 관련된 초모수들(몇 에폭마다 감소시킬지, 그리고 감소율)이 $$k$$에 비해서 해석이 더 쉽기 때문이다. 마지막으로, 계산 자원이 충분하다면, 감소율을 좀 더 낮춰서 오랜 시간동안 (모형을) 훈련시켜라. -#### Second order methods -A second, popular group of methods for optimization in context of deep learning is based on [Newton's method](http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization), which iterates the following update: +#### 이차 근사 방법들 (Second order methods) + +딥러닝의 맥락에서 두 번째로 대중적인 최적화 방법은 [뉴턴 방법(Newton's method)](http://en.wikipedia.org/wiki/Newton%27s_method_in_optimization)인데 다음과 같은 업데이트 방식을 뜻한다: $$ x \leftarrow x - [H f(x)]^{-1} \nabla f(x) $$ -Here, \\(H f(x)\\) is the [Hessian matrix](http://en.wikipedia.org/wiki/Hessian_matrix), which is a square matrix of second-order partial derivatives of the function. The term \\(\nabla f(x)\\) is the gradient vector, as seen in Gradient Descent. Intuitively, the Hessian describes the local curvature of the loss function, which allows us to perform a more efficient update. In particular, multiplying by the inverse Hessian leads the optimization to take more aggressive steps in directions of shallow curvature and shorter steps in directions of steep curvature. Note, crucially, the absence of any learning rate hyperparameters in the update formula, which the proponents of these methods cite this as a large advantage over first-order methods. +여기서 $$H f(x)$$는 [헤시안 행렬(Hessian matrix)](http://en.wikipedia.org/wiki/Hessian_matrix)로, (다변수 함수의) 2차 미분으로 이루어진 정방행렬을 뜻한다. $$\nabla f(x)$$ 항은 (그라디언트 감소 Gradient Descent에서 보았던) 그라디언트 벡터이다. 직관적으로 헤시안 행렬은 어떤 함수의 국지적인 곡률(curvature)을 뜻하고 이 정보로 울이는 더 효율적인 업데이트를 수행할 수 있다. 특별히, 헤시안 행렬의 역행렬을 곱함으로써, 휨이 약한 방향으로는 더 공격적으로 그리고 휨이 강한 방향으로는 짧게짧게 움직일 수 있다. 일차 근사 방법에 비해 뉴턴 방법이 가지는 강점은, 위의 업데이트 공식을 보면 학습 속도(learning rate)에 대한 초모수(hyperparameter)가 없다는 것이다. -However, the update above is impractical for most deep learning applications because computing (and inverting) the Hessian in its explicit form is a very costly process in both space and time. For instance, a Neural Network with one million parameters would have a Hessian matrix of size [1,000,000 x 1,000,000], occupying approximately 3725 gigabytes of RAM. Hence, a large variety of *quasi-Newton* methods have been developed that seek to approximate the inverse Hessian. Among these, the most popular is [L-BFGS](http://en.wikipedia.org/wiki/Limited-memory_BFGS), which uses the information in the gradients over time to form the approximation implicitly (i.e. the full matrix is never computed). +그렇지만 위의 업데이트는 거의 모든 실제 상황에서는 쓸모가 없는 게, 공식 그대로(explicitly) 헤시안 행렬을 계산한다면 (역행렬을 취하는 일 포함하여) 상상도 못할 시간과 메모리가 필요하다. 예를 들면, 모수가 백만개 정도인 신경망은 [1,000,000 x 1,000,000] 크기의 헤시안 행렬을 필요로 하고 이는 3725GB의 램(RAM)을 필요로 한다. 그 결과로 다양한 *유사-뉴턴* 방법이 역-헤시안 행렬을 근사하기 위해 고안되었다. 이 방법론들 중 [L-BFGS](http://en.wikipedia.org/wiki/Limited-memory_BFGS)가 가장 대중적이다. L-BFGS는 시간(iteration)에 따른 그라디언트의 변화를 (간접적으로) 근사에 이용한다. 즉, 전체 행렬은 절대 계산되지 않는다. -However, even after we eliminate the memory concerns, a large downside of a naive application of L-BFGS is that it must be computed over the entire training set, which could contain millions of examples. Unlike mini-batch SGD, getting L-BFGS to work on mini-batches is more tricky and an active area of research. +그렇다고 해도, 메모리 걱정을 없앴다고 할지라도, L-BFGS를 그냥 적용하자면 큰 단점이 하나 있는데 바로 전체 훈련 집합(traning set) 전체를 대상으로 계산하여야 한다는 점이다. 수백만 개체가 있는 그 데이터셋 말이다. 배치(Batch)-SGD와는 달리, 미니배치(mini-batch)에서 L-BFGS가 작동하게 하는 방법은 좀더 꼼수를 필요로 하며 활발한 연구 분야이다. -**In practice**, it is currently not common to see L-BFGS or similar second-order methods applied to large-scale Deep Learning and Convolutional Neural Networks. Instead, SGD variants based on (Nesterov's) momentum are more standard because they are simpler and scale more easily. +**실제 상황에서는**, 지금까지는, L-BFGS나 다른 이차 근사 방법이 대규모 딥러닝이나 CNN에서 사용되지는 않는 게 보통이다. 표준적으로는 SGD와 그 변종들 (모멘텀이나 Nesterov's 모멘텀)이 훨씬 간단하고 계산도 빨라서 많이 사용된다. -Additional references: +추가 참고문헌: -- [Large Scale Distributed Deep Networks](http://research.google.com/archive/large_deep_networks_nips2012.html) is a paper from the Google Brain team, comparing L-BFGS and SGD variants in large-scale distributed optimization. -- [SFO](http://arxiv.org/abs/1311.2115) algorithm strives to combine the advantages of SGD with advantages of L-BFGS. +- [Large Scale Distributed Deep Networks](http://research.google.com/archive/large_deep_networks_nips2012.html)은 Google Brain team이 출판하였다. 대규모 분산 최적화 (large-scale distributed optimization)에서 L-BFGS와 SGD(의 변형 방법론들)을 비교하였다. +- [SFO](http://arxiv.org/abs/1311.2115) 알고리즘은 SGD와 L-BFGS의 장점을 혼합하고자 노력하였다. -#### Per-parameter adaptive learning rate methods -All previous approaches we've discussed so far manipulated the learning rate globally and equally for all parameters. Tuning the learning rates is an expensive process, so much work has gone into devising methods that can adaptively tune the learning rates, and even do so per parameter. Many of these methods may still require other hyperparameter settings, but the argument is that they are well-behaved for a broader range of hyperparameter values than the raw learning rate. In this section we highlight some common adaptive methods you may encounter in practice: +#### 파라미터별 데이터-맞춤 학습 속도 (Per-parameter adaptive learning rates) -**Adagrad** is an adaptive learning rate method originally proposed by [Duchi et al.](http://jmlr.org/papers/v12/duchi11a.html). +지금까지 논의된 접근법들은 모든 파라미터에 똑같은 학습 속도를 적용하였다. 학습 속도의 튜닝(tuning)은 계산이 많은(expensive) 작업인지라, 데이터에 맞추어(adaptively) 자동으로 학습 속도를 정하는 방법을 찾고자 많은 사람들이 노력하였다. 파라미터별로 학습 속도를 다르게 하고 이를 데이터-맞춤으로 정하려는 노력들 또한 있었다. 이러한 방법들은 보통 또다른 초모수(hyperparameter) 세팅이 필요하긴 하지만, 이 초모수는 넓은 범위에서 잘 작동하는 편이라 일반적인 학습 속도 튜닝보다는 덜 까다롭다. 이번 절에서는 실전에서 마주칠 수도 있는 주요 데이터-맞춤 방법들을 조망해본다: -```python + +**Adagrad**는 데이터-맞춤 학습속도 조정 방법 중 하나이고 [Duchi et al.](http://jmlr.org/papers/v12/duchi11a.html) 에서 처음 제안되었다. + +~~~python # Assume the gradient dx and parameter vector x cache += dx**2 x += - learning_rate * dx / (np.sqrt(cache) + eps) -``` +~~~ -Notice that the variable `cache` has size equal to the size of the gradient, and keeps track of per-parameter sum of squared gradients. This is then used to normalize the parameter update step, element-wise. Notice that the weights that receive high gradients will have their effective learning rate reduced, while weights that receive small or infrequent updates will have their effective learning rate increased. Amusingly, the square root operation turns out to be very important and without it the algorithm performs much worse. The smoothing term `eps` (usually set somewhere in range from 1e-4 to 1e-8) avoids division by zero. A downside of Adagrad is that in case of Deep Learning, the monotonic learning rate usually proves too aggressive and stops learning too early. +위에서 변수 `cache`는 그라디언트 벡터의 사이즈와 동일한 사이즈를 갖고 있다. `cache`의 각 성분은 (해당 성분에 대응하는) 그라디언트의 제곱값들을 계속 추적하고 있고, 파라미터 업데이트에서, 성분별로, 일종의 표준화 기능을 수행한다. 주목할 점은, 높은 그라디언트값을 갖는 웨이트값(weight)들은 점점 실질적인 학습속도(effective learning rate)가 감소하고 / 그라디언트 값이 낮거나 업데이트가 거의 없는 웨이트값들은 실질 학습속도가 증가한다는 것이다. 놀랍게도 제곱근(square root) 연산이 여기서 중요한 비중을 차지한다. 제곱근이 없다면 알고리즘의 성능이 많이 나빠진다. 변수 `eps`는 분모가 너무 0에 가깝지 않도록 안정화 역할을 하고 주로 1e-4에서 1e-8의 값이 할당된다. Adagrad의 단점이 있다면, 딥러닝의 경우에는, 학습 속도가 단조적이라 너무 한 방향으로 급진적(aggressive)으로 나가거나, 혹은 학습을 너무 빨리 멈출 가능성도 있다. -**RMSprop.** RMSprop is a very effective, but currently unpublished adaptive learning rate method. Amusingly, everyone who uses this method in their work currently cites [slide 29 of Lecture 6](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) of Geoff Hinton's Coursera class. The RMSProp update adjusts the Adagrad method in a very simple way in an attempt to reduce its aggressive, monotonically decreasing learning rate. In particular, it uses a moving average of squared gradients instead, giving: +**RMSprop.** RMSprop는 매우 효과적이지만 아직 출판되지 않은 데이터-맞춤 학습속도 조정 방법이다. 현재는 Geoff Hinton의 Coursera 강의 중 다음 슬라이드를 인용한다: [slide 29 of Lecture 6](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf) (역자 주: 2016년 8월 현재에도 검색결과 논문을 찾지는 못하였습니다. Goodfellow et al.의 책 [](http://www.deeplearningbook.org)의 8장에 줄글로 설명이 있습니다.) RMSProp 업데이트는 Adagrad를 간단히 조정하여 급진적이고 단조감소하는 학습속도를 경감시켰다. 어떻게? 제곱 그라디언트의 평균(Adagrad처럼)이 아니라, 이동평균(moving average)을 사용한다: -```python +~~~python cache = decay_rate * cache + (1 - decay_rate) * dx**2 x += - learning_rate * dx / (np.sqrt(cache) + eps) -``` +~~~ -Here, `decay_rate` is a hyperparameter and typical values are [0.9, 0.99, 0.999]. Notice that the `x+=` update is identical to Adagrad, but the `cache` variable is a "leaky". Hence, RMSProp still modulates the learning rate of each weight based on the magnitudes of its gradients, which has a beneficial equalizing effect, but unlike Adagrad the updates do not get monotonically smaller. +여기서 `decay_rate`는 초모수이고 보통 [0.9, 0.99, 0.999] 중 하나의 값을 취한다. 주목할 점은 `+=` 업데이트는 Adagrad와 동등하지만, `cache`가 "어디선가 샌다". 따라서 RMSProp은 여전히 각 웨이트값을 (그것의 과거 그라디언트) 값으로) 조정하여 성분별로 실질 학습속도를 비슷하게 만드는 효과는 갖고 있지만, Adagrad처럼 학습 속도가 단조적으로 줄지는 않는다. -**Adam.** [Adam](http://arxiv.org/abs/1412.6980) is a recently proposed update that looks a bit like RMSProp with momentum. The (simplified) update looks as follows: +**Adam.** [Adam](http://arxiv.org/abs/1412.6980)은 최근에 제안된 방법인데 RMSProp에 모멘텀(momentum)을 혼합한 것처럼 보인다. 간단하게 쓰면 업데이트는 다음과 같다: -```python +~~~python m = beta1*m + (1-beta1)*dx v = beta2*v + (1-beta2)*(dx**2) x += - learning_rate * m / (np.sqrt(v) + eps) -``` +~~~ -Notice that the update looks exactly as RMSProp update, except the "smooth" version of the gradient `m` is used instead of the raw (and perhaps noisy) gradient vector `dx`. Recommended values in the paper are `eps = 1e-8`, `beta1 = 0.9`, `beta2 = 0.999`. In practice Adam is currently recommended as the default algorithm to use, and often works slightly better than RMSProp. However, it is often also worth trying SGD+Nesterov Momentum as an alternative. The full Adam update also includes a *bias correction* mechanism, which compensates for the fact that in the first few time steps the vectors `m,v` are both initialized and therefore biased at zero, before they fully "warm up". We refer the reader to the paper for the details, or the course slides where this is expanded on. +업데이트는 RMSProp의 업데이트 방식과 정확히 같아 보이는데, 그냥 (노이즈가 껴있을 수도 있는) 그라디언트 `dx` 대신에 "안정화된" 버전인 `m`이 사용되었다는 점이 다르다. 논문에 따르면 추천되는 초모수값들은 `eps = 1e-8`, `beta1 = 0.9`, `beta2 = 0.999`이다. 실전에서 Adam은 기본 알고리즘으로 추천되고 있고, 가끔은 RMSProp보다 조금 더 잘 하기도 한다. 그러나 SGD+Nesterov Momentum도 대안으로 해볼만 하다. Adam 업데이트 절차에는 *편향 보정(bias correction)* 매커니즘이 반영되어 있는데, 벡터 `m,v`가 나중에 완벽하게 "워밍업" 되기 전에 (iteration의 처음 몇 스텝에서) 초기화되어 0에 편향되어 있다는 점을 보상하기 위해서이다. 자세한 사항은 논문이나 강의 코스 슬라이드를 참조하라. -Additional References: +추가 참고문헌: -- [Unit Tests for Stochastic Optimization](http://arxiv.org/abs/1312.6055) proposes a series of tests as a standardized benchmark for stochastic optimization. +- [Unit Tests for Stochastic Optimization](http://arxiv.org/abs/1312.6055)는 (지금까지 제안된) 확률적 최적화(stochastic optimization) 방법들을 평가하는 표준적인 테스트들을 제안하고 있다.
- - + +
- Animations that may help your intuitions about the learning process dynamics. Left: Contours of a loss surface and time evolution of different optimization algorithms. Notice the "overshooting" behavior of momentum-based methods, which make the optimization look like a ball rolling down the hill. Right: A visualization of a saddle point in the optimization landscape, where the curvature along different dimension has different signs (one dimension curves up and another down). Notice that SGD has a very hard time breaking symmetry and gets stuck on the top. Conversely, algorithms such as RMSprop will see very low gradients in the saddle direction. Due to the denominator term in the RMSprop update, this will increase the effective learning rate along this direction, helping RMSProp proceed. Images credit: Alec Radford. + 이 동영상이 학습 과정에서의 동역학(dynamics)를 직관적으로 이해하는데 도움이 되길 바란다. + 왼쪽: 손실 함수의 등고선 위에서 각 최적화 알고리즘들의 시간(iteration)에 따른 변화. 모멘텀-기반 방법론들의 "급가속" 행동들을 주목하라. 이게 최적화를 마치 언덕을 내려가는 공처럼 보이게 만든다. 오른쪽: 목적함수에 안장점(saddle point)가 있을 때의 시각화. 안장점은 그라디언트가 0이지만 헤시안 행렬의 고유치(eigenvalue)에 양수/음수가 섞여있을 때 발생한다. SGD는 안장점에서 빠져나오는 데 매우 힘든 시간을 겪는다. 반대로, RMSprop같은 알고리즘들은 안장의 방향으로 매우 작은 그라디언트를 마주하게 되지만 분모-표준화 성질 덕분에 이 방향의 실질 학습속도를 높아질 수 있고 따라서 이 방향으로 빠져나올 수 있다. Images credit: Alec Radford.
-### Hyperparameter optimization -As we've seen, training Neural Networks can involve many hyperparameter settings. The most common hyperparameters in context of Neural Networks include: +### 초모수 최적화 (Hyperparameter optimization) + +일전에 본 대로, 신경망(neural network)의 훈련에는 많은 초모수(hyperparamter) 설정이 관련된다. 신경망 관련 논의에서 가장 빈번하게 등장하는 초모수는 다음과 같다: -- the initial learning rate -- learning rate decay schedule (such as the decay constant) -- regularization strength (L2 penalty, dropout strength) +- 학습속도의 초기값(the initial learning rate) +- 학습속도 경감 계획, 이를테면 경감 상수 (learning rate decay schedule (such as the decay constant)) +- L2나 드랍아웃 페널티의 정규화 강도 (regularization strength (L2 penalty, dropout strength)) -But as saw, there are many more relatively less sensitive hyperparameters, for example in per-parameter adaptive learning methods, the setting of momentum and its schedule, etc. In this section we describe some additional tips and tricks for performing the hyperparameter search: +그렇지만 역시 본 대로, 덜 민감한 초모수들도 있는데, 이들은 파라미터별 데이터-맞춤 학습 방법, 모멘텀이나 관련 스케쥴 등에서 등장하였다. 이번 절에서는 초모수 최적화를 수행하기 위한 추가적인 팁이나 트릭들을 언급한다: -**Implementation**. Larger Neural Networks typically require a long time to train, so performing hyperparameter search can take many days/weeks. It is important to keep this in mind since it influences the design of your code base. One particular design is to have a **worker** that continuously samples random hyperparameters and performs the optimization. During the training, the worker will keep track of the validation performance after every epoch, and writes a model checkpoint (together with miscellaneous training statistics such as the loss over time) to a file, preferably on a shared file system. It is useful to include the validation performance directly in the filename, so that it is simple to inspect and sort the progress. Then there is a second program which we will call a **master**, which launches or kills workers across a computing cluster, and may additionally inspect the checkpoints written by workers and plot their training statistics, etc. +**코드 구성 단계에서 (Implementation)**. 큰 신경망은 대개 긴 학습시간이 걸리고, 따라서 초모수 검색에는 며칠, 몇 주가 걸릴 수도 있다. 코드를 짤 때 이 점을 염두에 두는 것이 중요하다 (코드 베이스의 구성이 달라질 수도 있다). 하나 가능한 코드 구성은, 초모수를 임의로 선택하여 최적화를 수행하는 **일꾼**을 만드는 것이다. 이 일꾼에게 훈련 과정에서 매 에폭 뒤의 검증 성능을 쭉 추적하여 모형의 체크포인트들을 (다른 훈련 통계량들, 이를테면 시간에 따른 손실함수값들과 함께) 파일에 저장케 하라. 공유 파일 시스템 위에 저장하면 더 좋다. 검증 성능을 아예 직접 파일 이름에 써 놓는 것도 괜찮다. 그러면 과정이 더 단축되고 단순할 것이다. 그리고 **마스터**라 불릴 두번째 프로그램을 만들어서 계산 클러스터별로 일꾼들을 개시(launch)하거나 끝내(kill)게 하라. 혹은 마스터는 일꾼이 작성한 체크포인트들을 조사하고 훈련 통계량들로 그림을 그릴 수도 있다. -**Prefer one validation fold to cross-validation**. In most cases a single validation set of respectable size substantially simplifies the code base, without the need for cross-validation with multiple folds. You'll hear people say they "cross-validated" a parameter, but many times it is assumed that they still only used a single validation set. +**교차검증보다는 단일한 검증 집합 (Prefer one validation fold to cross-validation)**. 많은 경우에, 적당한 크기의 검증 집합을 설정해 두어 한 번만 검증하는 것이, 여러 번의 교차검증보다 코드를 단순화시킨다. 사람들이 "교차검증" 했다고 얘기해도, 많은 경우에 그 사람들은 단일한 검증 집합만 썼을 것이다. -**Hyperparameter ranges**. Search for hyperparameters on log scale. For example, a typical sampling of the learning rate would look as follows: `learning_rate = 10 ** uniform(-6, 1)`. That is, we are generating a random random with a uniform distribution, but then raising it to the power of 10. The same strategy should be used for the regularization strength. Intuitively, this is because learning rate and regularization strength have multiplicative effects on the training dynamics. For example, a fixed change of adding 0.01 to a learning rate has huge effects on the dynamics if the learning rate is 0.001, but nearly no effect if the learning rate when it is 10. This is because the learning rate multiplies the computed gradient in the update. Therefore, it is much more natural to consider a range of learning rate multiplied or divided by some value, than a range of learning rate added or subtracted to by some value. Some parameters (e.g. dropout) are instead usually searched in the original scale (e.g. `dropout = uniform(0,1)`). +**초모수의 범위 (Hyperparameter ranges)**. 로그 스케일로 초모수를 찾아라. 예를 들어, 학습 속도의 선정은 전형적으로 다음과 같이 보일 수도 있다: `learning_rate = 10 ** uniform(-6, 1)`. 다시 말하면, 균등분포에서 난수를 뽑은 뒤에 이를 10의 제곱값으로 취하는 것이다. 같은 전략이 정규화 강도 검색에도 사용되어야 한다. 왜냐고? 직관적으로, 학습 속도와 정규화 강도는 학습 동역학에 배수적인(multiplicative) 효과가 있기 때문이다 - 학습 속도는 업데이트에서 그라디언트에 곱해지는 수이다. 이를테면, 최초 학습 속도가 0.001이면 이를 0.01씩 더할 경우 동역학에 큰 영향을 미치지만 최초 학습 속도가 10인 경우에는 거의 영향이 없다. 그러므로 학습 속도의 범위는 어떤 값을 계속 곱하거나 나누는 것이 (빼거나 더하는 것보다) 더 자연스럽다. 대신에, 어떤 초모수들(이를테면 드랍아웃)은 보통의 스케일에서 검색된다. (예. `dropout = uniform(0,1)`). -**Prefer random search to grid search**. As argued by Bergstra and Bengio in [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf), "randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid". As it turns out, this is also usually easier to implement. +**그리드 검색보다는 임의 검색 (Prefer random search to grid search)**은 Bergstra and Bengio가 쓴 다음 논문에서 논의되었다: [Random Search for Hyper-Parameter Optimization](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf), "randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid". 그리고 밝혀진 대로, 이게 더 구현하기 쉽다.
- +
- Core illustration from Random Search for Hyper-Parameter Optimization by Bergstra and Bengio. It is very often the case that some of the hyperparameters matter much more than others (e.g. top hyperparam vs. left one in this figure). Performing random search rather than grid search allows you to much more precisely discover good values for the important ones. + Bergstra and Bengio의 논의의 핵심을 도식화하였다. (Random Search for Hyper-Parameter Optimization). 어떤 초모수는 다른 것보다 훨씬 중요할 때가 많다 (예. 오른쪽 그림에서 꼭대기에 있는 초모수 vs. 왼쪽 그림). 그리드 검색보다는 임의 검색이 좋고 중요한 초모수 발견을 더 용이하게 한다.
-**Careful with best values on border**. Sometimes it can happen that you're searching for a hyperparameter (e.g. learning rate) in a bad range. For example, suppose we use `learning_rate = 10 ** uniform(-6, 1)`. Once we receive the results, it is important to double check that the final learning rate is not at the edge of this interval, or otherwise you may be missing more optimal hyperparameter setting beyond the interval. +**가장 좋은 값이 경계에 있으면 조심하라 (Careful with best values on border)**. 가끔은 초모수 검색 범위 (이를테면 학습 속도) 가 나쁘게 설정되었을 수도 있다. 이를테면, `learning_rate = 10 ** uniform(-6, 1)`을 사용한다고 가정하여 보자. 한번 결과를 받았으면, 최종 학습 속도가 이 구간의 끝에 있지 않아야 한다. 그렇지 않으면, 당신은 (구간 밖에 있는) 더 최적의 초모수를 놓치고 있을는지도 모른다. -**Stage your search from coarse to fine**. In practice, it can be helpful to first search in coarse ranges (e.g. 10 ** [-6, 1]), and then depending on where the best results are turning up, narrow the range. Also, it can be helpful to perform the initial coarse search while only training for 1 epoch or even less, because many hyperparameter settings can lead the model to not learn at all, or immediately explode with infinite cost. The second stage could then perform a narrower search with 5 epochs, and the last stage could perform a detailed search in the final range for many more epochs (for example). +**성긴 검색에서 촘촘한 검색으로 (Stage your search from coarse to fine)**. 실전에서는, 처음에는 널찍한 범위에서 검색을 하다가 (예. 10 ** [-6, 1]), 좋은 결과가 어디에서 발생하냐에 따라 범위를 좁힐 수도 있다. 또한, 처음의 성긴 검색에서는 1 에폭이나 혹은 더 적게만 훈련하는 게 도움이 될 수도 있는데, 왜냐하면 많은 초모수 세팅에서는 하나도 학습하는 게 없을 수도 있거나 즉시 무한대의 손실함수값으로 폭발할 수도 있기 때문이다. 두 번째 단계는 좀더 좁은 범위에서의 검색을, 5 에폭 정도로, 할 수 있을 것이다. 그리고 마지막 검색에서는 좁은 범위에서 많은 에폭의 훈련을 수행해도 좋겠다. -**Bayesian Hyperparameter Optimization** is a whole area of research devoted to coming up with algorithms that try to more efficiently navigate the space of hyperparameters. The core idea is to appropriately balance the exploration - exploitation trade-off when querying the performance at different hyperparameters. Multiple libraries have been developed based on these models as well, among some of the better known ones are [Spearmint](https://github.com/JasperSnoek/spearmint), [SMAC](http://www.cs.ubc.ca/labs/beta/Projects/SMAC/), and [Hyperopt](http://jaberg.github.io/hyperopt/). However, in practical settings with ConvNets it is still relatively difficult to beat random search in a carefully-chosen intervals. See some additional from-the-trenches discussion [here](http://nlpers.blogspot.com/2014/10/hyperparameter-search-bayesian.html). +**베이지안 초모수 최적화 (Bayesian Hyperparameter Optimization)**는 초모수 공간을 좀 더 효율적으로 항해하는 방법을 고안하기 위한 분야이다. 핵심 아이디어는 초모수들의 성능을 평가할 때 탐험(exploration)-개발(exploitation)의 상충(trade-off)에서 적절한 균형을 찾는 것이다. 많은 라이브러리들이 이 모형에 기반하여 개발되었고 그 중에 잘 알려진 것은 [Spearmint](https://github.com/JasperSnoek/spearmint), [SMAC](http://www.cs.ubc.ca/labs/beta/Projects/SMAC/), 그리고 [Hyperopt](http://jaberg.github.io/hyperopt/)이다. 그러나, ConvNet 관련된 실전 세팅에서는 아직 조심스레 선택된 구간에서의 임의 검색이 상대적으로 더 뛰어나다. 딥러닝의 최전선 참호에서(from-the-trenches) 진행중인 논의를 참조하라. [here](http://nlpers.blogspot.com/2014/10/hyperparameter-search-bayesian.html). -## Evaluation + +## 평가 -### Model Ensembles -In practice, one reliable approach to improving the performance of Neural Networks by a few percent is to train multiple independent models, and at test time average their predictions. As the number of models in the ensemble increases, the performance typically monotonically improves (though with diminishing returns). Moreover, the improvements are more dramatic with higher model variety in the ensemble. There are a few approaches to forming an ensemble: +### 모형 앙상블 (Model Ensembles) + +실전에서, 신경망(neural network)의 성능을 몇 퍼센트 끌어올릴 수 있는 믿을 만한 방법이 하나 있는데 바로 여러 개의 독립적인 모형을 만들고 테스트 때 그들의 평균 예측을 취하는 것이다. 앙상블에 관여하는 모형이 많아지면, 보통 성능은 단조적으로 개선된다 (비록 개선 정도가 점점 떨어질지라도). 게다가, 앙상블 내에서 모형의 다양함이 늘어날수록 성능의 개선은 더 극적이다. 아래는 앙상블을 구축하는 몇 가지 방법이다: -- **Same model, different initializations**. Use cross-validation to determine the best hyperparameters, then train multiple models with the best set of hyperparameters but with different random initialization. The danger with this approach is that the variety is only due to initialization. -- **Top models discovered during cross-validation**. Use cross-validation to determine the best hyperparameters, then pick the top few (e.g. 10) models to form the ensemble. This improves the variety of the ensemble but has the danger of including suboptimal models. In practice, this can be easier to perform since it doesn't require additional retraining of models after cross-validation -- **Different checkpoints of a single model**. If training is very expensive, some people have had limited success in taking different checkpoints of a single network over time (for example after every epoch) and using those to form an ensemble. Clearly, this suffers from some lack of variety, but can still work reasonably well in practice. The advantage of this approach is that is very cheap. -- **Running average of parameters during training**. Related to the last point, a cheap way of almost always getting an extra percent or two of performance is to maintain a second copy of the network's weights in memory that maintains an exponentially decaying sum of previous weights during training. This way you're averaging the state of the network over last several iterations. You will find that this "smoothed" version of the weights over last few steps almost always achieves better validation error. The rough intuition to have in mind is that the objective is bowl-shaped and your network is jumping around the mode, so the average has a higher chance of being somewhere nearer the mode. +- **같은 모형, 다른 초기화 (Same model, different initializations)**. 교차 검증으로 최고의 초모수를 결정한 다음에, 같은 초모수를 이용하되 초기값을 임의로 다양하게 여러 모형을 훈련한다. 이 접근법의 위험은, 모형의 다양성이 오직 다양한 초기값에서만 온다는 것이다. +- **교차 검증 동안 발견되는 최고의 모형들 (Top models discovered during cross-validation)**. 교차 검증으로 최고의 초모수(들)를 결정한 다음에, 몇 개의 최고 모형을 선정하여 (예. 10개) 이들로 앙상블을 구축한다. 이 방법은 앙상블 내의 다양성을 증대시키나, 준-최적 모형을 포함할 수도 있는 위험이 있다. 실전에서는 이를 수행하는 게 (위보다) 쉬운 편인데, 교차 검증 뒤에 추가적인 모형의 재훈련이 필요없기 때문이다. +- **한 모형에서 다른 체크포인트들을 (Different checkpoints of a single model)**. 만약 훈련이 매우 값비싸면, 어떤 사람들은 단일한 네트워크의 체크포인트들을 (이를테면 매 에폭 후) 앙상블하여 제한적인 성공을 거둔 바 있음을 기억해 두라. 명백하게 이 방법은 다양성이 떨어지지만, 실전에서는 합리적으로 잘 작동할 수 있다. 이 방법은 매우 간편하고 저렴하다는 것이 장점이다. +- **훈련 동안의 모수값들에 평균을 취하기 (Running average of parameters during training)**. 훈련 동안 (시간에 따른) 웨이트 값들의 지수 하강 합(exponentially decaying sum)을 저장하는 제 2의 네트워크를 만들면 언제나 몇 퍼센트의 이득을 값싸게 취할 수 있다. 이 방식으로 당신은 최근 몇 iteration 동안의 네트워크에 평균을 취한다고 생각할 수도 있다. 마지막 몇 스텝 동안의 웨이트값들을 이렇게 "안정화" 시킴으로써 당신은 언제나 더 나은 검증 오차를 얻을 수 있다. 거친 직관으로 생각하자면, 목적함수는 볼(bowl)-모양이고 당신의 네트워크는 극값(mode) 주변을 맴돌 것이므로, 평균을 취하면 극값에 더 가까운 어딘가에 다다를 기회가 더 많아질 것이다. -One disadvantage of model ensembles is that they take longer to evaluate on test example. An interested reader may find the recent work from Geoff Hinton on ["Dark Knowledge"](https://www.youtube.com/watch?v=EK61htlw8hY) inspiring, where the idea is to "distill" a good ensemble back to a single model by incorporating the ensemble log likelihoods into a modified objective. +모형 앙상블의 단점이 하나 있다면 테스트 샘플에 모형을 적용할 때 평가(evaluation)에 더 시간이 걸린다는 점이다. 흥미로운 독자는 Geoff Hinton의 ["Dark Knowledge"](https://www.youtube.com/watch?v=EK61htlw8hY)에서 영감을 얻을 수도 있겠다. 여기서의 아이디어는 좋은 앙상블 모형을 하나의 모형으로 "증류"하는 것인데, 앙상블 모형의 로그-가능도(log-likelihood)를 어떤 변형된 목적함수로 통합하는 작업과 관련이 있다. -## Summary -To train a Neural Network: +## 요약 (Summary) -- Gradient check your implementation with a small batch of data and be aware of the pitfalls. -- As a sanity check, make sure your initial loss is reasonable, and that you can achieve 100% training accuracy on a very small portion of the data -- During training, monitor the loss, the training/validation accuracy, and if you're feeling fancier, the magnitude of updates in relation to parameter values (it should be ~1e-3), and when dealing with ConvNets, the first-layer weights. -- The two recommended updates to use are either SGD+Nesterov Momentum or Adam. -- Decay your learning rate over the period of the training. For example, halve the learning rate after a fixed number of epochs, or whenever the validation accuracy tops off. -- Search for good hyperparameters with random search (not grid search). Stage your search from coarse (wide hyperparameter ranges, training only for 1-5 epochs), to fine (narrower rangers, training for many more epochs) -- Form model ensembles for extra performance +신경망(neural network)를 훈련하기 위하여: + +- 코드를 짜는 중간중간에 작은 배치로 그라디언트를 체크하고, 뜻하지 않게 튀어나올 위험을 인지하고 있으라. +- 코드가 제대로 돌아가는지 확인하는 방법으로, 손실함수값의 초기값이 합리적인지 그리고 데이터의 일부분으로 100&%의 훈련 정확도를 달성할 수 있는지 확인하라. +- 훈련 동안, 손실함수와 훈련/검증 정확도를 계속 살펴보고, (이게 좀 더 멋져 보이면) 현재 파라미터 값 대비 업데이트 값 또한 살펴보라 (대충 ~1e-3 정도 되어야 한다). 만약 ConvNet을 다루고 있다면, 첫 층의 웨이트값도 살펴보라. +- 업데이트 방법으로 추천하는 건 SGD+Nesterov Momentum 혹은 Adam이다. +- 학습 속도를 훈련 동안 계속 하강시켜라. 예를 들면, 정해진 에폭 수 뒤에 (혹은 검증 정확도가 상승하다가 하강세로 꺾이면) 학습 속도를 반으로 깎아라. +- 초모수 검색은 그리드 검색이 아닌 임의 검색으로 수행하라. 처음에는 성긴 규모에서 탐색하다가 (넓은 초모수 범위, 1-5 에폭 정도만 학습), 점점 촘촘하게 검색하라 (좁은 범위, 더 많은 에폭에서 학습). +- 추가적인 개선을 위하여 모형 앙상블을 구축하라. -## Additional References + +## 추가 참고문헌 - [SGD](http://research.microsoft.com/pubs/192769/tricks-2012.pdf) tips and tricks from Leon Bottou - [Efficient BackProp](http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf) (pdf) from Yann LeCun - [Practical Recommendations for Gradient-Based Training of Deep Architectures](http://arxiv.org/pdf/1206.5533v2.pdf) from Yoshua Bengio + +--- +

+번역: 최영근 ygchoistat +

diff --git a/neural-networks-case-study.md b/neural-networks-case-study.md index aa9246aa..3df291f5 100644 --- a/neural-networks-case-study.md +++ b/neural-networks-case-study.md @@ -23,7 +23,7 @@ In this section we'll walk through a complete implementation of a toy Neural Net Lets generate a classification dataset that is not easily linearly separable. Our favorite example is the spiral dataset, which can be generated as follows: -```python +~~~python N = 100 # number of points per class D = 2 # dimensionality K = 3 # number of classes @@ -37,10 +37,10 @@ for j in xrange(K): y[ix] = j # lets visualize the data: plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.Spectral) -``` +~~~
- +
The toy spiral data consists of three classes (blue, red, yellow) that are not linearly separable.
@@ -56,11 +56,11 @@ Normally we would want to preprocess the dataset so that each feature has zero m Lets first train a Softmax classifier on this classification dataset. As we saw in the previous sections, the Softmax classifier has a linear score function and uses the cross-entropy loss. The parameters of the linear classifier consist of a weight matrix `W` and a bias vector `b` for each class. Lets first initialize these parameters to be random numbers: -```python +~~~python # initialize parameters randomly W = 0.01 * np.random.randn(D,K) b = np.zeros((1,K)) -``` +~~~ Recall that we `D = 2` is the dimensionality and `K = 3` is the number of classes. @@ -69,108 +69,108 @@ Recall that we `D = 2` is the dimensionality and `K = 3` is the number of classe Since this is a linear classifier, we can compute all class scores very simply in parallel with a single matrix multiplication: -```python +~~~python # compute class scores for a linear classifier scores = np.dot(X, W) + b -``` +~~~ In this example we have 300 2-D points, so after this multiplication the array `scores` will have size [300 x 3], where each row gives the class scores corresponding to the 3 classes (blue, red, yellow). ### Compute the loss -The second key ingredient we need is a loss function, which is a differentiable objective that quantifies our unhappiness with the computed class scores. Intuitively, we want the correct class to have a higher score than the other classes. When this is the case, the loss should be low and otherwise the loss should be high. There are many ways to quantify this intuition, but in this example lets use the cross-entropy loss that is associated with the Softmax classifier. Recall that if \\(f\\) is the array of class scores for a single example (e.g. array of 3 numbers here), then the Softmax classifier computes the loss for that example as: +The second key ingredient we need is a loss function, which is a differentiable objective that quantifies our unhappiness with the computed class scores. Intuitively, we want the correct class to have a higher score than the other classes. When this is the case, the loss should be low and otherwise the loss should be high. There are many ways to quantify this intuition, but in this example lets use the cross-entropy loss that is associated with the Softmax classifier. Recall that if $f$ is the array of class scores for a single example (e.g. array of 3 numbers here), then the Softmax classifier computes the loss for that example as: $$ -L\_i = -\log\left(\frac{e^{f\_{y\_i}}}{ \sum\_j e^{f\_j} }\right) +L_i = -\log\left(\frac{e^{f_{y_i}}}{ \sum_j e^{f_j} }\right) $$ -We can see that the Softmax classifier interprets every element of \\(f\\) as holding the (unnormalized) log probabilities of the three classes. We exponentiate these to get (unnormalized) probabilities, and then normalize them to get probabilites. Therefore, the expression inside the log is the normalized probability of the correct class. Note how this expression works: this quantity is always between 0 and 1. When the probability of the correct class is very small (near 0), the loss will go towards (postiive) infinity. Conversely, when the correct class probability goes towards 1, the loss will go towards zero because \\(log(1) = 0\\). Hence, the expression for \\(L\_i\\) is low when the correct class probability is high, and it's very high when it is low. +We can see that the Softmax classifier interprets every element of $f$ as holding the (unnormalized) log probabilities of the three classes. We exponentiate these to get (unnormalized) probabilities, and then normalize them to get probabilites. Therefore, the expression inside the log is the normalized probability of the correct class. Note how this expression works: this quantity is always between 0 and 1. When the probability of the correct class is very small (near 0), the loss will go towards (postiive) infinity. Conversely, when the correct class probability goes towards 1, the loss will go towards zero because $log(1) = 0$. Hence, the expression for $L_i$ is low when the correct class probability is high, and it's very high when it is low. Recall also that the full Softmax classifier loss is then defined as the average cross-entropy loss over the training examples and the regularization: $$ -L = \underbrace{ \frac{1}{N} \sum\_i L\_i }\_\text{data loss} + \underbrace{ \frac{1}{2} \lambda \sum\_k\sum\_l W\_{k,l}^2 }\_\text{regularization loss} \\\\ +L = \underbrace{ \frac{1}{N} \sum_i L_i }_\text{data loss} + \underbrace{ \frac{1}{2} \lambda \sum_k\sum_l W_{k,l}^2 }_\text{regularization loss} \\\\ $$ Given the array of `scores` we've computed above, we can compute the loss. First, the way to obtain the probabilities is straight forward: -```python +~~~python # get unnormalized probabilities exp_scores = np.exp(scores) # normalize them for each example probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) -``` +~~~ We now have an array `probs` of size [300 x 3], where each row now contains the class probabilities. In particular, since we've normalized them every row now sums to one. We can now query for the log probabilities assigned to the correct classes in each example: -```python +~~~python corect_logprobs = -np.log(probs[range(num_examples),y]) -``` +~~~ The array `correct_logprobs` is a 1D array of just the probabilities assigned to the correct classes for each example. The full loss is then the average of these log probabilities and the regularization loss: -```python +~~~python # compute the loss: average cross-entropy loss and regularization data_loss = np.sum(corect_logprobs)/num_examples reg_loss = 0.5*reg*np.sum(W*W) loss = data_loss + reg_loss -``` +~~~ -In this code, the regularization strength \\(\lambda\\) is stored inside the `reg`. The convenience factor of `0.5` multiplying the regularization will become clear in a second. Evaluating this in the beginning (with random parameters) might give us `loss = 1.1`, which is `np.log(1.0/3)`, since with small initial random weights all probabilities assigned to all classes are about one third. We now want to make the loss as low as possible, with `loss = 0` as the absolute lower bound. But the lower the loss is, the higher are the probabilities assigned to the correct classes for all examples. +In this code, the regularization strength $\lambda$ is stored inside the `reg`. The convenience factor of `0.5` multiplying the regularization will become clear in a second. Evaluating this in the beginning (with random parameters) might give us `loss = 1.1`, which is `np.log(1.0/3)`, since with small initial random weights all probabilities assigned to all classes are about one third. We now want to make the loss as low as possible, with `loss = 0` as the absolute lower bound. But the lower the loss is, the higher are the probabilities assigned to the correct classes for all examples. ### Computing the Analytic Gradient with Backpropagation -We have a way of evaluating the loss, and now we have to minimize it. We'll do so with gradient descent. That is, we start with random parameters (as shown above), and evaluate the gradient of the loss function with respect to the parameters, so that we know how we should change the parameters to decrease the loss. Lets introduce the intermediate variable \\(p\\), which is a vector of the (normalized) probabilities. The loss for one example is: +We have a way of evaluating the loss, and now we have to minimize it. We'll do so with gradient descent. That is, we start with random parameters (as shown above), and evaluate the gradient of the loss function with respect to the parameters, so that we know how we should change the parameters to decrease the loss. Lets introduce the intermediate variable $p$, which is a vector of the (normalized) probabilities. The loss for one example is: $$ -p\_k = \frac{e^{f\_k}}{ \sum\_j e^{f\_j} } \hspace{1in} L\_i =-\log\left(p\_{y\_i}\right) +p_k = \frac{e^{f_k}}{ \sum_j e^{f_j} } \hspace{1in} L_i =-\log\left(p_{y_i}\right) $$ -We now wish to understand how the computed scores inside \\(f\\) should change to decrease the loss \\(L\_i\\) that this example contributes to the full objective. In other words, we want to derive the gradient \\( \partial L\_i / \partial f\_k \\). The loss \\(L\_i\\) is computed from \\(p\\), which in turn depends on \\(f\\). It's a fun exercise to the reader to use the chain rule to derive the gradient, but it turns out to be extremely simple and interpretible in the end, after a lot of things cancel out: +We now wish to understand how the computed scores inside $f$ should change to decrease the loss $L_i$ that this example contributes to the full objective. In other words, we want to derive the gradient $ \partial L_i / \partial f_k $. The loss $L_i$ is computed from $p$, which in turn depends on $f$. It's a fun exercise to the reader to use the chain rule to derive the gradient, but it turns out to be extremely simple and interpretible in the end, after a lot of things cancel out: $$ -\frac{\partial L\_i }{ \partial f\_k } = p\_k - \mathbb{1}(y\_i = k) +\frac{\partial L_i }{ \partial f_k } = p_k - \mathbb{1}(y_i = k) $$ -Notice how elegant and simple this expression is. Suppose the probabilities we computed were `p = [0.2, 0.3, 0.5]`, and that the correct class was the middle one (with probability 0.3). According to this derivation the gradient on the scores would be `df = [0.2, -0.7, 0.5]`. Recalling what the interpretation of the gradient, we see that this result is highly intuitive: increasing the first or last element of the score vector `f` (the scores of the incorrect classes) leads to an *increased* loss (due to the positive signs +0.2 and +0.5) - and increasing the loss is bad, as expected. However, increasing the score of the correct class has *negative* influence on the loss. The gradient of -0.7 is telling us that increasing the correct class score would lead to a decrease of the loss \\(L\_i\\), which makes sense. +Notice how elegant and simple this expression is. Suppose the probabilities we computed were `p = [0.2, 0.3, 0.5]`, and that the correct class was the middle one (with probability 0.3). According to this derivation the gradient on the scores would be `df = [0.2, -0.7, 0.5]`. Recalling what the interpretation of the gradient, we see that this result is highly intuitive: increasing the first or last element of the score vector `f` (the scores of the incorrect classes) leads to an *increased* loss (due to the positive signs +0.2 and +0.5) - and increasing the loss is bad, as expected. However, increasing the score of the correct class has *negative* influence on the loss. The gradient of -0.7 is telling us that increasing the correct class score would lead to a decrease of the loss $L_i$, which makes sense. All of this boils down to the following code. Recall that `probs` stores the probabilities of all classes (as rows) for each example. To get the gradient on the scores, which we call `dscores`, we proceed as follows: -```python +~~~python dscores = probs dscores[range(num_examples),y] -= 1 dscores /= num_examples -``` +~~~ Lastly, we had that `scores = np.dot(X, W) + b`, so armed with the gradient on `scores` (stored in `dscores`), we can now backpropagate into `W` and `b`: -```python +~~~python dW = np.dot(X.T, dscores) db = np.sum(dscores, axis=0, keepdims=True) dW += reg*W # don't forget the regularization gradient -``` +~~~ -Where we see that we have backpropped through the matrix multiply operation, and also added the contribution from the regularization. Note that the regularization gradient has the very simple form `reg*W` since we used the constant `0.5` for its loss contribution (i.e. \\(\frac{d}{dw} ( \frac{1}{2} \lambda w^2) = \lambda w\\). This is a common convenience trick that simplifies the gradient expression. +Where we see that we have backpropped through the matrix multiply operation, and also added the contribution from the regularization. Note that the regularization gradient has the very simple form `reg*W` since we used the constant `0.5` for its loss contribution (i.e. $\frac{d}{dw} ( \frac{1}{2} \lambda w^2) = \lambda w$. This is a common convenience trick that simplifies the gradient expression. ### Performing a parameter update Now that we've evaluated the gradient we know how every parameter influences the loss function. We will now perform a parameter update in the *negative* gradient direction to *decrease* the loss: -```python +~~~python # perform a parameter update W += -step_size * dW b += -step_size * db -``` +~~~ ### Putting it all together: Training a Softmax Classifier Putting all of this together, here is the full code for training a Softmax classifier with Gradient descent: -```python +~~~python #Train a Linear Classifier # initialize parameters randomly @@ -214,11 +214,11 @@ for i in xrange(200): # perform a parameter update W += -step_size * dW b += -step_size * db -``` +~~~ Running this prints the output: -``` +~~~ iteration 0: loss 1.096956 iteration 10: loss 0.917265 iteration 20: loss 0.851503 @@ -239,21 +239,21 @@ iteration 160: loss 0.786431 iteration 170: loss 0.786373 iteration 180: loss 0.786331 iteration 190: loss 0.786302 -``` +~~~ We see that we've converged to something after about 190 iterations. We can evaluate the training set accuracy: -```python +~~~python # evaluate training set accuracy scores = np.dot(X, W) + b predicted_class = np.argmax(scores, axis=1) print 'training accuracy: %.2f' % (np.mean(predicted_class == y)) -``` +~~~ This prints **49%**. Not very good at all, but also not surprising given that the dataset is constructed so it is not linearly separable. We can also plot the learned decision boundaries:
- +
Linear classifier fails to learn the toy spiral dataset.
@@ -264,58 +264,58 @@ This prints **49%**. Not very good at all, but also not surprising given that th Clearly, a linear classifier is inadequate for this dataset and we would like to use a Neural Network. One additional hidden layer will suffice for this toy data. We will now need two sets of weights and biases (for the first and second layers): -```python +~~~python # initialize parameters randomly h = 100 # size of hidden layer W = 0.01 * np.random.randn(D,h) b = np.zeros((1,h)) W2 = 0.01 * np.random.randn(h,K) b2 = np.zeros((1,K)) -``` +~~~ The forward pass to compute scores now changes form: -```python +~~~python # evaluate class scores with a 2-layer Neural Network hidden_layer = np.maximum(0, np.dot(X, W) + b) # note, ReLU activation scores = np.dot(hidden_layer, W2) + b2 -``` +~~~ Notice that the only change from before is one extra line of code, where we first compute the hidden layer representation and then the scores based on this hidden layer. Crucially, we've also added a non-linearity, which in this case is simple ReLU that thresholds the activations on the hidden layer at zero. Everything else remains the same. We compute the loss based on the scores exactly as before, and get the gradient for the scores `dscores` exactly as before. However, the way we backpropagate that gradient into the model parameters now changes form, of course. First lets backpropagate the second layer of the Neural Network. This looks identical to the code we had for the Softmax classifier, except we're replacing `X` (the raw data), with the variable `hidden_layer`): -```python +~~~python # backpropate the gradient to the parameters # first backprop into parameters W2 and b2 dW2 = np.dot(hidden_layer.T, dscores) db2 = np.sum(dscores, axis=0, keepdims=True) -``` +~~~ However, unlike before we are not yet done, because `hidden_layer` is itself a function of other parameters and the data! We need to continue backpropagation through this variable. Its gradient can be computed as: -```python +~~~python dhidden = np.dot(dscores, W2.T) -``` +~~~ -Now we have the gradient on the outputs of the hidden layer. Next, we have to backpropagate the ReLU non-linearity. This turns out to be easy because ReLU during the backward pass is effectively a switch. Since \\(r = max(0, x)\\), we have that \\(\frac{dr}{dx} = 1(x > 0) \\). Combined with the chain rule, we see that the ReLU unit lets the gradient pass through unchanged if its input was greater than 0, but *kills it* if its input was less than zero during the forward pass. Hence, we can backpropagate the ReLU in place simply with: +Now we have the gradient on the outputs of the hidden layer. Next, we have to backpropagate the ReLU non-linearity. This turns out to be easy because ReLU during the backward pass is effectively a switch. Since $r = max(0, x)$, we have that $\frac{dr}{dx} = 1(x > 0) $. Combined with the chain rule, we see that the ReLU unit lets the gradient pass through unchanged if its input was greater than 0, but *kills it* if its input was less than zero during the forward pass. Hence, we can backpropagate the ReLU in place simply with: -```python +~~~python # backprop the ReLU non-linearity dhidden[hidden_layer <= 0] = 0 -``` +~~~ And now we finally continue to the first layer weights and biases: -```python +~~~python # finally into W,b dW = np.dot(X.T, dhidden) db = np.sum(dhidden, axis=0, keepdims=True) -``` +~~~ We're done! We have the gradients `dW,db,dW2,db2` and can perform the parameter update. Everything else remains unchanged. The full code looks very similar: -```python +~~~python # initialize parameters randomly h = 100 # size of hidden layer W = 0.01 * np.random.randn(D,h) @@ -373,11 +373,11 @@ for i in xrange(10000): b += -step_size * db W2 += -step_size * dW2 b2 += -step_size * db2 -``` +~~~ This prints: -``` +~~~ iteration 0: loss 1.098744 iteration 1000: loss 0.294946 iteration 2000: loss 0.259301 @@ -388,22 +388,22 @@ iteration 6000: loss 0.245491 iteration 7000: loss 0.245400 iteration 8000: loss 0.245335 iteration 9000: loss 0.245292 -``` +~~~ The training accuracy is now: -```python +~~~python # evaluate training set accuracy hidden_layer = np.maximum(0, np.dot(X, W) + b) scores = np.dot(hidden_layer, W2) + b2 predicted_class = np.argmax(scores, axis=1) print 'training accuracy: %.2f' % (np.mean(predicted_class == y)) -``` +~~~ Which prints **98%**!. We can also visualize the decision boundaries:
- +
Neural Network classifier crushes the spiral dataset.
diff --git a/optimization-1.md b/optimization-1.md index bc0bab7c..fafaffeb 100644 --- a/optimization-1.md +++ b/optimization-1.md @@ -3,96 +3,100 @@ layout: page permalink: /optimization-1/ --- -Table of Contents: - -- [Introduction](#intro) -- [Visualizing the loss function](#vis) -- [Optimization](#optimization) - - [Strategy #1: Random Search](#opt1) - - [Strategy #2: Random Local Search](#opt2) - - [Strategy #3: Following the gradient](#opt3) -- [Computing the gradient](#gradcompute) - - [Numerically with finite differences](#numerical) - - [Analytically with calculus](#analytic) -- [Gradient descent](#gd) -- [Summary](#summary) +목자: + +- [소개](#intro) +- [손실함수(Loss Function)의 시각화(Visualization)](#vis) +- [최적화(Optimization)](#optimization) + - [전략 #1: 무작위 탐색 (Random Search)](#opt1) + - [전략 #2: 무작위 국소 탐색 (Random Local Search)](#opt2) + - [전략 #3: 그라디언트(gradient) 따라가기](#opt3) +- [그라디언트(Gradient) 계산](#gradcompute) + - [Finite Differences를 이용한 수치적인 방법](#numerical) + - [미분을 이용한 해석적인 방법](#analytic) +- [그라디언트 하강(Gradient Descent)](#gd) +- [요약](#summary) -### Introduction -In the previous section we introduced two key components in context of the image classification task: +### 소개 -1. A (parameterized) **score function** mapping the raw image pixels to class scores (e.g. a linear function) -2. A **loss function** that measured the quality of a particular set of parameters based on how well the induced scores agreed with the ground truth labels in the training data. We saw that there are many ways and versions of this (e.g. Softmax/SVM). +이전 섹션에서 이미지 분류(image classification)을 할 때에 있어 두 가지의 핵심요쇼를 소개했습니다. -Concretely, recall that the linear function had the form \\( f(x\_i, W) = W x\_i \\) and the SVM we developed was formulated as: +1. 원 이미지의 픽셀들을 넣으면 분류 스코어(class score)를 계산해주는 파라미터화된(parameterized) **스코어함수(score function)** (예를 들어, 선형함수). +2. 학습(training) 데이타에 어떤 특정 파라미터(parameter/weight)들을 가지고 스코어함수(score function)를 적용시켰을 때, 실제 class와 얼마나 잘 일치하는지에 따라 그 특정 파라미터(parameter/weight)들의 질을 측정하는 **손실함수(loss function)**. 여러 종류의 손실함수(예를 들어, Softmax/SVM)가 있다. + +구체적으로 말하자면, 다음과 같은 형식을 가진 선형함수 $$ f(x_i, W) = W x_i $$를 스코어함수(score function)로 쓸 때, 앞에서 다룬 바와 같이 SVM은 다음과 같은 수식으로 표현할 수 있다.: $$ -L = \frac{1}{N} \sum\_i \sum\_{j\neq y\_i} \left[ \max(0, f(x\_i; W)\_j - f(x\_i; W)\_{y\_i} + 1) \right] + \alpha R(W) +L = \frac{1}{N} \sum_i \sum_{j\neq y_i} \left[ \max(0, f(x_i; W)_j - f(x_i; W)_{y_i} + 1) \right] + \alpha R(W) $$ -We saw that a setting of the parameters \\(W\\) that produced predictions for examples \\(x\_i\\) consistent with their ground truth labels \\(y\_i\\) would also have a very low loss \\(L\\). We are now going to introduce the third and last key component: **optimization**. Optimization is the process of finding the set of parameters \\(W\\) that minimize the loss function. +예시 $x_i$에 대한 예측값이 실제 값(레이블, labels) $$y_i$$과 같도록 설정된 파라미터(parameter/weight) $$W$$는 손실(loss)값 $$L$$ 또한 매우 낮게 나온다는 것을 알아보았다. 이제 세번째이자 마지막 핵심요소인 **최적화(optimization)**에 대해서 알아보자. 최적화(optimization)는 손실함수(loss function)을 최소화시카는 파라미터(parameter/weight, $$W$$)들을 찾는 과정을 뜻한다. -**Foreshadowing:** Once we understand how these three core components interact, we will revisit the first component (the parameterized function mapping) and extend it to functions much more complicated than a linear mapping: First entire Neural Networks, and then Convolutional Neural Networks. The loss functions and the optimization process will remain relatively unchanged. +**예고:** 이 세 가지 핵심요소가 어떻게 상호작용하는지 이해한 후에는, 첫번째 요소(파라미터화된 함수)로 다시 돌아가서 선형함수보다 더 복잡한 형태로 확장시켜볼 것이다. 처음엔 신경망(Neural Networks), 다음엔 컨볼루션 신경망(Convolutional Neural Networks). 손실함수(loss function)와 최적화(optimization) 과정은 거의 변화가 없을 것이다.. -### Visualizing the loss function -The loss functions we'll look at in this class are usually defined over very high-dimensional spaces (e.g. in CIFAR-10 a linear classifier weight matrix is of size [10 x 3073] for a total of 30,730 parameters), making them difficult to visualize. However, we can still gain some intuitions about one by slicing through the high-dimensional space along rays (1 dimension), or along planes (2 dimensions). For example, we can generate a random weight matrix \\(W\\) (which corresponds to a single point in the space), then march along a ray and record the loss function value along the way. That is, we can generate a random direction \\(W\_1\\) and compute the loss along this direction by evaluating \\(L(W + a W\_1)\\) for different values of \\(a\\). This process generates a simple plot with the value of \\(a\\) as the x-axis and the value of the loss function as the y-axis. We can also carry out the same procedure with two dimensions by evaluating the loss \\( L(W + a W\_1 + b W\_2) \\) as we vary \\(a, b\\). In a plot, \\(a, b\\) could then correspond to the x-axis and the y-axis, and the value of the loss function can be visualized with a color: +### 손실함수(loss function)의 시각화 + +이 강의에서 우리가 다루는 손실함수(loss function)들은 대체로 고차원 공간에서 정의된다. 예를 들어, CIFAR-10의 선형분류기(linear classifier)의 경우 파라미터(parameter/weight) 행렬은 크기가 [10 x 3073]이고 총 30,730개의 파라미터(parameter/weight)가 있다. 따라서, 시각화하기가 어려운 면이 있다. 하지만, 고차원 공간을 1차원 직선이나 2차원 평면으로 잘라서 보면 약간의 직관을 얻을 수 있다. 예를 들어, 무작위로 파라미터(parameter/weight) 행렬 $W$을 하나 뽑는다고 가정해보자. (이는 사실 고차원 공간의 한 점인 셈이다.) 이제 이 점을 직선 하나를 따라 이동시키면서 손실함수(loss function)를 기록해보자. 즉, 무작위로 뽑은 방향 $$W_1$$을 잡고, 이 방향을 따라 가면서 손실함수(loss function)를 계산하는데, 구체적으로 말하면 $$L(W + a W_1)$$에 여러 개의 $$a$$ 값(역자 주: 1차원 스칼라)을 넣어 계산해보는 것이다. 이 과정을 통해 우리는 $$a$$ 값을 x축, 손실함수(loss function) 값을 y축에 놓고 간단한 그래프를 그릴 수 있다. 또한 이 비슷한 것을 2차원으로도 할 수 있다. 여러 $$a, b$$값에 따라 $$ L(W + a W_1 + b W_2) $$을 계산하고(역자 주: $$W_2$$ 역시 $$W_1$$과 같은 식으로 뽑은 무작위 방향), $$a, b$$는 각각 x축과 y축에, 손실함수(loss function) 값 색을 이용해 그리면 된다.
- - - + + +
- Loss function landscape for the Multiclass SVM (without regularization) for one single example (left,middle) and for a hundred examples (right) in CIFAR-10. Left: one-dimensional loss by only varying a. Middle, Right: two-dimensional loss slice, Blue = low loss, Red = high loss. Notice the piecewise-linear structure of the loss function. The losses for multiple examples are combined with average, so the bowl shape on the right is the average of many piece-wise linear bowls (such as the one in the middle). + Regularization 없는 멀티클래스 SVM의 손실함수(Loss function)의 지형을 CIFAR-10 데이타의 1개의 예시(왼쪽, 가운데)와 여러 개의 예시(오른쪽)에 적용시켜 그려본 그림들. 왼쪽: 여러 a값에 따른 1차원 손실(loss) 곡선. 가운데, 오른쪽: 2차원 손실(loss) 평면, 파란색은 낮은 손실(loss)를 뜻하고, 빨간색은 높은 손실(=loss)를 뜻한다. 손실함수(Loss function)가 부분적으로 선형(piecewise linear)인 것이 특징이다. 특히, 오른쪽 그림은 여러 예시를 통해 구한 손실(loss)들을 평균낸 것인데, 밥공기 모양인 것이 특징이다. 이는 가운데 그림 같은 각진 모양의 밥공기 여러 개를 평균낸 모양인 셈이다.
-We can explain the piecewise-linear structure of the loss function by examing the math. For a single example we have: +부분적으로 선형(piecewise linear)은 손실함수(Loss function)의 구조를 수식을 통해 설명할 수 있다. 예시가 하나인 경우에 다음과 같이 쓸 수 있다. $$ -L\_i = \sum\_{j\neq y\_i} \left[ \max(0, w\_j^Tx\_i - w\_{y\_i}^Tx\_i + 1) \right] +L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + 1) \right] $$ -It is clear from the equation that the data loss for each example is a sum of (zero-thresholded due to the \\(\max(0,-)\\) function) linear functions of \\(W\\). Moreover, each row of \\(W\\) (i.e. \\(w\_j\\)) sometimes has a positive sign in front of it (when it corresponds to a wrong class for an example), and sometimes a negative sign (when it corresponds to the correct class for that example). To make this more explicit, consider a simple dataset that contains three 1-dimensional points and three classes. The full SVM loss (without regularization) becomes: +수식에서 명백히 볼 수 있듯이, 각 예시의 손실(loss)값은 ($$\max(0,-)$$ 함수로 인해 0에서 막혀있는) $$W$$의 선형함수들의 합으로 표현된다. $$W$$의 각 행(즉, $$w_j$$) 앞에는 때때로 (잘못된 분류일 때, 즉, $$j\neq y_i$$인 경우) 플러스가 붙고, 때때로 (옳은 분류일 때) 마이너스가 붙는다. 더 명확히 표현하자면, 3개의 1차원 점들과 3개의 클래스가 있다고 해보자. Regularization 없는 총 SVM 손실(loss)은 다음과 같다. $$ \begin{align} -L\_0 = & \max(0, w\_1^Tx\_0 - w\_0^Tx\_0 + 1) + \max(0, w\_2^Tx\_0 - w\_0^Tx\_0 + 1) \\\\ -L\_1 = & \max(0, w\_0^Tx\_1 - w\_1^Tx\_1 + 1) + \max(0, w\_2^Tx\_1 - w\_1^Tx\_1 + 1) \\\\ -L\_2 = & \max(0, w\_0^Tx\_2 - w\_2^Tx\_2 + 1) + \max(0, w\_1^Tx\_2 - w\_2^Tx\_2 + 1) \\\\ -L = & (L\_0 + L\_1 + L\_2)/3 +L_0 = & \max(0, w_1^Tx_0 - w_0^Tx_0 + 1) + \max(0, w_2^Tx_0 - w_0^Tx_0 + 1) \\\\ +L_1 = & \max(0, w_0^Tx_1 - w_1^Tx_1 + 1) + \max(0, w_2^Tx_1 - w_1^Tx_1 + 1) \\\\ +L_2 = & \max(0, w_0^Tx_2 - w_2^Tx_2 + 1) + \max(0, w_1^Tx_2 - w_2^Tx_2 + 1) \\\\ +L = & (L_0 + L_1 + L_2)/3 \end{align} $$ -Since these examples are 1-dimensional, the data \\(x\_i\\) and weights \\(w\_j\\) are numbers. Looking at, for instance, \\(w\_0\\), some terms above are linear functions of \\(w\_0\\) and each is clamped at zero. We can visualize this as follows: +이 예시들이 1차원이기 때문에, 데이타 $$x_i$$와 파라미터(parameter/weight) $$w_j$$는 숫자(역자 주: 즉, 스칼라. 따라서 위 수식에서 전치행렬을 뜻하는 $$T$$ 표시는 필요없음)이다. 예를 들어 $$w_0$$ 를 보면, 몇몇 항들은 $$w_0$$의 선형함수이고 각각은 0에서 꺾인다. 이를 다음과 같이 시각화할 수 있다.
- +
- 1-dimensional illustration of the data loss. The x-axis is a single weight and the y-axis is the loss. The data loss is a sum of multiple terms, each of which is either independent of a particular weight, or a linear function of it that is thresholded at zero. The full SVM data loss is a 30,730-dimensional version of this shape. + 손실(loss)를 1차원으로 표현한 그림. x축은 파라미터(parameter/weight) 하나이고, y축은 손실(loss)이다. 손실(loss)는 여러 항들의 합인데, 그 각각은 특정 파라미터(parameter/weight)값과 무관하거나, 0에 막혀있는 그 파라미터(parameter/weight)의 선형함수이다. 전체 SVM 손실은 이 모양의 30,730차원 버전이다.
-As an aside, you may have guessed from its bowl-shaped appearance that the SVM cost function is an example of a [convex function](http://en.wikipedia.org/wiki/Convex_function) There is a large amount of literature devoted to efficiently minimizing these types of functions, and you can also take a Stanford class on the topic ( [convex optimization](http://stanford.edu/~boyd/cvxbook/) ). Once we extend our score functions \\(f\\) to Neural Networks our objective functions will become non-convex, and the visualizations above will not feature bowls but complex, bumpy terrains. +옆길로 새면, 아마도 밥공기 모양을 보고 SVM 손실함수(loss function)이 일종의 [볼록함수](http://en.wikipedia.org/wiki/Convex_function)라고 생각했을 것이다. 이런 형태의 함수를 효율적으로 최소화하는 문제에 대한 엄청난 양의 연구 성과들이 있다. 스탠포드 강좌 중에서도 이 주제를 다룬 것도 있다. ( [볼록함수 최적화](http://stanford.edu/~boyd/cvxbook/) ). 이 스코어함수(score function) $$f$$를 신경망(neural networks)로 확장시키면, 목적함수(역자 주: 손실함수(loss function))은 더이상 볼록함수가 아니게 되고, 위와 같은 시각화를 해봐도 밥공기 모양 대신 울퉁불퉁하고 복잡한 모양이 보일 것이다. -*Non-differentiable loss functions*. As a technical note, you can also see that the *kinks* in the loss function (due to the max operation) technically make the loss function non-differentiable because at these kinks the gradient is not defined. However, the [subgradient](http://en.wikipedia.org/wiki/Subderivative) still exists and is commonly used instead. In this class will use the terms *subgradient* and *gradient* interchangeably. +*미분이 불가능한 손실함수(loss functions)*. 기술적인 설명을 덧붙이자면, $$\max(0,-)$$ 함수 때문에 손실함수(loss functionn)에 *꺾임*이 생기는데, 이 때문에 손실함수(loss functions)는 미분이 불가능해진다. 왜냐하면, 그 꺾이는 부분에서 미분 혹은 그라디언트가 존재하지 않기 때문이다. 하지만, [서브그라디언트(subgradient)](http://en.wikipedia.org/wiki/Subderivative)가 존재하고, 대체로 이를 그라디언트(gradient) 대신 이용한다. 앞으로 이 강의에서는 *그라디언트(gradient)*와 *서브그라디언트(subgradient)*를 구분하지 않고 쓸 것이다. - -### Optimization + -To reiterate, the loss function lets us quantify the quality of any particular set of weights **W**. The goal of optimization is to find **W** that minimizes the loss function. We will now motivate and slowly develop an approach to optimizing the loss function. For those of you coming to this class with previous experience, this section might seem odd since the working example we'll use (the SVM loss) is a convex problem, but keep in mind that our goal is to eventually optimize Neural Networks where we can't easily use any of the tools developed in the Convex Optimization literature. +### 최적화 + +정리하면, 손실함수(loss function)는 파라미터(parameter/weight) **W** 행렬의 질을 측정한다. 최적화의 목적은 이 손실함수(loss function)을 최소화시키는 **W**을 찾아내는 것이다. 다음 단락부터 손실함수(loss function)을 최적화하는 방법에 대해서 찬찬히 살펴볼 것이다. 이전에 경험이 있는 사람들이 보면 이 섹션은 좀 이상하다고 생각할지 모르겠다. 왜냐하면, 여기서 쓰인 예 (즉, SVM 손실(loss))가 볼록함수이기 때문이다. 하지만, 우리의 궁극적인 목적은 신경망(neural networks)를 최적화시키는 것이고, 여기에는 볼록함수 최적화를 위해 고안된 방법들이 쉽사리 통히지 않는다. -#### Strategy #1: A first very bad idea solution: Random search -Since it is so simple to check how good a given set of parameters **W** is, the first (very bad) idea that may come to mind is to simply try out many different random weights and keep track of what works best. This procedure might look as follows: +#### 전략 #1: 첫번째 매우 나쁜 방법: 무작위 탐색 (Random search) + +주어진 파라미터(parameter/weight) **W**이 얼마나 좋은지를 측정하는 것은 매우 간단하기 때문에, 처음 떠오르는 (매우 나쁜) 생각은, 단순히 무작위로 파라미터(parameter/weight)을 골라서 넣어보고 넣어 본 값들 중 제일 좋은 값을 기록하는 것이다. 그 과정은 다음과 같다. -```python -# assume X_train is the data where each column is an example (e.g. 3073 x 50,000) -# assume Y_train are the labels (e.g. 1D array of 50,000) -# assume the function L evaluates the loss function +~~~python +# X_train의 각 열(column)이 예제 하나에 해당하는 행렬이라고 생각하자. (예를 들어, 3073 x 50,000짜리) +# Y_train 은 레이블값이 저장된 어레이(array)이라고 하자. (즉, 길이 50,000짜리 1차원 어레이) +# 그리고 함수 L이 손실함수라고 하자. bestloss = float("inf") # Python assigns the highest possible float value for num in xrange(1000): @@ -112,35 +116,36 @@ for num in xrange(1000): # in attempt 5 the loss was 8.943151, best 8.857370 # in attempt 6 the loss was 8.605604, best 8.605604 # ... (trunctated: continues for 1000 lines) -``` +~~~ -In the code above, we see that we tried out several random weight vectors **W**, and some of them work better than others. We can take the best weights **W** found by this search and try it out on the test set: +위의 코드에서, 여러 개의 무작위 파라미터(parameter/weight) **W**를 넣어봤고, 그 중 몇몇은 다른 것들보다 좋았다. 그래서 그 중 제일 좋은 파라미터(parameter/weight) **W**을 테스트 데이터에 넣어보면 된다. -```python -# Assume X_test is [3073 x 10000], Y_test [10000 x 1] -scores = Wbest.dot(Xte_cols) # 10 x 10000, the class scores for all test examples -# find the index with max score in each column (the predicted class) +~~~python +# X_test은 크기가 [3073 x 10000]인 행렬, Y_test는 크기가 [10000 x 1]인 어레이라고 하자. +scores = Wbest.dot(Xte_cols) # 모든 테스트데이터 예제(1만개)에 대한 각 클라스(10개)별 점수를 모아놓은 크기 10 x 10000짜리인 행렬 +# 각 열(column)에서 가장 높은 점수에 해당하는 클래스를 찾자. (즉, 예측 클래스) Yte_predict = np.argmax(scores, axis = 0) -# and calculate accuracy (fraction of predictions that are correct) +# 그리고 정확도를 계산하자. (예측 성공률) np.mean(Yte_predict == Yte) -# returns 0.1555 -``` +# 정확도 값이 0.1555라고 한다. +~~~ -With the best **W** this gives an accuracy of about **15.5%**. Given that guessing classes completely at random achieves only 10%, that's not a very bad outcome for a such a brain-dead random search solution! +이 방법으로 얻은 최선의 **W**는 정확도 **15.5%**이다. 완전 무작위 찍기가 단 10%의 정확도를 보이므로, 무식한 방법 치고는 그리 나쁜 것은 아니다. -**Core idea: iterative refinement**. Of course, it turns out that we can do much better. The core idea is that finding the best set of weights **W** is a very difficult or even impossible problem (especially once **W** contains weights for entire complex neural networks), but the problem of refining a specific set of weights **W** to be slightly better is significantly less difficult. In other words, our approach will be to start with a random **W** and then iteratively refine it, making it slightly better each time. +**핵심 아이디어: 반복적 향상**. 물론 이보다 더 좋은 방법들이 있다. 여기서 핵심 아이디어는, 최선의 파라미터(parameter/weight) **W**을 찾는 것은 매우 어렵거나 때로는 불가능한 문제(특히 복잡한 신경망(neural network) 전체를 구현할 경우)이지만, 어떤 주어진 파라미터(parameter/weight) **W**을 조금 개선시키는 일은 훨씬 덜 힘들다는 점이다. 다시 말해, 우리의 접근법은 무작위로 뽑은 **W**에서 출발해서 매번 조금씩 개선시키는 것을 반복하는 것이다. -> Our strategy will be to start with random weights and iteratively refine them over time to get lower loss +> 우리의 전략은 무작위로 뽑은 파라미터(parameter/weight)으로부터 시작해서 반복적으로 조금씩 개선시켜 손실(loss)을 낮추는 것이다. -**Blindfolded hiker analogy.** One analogy that you may find helpful going forward is to think of yourself as hiking on a hilly terrain with a blindfold on, and trying to reach the bottom. In the example of CIFAR-10, the hills are 30,730-dimensional, since the dimensions of **W** are 3073 x 10. At every point on the hill we achieve a particular loss (the height of the terrain). +**눈 가리고 하산하는 것에 비유.** 앞으로 도움이 될만한 비유는, 경사진 지형에서 눈가리개를 하고 점점 아래로 내려오는 자기 자신을 생각해보는 것이다. CIFAR-10의 예시에서, 그 언덕들은 (**W**가 3073 x 10 차원이므로) 30,730차원이다. 언덕의 각 지점에서의 고도가 손실함수(loss function)의 손실값(loss)의 역할을 한다. -#### Strategy #2: Random Local Search -The first strategy you may think of is to to try to extend one foot in a random direction and then take a step only if it leads downhill. Concretely, we will start out with a random \\(W\\), generate random perturbations \\( \delta W \\) to it and if the loss at the perturbed \\(W + \delta W\\) is lower, we will perform an update. The code for this procedure is as follows: +#### 전략 #2: 무작위 국소 탐색 (Random Local Search) -```python -W = np.random.randn(10, 3073) * 0.001 # generate random starting W +처음 떠오르는 전략은, 시작점에서 무작위로 방향을 정해서 발을 살짝 뻗어서 더듬어보고 그게 내리막길길을 때만 한발짝 내딛는 것이다. 구체적으로 말하면, 임의의 $$W$$에서 시작하고, 또다른 임의의 방향 $$ \delta W $$으로 살짝 움직여본다. 만약에 움직여간 자리($$W + \delta W$$)에서의 손실잢(loss)가 더 낮으면, 거기로 움직이고 다시 탐색을 시작한다. 이 과정을 코드로 짜면 다음과 같다. + +~~~python +W = np.random.randn(10, 3073) * 0.001 # 임의의 시작 파라미터를 랜덤하게 고른다. bestloss = float("inf") for i in xrange(1000): step_size = 0.0001 @@ -150,90 +155,93 @@ for i in xrange(1000): W = Wtry bestloss = loss print 'iter %d loss is %f' % (i, bestloss) -``` +~~~ -Using the same number of loss function evaluations as before (1000), this approach achieves test set classification accuracy of **21.4%**. This is better, but still wasteful and computationally expensive. +이전과 같은 횟수(즉, 1000번)만큼 손실함수(loss function)을 계산하고도, 이 방법을 테스트 데이터에 적용해보니, 분류정확도가 **21.4%**로 나왔다. 발전하긴 했지만, 여전히 좀 비효울적인 것 같다. -#### Strategy #3: Following the Gradient -In the previous section we tried to find a direction in the weight-space that would improve our weight vector (and give us a lower loss). It turns out that there is no need to randomly search for a good direction: we can compute the *best* direction along which we should change our weight vector that is mathematically guaranteed to be the direction of the steepest descend (at least in the limit as the step size goes towards zero). This direction will be related to the **gradient** of the loss function. In our hiking analogy, this approach roughly corresponds to feeling the slope of the hill below our feet and stepping down the direction that feels steepest. +#### 전략 #3: 그라디언트(gradient) 따라가기 + +이전 섹션에서, 파라미터(parameter/weight) 공간에서 파라미터(parameter/weight) 벡터를 향상시키는 (즉, 손실값을 더 낮추는) 뱡향을 찾는 시도를 해봤다. 그런데 사실 좋은 방향을 찾기 위해 방향을 무작위로 탐색할 필요가 없다고 한다. (적어도 반지름이 0으로 수렴하는 아주 좁은 근방에서는) 가장 가파르게 감소한다고 수학적으로 검증된 *최선의* 방향을 구할 수 있고, 이 방향을 따라 파라미터(parameter/weight) 벡터를 움직이면 된다는 것이다. 이 방향이 손실함수(loss function)의 **그라디언트(gradient)**와 관계있다. 눈 가리고 하산하는 것에 비유할 때, 발 밑 지형을 잘 더듬어보고 가장 가파르다는 느낌을 주는 방향으로 내려가는 것에 비견할 수 있다. -In one-dimensional functions, the slope is the instantaneous rate of change of the function at any point you might be interested in. The gradient is a generalization of slope for functions that don't take a single number but a vector of numbers. Additionally, the gradient is just a vector of slopes (more commonly referred to as **derivatives**) for each dimension in the input space. The mathematical expression for the derivative of a 1-D function with respect its input is: +1차원 함수의 경우, 어떤 점에서 움직일 때 기울기는 함수값의 순간 증가율을 나타낸다. 그라디언트(gradient)는 이 기울기란 것을, 변수가 하나가 아니라 여러 개인 경우로 일반화시킨 것이다. 덧붙여 설명하면, 그라디언트(gradient)는 입력데이터공간(역자 주: x들의 공간)의 각 차원에 해당하는 기울기(**미분**이라고 더 많이 불린다)들의 백터이다. 1차원 함수의 미분을 수식으로 쓰면 다음과 같다. $$ \frac{df(x)}{dx} = \lim_{h\ \to 0} \frac{f(x + h) - f(x)}{h} $$ -When the functions of interest take a vector of numbers instead of a single number, we call the derivatives **partial derivatives**, and the gradient is simply the vector of partial derivatives in each dimension. +함수가 숫자 하나가 아닌 벡터를 입력으로 받는 경우 (역자 주: x가 벡터인 경우), 우리는 미분을 **편미분**이라고 부른고, 그라디언트(gradient)는 단순히 각 차원으로의 편미분들을 모아놓은 벡터이다. -### Computing the gradient -There are two ways to compute the gradient: A slow, approximate but easy way (**numerical gradient**), and a fast, exact but more error-prone way that requires calculus (**analytic gradient**). We will now present both. +### 그라디언트(gradient) 계산 + +그라디언트(gradient) 계산법은 크게 2가지가 있다: 느리고 근사값이지만 쉬운 방법 (**수치 그라디언트**), and 빠르고 정확하지만 미분이 필요하고 실수하기 쉬운 방법 (**해석적 그라디언트**). 여기서 둘 다 다룰 것이다. -#### Computing the gradient numerically with finite differences -The formula given above allows us to compute the gradient numerically. Here is a generic function that takes a function `f`, a vector `x` to evaluate the gradient on, and returns the gradient of `f` at `x`: +#### 유한한 차이(Finite Difference)를 이용하여 수치적으로 그라디언트(gradient) 계산하기 -```python +위에 주어진 수식을 이용하여 그라디언트(gradient)를 수치적으로 계산할 수 있다. 여기 임의의 함수 `f`, 이 함수에 입력값으로 넣을 벡터 `x` 가 주어졌을 때, `x`에서 `f`의 그라디언트(gradient)를 계산해주는 범용 함수가 있다: + +~~~python def eval_numerical_gradient(f, x): - """ - a naive implementation of numerical gradient of f at x - - f should be a function that takes a single argument - - x is the point (numpy array) to evaluate the gradient at - """ + """ +함수 f의 x에서의 그라디언트를 매우 단순하게 구현하기. +- f 는 입력값 1개를 받는 함수여야한다. + - x는 numpy 어레이(array)로서그라디언트를 계산할 지점 (역자 주: 그라디언트는 당연하게도 어디서 계산하느냐에 따라 달라지므로, 함수 f 뿐 아니라 x도 정해줘야함). + """ - fx = f(x) # evaluate function value at original point + fx = f(x) # 원래 지점 x에서 함수값 구하기. grad = np.zeros(x.shape) h = 0.00001 - # iterate over all indexes in x + # x의 모든 인덱스를 다 돌면서 계산하기. it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite']) while not it.finished: - # evaluate function at x+h + # 함수 값을 x+h에서 계산하기. ix = it.multi_index old_value = x[ix] - x[ix] = old_value + h # increment by h + x[ix] = old_value + h # 변화랑h fxh = f(x) # evalute f(x + h) - x[ix] = old_value # restore to previous value (very important!) + x[ix] = old_value # 이전 값을 다시 가져온다. (매우 중요!) - # compute the partial derivative - grad[ix] = (fxh - fx) / h # the slope - it.iternext() # step to next dimension + # 편미분 계산 + grad[ix] = (fxh - fx) / h # 기울기 + it.iternext() # 다음 단계로 가서 반복. return grad -``` +~~~ -Following the gradient formula we gave above, the code above iterates over all dimensions one by one, makes a small change `h` along that dimension and calculates the partial derivative of the loss function along that dimension by seeing how much the function changed. The variable `grad` holds the full gradient in the end. +이 코드는, 위에 주어진 그라디언트(gradient) 식을 이용해서 모든 차원을 하나씩 돌아가면서 그 방향으로 작은 변화 `h`를 줬을 때, 손실함수(loss function)의 값이 얼마나 변하는지를 구해서, 그 방향의 편미분 값을 계산한다. 변수 `grad`에 전체 그라디언트(gradient) 값이 최종적으로 저장된다. -**Practical considerations**. Note that in the mathematical formulation the gradient is defined in the limit as **h** goes towards zero, but in practice it is often sufficient to use a very small value (such as 1e-5 as seen in the example). Ideally, you want to use the smallest step size that does not lead to numerical issues. Additionally, in practice it often works better to compute the numeric gradient using the **centered difference formula**: \\( [f(x+h) - f(x-h)] / 2 h \\) . See [wiki](http://en.wikipedia.org/wiki/Numerical_differentiation) for details. +**실제 고려할 사항**. **h**가 0으로 수렴할 때의 극한값이 그라디언트(gradient)의 수학적으로 정의인데, (이 예시에서 나온 것처럼 1e-5 같이) 작은 값이면 충분하다. 이상적으로, 수치적인 문제를 일으키지 않는 수준에서 가장 작은 값을 쓰면 된다. 덧붙여서, 실제 활용할 때, x를 **양 방향으로 변화를 주어서 구한 수식**이 더 좋은 경우가 많다: $ [f(x+h) - f(x-h)] / 2 h $ . 다음 [위키](http://en.wikipedia.org/wiki/Numerical_differentiation)를 보면 자세한 것을 알 수 있다. -We can use the function given above to compute the gradient at any point and for any function. Lets compute the gradient for the CIFAR-10 loss function at some random point in the weight space: +위에서 계산한 함수를 이용하면, 아무 함수의 아무 값에서나 그라디언트(gradient)를 계산할 수 있다. 무작위로 뽑은 파라미터(parameter/weight)값에서 CIFAR-10의 손실함수(loss function)의 그라디언트를 구해본다.: -```python +~~~python -# to use the generic code above we want a function that takes a single argument -# (the weights in our case) so we close over X_train and Y_train +# 위의 범용코드를 쓰려면 함수가 입력값 하나(이 경우 파라미터)를 받아야함. + # 따라서X_train와 Y_train은 입력값으로 안 치고 W 하나만 입력값으로 받도록 함수 다시 정의. def CIFAR10_loss_fun(W): return L(X_train, Y_train, W) -W = np.random.rand(10, 3073) * 0.001 # random weight vector -df = eval_numerical_gradient(CIFAR10_loss_fun, W) # get the gradient -``` +W = np.random.rand(10, 3073) * 0.001 # 랜덤 파라미터 벡터. +df = eval_numerical_gradient(CIFAR10_loss_fun, W) # 그라디언트를 구했다. +~~~ -The gradient tells us the slope of the loss function along every dimension, which we can use to make an update: +그라디언트(gradient)는 각 차원에서 CIFAR-10의 손실함수(loss function)의 기울기를 알려주는데, 그걸 이용해서 파라미터(parameter/weight)를 업데이트한다. -```python -loss_original = CIFAR10_loss_fun(W) # the original loss +~~~python +loss_original = CIFAR10_loss_fun(W) # 기존 손실값 print 'original loss: %f' % (loss_original, ) -# lets see the effect of multiple step sizes +# 스텝크기가 주는 영향에 대해 알아보자. for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]: step_size = 10 ** step_size_log - W_new = W - step_size * df # new position in the weight space + W_new = W - step_size * df # 파라미터(parameter/weight) 공간 상의 새 파라미터 값 loss_new = CIFAR10_loss_fun(W_new) print 'for step size %f new loss: %f' % (step_size, loss_new) @@ -249,97 +257,103 @@ for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]: # for step size 1.000000e-03 new loss: 254.086573 # for step size 1.000000e-02 new loss: 2539.370888 # for step size 1.000000e-01 new loss: 25392.214036 -``` +~~~ -**Update in negative gradient direction**. In the code above, notice that to compute `W_new` we are making an update in the negative direction of the gradient `df` since we wish our loss function to decrease, not increase. +**Update in negative gradient direction**. 위 코드에서, 새로운 파라미터 `W_new`로 업데이트할 때, 그라디언트(gradient) `df`의 반대방향으로 움직인 것을 주목하자. 왜냐하면 우리가 원하는 것은 손실함수(loss function)의 증가가 아니라 감소하는 것이기 때문이다. -**Effect of step size**. The gradient tells us the direction in which the function has the steepest rate of increase, but it does not tell us how far along this direction we should step. As we will see later in the course, choosing the step size (also called the *learning rate*) will become one of the most important (and most headache-inducing) hyperparameter settings in training a neural network. In our blindfolded hill-descent analogy, we feel the hill below our feet sloping in some direction, but the step length we should take is uncertain. If we shuffle our feet carefully we can expect to make consistent but very small progress (this corresponds to having a small step size). Conversely, we can choose to make a large, confident step in an attempt to descend faster, but this may not pay off. As you can see in the code example above, at some point taking a bigger step gives a higher loss as we "overstep". +**스텝 크기가 미치는 영향**. 그라디언트(gradient)에서 알 수 있는 것은 함수값이 가장 빠르게 증가하는 방향이고, 그 방향으로 대체 얼만큼을 가야하는지는 알려주지 않는다. 강의 뒤에서 다루게 되겠지만, 얼만큼 가야하는지를 의미하는 스텝 크기(혹은 *학습 속도*라고도 함)는 신경망(neural network)를 학습시킬 때 있어 가장 중요한 (그래서 결정하기 까다로운) 하이퍼파라미터(hyperparameter)가 될 것이다. 눈 가리고 하산하는 비유에서, 우리는 우리 발 밑으로 어느 방향이 가장 가파른지 느끼지만, 얼마나 발을 뻗어야할 지는 불확실하다. 발을 살살 휘져으면, 꾸준하지만 매우 조금씩밖에 못 내려갈 것이다. (이는 아주 작은 스텝 크기에 비견된다.) 반대로, 욕심껏 빨리 내려가려고 크고 과감하게 발을 내딛을 수도 있는데, 항상 뜻대로 되지는 않을지 모른다. 위의 제시된 코드에서와 같이, 어느 수준 이상의 큰 스켑 크기는 오히려 손실값을 증가시킨다.
- +
- Visualizing the effect of step size. We start at some particular spot W and evaluate the gradient (or rather its negative - the white arrow) which tells us the direction of the steepest decrease in the loss function. Small steps are likely to lead to consistent but slow progress. Large steps can lead to better progress but are more risky. Note that eventually, for a large step size we will overshoot and make the loss worse. The step size (or as we will later call it - the learning rate) will become one of the most important hyperparameters that we will have to carefully tune. + 작은 변화값(step)이 주는 영향을 시각적으로 보여주는 그림. 특정 지검 W에서 시작해서 그라디언트(혹은 거기에 -1을 곱한 값)를 계산한다. 이 그라디언트에 -1을 곱한 방향, 즉 하얀 화살표 방향이 손실함수(loss function)이 가장 빠르게 감소하는 방향이다. 그 방향으로 조금 가는 것은 일관되지만 느리게 최적화를 진행시킨다. 반면에, 그 방향으로 너무 많이 가면, 더 많이 감소시키지만 위험성도 크다. 스텝 크기가 점점 커지면, 결국에는 최소값을 지나쳐서 손실값이 더 커지는 지점까지 가게될 것이다. 스텝 크기(나중에 학습속도라고 부를 것임) 가장 중요한 하이퍼파라미터(hyperparameter)이라서 매우 조심스럽게 결정해야 할 것이다.
-**A problem of efficiency**. You may have noticed that evaluating the numerical gradient has complexity linear in the number of parameters. In our example we had 30730 parameters in total and therefore had to perform 30,731 evaluations of the loss function to evaluate the gradient and to perform only a single parameter update. This problem only gets worse, since modern Neural Networks can easily have tens of millions of parameters. Clearly, this strategy is not scalable and we need something better. +**효율성의 문제**. 알다시피, 그라디언트(gradient)를 수치적으로 계산하는 데 드는 비용은 파라미터(parameter/weight)의 수에 따라 선형적으로 늘어난다. 위 예시에서, 총 30,730의 파라미터(parameter/weight)가 있으므로 30,731번 손실함수값을 계산해서 그라디언트(gradient)를 구해 봐야 딱 한 번 업데이트할 수 있다. 요즘 쓰이는 신경망(neural networks)들은 수천만개의 파라미터(parameter/weight)도 우스운데, 그런 경우 이 문제는 매우 심각해진다. 당연하게도, 이 전략은 별로고, 더 좋은게 있다. -#### Computing the gradient analytically with Calculus -The numerical gradient is very simple to compute using the finite difference approximation, but the downside is that it is approximate (since we have to pick a small value of *h*, while the true gradient is defined as the limit as *h* goes to zero), and that it is very computationally expensive to compute. The second way to compute the gradient is analytically using Calculus, which allows us to derive a direct formula for the gradient (no approximations) that is also very fast to compute. However, unlike the numerical gradient it can be more error prone to implement, which is why in practice it is very common to compute the analytic gradient and compare it to the numerical gradient to check the correctnes of your implementation. This is called a **gradient check**. +#### 미적분을 이용하여 해석적으로 그라디언트(gradient)를 계산하기 + +수치적으로 계산하는 그라디언트(gradient)는 유한차이(finite difference)를 이용해서 매우 단순하지만, 단점은 근사값이라는 점과 (그라디언트의 진짜 정의는 "h"가 0으로 수렴할 때의 극한값인데, 여기서는 그냥 작은 "h"값을 쓰기 때문에), 계산이 비효율적이라는 것이다. 두번째 방법은 미적분을 이용해서 해석적으로 그라디언트(gradient)를 구하는 것인데, 이는 (근사치가 아닌) 정확한 수식을 이용하기 때문에 계산하기 매우 빠르다. 하지만, 수치적으로 구한 그라디언트(gradient)와는 다르게, 구현하는데 실수하기 쉽다. 그래서, 실제 응용할 때 해석적으로 구한 다음에 수치적으로 구한 것과 비교해보고, 틀린 경우 고치는 게 흔한 일이다. 이 과정을 **그라디언트체크(gradient check)**라고 한다.. -Lets use the example of the SVM loss function for a single datapoint: +SVM 손실함수(loss function)의 예를 들어서 설명해보자. $$ -L\_i = \sum\_{j\neq y\_i} \left[ \max(0, w\_j^Tx\_i - w\_{y\_i}^Tx\_i + \Delta) \right] +L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right] $$ -We can differentiate the function with respect to the weights. For example, taking the gradient with respect to \\(w\_{y\_i}\\) we obtain: +파라미터(parameter/weight)로 이 함수를 미분할 수 있다. 예를 들어, $w_{y_i}$로 미분하면 이렇게 된다: $$ -\nabla\_{w\_{y\_i}} L\_i = - \left( \sum\_{j\neq y\_i} \mathbb{1}(w\_j^Tx\_i - w\_{y\_i}^Tx\_i + \Delta > 0) \right) x\_i +\nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i $$ -where \\(\mathbb{1}\\) is the indicator function that is one if the condition inside is true or zero otherwise. While the expression may look scary when it is written out, when you're implementing this in code you'd simply count the number of classes that didn't meet the desired margin (and hence contributed to the loss function) and then the data vector \\(x\_i\\) scaled by this number is the gradient. Notice that this is the gradient only with respect to the row of \\(W\\) that corresponds to the correct class. For the other rows where \\(j \neq y\_i \\) the gradient is: +여기서 $\mathbb{1}$은 정의함수라고 하는데, 쉽게 말해 괄호 안의 조건이 충족되면 1, 아니면 0인 값을 갖는다. 이렇게 써놓으면 무시무시해보이지만, 실제로 코딩으로 구현할 때는 원하는 차이(마진, margin)을 못 만족시키는, 따라서 손실함수(loss function)의 증가에 일조하는 클래스의 개수를 세고, 이 숫자를 데이터벡터 $x_i$에 곱하면 이게 바로 그라디언트(gradient)이다. 단, 이는 참인 클래스에 해당하는 $W$의 행으로 미분했을 때의 그라디언트이다. $j \neq y_i $인 다른 행에 대한 그라디언트(gradient)는 다음과 같다. $$ -\nabla\_{w\_j} L\_i = \mathbb{1}(w\_j^Tx\_i - w\_{y\_i}^Tx\_i + \Delta > 0) x\_i +\nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i $$ -Once you derive the expression for the gradient it is straight-forward to implement the expressions and use them to perform the gradient update. +일단 그라디언트(gradient) 수식을 구하고 그라디언트(gradient)를 업데이트시키는 것은 간단하다. -### Gradient Descent -Now that we can compute the gradient of the loss function, the procedure of repeatedly evaluating the gradient and then performing a parameter update is called *Gradient Descent*. Its **vanilla** version looks as follows: +### 그라디언트 하강 (gradient descent) + +이제 손실함수(loss function)의 그라디언트(gradient)를 계산할 줄 알게 됐는데, 그라디언트(gradient)를 계속해서 계산하고 파라미터(weight/parameter)를 Now that we can compute the gradient of the loss function, the procedure of repeatedly evaluating the gradient and then performing a parameter update is called *Gradient Descent*. Its **vanilla** version looks as follows: -```python -# Vanilla Gradient Descent +~~~python +# 단순한 경사하강(gradient descent) while True: weights_grad = evaluate_gradient(loss_fun, data, weights) - weights += - step_size * weights_grad # perform parameter update -``` + weights += - step_size * weights_grad # 파라미터 업데이트(parameter update) +~~~ -This simple loop is at the core of all Neural Network libraries. There are other ways of performing the optimization (e.g. LBFGS), but Gradient Descent is currently by far the most common and established way of optimizing Neural Network loss functions. Throughout the class we will put some bells and whistles on the details of this loop (e.g. the exact details of the update equation), but the core idea of following the gradient until we're happy with the results will remain the same. +이 단순한 루프는 모든 신경망(neural network)의 중심에 있는 것이다. 다른 방법으로 (예컨데. LBFGS) 최적화를 할 수 있는 방법이 있긴 하지만, 현재로는 그라디언트 하강 (gradient descent)이 신경망(neural network)의 손실함수(loss function)을 최적화하는 것으로는 가장 많이 쓰인다. 이 강의에서, 이 루프에 이것저것 세세하게 덧붙이기(예를 들어, 업데이트 수식이 정확히 어떻게 되는지 등)는 할 것이다. 하지만, 결과에 만족할 때까지 그라디언트(gradient)를 따라서 움직인다는 기본적인 개념은 안 바뀔 것이다. +bat +**미니배치 그라디언트 하강 (Mini-batch gradient descent (MGD)).** (ILSVRC challenge처럼) 대규모의 응용사례에서, 학습데이터(training data)가 수백만개 주어질 수 있다. 따라서, 파라미터를 한 번 업데이트하려고 학습데이터(training data) 전체를 계산에 사용하는 것은 낭비가 될 수 있다. 이를 극복하기 위해서 흔하게 쓰이는 방법으로는, 학습데이터(training data)의 **배치(batches)**만 이용해서 그라디언트(gradient)를 구하는 것이다. 예를 들어 ConvNets을 쓸 때, 한 번에 120만개 중에 256개짜리 배치만을 이용해서 그라디언트(gradient)를 구하고 파라미터(parameter/weight) 업데이트를 한다. 다음 코드를 보자. -**Mini-batch gradient descent.** In large-scale applications (such as the ILSVRC challenge), the training data can have on order of millions of examples. Hence, it seems wasteful to compute the full loss function over the entire training set in order to perform only a single parameter update. A very common approach to addressing this challenge is to compute the gradient over **batches** of the training data. For example, in current state of the art ConvNets, a typical batch contains 256 examples from the entire training set of 1.2 million. This batch is then used to perform a parameter update: - -```python -# Vanilla Minibatch Gradient Descent +~~~python +# 단순한 미니배치 (minibatch) 그라디언트(gradient) 업데이트 while True: - data_batch = sample_training_data(data, 256) # sample 256 examples + data_batch = sample_training_data(data, 256) # 예제 256개짜리 미니배치(mini-batch) weights_grad = evaluate_gradient(loss_fun, data_batch, weights) - weights += - step_size * weights_grad # perform parameter update -``` + weights += - step_size * weights_grad # 파라미터 업데이트(parameter update) +~~~ -The reason this works well is that the examples in the training data are correlated. To see this, consider the extreme case where all 1.2 million images in ILSVRC are in fact made up of exact duplicates of only 1000 unique images (one for each class, or in other words 1200 identical copies of each image). Then it is clear that the gradients we would compute for all 1200 identical copies would all be the same, and when we average the data loss over all 1.2 million images we would get the exact same loss as if we only evaluated on a small subset of 1000. In practice of course, the dataset would not contain duplicate images, the gradient from a mini-batch is a good approximation of the gradient of the full objective. Therefore, much faster convergence can be achieved in practice by evaluating the mini-batch gradients to perform more frequent parameter updates. +이 방법이 먹히는 이유는 학습데이터들의 예시들이 서로 상관관계가 있기 때문이다. 이것에 대해 알아보기위해, ILSVRC의 120만개 이미지들이 사실은 1천개의 서로 다른 이미지들의 복사본이라는 극단적인 경우를 생각해보자. (즉, 한 클래스 당 하나이고, 이 하나가 1천2백번 복사된 것) 그러면 명백한 것은, 이 1천2백개의 이미지에서의 그라디언트(gradient)값 (역자 주: 이 1천2백개에 해당하는 $i$에 대한 $L_i$값)은 다 똑같다는 점이다. 그렇다면 이 1천2백개씩 똑같은 값들 120만개를 평균내서 손실값(loss)를 구하는 것이나, 서로 다른 1천개의 이미지당 하나씩 1000개의 값을 평균내서 손실값(loss)을 구하는 것이나 똑같다. 실제로는 당연히 중복된 데이터를 주지는 않겠지만, 미니배치(mini-batch)에서만 계산하는 그라디언트(gradient)는 모든 데이터를 써서 구하는 것의 근사값으로 괜찮게 쓰일 수 있을 것이다. 따라서, 미니배치(mini-batch)에서 그라디언트(gradient)를 구해서 더 자주자주 파라미터(parameter/weight)을 업데이트하면 실제로 더 빠른 수렴하게 된다. -The extreme case of this is a setting where the mini-batch contains only a single example. This process is called **Stochastic Gradient Descent (SGD)** (or also sometimes **on-line** gradient descent). This is relatively less common to see because in practice due to vectorized code optimizations it can be computationally much more efficient to evaluate the gradient for 100 examples, than the gradient for one example 100 times. Even though SGD technically refers to using a single example at a time to evaluate the gradient, you will hear people use the term SGD even when referring to mini-batch gradient descent (i.e. mentions of MGD for "Minibatch Gradient Descent", or BGD for "Batch gradient descent" are rare to see), where it is usually assumed that mini-batches are used. The size of the mini-batch is a hyperparameter but it is not very common to cross-validate it. It is usually based on memory constraints (if any), or set to some value, e.g. 32, 64 or 128. We use powers of 2 in practice because many vectorized operation implementations work faster when their inputs are sized in powers of 2. +이 방법의 극단적인 형태는 미니배치(mini-batch)가 데이터 달랑 한개로 이루어졌을 때이다. 이는 **확률그라디언트하강(Stochastic Gradient Descent (SGD))** (혹은 **온라인** 그라디언트 하강)이라고 불린다. 이건 상대적으로 덜 보편적인데, 그 이유는 우리가 프로그램을 짤 때 계산을 벡터/행렬로 만들어서 하기 때문에, 한 예제에서 100번 계산하는 것보다 100개의 예제에서 1번 계산하는게 더 빠르기 때문이다. SGD가 엄밀한 의미에서는 예제 하나짜리 미니배치(mini-batch)에서 그라디언트(gradient)를 계산하는 것이지만, 많은 사람들이 그냥 MGD를 의미하면서 SGD라고 부르기도 한다. 혹은 드물게나마 배치 그라디언트 하강 (Batch gradient descent, BGD)이라고도 부른다. 미니배치(mini-batch)의 크기도 하이퍼파라미터(hyperparameter)이지만, 이것을 교차검증하는 일은 흔치 않다. 이건 대체로 컴퓨터 메모리 크기의 한계에 따라 결정되거나, 몇몇 특정값 (예를 들어, 32, 64 or 128 같은 것)을 이용한다. 2의 제곱수를 이용하는 이유는 많은 벡터 계산이 2의 제곱수가 입력될 때 더 빠르기 때문이다. -### Summary + +### 요약
- +
- Summary of the information flow. The dataset of pairs of (x,y) is given and fixed. The weights start out as random numbers and can change. During the forward pass the score function computes class scores, stored in vector f. The loss function contains two components: The data loss computes the compatibility between the scores f and the labels y. The regularization loss is only a function of the weights. During Gradient Descent, we compute the gradient on the weights (and optionally on data if we wish) and use them to perform a parameter update during Gradient Descent. + 정보 흐름 요약. (x,y)라는 고정된 데이터 쌍이 주어져 있다. 처음에는 무작위로 뽑은 파라미터(parameter/weight)값으로 시작해서 바꿔나간다. 왼쪽에서 오른쪽으로 가면서, 스코어함수(score function)가 각 클래스의 점수를 계산하고 그 값이 f 벡터에 저장된다. 손실함수(loss function)는 두 부분으로 나뉘어 있다. 첫째, 데이터 손실(data loss)은 파라미터(parameter/weight)만으로 계산하는 함수이다. 그라디언트 하강(Gradient Descent) 과정에서, 파라미터(parameter/weight)로 미분한 (혹은 원한다면 데이터 값으로 추가로 미분한. 역자 주: 필요에 따라 데이터 값으로도 미분하는 경우가 있다고 함. 문맥상 몰라도 되는 듯.) 그라디언트(gradient)를 계산하고, 이것을 이용해서 파라미터(parameter/weight)값을 업데이트한다.
-In this section, +이 섹션에서 다음을 다루었다. -- We developed the intuition of the loss function as a **high-dimensional optimization landscape** in which we are trying to reach the bottom. The working analogy we developed was that of a blindfolded hiker who wishes to reach the bottom. In particular, we saw that the SVM cost function is piece-wise linear and bowl-shaped. -- We motivated the idea of optimizing the loss function with -**iterative refinement**, where we start with a random set of weights and refine them step by step until the loss is minimized. -- We saw that the **gradient** of a function gives the steepest ascent direction and we discussed a simple but inefficient way of computing it numerically using the finite difference approximation (the finite difference being the value of *h* used in computing the numerical gradient). -- We saw that the parameter update requires a tricky setting of the **step size** (or the **learning rate**) that must be set just right: if it is too low the progress is steady but slow. If it is too high the progress can be faster, but more risky. We will explore this tradeoff in much more detail in future sections. -- We discussed the tradeoffs between computing the **numerical** and **analytic** gradient. The numerical gradient is simple but it is approximate and expensive to compute. The analytic gradient is exact, fast to compute but more error-prone since it requires the derivation of the gradient with math. Hence, in practice we always use the analytic gradient and then perform a **gradient check**, in which its implementation is compared to the numerical gradient. -- We introduced the **Gradient Descent** algorithm which iteratively computes the gradient and performs a parameter update in loop. +- 손실함수(loss function)가 **고차원의 울퉁불퉁한 지형**이고, 이 지형에서 아래쪽으로 내려가는 것으로 직관적인 설명을 발전시켰다. 이에 대한 비유는 눈가린 등산객이 하산하는 것이었다. 특히, SVM의 손실함수(loss function)가 부분적으로 선형(linear)인 밥공기 모양이라는 것을 확인했따. +- 손실함수(loss function)을 최적화시킨다는 개념을, 아무 데서나 시작해서 더 나아지는 쪽으로 한걸음 한걸음 나은 쪽으로 가서 최적화시킨다는 **반복적으로 개선**의 측면으로 운을 띄워봤고 +- 함수의 **그라디언트(gradient)**는 그 함수값이 감소하는 가장 빠른 방향이라는 점을 알아봤고, 이것을 유한차이(finite difference, 즉 미분할 때 *h*의 값이 유한하다는 의미)를 이용하여 단순무식하게 수치적으로 어림잡아 계산하는 방법도 알아보았다. +- 파라미터(parameter/weight)를 업데이트할 때, 한 번에 얼마나 움직여야하는지(혹은 **학습속도**)를 딱 맞게 설정하는 것이 까다로운 문제라는 것도 알아보았다. 이 값이 너무 낮으면 너무 느려지고, 너무 높으면 빨라지지만 위험한 점이 있다. 이 장단점에 대해 다음 섹션에서 자세하게 알아볼 것이다. +- 그라디언트(gradient)를 계산할 때 **수치적**인 방법과 **해석적**인 방법의 장단점을 알아보았다. 수치적인 그라디언트(gradient)는 단순하지만, 근사값이고 비효율적이다. 해석적인 그라디언트(gradient)는 정확하고 빠르지만 손으로 계산해야 되서 실수를 할 수 있다. 따라서 실제 응용에서는 해석적인 그라디언트(gradient)을 쓰고, **그라디언트 체크(gradient check)**라는 수치적인 그라디언트(gradient)와 비교/검증하는 과정을 거친다. +- 반복적으로 루프(loop)를 돌려서 그라디언트(gradient)를 계산하고 파라미터(parameter/weight)를 업데이트하는 **그라디언트 하강 (Gradient Descent)** 알고리즘을 소개했다. -**Coming up:** The core takeaway from this section is that the ability to compute the gradient of a loss function with respect to its weights (and have some intuitive understanding of it) is the most important skill needed to design, train and understand neural networks. In the next section we will develop proficiency in computing the gradient analytically using the chain rule, otherwise also refered to as **backpropagation**. This will allow us to efficiently optimize relatively arbitrary loss functions that express all kinds of Neural Networks, including Convolutional Neural Networks. +**예고:** 이 섹션에서 핵심은, 손실함수(loss function)를 파라미터(parameter/weight)로 미분하여 그라디언트(gradient)를 계산하는 법(과 그에 대한 직관적인 이해)가 신경망(neural network)를 디자인하고 학습시키고 이해하는데 있어 가장 중요한 기술이라는 점이다. 다음 섹션에서는, 그라디언(gradient)를 해석적으로 구할 때 연쇄법칙을 이용한, **backpropagation**이라고도 불리는 효율적인 방법에 대해 알아보겠다. 이 방법을 쓰면 컨볼루션 신경망 (Convolutional Neural Networks)을 포함한 모든 종류의 신경망(Neural Networks)에서 쓰이는 상대적으로 일반적인 손실함수(loss function)를 효율적으로 최적화시킬 수 있다. +--- +

+번역: stats2ml +

diff --git a/optimization-2.md b/optimization-2.md index 13ecb414..723df3b9 100644 --- a/optimization-2.md +++ b/optimization-2.md @@ -5,69 +5,72 @@ permalink: /optimization-2/ Table of Contents: -- [Introduction](#intro) -- [Simple expressions, interpreting the gradient](#grad) -- [Compound expressions, chain rule, backpropagation](#backprop) -- [Intuitive understanding of backpropagation](#intuitive) -- [Modularity: Sigmoid example](#sigmoid) -- [Backprop in practice: Staged computation](#staged) -- [Patterns in backward flow](#patters) -- [Gradients for vectorized operations](#mat) -- [Summary](#summary) +- [소개(Introduction)](#intro) +- [그라디언트(Gradient)에 대한 간단한 표현과 이해](#grad) +- [복합 표현식(Compound Expression), 연쇄 법칙(Chain rule), Backpropagation](#backprop) +- [Backpropation에 대한 직관적인 이해](#intuitive) +- [모듈성 : 시그모이드(Sigmoid) 예제](#sigmoid) +- [Backprop 실제: 단계별 계산](#staged) +- [역박향 흐름의 패턴](#patters) +- [벡터 기반의 그라디언트(Gradient) 계산)](#mat) +- [요약](#summary) + ### Introduction -**Motivation**. In this section we will develop expertise with an intuitive understanding of **backpropagation**, which is a way of computing gradients of expressions through recursive application of **chain rule**. Understanding of this process and its subtleties is critical for you to understand, and effectively develop, design and debug Neural Networks. +**Motivation**. 이번 섹션에서 우리는 **Backpropagation**에 대한 직관적인 이해를 바탕으로 전문지식을 더 키우고자 한다. Backpropagation은 네트워크 전체에 대해 반복적인 **연쇄 법칙(Chain rule)**을 적용하여 그라디언트(Gradient)를 계산하는 방법 중 하나이다. Backpropagation 과정과 세부 요소들에 대한 이해는 여러분에게 있어서 신경망을 효과적으로 개발하고, 디자인하고 디버그하는 데 중요하다고 볼 수 있다. + +**Problem statement**. 이번 섹션에서 공부할 핵심 문제는 다음과 같다 : 주어진 함수 $$f(x)$$ 가 있고, $$x$$ 는 입력 값으로 이루어진 벡터이고, 주어진 입력 $$x$$에 대해서 함수 $$f$$의 그라디언트를 계산하고자 한다. (i.e. $$\nabla f(x)$$ ). -**Problem statement**. The core problem studied in this section is as follows: We are given some function \\(f(x)\\) where \\(x\\) is a vector of inputs and we are interested in computing the gradient of \\(f\\) at \\(x\\) (i.e. \\(\nabla f(x)\\) ). +**Motivation**. 우리가 이 문제에 관심을 기울이는 이유에 대해 신경망 관점에서 좀더 구체적으로 살펴 보자. $$f$$는 손실 함수 ( $$L$$ ) 에 해당하고 입력 값 $$x$$ 는 학습 데이터(Training data)와 신경망의 Weight라고 볼 수 있다. 예를 들면, 손실 함수는 SVM Loss 함수가 될 수 있고, 입력 값은 학습 데이터 $$(x_i,y_i), i=1 \ldots N$$ 와 Weight, Bias $$W,b$$ 으로 볼 수 있다. 여기서 학습데이터는 미리 주어져서 고정 되어있는 값으로 볼 수 있고 (보통의 기계 학습에서 그러하듯..), Weight는 신경망의 학습을 위해 실제로 컨트롤 하는 값이다. 따라서 입력 값 $$x_i$$ 에 대한 그라디언트 계산이 쉬울지라도, 실제로는 파라미터(Parameter) 값에 대한 그라디언트를 일반적으로 계산하고, 그라디언트 값을 활용하여 파라미터를 업데이트 할 수 있다. 하지만, 신경망이 어떻게 작동하는지 해석하고, 시각화 하는 부분에서 입력 값 $$x_i$$에 대한 그라디언트도 유용하게 활용 될 수 있는데, 이 부분은 본 강의의 뒷부분에 다룰 예정이다. -**Motivation**. Recall that the primary reason we are interested in this problem is that in the specific case of Neural Networks, \\(f\\) will correspond to the loss function ( \\(L\\) ) and the inputs \\(x\\) will consist of the training data and the neural network weights. For example, the loss could be the SVM loss function and the inputs are both the training data \\((x\_i,y\_i), i=1 \ldots N\\) and the weights and biases \\(W,b\\). Note that (as is usually the case in Machine Learning) we think of the training data as given and fixed, and of the weights as variables we have control over. Hence, even though we can easily use backpropagation to compute the gradient on the input examples \\(x\_i\\), in practice we usually only compute the gradient for the parameters (e.g. \\(W,b\\)) so that we can use it to perform a parameter update. However, as we will see later in the class the gradient on \\(x\_i\\) can still be useful sometimes, for example for purposes of visualization and interpreting what the Neural Network might be doing. -If you are coming to this class and you're comfortable with deriving gradients with chain rule, we would still like to encourage you to at least skim this section, since it presents a rarely developed view of backpropagation as backward flow in real-valued circuits and any insights you'll gain may help you throughout the class. +여러분이 이미 연쇄 법칙을 통해 그라디언트를 도출하는데 익숙하더라도 이 섹션을 간략히 훑어보기를 권장한다. 왜냐하면 이 섹션에서는 다른데서는 보기 힘든 Backpropagation에 대한 실제 숫자를 활용한 역방향 흐름(Backward Flow)에 대해 설명을 할 것이고, 이를 통해 여러분이 얻게 될 통찰력은 이번 강의 전체에 있어 도움이 될 것이라 생각하기 때문이다. -### Simple expressions and interpretation of the gradient -Lets start simple so that we can develop the notation and conventions for more complex expressions. Consider a simple multiplication function of two numbers \\(f(x,y) = x y\\). It is a matter of simple calculus to derive the partial derivative for either input: +### 그라디언트(Gradient)에 대한 간단한 표현과 이해 + +복잡한 모델에 대한 수식등을 만들기에 앞서 간단하게 시작을 해보자. x와 y 두 숫자의 곱을 계산하는 간단한 함수 f를 정의하자. $$f(x,y) = x y$$. 각각의 입력 변수에 대한 편미분은 간단한 수학으로 아래와 같이 구해 진다. : $$ -f(x,y) = x y \hspace{0.5in} \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = y \hspace{0.5in} \frac{\partial f}{\partial y} = x +f(x,y) = x y \hspace{0.5in} \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = y \hspace{0.5in} \frac{\partial f}{\partial y} = x $$ -**Interpretation**. Keep in mind what the derivatives tell you: They indicate the rate of change of a function with respect to that variable surrounding an infinitesimally small region near a particular point: +**Interpretation**. 미분이 여러분에게 시사하는 바를 명심하자 : 미분은 입력 변수 부근의 아주 작은(0에 매우 가까운) 변화에 대한 해당 함수 값의 변화량이다. : $$ \frac{df(x)}{dx} = \lim_{h\ \to 0} \frac{f(x + h) - f(x)}{h} $$ -A technical note is that the division sign on the left-hand sign is, unlike the division sign on the right-hand sign, not a division. Instead, this notation indicates that the operator \\( \frac{d}{dx} \\) is being applied to the function \\(f\\), and returns a different function (the derivative). A nice way to think about the expression above is that when \\(h\\) is very small, then the function is well-approximated by a straight line, and the derivative is its slope. In other words, the derivative on each variable tells you the sensitivity of the whole expression on its value. For example, if \\(x = 4, y = -3\\) then \\(f(x,y) = -12\\) and the derivative on \\(x\\) \\(\frac{\partial f}{\partial x} = -3\\). This tells us that if we were to increase the value of this variable by a tiny amount, the effect on the whole expression would be to decrease it (due to the negative sign), and by three times that amount. This can be seen by rearranging the above equation ( \\( f(x + h) = f(x) + h \frac{df(x)}{dx} \\) ). Analogously, since \\(\frac{\partial f}{\partial y} = 4\\), we expect that increasing the value of \\(y\\) by some very small amount \\(h\\) would also increase the output of the function (due to the positive sign), and by \\(4h\\). +위에 수식을 기술적인 관점에서 보면, 왼쪽에 있는 분수 기호(가로바)는 오른쪽 분수 기호와 달리 나누기를 뜻하지는 않는다. 대신 연산자 $$ \frac{d}{dx} $$ 가 함수 $$f$$에 적용 되어 미분 된 함수를 의미 하는 것이다. 위의 수식을 이해하는 가장 좋은 방법은 $$h$$가 매우 작으면 함수 $$f$$는 직선으로 근사(Approximated) 될 수 있고, 미분 값은 그 직선의 기울기를 뜻한다. 다시말해, 만약 $$x = 4, y = -3$$ 이면 $$f(x,y) = -12$$ 가 되고, $$x$$에 대한 편미분 값은 $$x$$ $$\frac{\partial f}{\partial x} = -3$$ 으로 얻어진다. 이말은 즉슨, 우리가 x를 아주 조금 증가 시키면 전체 함수 값은 3배로 작아진다는 의미이다. (미분 값이 음수이므로). 이 것은 위의 수식을 재구성하면 이와 같이 간단히 보여 줄 수 있다 ( $$ f(x + h) = f(x) + h \frac{df(x)}{dx} $$ ). 비슷하게, $$\frac{\partial f}{\partial y} = 4$$, 이므로, $$y$$ 값을 아주 작은 $$h$$ 만큼 증가 시킨다면 $$4h$$ 만큼 전체 함수 값은 증가하게 될 것이다. (이번에는 미분 값이 양수) -> The derivative on each variable tells you the sensitivity of the whole expression on its value. +> 미분은 각 변수가 해당 값에서 전체 함수(Expression)의 결과 값에 영향을 미치는 민감도와 같은 개념이다. -As mentioned, the gradient \\(\nabla f\\) is the vector of partial derivatives, so we have that \\(\nabla f = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}] = [y, x]\\). Even though the gradient is technically a vector, we will often use terms such as *"the gradient on x"* instead of the technically correct phrase *"the partial derivative on x"* for simplicity. +앞서 말했듯이, 그라디언트 $$\nabla f$$는 편미분 값들의 벡터이다. 따라서 수식으로 표현하면 다음과 같다: $$\nabla f = [\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}] = [y, x]$$, 그라디언트가 기술적으로 벡터일지라도 심플한 표현을 위해 *"X에 대한 편미분"* 이라는 정확한 표현 대신 *"X에 대한 그라디언트"* 와 같은 표현을 종종 쓰게 될 예정이다. -We can also derive the derivatives for the addition operation: +다음과 같은 수식에 대해서도 미분값(그라디언트)을 한번 구해보자: $$ f(x,y) = x + y \hspace{0.5in} \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = 1 \hspace{0.5in} \frac{\partial f}{\partial y} = 1 $$ -that is, the derivative on both \\(x,y\\) is one regardless of what the values of \\(x,y\\) are. This makes sense, since increasing either \\(x,y\\) would increase the output of \\(f\\), and the rate of that increase would be independent of what the actual values of \\(x,y\\) are (unlike the case of multiplication above). The last function we'll use quite a bit in the class is the *max* operation: +위의 수식에서 볼 수 있듯이, $$x,y$$에 대한 미분은 $$x,y$$ 값에 관계 없이 1이다. 당연히, $$x,y$$ 값이 증가하면 $$f$$가 증가하기 때문이다. 그리고 $$f$$ 값의 증가율 또한 $$x,y$$ 값에 관계 없이 일정하다 (앞서 살펴본 곱셈의 경우와 다름). 마지막으로 살펴볼 함수는 우리가 수업에서 자주 다루는 *Max* 함수 이다 : $$ f(x,y) = \max(x, y) \hspace{0.5in} \rightarrow \hspace{0.5in} \frac{\partial f}{\partial x} = \mathbb{1}(x >= y) \hspace{0.5in} \frac{\partial f}{\partial y} = \mathbb{1}(y >= x) $$ -That is, the (sub)gradient is 1 on the input that was larger and 0 on the other input. Intuitively, if the inputs are \\(x = 4,y = 2\\), then the max is 4, and the function is not sensitive to the setting of \\(y\\). That is, if we were to increase it by a tiny amount \\(h\\), the function would keep outputting 4, and therefore the gradient is zero: there is no effect. Of course, if we were to change \\(y\\) by a large amount (e.g. larger than 2), then the value of \\(f\\) would change, but the derivatives tell us nothing about the effect of such large changes on the inputs of a function; They are only informative for tiny, infinitesimally small changes on the inputs, as indicated by the \\(\lim\_{h \rightarrow 0}\\) in its definition. - +입력 값이 더 큰 값에 대한 (서브)그라디언트는 1이고, 다른 입력 값의 그라디언트는 0이 된다. 직관적으로 보면, $$x = 4,y = 2$$ 인 경우 max 값은 4 이고, 이 함수는 현재의 $$y$$ 값에 영향을 받지 않는다. 바꾸어말하면, $$y$$값을 아주 작은 값인 $$h$$ 만큼 증가시키더라도 이 함수의 결과 값은 4로 유지된다. 따라서 그라디언트는 0이다 (y값의 영향이 없다). 물론 $$y$$값을 매우 크게 증가 시킨다면 (예를 들면 2이상) 함수 $$f$$ 값은 바뀌겠지만, 미분은 이런 큰 변화 값과는 관련이 없다. 미분이라는 것이 본래 그 정의에도 있듯($$\lim_{h \rightarrow 0}$$) 아주 작은 입력 값 변화에 대해서 의미를 갖는 값이기 때문이다. -### Compound expressions with chain rule -Lets now start to consider more complicated expressions that involve multiple composed functions, such as \\(f(x,y,z) = (x + y) z\\). This expression is still simple enough to differentiate directly, but we'll take a particular approach to it that will be helpful with understanding the intuition behind backpropagation. In particular, note that this expression can be broken down into two expressions: \\(q = x + y\\) and \\(f = q z\\). Moreover, we know how to compute the derivatives of both expressions separately, as seen in the previous section. \\(f\\) is just multiplication of \\(q\\) and \\(z\\), so \\(\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q\\), and \\(q\\) is addition of \\(x\\) and \\(y\\) so \\( \frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1 \\). However, we don't necessarily care about the gradient on the intermediate value \\(q\\) - the value of \\(\frac{\partial f}{\partial q}\\) is not useful. Instead, we are ultimately interested in the gradient of \\(f\\) with respect to its inputs \\(x,y,z\\). The **chain rule** tells us that the correct way to "chain" these gradient expressions together is through multiplication. For example, \\(\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} \\). In practice this is simply a multiplication of the two numbers that hold the two gradients. Lets see this with an example: +### 연쇄 법칙(Chain rule)을 이용한 복합 표현식 + +이제 $$f(x,y,z) = (x + y) z$$ 같은 다수의 복합 함수(composed functions)를 수반하는 더 복잡한 표현식을 고려해보자. 이 표현식은 여전히 바로 미분하기에 충분히 간단하지만, 우리는 이 식에 특별한 접근법을 적용할 것이다. 이는 backpropagation 뒤에 있는 직관을 이해하는데 도움이 될 것이다. 특히 이 식이 두 개의 표현식 $$q = x + y$$와 $$f = q z$$ 으로 분해될 수 있음에 주목하자. 게다가 이전 섹션에서 본 것처럼 우리는 두 식에 대한 미분값을 어떻게 따로따로 계산할지 알고 있다. $$f$$ 는 단지 $$q$$와 $$z$$의 곱이다. 따라서 $$\frac{\partial f}{\partial q} = z, \frac{\partial f}{\partial z} = q$$, 그리고 $$q$$는 $$x$$와 $$y$$의 합이므로 $$\frac{\partial q}{\partial x} = 1, \frac{\partial q}{\partial y} = 1$$이다. 하지만, 중간 결과값인 $$q$$에 대한 그라디언트($$\frac{\partial f}{\partial q}$$)를 신경쓸 필요가 없다. 대신 궁극적으로 입력 $$x,y,z$$에 대한 $$f$$의 그라디언트에 관심이 있다. **연쇄 법칙** 은 이러한 그라디언트 표현식들을 함께 연결시키는 적절한 방법이 곱하는 것이라는 것을 보여준다. 예를 들면, $$\frac{\partial f}{\partial x} = \frac{\partial f}{\partial q} \frac{\partial q}{\partial x} $$와 같이 표현할 수 있다. 실제로 이는 단순히 두 그라디언트를 담고 있는 두 수의 곱셈이다. 하나의 예를 통해 확인 해보자. -```python +~~~python # set some inputs x = -2; y = 5; z = -4 @@ -82,82 +85,84 @@ dfdq = z # df/dq = z, so gradient on q becomes -4 # now backprop through q = x + y dfdx = 1.0 * dfdq # dq/dx = 1. And the multiplication here is the chain rule! dfdy = 1.0 * dfdq # dq/dy = 1 -``` +~~~ -At the end we are left with the gradient in the variables `[dfdx,dfdy,dfdz]`, which tell us the sensitivity of the variables `x,y,z` on `f`!. This is the simplest example of backpropagation. Going forward, we will want to use a more concise notation so that we don't have to keep writing the `df` part. That is, for example instead of `dfdq` we would simply write `dq`, and always assume that the gradient is with respect to the final output. +결국 `[dfdx,dfdy,dfdz]` 변수들로 그라디언트가 표현되는데, 이는 `f`에 대한 변수 `x,y,z`의 민감도(sensitivity)를 보여준다. 이는 backpropagation의 가장 간단한 예이다. 더 나아가서 보다 간결한 표기법을 사용해서 `df` 파트를 계속 쓸 필요가 없도록 하고 싶을 것이다. 예를 들어 `dfdq` 대신에 단순히 `dq`를 쓰고 항상 그라디언트가 최종 출력에 관한 것이라 가정하는 것이다. -This computation can also be nicely visualized with a circuit diagram: +또한 이런 계산은 회로도를 가지고 다음과 같이 멋지게 시각화할 수 있다:
-2-4x5-4y-43z3-4q+-121f*
- The real-valued "circuit" on left shows the visual representation of the computation. The forward pass computes values from inputs to output (shown in green). The backward pass then performs backpropagation which starts at the end and recursively applies the chain rule to compute the gradients (shown in red) all the way to the inputs of the circuit. The gradients can be thought of as flowing backwards through the circuit. + 좌측에 실수 값으로 표현되는 "회로"는 이 계산에 대한 시각 표현을 보여준다. 전방 전달(forward pass)은 입력부터 출력까지 값을 계산한다 (녹색으로 표시). 그리고 나서 후방 전달(backward pass)은 backpropagation을 수행하는데, 이는 끝에서 시작해서 반복적으로 연쇄 법칙을 적용해 회로 입력에 대한 모든 길에서 그라디언트 값(적색으로 표시)을 계산한다. 그라디언트 값은 회로를 통해 거꾸로 흐르는 것으로 볼 수 있다.
-### Intuitive understanding of backpropagation -Notice that backpropagation is a beautifully local process. Every gate in a circuit diagram gets some inputs and can right away compute two things: 1. its output value and 2. the *local* gradient of its inputs with respect to its output value. Notice that the gates can do this completely independently without being aware of any of the details of the full circuit that they are embedded in. However, once the forward pass is over, during backpropagation the gate will eventually learn about the gradient of its output value on the final output of the entire circuit. Chain rule says that the gate should take that gradient and multiply it into every gradient it normally computes for all of its inputs. +### Backpropagation에 대한 직관적 이해 + +backpropagation이 굉장히 지역적인(local) 프로세스임에 주목하자. 회로도 내의 모든 게이트(gate) 몇개의 입력을 받아드리고 곧 바로 두 가지를 계산할 수 있다: 1. 게이트의 출력 값, 2. 게이트 출력에 대한 입력들의 *지역적* 그라디언트 값. 여기서 게이트들이 포함된 전체 회로의 세세한 부분을 모르더라도 완전히 독립적으로 값들을 계산할 수 있음을 주목하라. 하지만, 일단 전방 전달이 끝나면 backpropagation 과정에서 게이트는 결국 전체 회로의 마지막 출력에 대한 게이트 출력의 그라디언트 값에 관해 학습할 것이다. 연쇄 법칙을 통해 게이트는 이 그라디언트 값을 받아들여 모든 입력에 대해서 계산한 게이트의 모든 그라디언트 값에 곱한다. -> This extra multiplication (for each input) due to the chain rule can turn a single and relatively useless gate into a cog in a complex circuit such as an entire neural network. +> 연쇄 법칙 덕분에 이러한 각 입력에 대한 추가 곱셈은 전체 신경망과 같은 복잡한 회로에서 상대적으로 쓸모 없는 개개의 게이트를 중요하지 않은 것으로 바꿀 수 있다. -Lets get an intuition for how this works by referring again to the example. The add gate received inputs [-2, 5] and computed output 3. Since the gate is computing the addition operation, its local gradient for both of its inputs is +1. The rest of the circuit computed the final value, which is -12. During the backward pass in which the chain rule is applied recursively backwards through the circuit, the add gate (which is an input to the multiply gate) learns that the gradient for its output was -4. If we anthropomorphize the circuit as wanting to output a higher value (which can help with intuition), then we can think of the circuit as "wanting" the output of the add gate to be lower (due to negative sign), and with a *force* of 4. To continue the recurrence and to chain the gradient, the add gate takes that gradient and multiplies it to all of the local gradients for its inputs (making the gradient on both **x** and **y** 1 * -4 = -4). Notice that this has the desired effect: If **x,y** were to decrease (responding to their negative gradient) then the add gate's output would decrease, which in turn makes the multiply gate's output increase. +다시 위 예를 통해 이것이 어떻게 동작하는지에 대한 직관을 얻자. 덧셈 게이트는 입력 [-2, 5]를 받아 3을 출력한다. 이 게이트는 덧셈 연산을 하고 있기 때문에 두 입력에 대한 게이트의 지역적 그라디언트 값은 +1이 된다. 회로의 나머지 부분을 통해 최종 출력 값으로 -12가 나온다. 연쇄 법칙이 회로를 역으로 가로질러 반복적으로 적용되는 후방 전달 과정 동안, (곱셈 게이트의 입력인) 덧셈 게이트는 출력 값에 대한 그라디언트 값이 -4였다는 것을 학습한다. 만약 회로가 높은 값을 출력하기를 원하는 것으로 의인화하면 (이는 직관에 도움이 될 수 있다), 이 회로가 덧셈 게이트의 출력 값이 4의 *힘*으로 낮아지길 (음의 부호이기 때문) "원하는" 것으로 볼 수 있다. 반복을 지속하고 그라디언트 값을 연결하기 위해 덧셈 게이트는 이 그라디언트 값을 받아들이고 이를 모든 입력들에 대한 지역적 그라디언트 값에 곱한다 (**x**와 **y**에 대한 그라디언트 값이 1 * -4 = -4가 되도록). 다음의 원하는 효과가 있다는 사실에 주목하자. 만약 **x,y**가 (음의 그라디언트 값에 대한 반응으로) 감소한다면, 이 덧셈 게이트의 출력은 감소할 것이고 이는 다시 곱셈 게이트의 출력이 증가하도록 만들 것이다. -Backpropagation can thus be thought of as gates communicating to each other (through the gradient signal) whether they want their outputs to increase or decrease (and how strongly), so as to make the final output value higher. +따라서 backpropagation은 보다 큰 최종 출력 값을 얻도록 게이트들이 자신들의 출력이 (얼마나 강하게) 증가하길 원하는지 또는 감소하길 원하는지 서로 소통하는 것으로 간주할 수 있다. -### Modularity: Sigmoid example -The gates we introduced above are relatively arbitrary. Any kind of differentiable function can act as a gate, and we can group multiple gates into a single gate, or decompose a function into multiple gates whenever it is convenient. Lets look at another expression that illustrates this point: +### 모듈성: 시그모이드(Sigmoid) 예제 + +위에서 본 게이트들은 상대적으로 임의로 선택된 것이다. 어떤 종류의 함수도 미분가능하다면 게이트로서 역할을 할 수 있다. 필요한 경우 여러 개의 게이트를 그룹지어서 하나의 게이트로 만들거나, 하나의 함수를 여러 개의 게이트로 분해할 수도 있다. 이러한 요점을 보여주는 다른 표현식을 살펴보자: $$ -f(w,x) = \frac{1}{1+e^{-(w\_0x\_0 + w\_1x\_1 + w\_2)}} +f(w,x) = \frac{1}{1+e^{-(w_0x_0 + w_1x_1 + w_2)}} $$ -as we will see later in the class, this expression describes a 2-dimensional neuron (with inputs **x** and weights **w**) that uses the *sigmoid activation* function. But for now lets think of this very simply as just a function from inputs *w,x* to a single number. The function is made up of multiple gates. In addition to the ones described already above (add, mul, max), there are four more: +나중에 다른 수업에서 보겠지만, 이 표현식은 *시그모이드 활성* 함수를 사용하는 2차원 뉴런(입력 **x**와 가중치 **w**를 갖는)을 나타낸다. 그러나 지금은 이를 매우 단순하게 *w,x*를 입력으로 받아 하나의 단일 숫자를 출력하는 하나의 함수정도로 생각하자. 이 함수는 여러개의 게이트로 구성된다. 위에서 이미 설명한 게이트들(덧셈, 곱셈, 최대)에 더해 네 종류의 게이트가 더 있다: $$ -f(x) = \frac{1}{x} -\hspace{1in} \rightarrow \hspace{1in} -\frac{df}{dx} = -1/x^2 +f(x) = \frac{1}{x} +\hspace{1in} \rightarrow \hspace{1in} +\frac{df}{dx} = -1/x^2 \\\\ -f\_c(x) = c + x -\hspace{1in} \rightarrow \hspace{1in} -\frac{df}{dx} = 1 +f_c(x) = c + x +\hspace{1in} \rightarrow \hspace{1in} +\frac{df}{dx} = 1 \\\\ f(x) = e^x -\hspace{1in} \rightarrow \hspace{1in} +\hspace{1in} \rightarrow \hspace{1in} \frac{df}{dx} = e^x \\\\ -f\_a(x) = ax -\hspace{1in} \rightarrow \hspace{1in} +f_a(x) = ax +\hspace{1in} \rightarrow \hspace{1in} \frac{df}{dx} = a $$ -Where the functions \\(f\_c, f\_a\\) translate the input by a constant of \\(c\\) and scale the input by a constant of \\(a\\), respectively. These are technically special cases of addition and multiplication, but we introduce them as (new) unary gates here since we do need the gradients for the constants. \\(c,a\\). The full circuit then looks as follows: +여기서 $$f_c, f_a$$는 각각 입력을 상수 $$c$$만큼 이동시키고, 상수 $$a$$만큼 크기를 조정하는 함수이다. 이 함수들은 덧셈과 곰셈의 기술적으로 특별한 경우에 해당하지만, 여기서는 상수 $$c,a$$에 대한 그라디언트가 필요한 것이기에 (새로운) 단일 게이트로써 소개하고자 한다. 그러면 전체 회로는 다음과 같이 나타난다.
2.00-0.20w0-1.000.39x0-3.00-0.39w1-2.00-0.59x1-3.000.20w2-2.000.20*6.000.20*4.000.20+1.000.20+-1.00-0.20*-10.37-0.53exp1.37-0.53+10.731.001/x
- Example circuit for a 2D neuron with a sigmoid activation function. The inputs are [x0,x1] and the (learnable) weights of the neuron are [w0,w1,w2]. As we will see later, the neuron computes a dot product with the input and then its activation is softly squashed by the sigmoid function to be in range from 0 to 1. + 시그모이드 활성 함수를 갖는 2차원 뉴런에 대한 예시 회로. 입력은 [x0,x1]이고 뉴런의 (학습 가능한) 파라미터 값들은 [w0,w1,w2]이다. 나중에 보겠지만, 뉴런은 입력을 가지고 내적을 계산하고 이 입력의 활성 함수 출력 값은 0부터 1사이의 범위에 들어가도록 시그모이드 함수에 의해 압착(squash)이 된다.
-In the example above, we see a long chain of function applications that operates on the result of the dot product between **w,x**. The function that these operations implement is called the *sigmoid function* \\(\sigma(x)\\). It turns out that the derivative of the sigmoid function with respect to its input simplifies if you perform the derivation (after a fun tricky part where we add and subtract a 1 in the numerator): +위 예제에서 **w,x** 사이의 내적의 결과로 동작하는 함수 적용(function applications)의 긴 체인을 보았다. 이러한 연산을 제공하는 함수를 *시그모이드 함수(sigmoid function)* $$\sigma(x)$$ 라고 한다. 만약 (분자에 1을 더하고 다시 빼는 재미있지만 까다로운 과정을 거친 후에)미분을 한다면 입력에 대한 시그모이드 함수의 미분값은 단순화할 수 있는 것으로 알려져 있다. $$ \sigma(x) = \frac{1}{1+e^{-x}} \\\\ -\rightarrow \hspace{0.3in} \frac{d\sigma(x)}{dx} = \frac{e^{-x}}{(1+e^{-x})^2} = \left( \frac{1 + e^{-x} - 1}{1 + e^{-x}} \right) \left( \frac{1}{1+e^{-x}} \right) +\rightarrow \hspace{0.3in} \frac{d\sigma(x)}{dx} = \frac{e^{-x}}{(1+e^{-x})^2} = \left( \frac{1 + e^{-x} - 1}{1 + e^{-x}} \right) \left( \frac{1}{1+e^{-x}} \right) = \left( 1 - \sigma(x) \right) \sigma(x) $$ -As we see, the gradient turns out to simplify and becomes surprisingly simple. For example, the sigmoid expression receives the input 1.0 and computes the ouput 0.73 during the forward pass. The derivation above shows that the *local* gradient would simply be (1 - 0.73) * 0.73 ~= 0.2, as the circuit computed before (see the image above), except this way it would be done with a single, simple and efficient expression (and with less numerical issues). Therefore, in any real practical application it would be very useful to group these operations into a single gate. Lets see the backprop for this neuron in code: +보이는 것처럼 그라디언트는 단순화되면서 놀라울만큼 간단해진다.예를 들어 시그모이드 표현은 전방 전달(forward pass) 과정에서 입력 1.0을 받아 출력 0.73을 계산한다. 단일의 단순하고 효율적인 표현식을 이용해 (그리고 더 적은 수치적인 문제를 갖고) 계산하는 방식을 제외하고서, 마치 이전에 본 회로가 계산했던 것(위 그림을 보라)과 비슷하게 위의 미분은 *지역(local)* 그라디언트 값이 단순히 (1 - 0.73) * 0.73 ~= 0.2 가 됨을 보여준다. 그러므로 어떤 실제 실용적인 적용에서 그러한 연산들을 단일 게이트로 묶어주는 것은 매우 유용하다고 할 수 있다. 코드에서 이 뉴런에 대한 backprop를 살펴보자: -```python +~~~python w = [2,-3,-3] # assume some random weights and data x = [-1, -2] @@ -170,24 +175,25 @@ ddot = (1 - f) * f # gradient on dot variable, using the sigmoid gradient deriva dx = [w[0] * ddot, w[1] * ddot] # backprop into x dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot] # backprop into w # we're done! we have the gradients on the inputs to the circuit -``` +~~~ -**Implementation protip: staged backpropagation**. As shown in the code above, in practice it is always helpful to break down the forward pass into stages that are easily backpropped through. For example here we created an intermediate variable `dot` which holds the output of the dot product between `w` and `x`. During backward pass we then successively compute (in reverse order) the corresponding variables (e.g. `ddot`, and ultimately `dw, dx`) that hold the gradients of those variables. +**구현 팁(protip): 단계적 backpropagation**. 위 코드에서 볼 수 있듯이, 전방 전달(forward pass)를 쉽게 backprop되는 단계들로 잘게 분해하는 것은 실질적으로 항상 도움이 된다. 예를 들어 우리는 여기서 `w`와 `x` 사이의 내적의 결과를 담는 중간 변수 `dot`를 만들었다. 그리고나서 후방 전달(backward pass) 과정에서 그러한 변수들의 그라디언트 값들을 담은 해당 변수들(예: `ddot` 및 궁극적으로는 `dw, dx`)을 성공적으로 계산한다(역순으로). -The point of this section is that the details of how the backpropagation is performed, and which parts of the forward function we think of as gates, is a matter of convenience. It helps to be aware of which parts of the expression have easy local gradients, so that they can be chained together with the least amount of code and effort. +이 섹션에서 요점은 어떻게 backpropagation이 수행되는 지와 전방 함수(forward function)의 어느 부분을 게이트로 취급해야할 지에 대한 세부사항은 편의성 문제라는 것이다. 이는 표현식의 어느 부분들이 쉬운 지역 그라디언트를 가지며, 가장 적은 코드의 양과 노력으로 이들을 함께 묶을 수 있는지를 이해하는데 도움이 된다. -### Backprop in practice: Staged computation -Lets see this with another example. Suppose that we have a function of the form: +### 실제 backprop: 단계적 계산 + +또 다른 예제를 통해 확인해보자. 다음과 같은 형태의 함수가 있다고 가정하자: $$ f(x,y) = \frac{x + \sigma(y)}{\sigma(x) + (x+y)^2} $$ -To be clear, this function is completely useless and it's not clear why you would ever want to compute its gradient, except for the fact that it is a good example of backpropagation in practice. It is very important to stress that if you were to launch into performing the differentiation with respect to either \\(x\\) or \\(y\\), you would end up with very large and complex expressions. However, it turns out that doing so is completely unnecessary because we don't need to have an explicit function written down that evaluates the gradient. We only have to know how to compute it. Here is how we would structure the forward pass of such expression: +명확히 말하면, 실제 backpropagation의 좋은 예제라는 사실 외에는 이 함수는 완전히 쓸모가 없으며 따라서 왜 여러분이 이 함수의 그라디언트를 그토록 계산해야 하는지 그 이유도 뚜렷하지 않다. 만약 여러분들이 $x$ 또는 $y$에 관해서 미분을 수행한다면 결국 매우 크고 복잡한 식을 얻게 될 것이다. 하지만, 그라디언트를 계산하는 명확한 함수(explicit function)를 쓸 필요가 없기 때문에 그렇게 미분하는 것은 완전히 불필요한 것으로 알려져있다. 우리는 단지 어떻게 이를 계산하는지만 알면 된다. 다음은 우리가 어떻게 그러한 표현식에 대해 전방 전달(forward pass)을 구조화 하는지를 나타낸 것이다: -```python +~~~python x = 3 # example values y = -4 @@ -200,15 +206,15 @@ xpysqr = xpy**2 #(5) den = sigx + xpysqr # denominator #(6) invden = 1.0 / den #(7) f = num * invden # done! #(8) -``` +~~~ -Phew, by the end of the expression we have computed the forward pass. Notice that we have structured the code in such way that it contains multiple intermediate variables, each of which are only simple expressions for which we already know the local gradients. Therefore, computing the backprop pass is easy: We'll go backwards and for every variable along the way in the forward pass (`sigy, num, sigx, xpy, xpysqr, den, invden`) we will have the same variable, but one that begins with a `d`, which will hold the gradient of the output of the circuit with respect to that variable. Additionally, note that every single piece in our backprop will involve computing the local gradient of that expression, and chaining it with the gradient on that expression with a multiplication. For each row, we also highlight which part of the forward pass it refers to: +표현식의 마지막에서 전방 전달(forward pass)을 계산했다. 각각이 단순한 표현식들인 다수의 중간 변수들을 포함하는 방식으로 코드를 구조화한 것에 주목하자, 우리는 이미 이 표현식들에 대한 지역 그라디언트 값을 알고 있다. 그러므로, backprop 전달을 계산하는 것은 쉬운 일이다: 전방 전달 과정의 모든 변수들(`sigy, num, sigx, xpy, xpysqr, den, invden`)에 대해 역방향으로 가면서 똑같은 변수들을 볼 것이다, 다만 해당 변수에 대한 회로 출력의 그라디언트를 담는 것을 나타내기 위해 변수명 앞에 `d`를 붙인다. 추가로, backprop에서 모든 단일 조각이 이 표현식에 대한 지역 그라디언트을 계산하고 곱셈 형태로 이 그라디언트 값을 연결하는 과정을 수반할 것이다. 각 행마다 전방 전달 과정에서 어느 부분에 해당하는지 표시한 것이다: -```python +~~~python # backprop f = num * invden dnum = invden # gradient on numerator #(8) dinvden = num #(8) -# backprop invden = 1.0 / den +# backprop invden = 1.0 / den dden = (-1.0 / (den**2)) * dinvden #(7) # backprop den = sigx + xpysqr dsigx = (1) * dden #(6) @@ -226,15 +232,16 @@ dsigy = (1) * dnum #(2) # backprop sigy = 1.0 / (1 + math.exp(-y)) dy += ((1 - sigy) * sigy) * dsigy #(1) # done! phew -``` +~~~ -Notice a few things: +몇 가지 주의할 점: -**Cache forward pass variables**. To compute the backward pass it is very helpful to have some of the variables that were used in the forward pass. In practice you want to structure your code so that you cache these variables, and so that they are available during backpropagation. If this is too difficult, it is possible (but wasteful) to recompute them. +**전방 전달 변수들을 저장(cache)하라**. 후방 전달을 계산하기 위해 전방 전달에서 사용한 일부 변수들을 가지고 있는 것은 정말 유용하다. 실제로 여러분은 이 변수들을 저장해서 backpropagation 동안 이용할 수 있도록 코드를 구성하고 싶을 것이다. 이것이 너무 어려운 일이라면, 이 변수들을 다시 계산할 수 있다(물론 비효율적이지만). -**Gradients add up at forks**. The forward expression involves the variables **x,y** multiple times, so when we perform backpropagation we must be careful to use `+=` instead of `=` to accumulate the gradient on these variables (otherwise we would overwrite it). This follows the *multivariable chain rule* in Calculus, which states that if a variable branches out to different parts of the circuit, then the gradients that flow back to it will add. +**갈래길에서 그라디언트는 더해진다**. 전방 표현식은 변수 **x,y** 를 여러번 수반하므로, backpropagation을 수행할 때 이 변수들에 대한 그라디언트 값을 축적하기 위해 `=` 대신 `+=`를 사용해야 하는 점에 주의해야 한다 (그렇게 하지 않으면 덮어쓰게 된다). 이는 Calculus에 나오는 *다변수 연쇄 법칙(multivariate chain rule)*을 따른다, Calculus에는 하나의 변수가 회로의 다른 부분들로 가지를 뻗어나가면, 반환하는 그라디언트는 더해질 것이라고 명시되어 있다. + ### Patterns in backward flow It is interesting to note that in many cases the backward-flowing gradient can be interpreted on an intuitive level. For example, the three most commonly used gates in neural networks (*add,mul,max*), all have very simple interpretations in terms of how they act during backpropagation. Consider this example circuit: @@ -253,18 +260,19 @@ The **add gate** always takes the gradient on its output and distributes it equa The **max gate** routes the gradient. Unlike the add gate which distributed the gradient unchanged to all its inputs, the max gate distributes the gradient (unchanged) to exactly one of its inputs (the input that had the highest value during the forward pass). This is because the local gradient for a max gate is 1.0 for the highest value, and 0.0 for all other values. In the example circuit above, the max operation routed the gradient of 2.00 to the **z** variable, which had a higher value than **w**, and the gradient on **w** remains zero. -The **multiply gate** is a little less easy to interpret. Its local gradients are the input values (except switched), and this is multiplied by the gradient on its output during the chain rule. In the example above, the gradient on **x** is -8.00, which is -4.00 x 2.00. +The **multiply gate** is a little less easy to interpret. Its local gradients are the input values (except switched), and this is multiplied by the gradient on its output during the chain rule. In the example above, the gradient on **x** is -8.00, which is -4.00 x 2.00. -*Unintuitive effects and their consequences*. Notice that if one of the inputs to the multiply gate is very small and the other is very big, then the multiply gate will do something slightly unintuitive: it will assign a relatively huge gradient to the small input and a tiny gradient to the large input. Note that in linear classifiers where the weights are dot producted \\(w^Tx\_i\\) (multiplied) with the inputs, this implies that the scale of the data has an effect on the magnitude of the gradient for the weights. For example, if you multiplied all input data examples \\(x\_i\\) by 1000 during preprocessing, then the gradient on the weights will be 1000 times larger, and you'd have to lower the learning rate by that factor to compensate. This is why preprocessing matters a lot, sometimes in subtle ways! And having intuitive understanding for how the gradients flow can help you debug some of these cases. +*Unintuitive effects and their consequences*. Notice that if one of the inputs to the multiply gate is very small and the other is very big, then the multiply gate will do something slightly unintuitive: it will assign a relatively huge gradient to the small input and a tiny gradient to the large input. Note that in linear classifiers where the weights are dot producted $$w^Tx_i$$ (multiplied) with the inputs, this implies that the scale of the data has an effect on the magnitude of the gradient for the weights. For example, if you multiplied all input data examples $$x_i$$ by 1000 during preprocessing, then the gradient on the weights will be 1000 times larger, and you'd have to lower the learning rate by that factor to compensate. This is why preprocessing matters a lot, sometimes in subtle ways! And having intuitive understanding for how the gradients flow can help you debug some of these cases. + ### Gradients for vectorized operations The above sections were concerned with single variables, but all concepts extend in a straight-forward manner to matrix and vector operations. However, one must pay closer attention to dimensions and transpose operations. **Matrix-Matrix multiply gradient**. Possibly the most tricky operation is the matrix-matrix multiplication (which generalizes all matrix-vector and vector-vector) multiply operations: -```python +~~~python # forward pass W = np.random.randn(5, 10) X = np.random.randn(10, 3) @@ -274,13 +282,14 @@ D = W.dot(X) dD = np.random.randn(*D.shape) # same shape as D dW = dD.dot(X.T) #.T gives the transpose of the matrix dX = W.T.dot(dD) -``` +~~~ *Tip: use dimension analysis!* Note that you do not need to remember the expressions for `dW` and `dX` because they are easy to re-derive based on dimensions. For instance, we know that the gradient on the weights `dW` must be of the same size as `W` after it is computed, and that it must depend on matrix multiplication of `X` and `dD` (as is the case when both `X,W` are single numbers and not matrices). There is always exactly one way of achieving this so that the dimensions work out. For example, `X` is of size [10 x 3] and `dD` of size [5 x 3], so if we want `dW` and `W` has shape [5 x 10], then the only way of achieving this is with `dD.dot(X.T)`, as shown above. **Work with small, explicit examples**. Some people may find it difficult at first to derive the gradient updates for some vectorized expressions. Our recommendation is to explicitly write out a minimal vectorized example, derive the gradient on paper and then generalize the pattern to its efficient, vectorized form. + ### Summary - We developed intuition for what the gradients mean, how they flow backwards in the circuit, and how they communicate which part of the circuit should increase or decrease and with what force to make the final output higher. diff --git a/python-numpy-tutorial.md b/python-numpy-tutorial.md index 65e630a7..8355cf01 100644 --- a/python-numpy-tutorial.md +++ b/python-numpy-tutorial.md @@ -21,58 +21,53 @@ Python: Numpy --> -This tutorial was contributed by [Justin Johnson](http://cs.stanford.edu/people/jcjohns/). +이 튜토리얼은 [Justin Johnson](http://cs.stanford.edu/people/jcjohns/) 에 의해 작성되었습니다. -We will use the Python programming language for all assignments in this course. -Python is a great general-purpose programming language on its own, but with the -help of a few popular libraries (numpy, scipy, matplotlib) it becomes a powerful -environment for scientific computing. +cs231n 수업의 모든 과제에서는 프로그래밍 언어로 파이썬을 사용할 것입니다. +파이썬은 그 자체만으로도 훌륭한 범용 프로그래밍 언어이지만, 몇몇 라이브러리(numpy, scipy, matplotlib)의 도움으로 +계산과학 분야에서 강력한 개발 환경을 갖추게 됩니다. -We expect that many of you will have some experience with Python and numpy; -for the rest of you, this section will serve as a quick crash course both on -the Python programming language and on the use of Python for scientific -computing. +많은 분들이 파이썬과 numpy를 경험 해보셨을 거라고 생각합니다. 경험하지 못했을지라도 이 문서를 통해 +'프로그래밍 언어로서의 파이썬'과 '파이썬을 계산과학에 활용하는 법'을 빠르게 훑을 수 있습니다. -Some of you may have previous knowledge in Matlab, in which case we also recommend the [numpy for Matlab users](http://wiki.scipy.org/NumPy_for_Matlab_Users) page. +만약 Matlab을 사용해보셨다면, [Matlab 사용자를 위한 numpy](http://wiki.scipy.org/NumPy_for_Matlab_Users) 페이지를 추천해 드립니다. -You can also find an [IPython notebook version of this tutorial here](https://github.com/kuleshov/cs228-material/blob/master/tutorials/python/cs228-python-tutorial.ipynb) created by [Volodymyr Kuleshov](http://web.stanford.edu/~kuleshov/) and [Isaac Caswell](https://symsys.stanford.edu/viewing/symsysaffiliate/21335) for [CS 228](http://cs.stanford.edu/~ermon/cs228/index.html). +또한 [CS 228](http://cs.stanford.edu/~ermon/cs228/index.html) 수업을 위해 [Volodymyr Kuleshov](http://web.stanford.edu/~kuleshov/) 와 [Isaac Caswell](https://symsys.stanford.edu/viewing/symsysaffiliate/21335) 가 만든 [이 튜토리얼의 IPython notebook 버전](https://github.com/kuleshov/cs228-material/blob/master/tutorials/python/cs228-python-tutorial.ipynb) 도 참조할 수 있습니다. -Table of contents: +목차: -- [Python](#python) - - [Basic data types](#python-basic) - - [Containers](#python-containers) - - [Lists](#python-lists) - - [Dictionaries](#python-dicts) - - [Sets](#python-sets) - - [Tuples](#python-tuples) - - [Functions](#python-functions) - - [Classes](#python-classes) +- [파이썬](#python) + - [기본 자료형](#python-basic) + - [컨테이너](#python-containers) + - [리스트](#python-lists) + - [딕셔너리](#python-dicts) + - [집합](#python-sets) + - [튜플](#python-tuples) + - [함수](#python-functions) + - [클래스](#python-classes) - [Numpy](#numpy) - - [Arrays](#numpy-arrays) - - [Array indexing](#numpy-array-indexing) - - [Datatypes](#numpy-datatypes) - - [Array math](#numpy-math) - - [Broadcasting](#numpy-broadcasting) + - [배열](#numpy-arrays) + - [배열 인덱싱](#numpy-array-indexing) + - [데이터 타입](#numpy-datatypes) + - [배열 연산](#numpy-math) + - [브로드캐스팅](#numpy-broadcasting) - [SciPy](#scipy) - - [Image operations](#scipy-image) - - [MATLAB files](#scipy-matlab) - - [Distance between points](#scipy-dist) + - [이미지 작업](#scipy-image) + - [MATLAB 파일](#scipy-matlab) + - [두 점 사이의 거리](#scipy-dist) - [Matplotlib](#matplotlib) - [Plotting](#matplotlib-plotting) - [Subplots](#matplotlib-subplots) - - [Images](#matplotlib-images) + - [이미지](#matplotlib-images) -## Python -Python is a high-level, dynamically typed multiparadigm programming language. -Python code is often said to be almost like pseudocode, since it allows you -to express very powerful ideas in very few lines of code while being very -readable. As an example, here is an implementation of the classic quicksort -algorithm in Python: +## 파이썬 +파이썬은 고급 프로그래밍 언어(사람이 이해하기 쉽게 작성된 언어)이며, 다중패러다임을 지원하는 동적 프로그래밍 언어입니다. +짧지만 가독성 높은 코드 몇 줄로 수준 높은 아이디어들을 표현할 수 있기에 파이썬 코드는 거의 수도코드처럼 보인다고도 합니다. +아래는 quicksort 알고리즘을 파이썬으로 구현한 예시입니다: -```python +~~~python def quicksort(arr): if len(arr) <= 1: return arr @@ -81,292 +76,289 @@ def quicksort(arr): middle = [x for x in arr if x == pivot] right = [x for x in arr if x > pivot] return quicksort(left) + middle + quicksort(right) - + print quicksort([3,6,8,10,1,2,1]) -# Prints "[1, 1, 2, 3, 6, 8, 10]" -``` +# 출력 "[1, 1, 2, 3, 6, 8, 10]" +~~~ -### Python versions -There are currently two different supported versions of Python, 2.7 and 3.4. -Somewhat confusingly, Python 3.0 introduced many backwards-incompatible changes -to the language, so code written for 2.7 may not work under 3.4 and vice versa. -For this class all code will use Python 2.7. +### 파이썬 버전 +현재 파이썬에는 두 가지 버전이 있습니다. 파이썬 2.7 그리고 파이썬 3.4입니다. +혼란스럽게도, 파이썬3은 기존 파이썬2와 호환되지 않게 변경된 부분이 있습니다. +그러므로 파이썬 2.7로 쓰여진 코드는 3.4환경에서 동작하지 않고 그 반대도 마찬가지입니다. +이 수업에선 파이썬 2.7을 사용합니다. -You can check your Python version at the command line by running +커맨드라인에 아래의 명령어를 입력해서 현재 설치된 파이썬 버전을 확인할 수 있습니다. `python --version`. -### Basic data types -Like most languages, Python has a number of basic types including integers, -floats, booleans, and strings. These data types behave in ways that are -familiar from other programming languages. +### 기본 자료형 + +다른 프로그래밍 언어들처럼, 파이썬에는 정수, 실수, 불리언, 문자열 같은 기본 자료형이 있습니다. +파이썬 기본 자료형 역시 다른 프로그래밍 언어와 유사합니다. -**Numbers:** Integers and floats work as you would expect from other languages: +**숫자:** 다른 언어와 마찬가지로 파이썬의 정수형(Integers)과 실수형(floats) 데이터 타입 역시 동일한 역할을 합니다 : -```python +~~~python x = 3 -print type(x) # Prints "" -print x # Prints "3" -print x + 1 # Addition; prints "4" -print x - 1 # Subtraction; prints "2" -print x * 2 # Multiplication; prints "6" -print x ** 2 # Exponentiation; prints "9" +print type(x) # 출력 "" +print x # 출력 "3" +print x + 1 # 덧셈; 출력 "4" +print x - 1 # 뺄셈; 출력 "2" +print x * 2 # 곱셈; 출력 "6" +print x ** 2 # 제곱; 출력 "9" x += 1 -print x # Prints "4" +print x # 출력 "4" x *= 2 -print x # Prints "8" +print x # 출력 "8" y = 2.5 -print type(y) # Prints "" -print y, y + 1, y * 2, y ** 2 # Prints "2.5 3.5 5.0 6.25" -``` -Note that unlike many languages, Python does not have unary increment (`x++`) -or decrement (`x--`) operators. +print type(y) # 출력 "" +print y, y + 1, y * 2, y ** 2 # 출력 "2.5 3.5 5.0 6.25" +~~~ +다른 언어들과는 달리, 파이썬에는 증감 단항연산자(`x++`, `x--`)가 없습니다. -Python also has built-in types for long integers and complex numbers; -you can find all of the details -[in the documentation](https://docs.python.org/2/library/stdtypes.html#numeric-types-int-float-long-complex). +파이썬 역시 long 정수형과 복소수 데이터 타입이 구현되어 있습니다. +자세한 사항은 [문서](https://docs.python.org/2/library/stdtypes.html#numeric-types-int-float-long-complex)에서 찾아볼 수 있습니다. -**Booleans:** Python implements all of the usual operators for Boolean logic, -but uses English words rather than symbols (`&&`, `||`, etc.): +**불리언(Booleans):** 파이썬에는 논리 자료형의 모든 연산자가 구현되어 있습니다. +그렇지만 기호(`&&`, `||`, 등.) 대신 영어 단어로 구현되어 있습니다 : -```python +~~~python t = True f = False -print type(t) # Prints "" -print t and f # Logical AND; prints "False" -print t or f # Logical OR; prints "True" -print not t # Logical NOT; prints "False" -print t != f # Logical XOR; prints "True" -``` - -**Strings:** Python has great support for strings: - -```python -hello = 'hello' # String literals can use single quotes -world = "world" # or double quotes; it does not matter. -print hello # Prints "hello" -print len(hello) # String length; prints "5" -hw = hello + ' ' + world # String concatenation -print hw # prints "hello world" -hw12 = '%s %s %d' % (hello, world, 12) # sprintf style string formatting -print hw12 # prints "hello world 12" -``` - -String objects have a bunch of useful methods; for example: - -```python +print type(t) # 출력 "" +print t and f # 논리 AND; 출력 "False" +print t or f # 논리 OR; 출력 "True" +print not t # 논리 NOT; 출력 "False" +print t != f # 논리 XOR; 출력 "True" +~~~ + +**문자열:** 파이썬은 문자열과 연관된 다양한 기능을 지원합니다: + +~~~python +hello = 'hello' # String 문자열을 표현할 땐 따옴표나 +world = "world" # 쌍따옴표가 사용됩니다; 어떤 걸 써도 상관없습니다. +print hello # 출력 "hello" +print len(hello) # 문자열 길이; 출력 "5" +hw = hello + ' ' + world # 문자열 연결 +print hw # 출력 "hello world" +hw12 = '%s %s %d' % (hello, world, 12) # sprintf 방식의 문자열 서식 지정 +print hw12 # 출력 "hello world 12" +~~~ + +문자열 객체에는 유용한 메소드들이 많습니다; 예를 들어: + +~~~python s = "hello" -print s.capitalize() # Capitalize a string; prints "Hello" -print s.upper() # Convert a string to uppercase; prints "HELLO" -print s.rjust(7) # Right-justify a string, padding with spaces; prints " hello" -print s.center(7) # Center a string, padding with spaces; prints " hello " -print s.replace('l', '(ell)') # Replace all instances of one substring with another; - # prints "he(ell)(ell)o" -print ' world '.strip() # Strip leading and trailing whitespace; prints "world" -``` -You can find a list of all string methods [in the documentation](https://docs.python.org/2/library/stdtypes.html#string-methods). +print s.capitalize() # 문자열을 대문자로 시작하게 함; 출력 "Hello" +print s.upper() # 모든 문자를 대문자로 바꿈; 출력 "HELLO" +print s.rjust(7) # 문자열 오른쪽 정렬, 빈공간은 여백으로 채움; 출력 " hello" +print s.center(7) # 문자열 가운데 정렬, 빈공간은 여백으로 채움; 출력 " hello " +print s.replace('l', '(ell)') # 첫 번째 인자로 온 문자열을 두 번째 인자 문자열로 바꿈; + # 출력 "he(ell)(ell)o" +print ' world '.strip() # 문자열 앞뒤 공백 제거; 출력 "world" +~~~ +모든 문자열 메소드는 [문서](https://docs.python.org/2/library/stdtypes.html#string-methods)에서 찾아볼 수 있습니다. -### Containers -Python includes several built-in container types: lists, dictionaries, sets, and tuples. + +### 컨테이너 +파이썬은 다음과 같은 컨테이너 타입이 구현되어 있습니다: 리스트, 딕셔너리, 집합, 튜플 -#### Lists -A list is the Python equivalent of an array, but is resizeable -and can contain elements of different types: - -```python -xs = [3, 1, 2] # Create a list -print xs, xs[2] # Prints "[3, 1, 2] 2" -print xs[-1] # Negative indices count from the end of the list; prints "2" -xs[2] = 'foo' # Lists can contain elements of different types -print xs # Prints "[3, 1, 'foo']" -xs.append('bar') # Add a new element to the end of the list -print xs # Prints "[3, 1, 'foo', 'bar']" -x = xs.pop() # Remove and return the last element of the list -print x, xs # Prints "bar [3, 1, 'foo']" -``` -As usual, you can find all the gory details about lists -[in the documentation](https://docs.python.org/2/tutorial/datastructures.html#more-on-lists). - -**Slicing:** -In addition to accessing list elements one at a time, Python provides -concise syntax to access sublists; this is known as *slicing*: - -```python -nums = range(5) # range is a built-in function that creates a list of integers -print nums # Prints "[0, 1, 2, 3, 4]" -print nums[2:4] # Get a slice from index 2 to 4 (exclusive); prints "[2, 3]" -print nums[2:] # Get a slice from index 2 to the end; prints "[2, 3, 4]" -print nums[:2] # Get a slice from the start to index 2 (exclusive); prints "[0, 1]" -print nums[:] # Get a slice of the whole list; prints ["0, 1, 2, 3, 4]" -print nums[:-1] # Slice indices can be negative; prints ["0, 1, 2, 3]" -nums[2:4] = [8, 9] # Assign a new sublist to a slice -print nums # Prints "[0, 1, 8, 9, 4]" -``` -We will see slicing again in the context of numpy arrays. - -**Loops:** You can loop over the elements of a list like this: - -```python + +#### 리스트 +리스트는 파이썬에서 배열 같은 존재입니다. 그렇지만 배열과 달리 크기 변경이 가능하고 +서로 다른 자료형일지라도 하나의 리스트에 저장될 수 있습니다: + +~~~python +xs = [3, 1, 2] # 리스트 생성 +print xs, xs[2] # 출력 "[3, 1, 2] 2" +print xs[-1] # 인덱스가 음수일 경우 리스트의 끝에서부터 세어짐; 출력 "2" +xs[2] = 'foo' # 리스트는 자료형이 다른 요소들을 저장할 수 있습니다 +print xs # 출력 "[3, 1, 'foo']" +xs.append('bar') # 리스트의 끝에 새 요소 추가 +print xs # 출력 "[3, 1, 'foo', 'bar']" +x = xs.pop() # 리스트의 마지막 요소 삭제하고 반환 +print x, xs # 출력 "bar [3, 1, 'foo']" +~~~ +마찬가지로, 리스트에 대해 자세한 사항은 [문서](https://docs.python.org/2/tutorial/datastructures.html#more-on-lists)에서 찾아볼 수 있습니다. + +**슬라이싱:** +리스트의 요소로 한 번에 접근하는 것 이외에도, 파이썬은 리스트의 일부분에만 접근하는 간결한 문법을 제공합니다; +이를 *슬라이싱*이라고 합니다: + +~~~python +nums = range(5) # range는 파이썬에 구현되어 있는 함수이며 정수들로 구성된 리스트를 만듭니다 +print nums # 출력 "[0, 1, 2, 3, 4]" +print nums[2:4] # 인덱스 2에서 4(제외)까지 슬라이싱; 출력 "[2, 3]" +print nums[2:] # 인덱스 2에서 끝까지 슬라이싱; 출력 "[2, 3, 4]" +print nums[:2] # 처음부터 인덱스 2(제외)까지 슬라이싱; 출력 "[0, 1]" +print nums[:] # 전체 리스트 슬라이싱; 출력 ["0, 1, 2, 3, 4]" +print nums[:-1] # 슬라이싱 인덱스는 음수도 가능; 출력 ["0, 1, 2, 3]" +nums[2:4] = [8, 9] # 슬라이스된 리스트에 새로운 리스트 할당 +print nums # 출력 "[0, 1, 8, 9, 4]" +~~~ +numpy 배열 부분에서 다시 슬라이싱을 보게 될 것입니다. + +**반복문:** 아래와 같이 리스트의 요소들을 반복해서 조회할 수 있습니다: + +~~~python animals = ['cat', 'dog', 'monkey'] for animal in animals: print animal -# Prints "cat", "dog", "monkey", each on its own line. -``` +# 출력 "cat", "dog", "monkey", 한 줄에 하나씩 출력. +~~~ -If you want access to the index of each element within the body of a loop, -use the built-in `enumerate` function: +만약 반복문 내에서 리스트 각 요소의 인덱스에 접근하고 싶다면, 'enumerate' 함수를 사용하세요: -```python +~~~python animals = ['cat', 'dog', 'monkey'] for idx, animal in enumerate(animals): print '#%d: %s' % (idx + 1, animal) -# Prints "#1: cat", "#2: dog", "#3: monkey", each on its own line -``` +# 출력 "#1: cat", "#2: dog", "#3: monkey", 한 줄에 하나씩 출력. +~~~ -**List comprehensions:** -When programming, frequently we want to transform one type of data into another. -As a simple example, consider the following code that computes square numbers: +**리스트 comprehensions:** +프로그래밍을 하다 보면, 자료형을 변환해야 하는 경우가 자주 있습니다. +간단한 예를 들자면, 숫자의 제곱을 계산하는 다음의 코드를 보세요: -```python + +~~~python nums = [0, 1, 2, 3, 4] squares = [] for x in nums: squares.append(x ** 2) -print squares # Prints [0, 1, 4, 9, 16] -``` +print squares # 출력 [0, 1, 4, 9, 16] +~~~ -You can make this code simpler using a **list comprehension**: +**리스트 comprehension**을 이용해 이 코드를 더 간단하게 만들 수 있습니다: -```python +~~~python nums = [0, 1, 2, 3, 4] squares = [x ** 2 for x in nums] -print squares # Prints [0, 1, 4, 9, 16] -``` +print squares # 출력 [0, 1, 4, 9, 16] +~~~ -List comprehensions can also contain conditions: +리스트 comprehensions에 조건을 추가할 수도 있습니다: -```python +~~~python nums = [0, 1, 2, 3, 4] even_squares = [x ** 2 for x in nums if x % 2 == 0] -print even_squares # Prints "[0, 4, 16]" -``` +print even_squares # 출력 "[0, 4, 16]" +~~~ -#### Dictionaries -A dictionary stores (key, value) pairs, similar to a `Map` in Java or -an object in Javascript. You can use it like this: - -```python -d = {'cat': 'cute', 'dog': 'furry'} # Create a new dictionary with some data -print d['cat'] # Get an entry from a dictionary; prints "cute" -print 'cat' in d # Check if a dictionary has a given key; prints "True" -d['fish'] = 'wet' # Set an entry in a dictionary -print d['fish'] # Prints "wet" + +#### 딕셔너리 +자바의 '맵', 자바스크립트의 '오브젝트'와 유사하게, 파이썬의 '딕셔너리'는 (열쇠, 값) 쌍을 저장합니다. +아래와 같은 방식으로 딕셔너리를 사용할 수 있습니다: + +~~~python +d = {'cat': 'cute', 'dog': 'furry'} # 새로운 딕셔너리를 만듭니다 +print d['cat'] # 딕셔너리의 값을 받음; 출력 "cute" +print 'cat' in d # 딕셔너리가 주어진 열쇠를 가졌는지 확인; 출력 "True" +d['fish'] = 'wet' # 딕셔너리의 값을 지정 +print d['fish'] # 출력 "wet" # print d['monkey'] # KeyError: 'monkey' not a key of d -print d.get('monkey', 'N/A') # Get an element with a default; prints "N/A" -print d.get('fish', 'N/A') # Get an element with a default; prints "wet" -del d['fish'] # Remove an element from a dictionary -print d.get('fish', 'N/A') # "fish" is no longer a key; prints "N/A" -``` -You can find all you need to know about dictionaries -[in the documentation](https://docs.python.org/2/library/stdtypes.html#dict). +print d.get('monkey', 'N/A') # 딕셔너리의 값을 받음. 존재하지 않는 다면 'N/A'; 출력 "N/A" +print d.get('fish', 'N/A') # 딕셔너리의 값을 받음. 존재하지 않는 다면 'N/A'; 출력 "wet" +del d['fish'] # 딕셔너리에 저장된 요소 삭제 +print d.get('fish', 'N/A') # "fish"는 더 이상 열쇠가 아님; 출력 "N/A" +~~~ +딕셔너리에 관해 더 알고 싶다면 [문서](https://docs.python.org/2/library/stdtypes.html#dict)를 참조하세요. -**Loops:** It is easy to iterate over the keys in a dictionary: +**반복문:** 딕셔너리의 열쇠는 쉽게 반복될 수 있습니다: -```python +~~~python d = {'person': 2, 'cat': 4, 'spider': 8} for animal in d: legs = d[animal] print 'A %s has %d legs' % (animal, legs) -# Prints "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs" -``` +# 출력 "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs", 한 줄에 하나씩 출력. +~~~ -If you want access to keys and their corresponding values, use the `iteritems` method: +만약 열쇠와, 그에 상응하는 값에 접근하고 싶다면, 'iteritems' 메소드를 사용하세요: -```python +~~~python d = {'person': 2, 'cat': 4, 'spider': 8} for animal, legs in d.iteritems(): print 'A %s has %d legs' % (animal, legs) -# Prints "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs" -``` +# 출력 "A person has 2 legs", "A spider has 8 legs", "A cat has 4 legs", 한 줄에 하나씩 출력. +~~~ -**Dictionary comprehensions:** -These are similar to list comprehensions, but allow you to easily construct -dictionaries. For example: +**딕셔너리 comprehensions:** +리스트 comprehensions과 유사한 딕셔너리 comprehensions을 통해 손쉽게 딕셔너리를 만들 수 있습니다. +예시: -```python +~~~python nums = [0, 1, 2, 3, 4] even_num_to_square = {x: x ** 2 for x in nums if x % 2 == 0} -print even_num_to_square # Prints "{0: 0, 2: 4, 4: 16}" -``` +print even_num_to_square # 출력 "{0: 0, 2: 4, 4: 16}" +~~~ -#### Sets -A set is an unordered collection of distinct elements. As a simple example, consider -the following: -```python +#### 집합 +집합은 순서 구분이 없고 서로 다른 요소 간의 모임입니다. 예시: + +~~~python animals = {'cat', 'dog'} -print 'cat' in animals # Check if an element is in a set; prints "True" -print 'fish' in animals # prints "False" -animals.add('fish') # Add an element to a set -print 'fish' in animals # Prints "True" -print len(animals) # Number of elements in a set; prints "3" -animals.add('cat') # Adding an element that is already in the set does nothing -print len(animals) # Prints "3" +print 'cat' in animals # 요소가 집합에 포함되어 있는지 확인; 출력 "True" +print 'fish' in animals # 출력 "False" +animals.add('fish') # 요소를 집합에 추가 +print 'fish' in animals # 출력 "True" +print len(animals) # 집합에 포함된 요소의 수; 출력 "3" +animals.add('cat') # 이미 포함되어있는 요소를 추가할 경우 아무 변화 없음 +print len(animals) # 출력 "3" animals.remove('cat') # Remove an element from a set -print len(animals) # Prints "2" -``` - -As usual, everything you want to know about sets can be found -[in the documentation](https://docs.python.org/2/library/sets.html#set-objects). +print len(animals) # 출력 "2" +~~~ +마찬가지로, 집합에 관해 더 알고 싶다면 [문서](https://docs.python.org/2/library/sets.html#set-objects)를 참조하세요. -**Loops:** -Iterating over a set has the same syntax as iterating over a list; -however since sets are unordered, you cannot make assumptions about the order -in which you visit the elements of the set: +**반복문:** +집합을 반복하는 구문은 리스트 반복 구문과 동일합니다; +그러나 집합은 순서가 없어서, 어떤 순서로 반복될지 추측할 순 없습니다: -```python +~~~python animals = {'cat', 'dog', 'fish'} for idx, animal in enumerate(animals): print '#%d: %s' % (idx + 1, animal) -# Prints "#1: fish", "#2: dog", "#3: cat" -``` +# 출력 "#1: fish", "#2: dog", "#3: cat", 한 줄에 하나씩 출력. +~~~ -**Set comprehensions:** -Like lists and dictionaries, we can easily construct sets using set comprehensions: +**집합 comprehensions:** +리스트, 딕셔너리와 마찬가지로 집합 comprehensions을 통해 손쉽게 집합을 만들수 있습니다. -```python +~~~python from math import sqrt nums = {int(sqrt(x)) for x in range(30)} -print nums # Prints "set([0, 1, 2, 3, 4, 5])" -``` +print nums # 출력 "set([0, 1, 2, 3, 4, 5])" +~~~ -#### Tuples -A tuple is an (immutable) ordered list of values. -A tuple is in many ways similar to a list; one of the most important differences is that -tuples can be used as keys in dictionaries and as elements of sets, while lists cannot. -Here is a trivial example: - -```python -d = {(x, x + 1): x for x in range(10)} # Create a dictionary with tuple keys -t = (5, 6) # Create a tuple -print type(t) # Prints "" -print d[t] # Prints "5" -print d[(1, 2)] # Prints "1" -``` -[The documentation](https://docs.python.org/2/tutorial/datastructures.html#tuples-and-sequences) has more information about tuples. + +#### 튜플 +튜플은 요소 간 순서가 있으며 값이 변하지 않는 리스트입니다. +튜플은 많은 면에서 리스트와 유사합니다; 가장 중요한 차이점은 튜플은 '딕셔너리의 열쇠'와 '집합의 요소'가 될 수 있지만, 리스트는 불가능하다는 점입니다. +여기 간단한 예시가 있습니다: + +~~~python +d = {(x, x + 1): x for x in range(10)} # 튜플을 열쇠로 하는 딕셔너리 생성 +t = (5, 6) # 튜플 생성 +print type(t) # 출력 "" +print d[t] # 출력 "5" +print d[(1, 2)] # 출력 "1" +~~~ +[문서](https://docs.python.org/2/tutorial/datastructures.html#tuples-and-sequences)에 튜플에 관한 더 많은 정보가 있습니다. -### Functions -Python functions are defined using the `def` keyword. For example: -```python +### 함수 +파이썬 함수는 'def' 키워드를 통해 정의됩니다. 예시: + +~~~python def sign(x): if x > 0: return 'positive' @@ -377,119 +369,115 @@ def sign(x): for x in [-1, 0, 1]: print sign(x) -# Prints "negative", "zero", "positive" -``` +# 출력 "negative", "zero", "positive", 한 줄에 하나씩 출력. +~~~ -We will often define functions to take optional keyword arguments, like this: +가끔은 아래처럼 선택적으로 인자를 받는 함수를 정의할 때도 있습니다: -```python +~~~python def hello(name, loud=False): if loud: print 'HELLO, %s!' % name.upper() else: print 'Hello, %s' % name -hello('Bob') # Prints "Hello, Bob" -hello('Fred', loud=True) # Prints "HELLO, FRED!" -``` -There is a lot more information about Python functions -[in the documentation](https://docs.python.org/2/tutorial/controlflow.html#defining-functions). +hello('Bob') # 출력 "Hello, Bob" +hello('Fred', loud=True) # 출력 "HELLO, FRED!" +~~~ +파이썬 함수에 관한 더 많은 정보는 [문서](https://docs.python.org/2/tutorial/controlflow.html#defining-functions)를 참조하세요. -### Classes -The syntax for defining classes in Python is straightforward: +### 클래스 + +파이썬에서 클래스를 정의하는 구문은 복잡하지 않습니다: -```python +~~~python class Greeter(object): - - # Constructor + + # 생성자 def __init__(self, name): - self.name = name # Create an instance variable - - # Instance method + self.name = name # 인스턴스 변수 선언 + + # 인스턴스 메소드 def greet(self, loud=False): if loud: print 'HELLO, %s!' % self.name.upper() else: print 'Hello, %s' % self.name - -g = Greeter('Fred') # Construct an instance of the Greeter class -g.greet() # Call an instance method; prints "Hello, Fred" -g.greet(loud=True) # Call an instance method; prints "HELLO, FRED!" -``` -You can read a lot more about Python classes -[in the documentation](https://docs.python.org/2/tutorial/classes.html). + +g = Greeter('Fred') # Greeter 클래스의 인스턴스 생성 +g.greet() # 인스턴스 메소드 호출; 출력 "Hello, Fred" +g.greet(loud=True) # 인스턴스 메소드 호출; 출력 "HELLO, FRED!" +~~~ +파이썬 클래스에 관한 더 많은 정보는 [문서](https://docs.python.org/2/tutorial/classes.html)를 참조하세요. + ## Numpy -[Numpy](http://www.numpy.org/) is the core library for scientific computing in Python. -It provides a high-performance multidimensional array object, and tools for working with these -arrays. If you are already familiar with MATLAB, you might find -[this tutorial useful](http://wiki.scipy.org/NumPy_for_Matlab_Users) to get started with Numpy. +[Numpy](http://www.numpy.org/)는 파이썬이 계산과학분야에 이용될때 핵심 역할을 하는 라이브러리입니다. +Numpy는 고성능의 다차원 배열 객체와 이를 다룰 도구를 제공합니다. 만약 MATLAB에 익숙한 분이라면 Numpy 학습을 시작하는데 있어 +[이 튜토리얼](http://wiki.scipy.org/NumPy_for_Matlab_Users)이 유용할 것입니다. -### Arrays -A numpy array is a grid of values, all of the same type, and is indexed by a tuple of -nonnegative integers. The number of dimensions is the *rank* of the array; the *shape* -of an array is a tuple of integers giving the size of the array along each dimension. -We can initialize numpy arrays from nested Python lists, -and access elements using square brackets: +### 배열 +Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. 각각의 값들은 튜플(이때 튜플은 양의 정수만을 요소값으로 갖습니다.) 형태로 색인 됩니다. +*rank*는 배열이 몇 차원인지를 의미합니다; *shape*는 는 각 차원의 크기를 알려주는 정수들이 모인 튜플입니다. + +파이썬의 리스트를 중첩해 Numpy 배열을 초기화 할 수 있고, 대괄호를 통해 각 요소에 접근할 수 있습니다: -```python +~~~python import numpy as np -a = np.array([1, 2, 3]) # Create a rank 1 array -print type(a) # Prints "" -print a.shape # Prints "(3,)" -print a[0], a[1], a[2] # Prints "1 2 3" -a[0] = 5 # Change an element of the array -print a # Prints "[5, 2, 3]" +a = np.array([1, 2, 3]) # rank가 1인 배열 생성 +print type(a) # 출력 "" +print a.shape # 출력 "(3,)" +print a[0], a[1], a[2] # 출력 "1 2 3" +a[0] = 5 # 요소를 변경 +print a # 출력 "[5, 2, 3]" -b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array -print b.shape # Prints "(2, 3)" -print b[0, 0], b[0, 1], b[1, 0] # Prints "1 2 4" -``` +b = np.array([[1,2,3],[4,5,6]]) # rank가 2인 배열 생성 +print b.shape # 출력 "(2, 3)" +print b[0, 0], b[0, 1], b[1, 0] # 출력 "1 2 4" +~~~ -Numpy also provides many functions to create arrays: +리스트의 중첩이 아니더라도 Numpy는 배열을 만들기 위한 다양한 함수를 제공합니다. -```python +~~~python import numpy as np -a = np.zeros((2,2)) # Create an array of all zeros -print a # Prints "[[ 0. 0.] - # [ 0. 0.]]" - -b = np.ones((1,2)) # Create an array of all ones -print b # Prints "[[ 1. 1.]]" - -c = np.full((2,2), 7) # Create a constant array -print c # Prints "[[ 7. 7.] - # [ 7. 7.]]" - -d = np.eye(2) # Create a 2x2 identity matrix -print d # Prints "[[ 1. 0.] - # [ 0. 1.]]" - -e = np.random.random((2,2)) # Create an array filled with random values -print e # Might print "[[ 0.91940167 0.08143941] - # [ 0.68744134 0.87236687]]" -``` -You can read about other methods of array creation -[in the documentation](http://docs.scipy.org/doc/numpy/user/basics.creation.html#arrays-creation). +a = np.zeros((2,2)) # 모든 값이 0인 배열 생성 +print a # 출력 "[[ 0. 0.] + # [ 0. 0.]]" + +b = np.ones((1,2)) # 모든 값이 1인 배열 생성 +print b # 출력 "[[ 1. 1.]]" + +c = np.full((2,2), 7) # 모든 값이 특정 상수인 배열 생성 +print c # 출력 "[[ 7. 7.] + # [ 7. 7.]]" + +d = np.eye(2) # 2x2 단위행렬 생성 +print d # 출력 "[[ 1. 0.] + # [ 0. 1.]]" + +e = np.random.random((2,2)) # 임의의 값으로 채워진 배열 생성 +print e # 임의의 값 출력 "[[ 0.91940167 0.08143941] + # [ 0.68744134 0.87236687]]" +~~~ +배열 생성에 관한 다른 방법들은 [문서](http://docs.scipy.org/doc/numpy/user/basics.creation.html#arrays-creation)를 참조하세요. -### Array indexing -Numpy offers several ways to index into arrays. -**Slicing:** -Similar to Python lists, numpy arrays can be sliced. -Since arrays may be multidimensional, you must specify a slice for each dimension -of the array: +### 배열 인덱싱 +Numpy는 배열을 인덱싱하는 몇 가지 방법을 제공합니다. -```python +**슬라이싱:** +파이썬 리스트와 유사하게, Numpy 배열도 슬라이싱이 가능합니다. Numpy 배열은 다차원인 경우가 많기에, 각 차원별로 어떻게 슬라이스할건지 명확히 해야 합니다: + +~~~python import numpy as np # Create the following rank 2 array with shape (3, 4) @@ -498,209 +486,201 @@ import numpy as np # [ 9 10 11 12]] a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) -# Use slicing to pull out the subarray consisting of the first 2 rows -# and columns 1 and 2; b is the following array of shape (2, 2): +# 슬라이싱을 이용하여 첫 두 행과 1열, 2열로 이루어진 부분배열을 만들어 봅시다; +# b는 shape가 (2,2)인 배열이 됩니다: # [[2 3] # [6 7]] b = a[:2, 1:3] -# A slice of an array is a view into the same data, so modifying it -# will modify the original array. -print a[0, 1] # Prints "2" -b[0, 0] = 77 # b[0, 0] is the same piece of data as a[0, 1] -print a[0, 1] # Prints "77" -``` +# 슬라이싱된 배열은 원본 배열과 같은 데이터를 참조합니다, 즉 슬라이싱된 배열을 수정하면 +# 원본 배열 역시 수정됩니다. +print a[0, 1] # 출력 "2" +b[0, 0] = 77 # b[0, 0]은 a[0, 1]과 같은 데이터입니다 +print a[0, 1] # 출력 "77" +~~~ + +정수를 이용한 인덱싱과 슬라이싱을 혼합하여 사용할 수 있습니다. +하지만 이렇게 할 경우, 기존의 배열보다 낮은 rank의 배열이 얻어집니다. +이는 MATLAB이 배열을 다루는 방식과 차이가 있습니다. -You can also mix integer indexing with slice indexing. -However, doing so will yield an array of lower rank than the original array. -Note that this is quite different from the way that MATLAB handles array -slicing: +슬라이싱: -```python +~~~python import numpy as np -# Create the following rank 2 array with shape (3, 4) +# 아래와 같은 요소를 가지는 rank가 2이고 shape가 (3, 4)인 배열 생성 # [[ 1 2 3 4] # [ 5 6 7 8] # [ 9 10 11 12]] a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) -# Two ways of accessing the data in the middle row of the array. -# Mixing integer indexing with slices yields an array of lower rank, -# while using only slices yields an array of the same rank as the -# original array: -row_r1 = a[1, :] # Rank 1 view of the second row of a -row_r2 = a[1:2, :] # Rank 2 view of the second row of a -print row_r1, row_r1.shape # Prints "[5 6 7 8] (4,)" -print row_r2, row_r2.shape # Prints "[[5 6 7 8]] (1, 4)" +# 배열의 중간 행에 접근하는 두 가지 방법이 있습니다. +# 정수 인덱싱과 슬라이싱을 혼합해서 사용하면 낮은 rank의 배열이 생성되지만, +# 슬라이싱만 사용하면 원본 배열과 동일한 rank의 배열이 생성됩니다. +row_r1 = a[1, :] # 배열a의 두 번째 행을 rank가 1인 배열로 +row_r2 = a[1:2, :] # 배열a의 두 번째 행을 rank가 2인 배열로 +print row_r1, row_r1.shape # 출력 "[5 6 7 8] (4,)" +print row_r2, row_r2.shape # 출력 "[[5 6 7 8]] (1, 4)" -# We can make the same distinction when accessing columns of an array: +# 행이 아닌 열의 경우에도 마찬가지입니다: col_r1 = a[:, 1] col_r2 = a[:, 1:2] -print col_r1, col_r1.shape # Prints "[ 2 6 10] (3,)" -print col_r2, col_r2.shape # Prints "[[ 2] - # [ 6] - # [10]] (3, 1)" -``` - -**Integer array indexing:** -When you index into numpy arrays using slicing, the resulting array view -will always be a subarray of the original array. In contrast, integer array -indexing allows you to construct arbitrary arrays using the data from another -array. Here is an example: - -```python +print col_r1, col_r1.shape # 출력 "[ 2 6 10] (3,)" +print col_r2, col_r2.shape # 출력 "[[ 2] + # [ 6] + # [10]] (3, 1)" +~~~ + +**정수 배열 인덱싱:** +Numpy 배열을 슬라이싱하면, 결과로 얻어지는 배열은 언제나 원본 배열의 부분 배열입니다. +그러나 정수 배열 인덱싱을 한다면, 원본과 다른 배열을 만들 수 있습니다. +여기에 예시가 있습니다: + +~~~python import numpy as np a = np.array([[1,2], [3, 4], [5, 6]]) -# An example of integer array indexing. -# The returned array will have shape (3,) and -print a[[0, 1, 2], [0, 1, 0]] # Prints "[1 4 5]" +# 정수 배열 인덱싱의 예. +# 반환되는 배열의 shape는 (3,) +print a[[0, 1, 2], [0, 1, 0]] # 출력 "[1 4 5]" -# The above example of integer array indexing is equivalent to this: -print np.array([a[0, 0], a[1, 1], a[2, 0]]) # Prints "[1 4 5]" +# 위에서 본 정수 배열 인덱싱 예제는 다음과 동일합니다: +print np.array([a[0, 0], a[1, 1], a[2, 0]]) # 출력 "[1 4 5]" -# When using integer array indexing, you can reuse the same -# element from the source array: -print a[[0, 0], [1, 1]] # Prints "[2 2]" +# 정수 배열 인덱싱을 사용할 때, +# 원본 배열의 같은 요소를 재사용할 수 있습니다: +print a[[0, 0], [1, 1]] # 출력 "[2 2]" -# Equivalent to the previous integer array indexing example -print np.array([a[0, 1], a[0, 1]]) # Prints "[2 2]" -``` +# 위 예제는 다음과 동일합니다 +print np.array([a[0, 1], a[0, 1]]) # 출력 "[2 2]" +~~~ -One useful trick with integer array indexing is selecting or mutating one -element from each row of a matrix: +정수 배열 인덱싱을 유용하게 사용하는 방법 중 하나는 행렬의 각 행에서 하나의 요소를 선택하거나 바꾸는 것입니다: -```python +~~~python import numpy as np -# Create a new array from which we will select elements +# 요소를 선택할 새로운 배열 생성 a = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]]) -print a # prints "array([[ 1, 2, 3], - # [ 4, 5, 6], - # [ 7, 8, 9], - # [10, 11, 12]])" +print a # 출력 "array([[ 1, 2, 3], + # [ 4, 5, 6], + # [ 7, 8, 9], + # [10, 11, 12]])" -# Create an array of indices +# 인덱스를 저장할 배열 생성 b = np.array([0, 2, 0, 1]) -# Select one element from each row of a using the indices in b -print a[np.arange(4), b] # Prints "[ 1 6 7 11]" -# Mutate one element from each row of a using the indices in b +# b에 저장된 인덱스를 이용해 각 행에서 하나의 요소를 선택합니다 +print a[np.arange(4), b] # 출력 "[ 1 6 7 11]" + +# b에 저장된 인덱스를 이용해 각 행에서 하나의 요소를 변경합니다 a[np.arange(4), b] += 10 -print a # prints "array([[11, 2, 3], - # [ 4, 5, 16], - # [17, 8, 9], - # [10, 21, 12]]) -``` +print a # 출력 "array([[11, 2, 3], + # [ 4, 5, 16], + # [17, 8, 9], + # [10, 21, 12]]) +~~~ -**Boolean array indexing:** -Boolean array indexing lets you pick out arbitrary elements of an array. -Frequently this type of indexing is used to select the elements of an array -that satisfy some condition. Here is an example: +**불리언 배열 인덱싱:** +불리언 배열 인덱싱을 통해 배열 속 요소를 취사선택할 수 있습니다. +불리언 배열 인덱싱은 특정 조건을 만족하게 하는 요소만 선택하고자 할 때 자주 사용됩니다. +다음은 그 예시입니다: -```python +~~~python import numpy as np a = np.array([[1,2], [3, 4], [5, 6]]) -bool_idx = (a > 2) # Find the elements of a that are bigger than 2; - # this returns a numpy array of Booleans of the same - # shape as a, where each slot of bool_idx tells - # whether that element of a is > 2. - -print bool_idx # Prints "[[False False] - # [ True True] - # [ True True]]" +bool_idx = (a > 2) # 2보다 큰 a의 요소를 찾습니다; + # 이 코드는 a와 shape가 같고 불리언 자료형을 요소로 하는 numpy 배열을 반환합니다, + # bool_idx의 각 요소는 동일한 위치에 있는 a의 + # 요소가 2보다 큰지를 말해줍니다. + +print bool_idx # 출력 "[[False False] + # [ True True] + # [ True True]]" -# We use boolean array indexing to construct a rank 1 array -# consisting of the elements of a corresponding to the True values -# of bool_idx -print a[bool_idx] # Prints "[3 4 5 6]" +# 불리언 배열 인덱싱을 통해 bool_idx에서 +# 참 값을 가지는 요소로 구성되는 +# rank 1인 배열을 구성할 수 있습니다. +print a[bool_idx] # 출력 "[3 4 5 6]" -# We can do all of the above in a single concise statement: -print a[a > 2] # Prints "[3 4 5 6]" -``` +# 위에서 한 모든것을 한 문장으로 할 수 있습니다: +print a[a > 2] # 출력 "[3 4 5 6]" +~~~ -For brevity we have left out a lot of details about numpy array indexing; -if you want to know more you should -[read the documentation](http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html). +튜토리얼을 간결히 하고자 numpy 배열 인덱싱에 관한 많은 내용을 생략했습니다. +조금 더 알고싶다면 [문서](http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html)를 참조하세요. -### Datatypes -Every numpy array is a grid of elements of the same type. -Numpy provides a large set of numeric datatypes that you can use to construct arrays. -Numpy tries to guess a datatype when you create an array, but functions that construct -arrays usually also include an optional argument to explicitly specify the datatype. -Here is an example: - -```python + +### 자료형 +Numpy 배열은 동일한 자료형을 가지는 값들이 격자판 형태로 있는 것입니다. +Numpy에선 배열을 구성하는 데 사용할 수 있는 다양한 숫자 자료형을 제공합니다. +Numpy는 배열이 생성될 때 자료형을 스스로 추측합니다, 그러나 배열을 생성할 때 명시적으로 특정 자료형을 지정할 수도 있습니다. 예시: + +~~~python import numpy as np -x = np.array([1, 2]) # Let numpy choose the datatype -print x.dtype # Prints "int64" +x = np.array([1, 2]) # Numpy가 자료형을 추측해서 선택 +print x.dtype # 출력 "int64" -x = np.array([1.0, 2.0]) # Let numpy choose the datatype -print x.dtype # Prints "float64" +x = np.array([1.0, 2.0]) # Numpy가 자료형을 추측해서 선택 +print x.dtype # 출력 "float64" -x = np.array([1, 2], dtype=np.int64) # Force a particular datatype -print x.dtype # Prints "int64" -``` -You can read all about numpy datatypes -[in the documentation](http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html). +x = np.array([1, 2], dtype=np.int64) # 특정 자료형을 명시적으로 지정 +print x.dtype # 출력 "int64" +~~~ +Numpy 자료형에 관한 자세한 사항은 [문서](http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html)를 참조하세요. -### Array math -Basic mathematical functions operate elementwise on arrays, and are available -both as operator overloads and as functions in the numpy module: -```python +### 배열 연산 +기본적인 수학함수는 배열의 각 요소별로 동작하며 연산자를 통해 동작하거나 numpy 함수모듈을 통해 동작합니다: + +~~~python import numpy as np x = np.array([[1,2],[3,4]], dtype=np.float64) y = np.array([[5,6],[7,8]], dtype=np.float64) -# Elementwise sum; both produce the array +# 요소별 합; 둘 다 다음의 배열을 만듭니다 # [[ 6.0 8.0] # [10.0 12.0]] print x + y print np.add(x, y) -# Elementwise difference; both produce the array +# 요소별 차; 둘 다 다음의 배열을 만듭니다 # [[-4.0 -4.0] # [-4.0 -4.0]] print x - y print np.subtract(x, y) -# Elementwise product; both produce the array +# 요소별 곱; 둘 다 다음의 배열을 만듭니다 # [[ 5.0 12.0] # [21.0 32.0]] print x * y print np.multiply(x, y) -# Elementwise division; both produce the array +# 요소별 나눗셈; 둘 다 다음의 배열을 만듭니다 # [[ 0.2 0.33333333] # [ 0.42857143 0.5 ]] print x / y print np.divide(x, y) -# Elementwise square root; produces the array +# 요소별 제곱근; 다음의 배열을 만듭니다 # [[ 1. 1.41421356] # [ 1.73205081 2. ]] print np.sqrt(x) -``` +~~~ -Note that unlike MATLAB, `*` is elementwise multiplication, not matrix -multiplication. We instead use the `dot` function to compute inner -products of vectors, to multiply a vector by a matrix, and to -multiply matrices. `dot` is available both as a function in the numpy -module and as an instance method of array objects: +MATLAB과 달리, '*'은 행렬 곱이 아니라 요소별 곱입니다. Numpy에선 벡터의 내적, 벡터와 행렬의 곱, 행렬곱을 위해 '*'대신 'dot'함수를 사용합니다. 'dot'은 Numpy 모듈 함수로서도 배열 객체의 인스턴스 메소드로서도 이용 가능한 합수입니다: -```python +~~~python import numpy as np x = np.array([[1,2],[3,4]]) @@ -709,346 +689,319 @@ y = np.array([[5,6],[7,8]]) v = np.array([9,10]) w = np.array([11, 12]) -# Inner product of vectors; both produce 219 +# 벡터의 내적; 둘 다 결과는 219 print v.dot(w) print np.dot(v, w) -# Matrix / vector product; both produce the rank 1 array [29 67] +# 행렬과 벡터의 곱; 둘 다 결과는 rank 1인 배열 [29 67] print x.dot(v) print np.dot(x, v) -# Matrix / matrix product; both produce the rank 2 array +# 행렬곱; 둘 다 결과는 rank 2인 배열 # [[19 22] # [43 50]] print x.dot(y) print np.dot(x, y) -``` +~~~ -Numpy provides many useful functions for performing computations on -arrays; one of the most useful is `sum`: +Numpy는 배열 연산에 유용하게 쓰이는 많은 함수를 제공합니다. 가장 유용한 건 'sum'입니다: -```python +~~~python import numpy as np x = np.array([[1,2],[3,4]]) -print np.sum(x) # Compute sum of all elements; prints "10" -print np.sum(x, axis=0) # Compute sum of each column; prints "[4 6]" -print np.sum(x, axis=1) # Compute sum of each row; prints "[3 7]" -``` -You can find the full list of mathematical functions provided by numpy -[in the documentation](http://docs.scipy.org/doc/numpy/reference/routines.math.html). +print np.sum(x) # 모든 요소를 합한 값을 연산; 출력 "10" +print np.sum(x, axis=0) # 각 열에 대한 합을 연산; 출력 "[4 6]" +print np.sum(x, axis=1) # 각 행에 대한 합을 연산; 출력 "[3 7]" +~~~ +Numpy가 제공하는 모든 수학함수의 목록은 [문서](http://docs.scipy.org/doc/numpy/reference/routines.math.html)를 참조하세요. -Apart from computing mathematical functions using arrays, we frequently -need to reshape or otherwise manipulate data in arrays. The simplest example -of this type of operation is transposing a matrix; to transpose a matrix, -simply use the `T` attribute of an array object: +배열연산을 하지 않더라도, 종종 배열의 모양을 바꾸거나 데이터를 처리해야 할 때가 있습니다. +가장 간단한 예는 행렬의 주 대각선을 기준으로 대칭되는 요소끼리 뒤바꾸는 것입니다; 이를 전치라고 하며 행렬을 전치하기 위해선, 간단하게 배열 객체의 'T' 속성을 사용하면 됩니다: -```python +~~~python import numpy as np x = np.array([[1,2], [3,4]]) -print x # Prints "[[1 2] +print x # 출력 "[[1 2] # [3 4]]" -print x.T # Prints "[[1 3] +print x.T # 출력 "[[1 3] # [2 4]]" -# Note that taking the transpose of a rank 1 array does nothing: +# rank 1인 배열을 전치할 경우 아무 일도 일어나지 않습니다: v = np.array([1,2,3]) -print v # Prints "[1 2 3]" -print v.T # Prints "[1 2 3]" -``` -Numpy provides many more functions for manipulating arrays; you can see the full list -[in the documentation](http://docs.scipy.org/doc/numpy/reference/routines.array-manipulation.html). +print v # 출력 "[1 2 3]" +print v.T # 출력 "[1 2 3]" +~~~ +Numpy는 배열을 다루는 다양한 함수들을 제공합니다; 이러한 함수의 전체 목록은 [문서](http://docs.scipy.org/doc/numpy/reference/routines.array-manipulation.html)를 참조하세요. -### Broadcasting -Broadcasting is a powerful mechanism that allows numpy to work with arrays of different -shapes when performing arithmetic operations. Frequently we have a smaller array and a -larger array, and we want to use the smaller array multiple times to perform some operation -on the larger array. -For example, suppose that we want to add a constant vector to each -row of a matrix. We could do it like this: +### 브로드캐스팅 +브로트캐스팅은 Numpy에서 shape가 다른 배열 간에도 산술 연산이 가능하게 하는 메커니즘입니다. 종종 작은 배열과 큰 배열이 있을 때, 큰 배열을 대상으로 작은 배열을 여러 번 연산하고자 할 때가 있습니다. 예를 들어, 행렬의 각 행에 상수 벡터를 더하는 걸 생각해보세요. 이는 다음과 같은 방식으로 처리될 수 있습니다: -```python +~~~python import numpy as np -# We will add the vector v to each row of the matrix x, -# storing the result in the matrix y +# 행렬 x의 각 행에 벡터 v를 더한 뒤, +# 그 결과를 행렬 y에 저장하고자 합니다 x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]]) v = np.array([1, 0, 1]) -y = np.empty_like(x) # Create an empty matrix with the same shape as x +y = np.empty_like(x) # x와 동일한 shape를 가지며 비어있는 행렬 생성 -# Add the vector v to each row of the matrix x with an explicit loop +# 명시적 반복문을 통해 행렬 x의 각 행에 벡터 v를 더하는 방법 for i in range(4): y[i, :] = x[i, :] + v -# Now y is the following +# 이제 y는 다음과 같습니다 # [[ 2 2 4] # [ 5 5 7] # [ 8 8 10] # [11 11 13]] print y -``` +~~~ -This works; however when the matrix `x` is very large, computing an explicit loop -in Python could be slow. Note that adding the vector `v` to each row of the matrix -`x` is equivalent to forming a matrix `vv` by stacking multiple copies of `v` vertically, -then performing elementwise summation of `x` and `vv`. We could implement this -approach like this: +위의 방식대로 하면 됩니다; 그러나 'x'가 매우 큰 행렬이라면, 파이썬의 명시적 반복문을 이용한 위 코드는 매우 느려질 수 있습니다. 벡터 'v'를 행렬 'x'의 각 행에 더하는 것은 'v'를 여러 개 복사해 수직으로 쌓은 행렬 'vv'를 만들고 이 'vv'를 'x'에 더하는것과 동일합니다. 이 과정을 아래의 코드로 구현할 수 있습니다: -```python +~~~python import numpy as np -# We will add the vector v to each row of the matrix x, -# storing the result in the matrix y +# 벡터 v를 행렬 x의 각 행에 더한 뒤, +# 그 결과를 행렬 y에 저장하고자 합니다 x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]]) v = np.array([1, 0, 1]) -vv = np.tile(v, (4, 1)) # Stack 4 copies of v on top of each other -print vv # Prints "[[1 0 1] - # [1 0 1] - # [1 0 1] - # [1 0 1]]" -y = x + vv # Add x and vv elementwise -print y # Prints "[[ 2 2 4 - # [ 5 5 7] - # [ 8 8 10] - # [11 11 13]]" -``` - -Numpy broadcasting allows us to perform this computation without actually -creating multiple copies of `v`. Consider this version, using broadcasting: - -```python +vv = np.tile(v, (4, 1)) # v의 복사본 4개를 위로 차곡차곡 쌓은 것이 vv +print vv # 출력 "[[1 0 1] + # [1 0 1] + # [1 0 1] + # [1 0 1]]" +y = x + vv # x와 vv의 요소별 합 +print y # 출력 "[[ 2 2 4 + # [ 5 5 7] + # [ 8 8 10] + # [11 11 13]]" +~~~ + +Numpy 브로드캐스팅을 이용한다면 이렇게 v의 복사본을 여러 개 만들지 않아도 동일한 연산을 할 수 있습니다. +아래는 브로드캐스팅을 이용한 예시 코드입니다: + +~~~python import numpy as np -# We will add the vector v to each row of the matrix x, -# storing the result in the matrix y +# 벡터 v를 행렬 x의 각 행에 더한 뒤, +# 그 결과를 행렬 y에 저장하고자 합니다 x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]]) v = np.array([1, 0, 1]) -y = x + v # Add v to each row of x using broadcasting -print y # Prints "[[ 2 2 4] - # [ 5 5 7] - # [ 8 8 10] - # [11 11 13]]" -``` - -The line `y = x + v` works even though `x` has shape `(4, 3)` and `v` has shape -`(3,)` due to broadcasting; this line works as if `v` actually had shape `(4, 3)`, -where each row was a copy of `v`, and the sum was performed elementwise. - -Broadcasting two arrays together follows these rules: - -1. If the arrays do not have the same rank, prepend the shape of the lower rank array - with 1s until both shapes have the same length. -2. The two arrays are said to be *compatible* in a dimension if they have the same - size in the dimension, or if one of the arrays has size 1 in that dimension. -3. The arrays can be broadcast together if they are compatible in all dimensions. -4. After broadcasting, each array behaves as if it had shape equal to the elementwise - maximum of shapes of the two input arrays. -5. In any dimension where one array had size 1 and the other array had size greater than 1, - the first array behaves as if it were copied along that dimension - -If this explanation does not make sense, try reading the explanation -[from the documentation](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) -or [this explanation](http://wiki.scipy.org/EricsBroadcastingDoc). - -Functions that support broadcasting are known as *universal functions*. You can find -the list of all universal functions -[in the documentation](http://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs). - -Here are some applications of broadcasting: - -```python +y = x + v # 브로드캐스팅을 이용하여 v를 x의 각 행에 더하기 +print y # 출력 "[[ 2 2 4] + # [ 5 5 7] + # [ 8 8 10] + # [11 11 13]]" +~~~ + +`x`의 shape가 `(4, 3)`이고 `v`의 shape가 `(3,)`라도 브로드캐스팅으로 인해 `y = x + v`는 문제없이 수행됩니다; +이때 'v'는 'v'의 복사본이 차곡차곡 쌓인 shape `(4, 3)`처럼 간주되어 'x'와 동일한 shape가 되며 이들 간의 요소별 덧셈연산이 y에 저장됩니다. + +두 배열의 브로드캐스팅은 아래의 규칙을 따릅니다: + +1. 두 배열이 동일한 rank를 가지고 있지 않다면, 낮은 rank의 1차원 배열이 높은 rank 배열의 shape로 간주합니다. +2. 특정 차원에서 두 배열이 동일한 크기를 갖거나, 두 배열 중 하나의 크기가 1이라면 그 두 배열은 특정 차원에서 *compatible*하다고 여겨집니다. +3. 두 행렬이 모든 차원에서 compatible하다면, 브로드캐스팅이 가능합니다. +4. 브로드캐스팅이 이뤄지면, 각 배열 shape의 요소별 최소공배수로 이루어진 shape가 두 배열의 shape로 간주합니다. +5. 차원에 상관없이 크기가 1인 배열과 1보다 큰 배열이 있을 때, 크기가 1인 배열은 자신의 차원 수만큼 복사되어 쌓인 것처럼 간주합니다. + +설명이 이해하기 부족하다면 [scipy문서](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html)나 [scipy위키](http://wiki.scipy.org/EricsBroadcastingDoc)를 참조하세요. + +브로드캐스팅을 지원하는 함수를 *universal functions*라고 합니다. +*universal functions* 목록은 [문서](http://docs.scipy.org/doc/numpy/reference/ufuncs.html#available-ufuncs)를 참조하세요. + +브로드캐스팅을 응용한 예시들입니다: + +~~~python import numpy as np -# Compute outer product of vectors -v = np.array([1,2,3]) # v has shape (3,) -w = np.array([4,5]) # w has shape (2,) -# To compute an outer product, we first reshape v to be a column -# vector of shape (3, 1); we can then broadcast it against w to yield -# an output of shape (3, 2), which is the outer product of v and w: +# 벡터의 외적을 계산 +v = np.array([1,2,3]) # v의 shape는 (3,) +w = np.array([4,5]) # w의 shape는 (2,) +# 외적을 계산하기 위해, 먼저 v를 shape가 (3,1)인 행벡터로 바꿔야 합니다; +# 그다음 이것을 w에 맞춰 브로드캐스팅한뒤 결과물로 shape가 (3,2)인 행렬을 얻습니다, +# 이 행렬은 v와 w 외적의 결과입니다: # [[ 4 5] # [ 8 10] # [12 15]] print np.reshape(v, (3, 1)) * w -# Add a vector to each row of a matrix +# 벡터를 행렬의 각 행에 더하기 x = np.array([[1,2,3], [4,5,6]]) -# x has shape (2, 3) and v has shape (3,) so they broadcast to (2, 3), -# giving the following matrix: +# x는 shape가 (2, 3)이고 v는 shape가 (3,)이므로 이 둘을 브로드캐스팅하면 shape가 (2, 3)인 +# 아래와 같은 행렬이 나옵니다: # [[2 4 6] # [5 7 9]] print x + v -# Add a vector to each column of a matrix -# x has shape (2, 3) and w has shape (2,). -# If we transpose x then it has shape (3, 2) and can be broadcast -# against w to yield a result of shape (3, 2); transposing this result -# yields the final result of shape (2, 3) which is the matrix x with -# the vector w added to each column. Gives the following matrix: +# 벡터를 행렬의 각 행에 더하기 +# x는 shape가 (2, 3)이고 w는 shape가 (2,)입니다. +# x의 전치행렬은 shape가 (3,2)이며 이는 w와 브로드캐스팅이 가능하고 결과로 shape가 (3,2)인 행렬이 생깁니다; +# 이 행렬을 전치하면 shape가 (2,3)인 행렬이 나오며 +# 이는 행렬 x의 각 열에 벡터 w을 더한 결과와 동일합니다. +# 아래의 행렬입니다: # [[ 5 6 7] # [ 9 10 11]] print (x.T + w).T -# Another solution is to reshape w to be a row vector of shape (2, 1); -# we can then broadcast it directly against x to produce the same -# output. +# 다른 방법은 w를 shape가 (2,1)인 열벡터로 변환하는 것입니다; +# 그런 다음 이를 바로 x에 브로드캐스팅해 더하면 +# 동일한 결과가 나옵니다. print x + np.reshape(w, (2, 1)) -# Multiply a matrix by a constant: -# x has shape (2, 3). Numpy treats scalars as arrays of shape (); -# these can be broadcast together to shape (2, 3), producing the -# following array: +# 행렬의 스칼라배: +# x 의 shape는 (2, 3)입니다. Numpy는 스칼라를 shape가 ()인 배열로 취급합니다; +# 그렇기에 스칼라 값은 (2,3) shape로 브로드캐스트 될 수 있고, +# 아래와 같은 결과를 만들어 냅니다: # [[ 2 4 6] # [ 8 10 12]] print x * 2 -``` +~~~ -Broadcasting typically makes your code more concise and faster, so you -should strive to use it where possible. +브로드캐스팅은 보통 코드를 간결하고 빠르게 해줍니다, 그러니 가능한 많이 사용하세요. ### Numpy Documentation -This brief overview has touched on many of the important things that you need to -know about numpy, but is far from complete. Check out the -[numpy reference](http://docs.scipy.org/doc/numpy/reference/) -to find out much more about numpy. +이 문서는 여러분이 numpy에 대해 알아야할 많은 중요한 사항들을 다루지만 완벽하진 않습니다. +numpy에 관한 더 많은 사항은 [numpy 레퍼런스](http://docs.scipy.org/doc/numpy/reference/)를 참조하세요. + ## SciPy -Numpy provides a high-performance multidimensional array and basic tools to -compute with and manipulate these arrays. -[SciPy](http://docs.scipy.org/doc/scipy/reference/) -builds on this, and provides -a large number of functions that operate on numpy arrays and are useful for -different types of scientific and engineering applications. -The best way to get familiar with SciPy is to -[browse the documentation](http://docs.scipy.org/doc/scipy/reference/index.html). -We will highlight some parts of SciPy that you might find useful for this class. +Numpy는 고성능의 다차원 배열 객체와 이를 다룰 도구를 제공합니다. +numpy를 바탕으로 만들어진 [SciPy](http://docs.scipy.org/doc/scipy/reference/)는, +numpy 배열을 다루는 많은 함수를 제공하며 다양한 과학, 공학분야에서 유용하게 사용됩니다. + +SciPy에 익숙해지는 최고의 방법은 [SciPy 공식 문서](http://docs.scipy.org/doc/scipy/reference/index.html)를 보는 것입니다. +이 문서에서는 scipy중 cs231n 수업에서 유용하게 쓰일 일부분만을 소개할것입니다. -### Image operations -SciPy provides some basic functions to work with images. -For example, it has functions to read images from disk into numpy arrays, -to write numpy arrays to disk as images, and to resize images. -Here is a simple example that showcases these functions: -```python +### 이미지 작업 +SciPy는 이미지를 다룰 기본적인 함수들을 제공합니다. +예를들자면, 디스크에 저장된 이미지를 numpy 배열로 읽어 들이는 함수가 있으며, +numpy 배열을 디스크에 이미지로 저장하는 함수도 있고, 이미지의 크기를 바꾸는 함수도 있습니다. +이 함수들의 간단한 사용 예시입니다: + +~~~python from scipy.misc import imread, imsave, imresize -# Read an JPEG image into a numpy array +# JPEG 이미지를 numpy 배열로 읽어들이기 img = imread('assets/cat.jpg') -print img.dtype, img.shape # Prints "uint8 (400, 248, 3)" - -# We can tint the image by scaling each of the color channels -# by a different scalar constant. The image has shape (400, 248, 3); -# we multiply it by the array [1, 0.95, 0.9] of shape (3,); -# numpy broadcasting means that this leaves the red channel unchanged, -# and multiplies the green and blue channels by 0.95 and 0.9 -# respectively. +print img.dtype, img.shape # 출력 "uint8 (400, 248, 3)" + +# 각각의 색깔 채널을 다른 상수값으로 스칼라배함으로써 +# 이미지의 색을 변화시킬 수 있습니다. +# 이미지의 shape는 (400, 248, 3)입니다; +# 여기에 shape가 (3,)인 배열 [1, 0.95, 0.9]를 곱합니다; +# numpy 브로드캐스팅에 의해 이 배열이 곱해지며 붉은색 채널은 변하지 않으며, +# 초록색, 파란색 채널에는 각각 0.95, 0.9가 곱해집니다 img_tinted = img * [1, 0.95, 0.9] -# Resize the tinted image to be 300 by 300 pixels. +# 색변경 이미지를 300x300픽셀로 크기 조절. img_tinted = imresize(img_tinted, (300, 300)) -# Write the tinted image back to disk +# 색변경 이미지를 디스크에 기록하기 imsave('assets/cat_tinted.jpg', img_tinted) -``` +~~~
- - + +
- Left: The original image. - Right: The tinted and resized image. + Left: 원본 이미지. + Right: 색변경 & 크기변경 이미지.
-### MATLAB files -The functions `scipy.io.loadmat` and `scipy.io.savemat` allow you to read and -write MATLAB files. You can read about them -[in the documentation](http://docs.scipy.org/doc/scipy/reference/io.html). + +### MATLAB 파일 +`scipy.io.loadmat` 와 `scipy.io.savemat`함수를 통해 +matlab 파일을 읽고 쓸 수 있습니다. +[문서](http://docs.scipy.org/doc/scipy/reference/io.html)를 참조하세요. -### Distance between points -SciPy defines some useful functions for computing distances between sets of points. -The function `scipy.spatial.distance.pdist` computes the distance between all pairs -of points in a given set: +### 두 점 사이의 거리 +SciPy에는 점들간의 거리를 계산하기 위한 유용한 함수들이 정의되어 있습니다. -```python +`scipy.spatial.distance.pdist`함수는 주어진 점들 사이의 모든 거리를 계산합니다: + +~~~python import numpy as np from scipy.spatial.distance import pdist, squareform -# Create the following array where each row is a point in 2D space: +# 각 행이 2차원 공간에서의 한 점을 의미하는 행렬을 생성: # [[0 1] # [1 0] # [2 0]] x = np.array([[0, 1], [1, 0], [2, 0]]) print x -# Compute the Euclidean distance between all rows of x. -# d[i, j] is the Euclidean distance between x[i, :] and x[j, :], -# and d is the following array: +# x가 나타내는 모든 점 사이의 유클리디안 거리를 계산. +# d[i, j]는 x[i, :]와 x[j, :]사이의 유클리디안 거리를 의미하며, +# d는 아래의 행렬입니다: # [[ 0. 1.41421356 2.23606798] # [ 1.41421356 0. 1. ] # [ 2.23606798 1. 0. ]] d = squareform(pdist(x, 'euclidean')) print d -``` -You can read all the details about this function -[in the documentation](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html). +~~~ +이 함수에 대한 자세한 사항은 [pidst 공식 문서](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html)를 참조하세요. -A similar function (`scipy.spatial.distance.cdist`) computes the distance between all pairs -across two sets of points; you can read about it -[in the documentation](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html). +`scipy.spatial.distance.cdist`도 위와 유사하게 점들 사이의 거리를 계산합니다. 자세한 사항은 [cdist 공식 문서](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html)를 참조하세요. + ## Matplotlib -[Matplotlib](http://matplotlib.org/) is a plotting library. -In this section give a brief introduction to the `matplotlib.pyplot` module, -which provides a plotting system similar to that of MATLAB. +[Matplotlib](http://matplotlib.org/)는 plotting 라이브러리입니다. +이번에는 MATLAB의 plotting 시스템과 유사한 기능을 제공하는 +`matplotlib.pyplot` 모듈에 관한 간략한 소개가 있겠습니다., + ### Plotting -The most important function in matplotlib is `plot`, -which allows you to plot 2D data. Here is a simple example: +matplotlib에서 가장 중요한 함수는 2차원 데이터를 그릴 수 있게 해주는 `plot`입니다. +여기 간단한 예시가 있습니다: -```python +~~~python import numpy as np import matplotlib.pyplot as plt -# Compute the x and y coordinates for points on a sine curve +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y = np.sin(x) -# Plot the points using matplotlib +# matplotlib를 이용해 점들을 그리기 plt.plot(x, y) -plt.show() # You must call plt.show() to make graphics appear. -``` +plt.show() # 그래프를 나타나게 하기 위해선 plt.show()함수를 호출해야만 합니다. +~~~ -Running this code produces the following plot: +이 코드를 실행하면 아래의 그래프가 생성됩니다:
- +
-With just a little bit of extra work we can easily plot multiple lines -at once, and add a title, legend, and axis labels: +약간의 몇 가지 추가적인 작업을 통해 여러 개의 그래프와 제목, 범주, 축 이름을 한 번에 쉽게 나타낼 수 있습니다: -```python +~~~python import numpy as np import matplotlib.pyplot as plt -# Compute the x and y coordinates for points on sine and cosine curves +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y_sin = np.sin(x) y_cos = np.cos(x) -# Plot the points using matplotlib +# matplotlib를 이용해 점들을 그리기 plt.plot(x, y_sin) plt.plot(x, y_cos) plt.xlabel('x axis label') @@ -1056,57 +1009,59 @@ plt.ylabel('y axis label') plt.title('Sine and Cosine') plt.legend(['Sine', 'Cosine']) plt.show() -``` +~~~
- +
-You can read much more about the `plot` function -[in the documentation](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot). +`plot`함수에 관한 더 많은 내용은 [문서](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.plot)를 참조하세요. + ### Subplots -You can plot different things in the same figure using the `subplot` function. -Here is an example: -```python +'subplot'함수를 통해 다른 내용도 동일한 그림 위에 나타낼 수 있습니다. +여기 간단한 예시가 있습니다: + +~~~python import numpy as np import matplotlib.pyplot as plt -# Compute the x and y coordinates for points on sine and cosine curves +# 사인과 코사인 곡선의 x,y 좌표를 계산 x = np.arange(0, 3 * np.pi, 0.1) y_sin = np.sin(x) y_cos = np.cos(x) -# Set up a subplot grid that has height 2 and width 1, -# and set the first such subplot as active. +# 높이가 2이고 너비가 1인 subplot 구획을 설정하고, +# 첫 번째 구획을 활성화. plt.subplot(2, 1, 1) -# Make the first plot +# 첫 번째 그리기 plt.plot(x, y_sin) plt.title('Sine') -# Set the second subplot as active, and make the second plot. +# 두 번째 subplot 구획을 활성화 하고 그리기 plt.subplot(2, 1, 2) plt.plot(x, y_cos) plt.title('Cosine') -# Show the figure. +# 그림 보이기. plt.show() -``` +~~~
- +
-You can read much more about the `subplot` function -[in the documentation](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.subplot). +`subplot`함수에 관한 더 많은 내용은 +[문서](http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.subplot)를 참조하세요. -### Images -You can use the `imshow` function to show images. Here is an example: -```python +### 이미지 +`imshow`함수를 사용해 이미지를 나타낼 수 있습니다. 여기 예시가 있습니다: + +~~~python import numpy as np from scipy.misc import imread, imresize import matplotlib.pyplot as plt @@ -1114,20 +1069,26 @@ import matplotlib.pyplot as plt img = imread('assets/cat.jpg') img_tinted = img * [1, 0.95, 0.9] -# Show the original image +# 원본 이미지 나타내기 plt.subplot(1, 2, 1) plt.imshow(img) -# Show the tinted image +# 색변화된 이미지 나타내기 plt.subplot(1, 2, 2) -# A slight gotcha with imshow is that it might give strange results -# if presented with data that is not uint8. To work around this, we -# explicitly cast the image to uint8 before displaying it. +# imshow를 이용하며 주의할 점은 데이터의 자료형이 +# uint8이 아니라면 이상한 결과를 보여줄 수도 있다는 것입니다. +# 그러므로 이미지를 나타내기 전에 명시적으로 자료형을 uint8로 형변환 해줍니다. + plt.imshow(np.uint8(img_tinted)) plt.show() -``` +~~~
- +
+ +--- +

+번역: 강상훈 (sanghkaang) +

diff --git a/terminal-tutorial.md b/terminal-tutorial.md index 771efb68..3da07bc3 100644 --- a/terminal-tutorial.md +++ b/terminal-tutorial.md @@ -3,36 +3,41 @@ layout: page title: Terminal.com Tutorial permalink: /terminal-tutorial/ --- -For the assignments, we offer an option to use [Terminal](https://www.stanfordterminalcloud.com) for developing and testing your implementations. Notice that we're not using the main Terminal.com site but a subdomain which has been assigned specifically for this class. [Terminal](https://www.stanfordterminalcloud.com) is an online computing platform that allows us to access pre-configured command line environments. Note that, it's not required to use [Terminal](https://www.stanfordterminalcloud.com) for your assignments; however, it might make life easier with all the required dependencies and development toolkits configured for you. +과제를 진행하기 위해서, [Terminal](https://www.stanfordterminalcloud.com)을 사용하는 옵션을 제공합니다. Terminal에서 여러분의 결과물을 개발하고 테스트할 수 있습니다. 한가지 유의해야 할 것은 Terminal.com의 메인페이지를 사용하지 않고 cs231n 수업을 위해 특별히 할당된 서브도메인에 등록된 사이트를 사용합니다. [Terminal](https://www.stanfordterminalcloud.com)은 미리 설정된 커맨드 라인 환경(command line environment)에 접근할 수 있는 온라인 컴퓨팅 플랫폼입니다. 여러분이 과제를 진행하기 위해서 반드시 [Terminal](https://www.stanfordterminalcloud.com) 을 사용할 필요는 없습니다. 그러나 개발을 위한 필요사항들과 개발도구들이 미리 설정되어 있기 때문에 수고를 덜 수 있습니다. -This tutorial lists the necessary steps of working on the assignments using Terminal. First of all, [sign up your own account](https://www.stanfordterminalcloud.com/signup). Log in [Terminal](https://www.stanfordterminalcloud.com) with the account that you have just created. +이 튜토리얼은 Terminal을 사용하여 과제를 진행하기 위한 필수적인 과정들을 설명합니다. 가장 먼저, [여러분의 계정을 만듭니다.](https://www.stanfordterminalcloud.com/signup). 바로 전에 만든 계정으로 [Terminal](https://www.stanfordterminalcloud.com)에 로그인합니다. -For each assignment, we will provide you a link to a shared terminal snapshot. These snapshots are pre-configured command line environments with the starter code, where you can write your implementations and execute the code. +각각의 과제마다 Terminal 스냅 샷 링크를 제공합니다. 이 스냅 샷들은 여러분의 결과물을 작성하고 실행할 시작 코드와 미리 설정된 command line 환경이 포함되어 있습니다. -Here's an example of what a snapshot page looked like for an assignment in 2015: +여기 2015년 과제처럼 보이는 스냅 샷을 통해 예를 들어보겠습니다.
- +
-Yours will look similar. Click the "Start" button on the lower right corner. This will clone the shared snapshot to your own account. Now you should be able to find the terminal under the [My Terminals](https://www.stanfordterminalcloud.com/terminals) tab. +여러분의 스냅 샷도 이와 비슷할 것입니다. 오른쪽 아래의 "Start" 버튼을 클릭합니다. 그럼 여러분의 계정에 공유된 스냅 샷이 복사됩니다. 이제 [My Terminals](https://www.stanfordterminalcloud.com/terminals) 탭에서 복사된 터미널을 찾을 수 있습니다.
- +
-Yours will look similar. You are all set! To work on the assignments, click the link to your terminal (shown in the red box in the above image). This link will open up the user interface layer over an AWS machine. It will look something similar to this: +여러분의 화면도 이와 비슷할 것입니다. 이제 과제를 진행하기 위한 준비가 되었습니다! 링크를 클릭하여 terminal을 열어봅시다. (위 이미지의 빨간색 상자) 이 링크는 AWS 머신상의 사용자인터페이스를 계층을 엽니다. 다음과 비슷한 화면이 나타납니다.
- +
-We have set up the Jupyter Notebook and other dependencies in the terminal. Launch a new console window with the small + sign (if you don't already have one), navigate around and look for the assignment folder and code. Launch a Jupyer notebook and work on the assignment. If your're a student enrolled in the class you will submit your assignment through Coursework: +terminal에 Jupyter Notebook과 다른 필요요소들이 설치되어 있습니다. 조그마한 + 버튼을 눌러 콘솔을 실행합니다. (콘솔이 없으면 ), 그리고 과제 폴더와 코드를 찾습니다. 그리고 Jupyper Notebook을 실행하고 과제를 진행합니다. 만약 당신이 cs231n에 등록한 학생이면 코스워크를 통해 과제를 제출해야 합니다.
- +
-For more information about [Terminal](https://www.stanfordterminalcloud.com), check out the [FAQ](https://www.stanfordterminalcloud.com/faq) page. +[Terminal](https://www.stanfordterminalcloud.com)에 대한 더 많은 정보를 원하시면 [FAQ](https://www.stanfordterminalcloud.com/faq) 페이지를 방문해주세요 -**Important Note:** the usage of Terminal is charged on an hourly rate based on the instance type. A medium type instance costs $0.124 per hour. If you are enrolled in the class email Serena Yeung (syyeung@cs.stanford.edu) to request Terminal credits. We will send you $3 the first time around, and you can request more funds on a rolling basis when you run out. Please be responsible with the funds we allocate you. \ No newline at end of file +**중요** 터미널 사용 시 사용하는 인스턴스 타입에 따라 시간당 사용요금이 부과됩니다. 미디엄 타입의 인스턴스 요금은 시간당 $0.124입니다. + +--- +

+번역: 김우정 (gnujoow) +

diff --git a/understanding-cnn.md b/understanding-cnn.md index 24eb909c..c931cff5 100644 --- a/understanding-cnn.md +++ b/understanding-cnn.md @@ -8,16 +8,18 @@ permalink: /understanding-cnn/ (this page is currently in draft form) ## Visualizing what ConvNets learn +## ConvNets이 무엇을 학습하는지의 시각화 Several approaches for understanding and visualizing Convolutional Networks have been developed in the literature, partly as a response the common criticism that the learned features in a Neural Network are not interpretable. In this section we briefly survey some of these approaches and related work. +학계에서 콘볼루션 네트워크 (Convolution Network)을 이해하고 시각화 하려고 하는 여러가지 시도들이 있었는데, 이는 신경망 네트워크 (Neural Network)를 통해 학습된 특징 (feature) 가 설명력이 없다는 일반적인 비판에 대해 답변을 하기 위함이다. 이번 섹션에는 이와 관련된 접근법과 관련 연구들을 간단하게 알아보고자 한다. ### Visualizing the activations and first-layer weights **Layer Activations**. The most straight-forward visualization technique is to show the activations of the network during the forward pass. For ReLU networks, the activations usually start out looking relatively blobby and dense, but as the training progresses the activations usually become more sparse and localized. One dangerous pitfall that can be easily noticed with this visualization is that some activation maps may be all zero for many different inputs, which can indicate *dead* filters, and can be a symptom of high learning rates.
- - + +
Typical-looking activations on the first CONV layer (left), and the 5th CONV layer (right) of a trained AlexNet looking at a picture of a cat. Every box shows an activation map corresponding to some filter. Notice that the activations are sparse (most values are zero, in this visualization shown in black) and mostly local.
@@ -26,8 +28,8 @@ Several approaches for understanding and visualizing Convolutional Networks have **Conv/FC Filters.** The second common strategy is to visualize the weights. These are usually most interpretable on the first CONV layer which is looking directly at the raw pixel data, but it is possible to also show the filter weights deeper in the network. The weights are useful to visualize because well-trained networks usually display nice and smooth filters without any noisy patterns. Noisy patterns can be an indicator of a network that hasn't been trained for long enough, or possibly a very low regularization strength that may have led to overfitting.
- - + +
Typical-looking filters on the first CONV layer (left), and the 2nd CONV layer (right) of a trained AlexNet. Notice that the first-layer weights are very nice and smooth, indicating nicely converged network. The color/grayscale features are clustered because the AlexNet contains two separate streams of processing, and an apparent consequence of this architecture is that one stream develops high-frequency grayscale features and the other low-frequency color features. The 2nd CONV layer weights are not as interpretible, but it is apparent that they are still smooth, well-formed, and absent of noisy patterns.
@@ -38,7 +40,7 @@ Several approaches for understanding and visualizing Convolutional Networks have Another visualization technique is to take a large dataset of images, feed them through the network and keep track of which images maximally activate some neuron. We can then visualize the images to get an understanding of what the neuron is looking for in its receptive field. One such visualization (among others) is shown in [Rich feature hierarchies for accurate object detection and semantic segmentation](http://arxiv.org/abs/1311.2524) by Ross Girshick et al.:
- +
Maximally activating images for some POOL5 (5th pool layer) neurons of an AlexNet. The activation values and the receptive field of the particular neuron are shown in white. (In particular, note that the POOL5 neurons are a function of a relatively large portion of the input image!) It can be seen that some neurons are responsive to upper bodies, text, or specular highlights.
@@ -53,7 +55,7 @@ ConvNets can be interpreted as gradually transforming the images into a represen To produce an embedding, we can take a set of images and use the ConvNet to extract the CNN codes (e.g. in AlexNet the 4096-dimensional vector right before the classifier, and crucially, including the ReLU non-linearity). We can then plug these into t-SNE and get 2-dimensional vector for each image. The corresponding images can them be visualized in a grid:
- +
t-SNE embedding of a set of images based on their CNN codes. Images that are nearby each other are also close in the CNN representation space, which implies that the CNN "sees" them as being very similar. Notice that the similarities are more often class-based and semantic rather than pixel and color-based. For more details on how this visualization was produced the associated code, and more related visualizations at different scales refer to t-SNE visualization of CNN codes.
@@ -64,7 +66,7 @@ To produce an embedding, we can take a set of images and use the ConvNet to extr Suppose that a ConvNet classifies an image as a dog. How can we be certain that it's actually picking up on the dog in the image as opposed to some contextual cues from the background or some other miscellaneous object? One way of investigating which part of the image some classification prediction is coming from is by plotting the probability of the class of interest (e.g. dog class) as a function of the position of an occluder object. That is, we iterate over regions of the image, set a patch of the image to be all zero, and look at the probability of the class. We can visualize the probability as a 2-dimensional heat map. This approach has been used in Matthew Zeiler's [Visualizing and Understanding Convolutional Networks](http://arxiv.org/abs/1311.2901):
- +
Three input images (top). Notice that the occluder region is shown in grey. As we slide the occluder over the image we record the probability of the correct class and then visualize it as a heatmap (shown below each image). For instance, in the left-most image we see that the probability of Pomeranian plummets when the occluder covers the face of the dog, giving us some level of confidence that the dog's face is primarily responsible for the high classification score. Conversely, zeroing out other parts of the image is seen to have relatively negligible impact.
diff --git a/video-lectures.md b/video-lectures.md new file mode 100644 index 00000000..3a7ae924 --- /dev/null +++ b/video-lectures.md @@ -0,0 +1,10 @@ +--- +layout: page +title: Video Lectures +permalink: /video-lectures/ +--- + +동영상 강의는 원래 강사인 Andrej Karpathy가 직접 유튜브에 올렸었지만, 몇 가지 문제로 인해 현재는 내려간 상태입니다. +그러나 [[이곳](https://archive.org/details/cs231n-CNNs)]에서 웹으로 강의를 듣거나 [토렌트 링크](https://archive.org/download/cs231n-CNNs/cs231n-CNNs_archive.torrent)를 통해 받을 수 있고, [새로 유튜브에 재생목록을 만들어주신 분](https://www.youtube.com/playlist?list=PLLvH2FwAQhnpj1WEB-jHmPuUeQ8mX-XXG)도 있습니다. 자막 파일은 [[여기](https://github.com/aikorea/cs231n/tree/master/captions)]에서 다운받으실 수 있습니다. (아직 진행 중이라 미완입니다.) + +유튜브에서 자동생성되어 매우 안 좋은 상태였던 영어 자막을 수정하고 한글로 번역까지 해 주는 작업은 **김영범 (rollis0825), 황재하 (jaywhang), 이지훈 (jihoonl), 김석우 (sandrokim), 이준수 (jslee), 조재민 (j-min)** 님께서 수고해 주시고 계십니다!